GLORIA — GEOMAR Library Ocean Research Information Access

1

Unknown

Probabilistic models of genetic variation in structured populations applied to global human studies (2016)

Hao, W., Song, M., Storey, J. D.

Oxford University Press

In: Bioinformatics

add to mindlist on the mindlist

Details

Publication Date: 2016-02-27

Description: Motivation: Modern population genetics studies typically involve genome-wide genotyping of individuals from a diverse network of ancestries. An important problem is how to formulate and estimate probabilistic models of observed genotypes that account for complex population structure. The most prominent work on this problem has focused on estimating a model of admixture proportions of ancestral populations for each individual. Here, we instead focus on modeling variation of the genotypes without requiring a higher-level admixture interpretation. Results: We formulate two general probabilistic models, and we propose computationally efficient algorithms to estimate them. First, we show how principal component analysis can be utilized to estimate a general model that includes the well-known Pritchard–Stephens–Donnelly admixture model as a special case. Noting some drawbacks of this approach, we introduce a new ‘logistic factor analysis’ framework that seeks to directly model the logit transformation of probabilities underlying observed genotypes in terms of latent variables that capture population structure. We demonstrate these advances on data from the Human Genome Diversity Panel and 1000 Genomes Project, where we are able to identify SNPs that are highly differentiated with respect to structure while making minimal modeling assumptions. Availability and Implementation: A Bioconductor R package called lfa is available at http://www.bioconductor.org/packages/release/bioc/html/lfa.html . Contact: jstorey@princeton.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

PAPER CURRENT

Fulltext

2

Unknown

The sva package for removing batch effects and other unwanted variation in high-throughput experiments (2012)

Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E., Storey, J. D.

Oxford University Press

In: Bioinformatics

add to mindlist on the mindlist

Details

Publication Date: 2012-03-20

Description: : Heterogeneity and latent variables are now widely recognized as major sources of bias and variability in high-throughput experiments. The most well-known source of latent variation in genomic experiments are batch effects—when samples are processed on different days, in different groups or by different people. However, there are also a large number of other variables that may have a major impact on high-throughput measurements. Here we describe the sva package for identifying, estimating and removing unwanted sources of variation in high-throughput experiments. The sva package supports surrogate variable estimation with the sva function, direct adjustment for known batch effects with the ComBat function and adjustment for batch and latent variables in prediction problems with the fsva function. Availability: The R package sva is freely available from http://www.bioconductor.org . Contact: jleek@jhsph.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

PAPER CURRENT

Fulltext

3

Unknown

subSeq: Determining Appropriate Sequencing Depth Through Efficient Read Subsampling (2014)

Robinson, D. G., Storey, J. D.

Oxford University Press

In: Bioinformatics

add to mindlist on the mindlist

Details

Publication Date: 2014-11-26

Description: Motivation: Next-generation sequencing experiments, such as RNA-Seq, play an increasingly important role in biological research. One complication is that the power and accuracy of such experiments depend substantially on the number of reads sequenced, so it is important and challenging to determine the optimal read depth for an experiment or to verify whether one has adequate depth in an existing experiment. Results: By randomly sampling lower depths from a sequencing experiment and determining where the saturation of power and accuracy occurs, one can determine what the most useful depth should be for future experiments, and furthermore, confirm whether an existing experiment had sufficient depth to justify its conclusions. We introduce the subSeq R package, which uses a novel efficient approach to perform this subsampling and to calculate informative metrics at each depth. Availability and Implementation: The subSeq R package is available at http://github.com/StoreyLab/subSeq/ . Contact: dgrtwo@princeton.edu or jstorey@princeton.edu Supplementary information: Supplementary data are available at Bioinformatics online.

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

PAPER CURRENT

Fulltext

4

Unknown

A nested parallel experiment demonstrates differences in intensity-dependence between RNA-seq and microarrays (2015)

Robinson, D. G., Wang, J. Y., Storey, J. D.

Oxford University Press

In: Nucleic Acids Research

add to mindlist on the mindlist

Details

Publication Date: 2015-11-17

Description: Understanding the differences between microarray and RNA-Seq technologies for measuring gene expression is necessary for informed design of experiments and choice of data analysis methods. Previous comparisons have come to sometimes contradictory conclusions, which we suggest result from a lack of attention to the intensity-dependent nature of variation generated by the technologies. To examine this trend, we carried out a parallel nested experiment performed simultaneously on the two technologies that systematically split variation into four stages (treatment, biological variation, library preparation and chip/lane noise), allowing a separation and comparison of the sources of variation in a well-controlled cellular system, Saccharomyces cerevisiae . With this novel dataset, we demonstrate that power and accuracy are more dependent on per-gene read depth in RNA-Seq than they are on fluorescence intensity in microarrays. However, we carried out quantitative PCR validations which indicate that microarrays may demonstrate greater systematic bias in low-intensity genes than in RNA-seq.

Keywords: Microarray Technology, Massively Parallel (Deep) Sequencing

Print ISSN: 0305-1048

Electronic ISSN: 1362-4962

Topics: Biology

Published by Oxford University Press

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

PAPER CURRENT

Fulltext

5

Unknown

Statistical significance of variables driving systematic variation in high-dimensional data (2015)

Chung, N. C., Storey, J. D.

Oxford University Press

In: Bioinformatics

add to mindlist on the mindlist

Details

Publication Date: 2015-02-13

Description: Motivation : There are a number of well-established methods such as principal component analysis (PCA) for automatically capturing systematic variation due to latent variables in large-scale genomic data. PCA and related methods may directly provide a quantitative characterization of a complex biological variable that is otherwise difficult to precisely define or model. An unsolved problem in this context is how to systematically identify the genomic variables that are drivers of systematic variation captured by PCA. Principal components (PCs) (and other estimates of systematic variation) are directly constructed from the genomic variables themselves, making measures of statistical significance artificially inflated when using conventional methods due to over-fitting. Results : We introduce a new approach called the jackstraw that allows one to accurately identify genomic variables that are statistically significantly associated with any subset or linear combination of PCs. The proposed method can greatly simplify complex significance testing problems encountered in genomics and can be used to identify the genomic variables significantly associated with latent variables. Using simulation, we demonstrate that our method attains accurate measures of statistical significance over a range of relevant scenarios. We consider yeast cell-cycle gene expression data, and show that the proposed method can be used to straightforwardly identify genes that are cell-cycle regulated with an accurate measure of statistical significance. We also analyze gene expression data from post-trauma patients, allowing the gene expression data to provide a molecularly driven phenotype. Using our method, we find a greater enrichment for inflammatory-related gene sets compared to the original analysis that uses a clinically defined, although likely imprecise, phenotype. The proposed method provides a useful bridge between large-scale quantifications of systematic variation and gene-level significance analyses. Availability and implementation : An R software package, called jackstraw , is available in CRAN. Contact : jstorey@princeton.edu

Print ISSN: 1367-4803

Electronic ISSN: 1460-2059

Topics: Biology , Computer Science , Medicine

Published by Oxford University Press

Permalink

	Location	Call Number	Limitation	Availability

Others were also interested in ...

PAPER CURRENT

Fulltext