RNA-seq / Differential expression analysis using edgeR
Description
This tool performs an analysis for differential expression using the edgeR Bioconductor package.
Parameters
- Column describing groups [group]
- Apply normalization (yes, no) [yes]
- Dispersion method (common, tagwise) [tagwise]
- Dispersion estimate (0-1) [0.1]
- Multiple testing correction (none, Bonferroni, Holm, Hochberg, BH, BY) [BH]
- P-value cutoff (0-1) [0.05]
- Plot width (200-3200 [600]
- Plot height (200-3200) [600]
Details
Given an input table of raw counts,
the edgeR package performs statistical analysis to identify differentially expressed genomic features (genes, miRNAs,...) between two experimental conditions.
Note that in its current implementation, the tool only supports single-factor experiment designs. The experiment
conditions to be compared should be defined in the phenodata.tsv file, and the appropriate column selected using
the 'Column describing groups' parameter.
Normalization factors are calculated using the library size given by the user in the phenodata.tsv or by summing the counts for each sample.
TMM method is then used to calculate the normalization factors in order to reduce RNA compositon bias
(which can arise for example when only a small number of genes
are very highly expressed in one experiment condition but not in the other).
Dispersion is estimated using the quantile-adjusted conditional maximum likelyhood method (qCML). It can estimate a common dispersion for all the genomic features,
or a separate (tagwise) dispersion for each individual feature using an empirical Bayes strategy.
It is highly recommended to always have at least two biological replicates for each experiment condition.
If this is not possible, one can run the analysis by manually setting
the dispersion factor through the 'Dispersion estimate' parameter.
By default the dispersion estimate is set to 0.1, which is somewhere in-between what is usually observed for technical replicates (0.01) and human data (0.4).
It is recommended to experiment with different values for this parameter.
Once negative binomial models are fitted and dispersion estimates are obtained,
edgeR proceeds with testing for differential expression using the exact test, which is based on the qCML methods.
Output
The analysis output consists of the following files:
- de-list-edger.tsv: Table containing the results of the statistical testing, including fold change estimates and p-values.
- de-list-edger.bed: The BED version of the results table contains genomic coordinates and log fold change values.
- ma-plot-raw-edger.pdf: A scatter plot of the raw count value averages between experiment conditions.
- ma-plot-normalized-edger.pdf: A scatter plot of the normalized count value averages between experiment conditions.
- ma-plot-significant-edger.pdf: A scatter plot where the significantly differentially expressed features are highlighted.
- mds-plot-edger.pdf: A plot showing the results of multidimensional scaling of the data to visualize sample similarities.
- edger-log.txt: Log file.
- p-value-plot-edger.pdf: Plot of the raw and adjusted p-value distributions of the statistical test.
References
This tool uses the edgeR package for statistical analysis. Please read the following article for more detailed information:
MD Robinson, DJ McCarthy, and GK Smyth. edgeR: a bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26 (1):139Ð40, Jan 2010.
.