Analysing differential expression with RNA-seq data



CuffDiff, eXpress, aligners...???

We have gathered all of our learning material for RNA-seq data analysis to this page to help you out to get started!
If RNA-seq is a whole new topic to you, we would suggest checking out the introduction to RNA-seq webinar (1h), where the very basic steps are explained -this video covers also the 101 of using Chipster.

After the webinar, a hands-on approach might be handy -for that we suggest going through our course exercises (listed below). Open the excercises pdf document, log in to Chipster, and just follow the instructions!

Some parts of the analysis can raise more questions -we have tried to clarify those steps with more detailed instructions and tutorials. Please find the links to those on this page!

Course material

Please find below the links to our RNA-seq course materials.
Course example sessions are available on Chipster (File -> Open example session -> "course_RNAseq...").

Getting started with Chipster

Importing your data to Chipster

Tutorial videos

We have a RNA-seq playlist in our Youtube channel. This list includes tutorials for some common analysis needs:

Analysis steps



Typical analysis steps include:

  1. Quality control of the raw data (FASTQ files):
    Check whether there's need for trimming or filtering of your reads, and gain information about the reads (number of reads, quality encoding, strandedness and inner distance)
    Tools (under Quality control category): FASTQC (most used, very clear), FASTX , PRINSEQ (bit more info on duplicates) You can also check out whether your data was produced using a stranded protocol and what is the inner distance of your paired reads using RNA-seq strandedness inference and inner distance estimation using RseQC

  2. Preprocessing of the raw data (FASTQ files):
    If you noticed some needs for trimming or filtering of your data, like poor quality reads or adapter sequences, you can get rid of them now -after this, run the quality control tool(s) again to see whether the problem was solved. Most of the problems can be dealt with in the alignment step too, but by getting rid of the poor quality data here your files are smaller and thus maybe a bit faster to work with.
    Tools (under Preprocessing category): Trimmomatic (several options, quite fast), PRINSEQ trimming and filtering are in separate tools, and so are FASTX tools.

  3. Alignment (from FASTQ to BAM files):
    Next you align your FASTQ files against a reference genome.
    Choose the reference genome, and remember to use same reference also later on! If your reference genome is not available, you can import it separately and use the "own genome" version of the tool. If your genome is used by many users, you can also contact us and ask if it could be added.
    In case you have paired end data, select both read1 and read 2 files for a particular sample as input for the aligner. Make sure you choose the correct tool, there are separate tools for single end and paired end data.
    In case you have several FASTQ files per sample (Illumina NextSeq generates 8 files per sample in case of paired end data), you need to first generate list files (one for read 1 files and another for read 2 files), and give those list files and all the FASTQ files as input. Please read the manual for the Utilities/Make a list of file names tool for more info!
    Aligners also benefit from the information regarding the library prep: the strandedness and inner distance of the paired end reads. Since the strandedness / library type issue can easily get a bit confusing due to incoherent nomenclature, we gathered a summary here!
    Tools (under Alignment category): TopHat2 (recommended for RNA-seq), Bowtie, BWA. Coming: STAR ??

  4. Alignment level quality control (for BAM files):
    Check the alignment quality to see how well the reads mapped to the reference, how many are on exons or in the junctions.
    Tools (under Quality control category): RSeQC , CalculateMultipleMetrics (Picard))

  5. Quantitation (for BAM files):
    Next we want to count the reads per gene/transcript. The quantitation is done separately for each BAM file. At this step you also need to report the library type / strandedness -make sure you choose the correct parameter!
    Tools (under RNA-seq category): HTSeq , DEXSeq

  6. Define NGS experiment (for the count files):
    Now we need to combine all the count files (samplename.tsv) into a one table (ngs-data-table.tsv), where rows represent genes and columns different samples. For this, select all the blue count files and run the tool Utilities/Define NGS experiment.
    This tool generates the table and a phenodata file . This is your way to describe your samples in Chipster: your next task is to fill in the "group" column and maybe add some more columns that describe which of your samples are "controls", "treatments", from same batch/patient etc. You also want to make sure the "Description" column is, well, descriptive, and the titles there are short enough, as these are used by many visualisation tools. There are a few things to keep in mind when filling in the phenodata file, so it might be advisable to check out our video tutorial (3 min) and/or the manual page on how to fill in the phenodata file
    Tool (under Utilities category): Define NGS experiment

  7. HINT: If your FASTQ and/or BAM files are big, at this point it might be wise to make sure your session is saved, then remove the large files from your session, and save the session again with another name. It makes your session smaller and easier to handle! You can always return to the earlier session with the bigger files if you notice a need for that later on. Note that the cloud sessions are not stored forever, so for longer storage, save the sessions (also) locally! If the sessions are huge, it might be a good idea to store them in Taito.

  8. Experiment level quality control (for the count table):
    Now we have our data in one table, and it is time to do some experiment level quality control -this is the exciting part where you get to see whether your samples really are expressing differently!
    Note: the tool expects you to use raw counts (in the count table), so don't do any normalisation!
    Check out from the PCA plot and heatmap that the samples are clustered as you would imagine them to cluster -controls in one clump and so on- and that there are no outliers.
    Now you can also see if there are some possible batch effects lurking in your data (see the Drosophila example session to see what you should be looking for. If you notice something, make sure you take it into account in the statistical testing / modeling phase! Tool (under Quality control category): PCA and heatmap of samples using DESeq2)

  9. Differential expression analysis (for the count table):

    MISSAHAN OLIS SE HYVA SLAIDI MIKÄ TEHTIIN SEIJAN OHJEIDEN PERUSTEELLA?
    KEHTAISKO PYYTAA SEIJALTA LYHYTTA VIDEOTA TAI NOTEBOOKIA AIHEESTA? TASTA JA SIITA linear modeling?
    Time for statistics! It is actually fairly tricky thing to estimate the differential expression due to a couple of things: we are testing usually thousands or tens of thousands of genes (multiple testing correction is needed), the expression values are not normally distributed (instead, negative binomial distributions and generalised linear modeling are used) and the range in which a genes expression values vary varies from gene to gene (dispersion estimation). Luckily, there are tools that do all these tricky things for you.
    These tools are presented in our video tutorial:
    Differential expression analysis tools for RNA-seq tutorial video (3 min)
    Things get a bit trickier if you have multiple variables, like treatment, gender, batch to take into account simultaniously. For these cases, you need to use the tool called EdgeR for multivariate experiments
    -check out the video tutorial for that here (6 min),
    and a specific tutorial for cases where you need to use this "nested" option here (4 min).


    IF NONE OF THESE OPTIONS WORK FOR YOUR DATA, THEN....??? Oliks joku kayta omaa matriisii tyokalu?? Missa??

    Note: the tools expect raw counts (in the count table) as input, so DON'T do any normalisation and don't use FPKM values or similar!
    Tools (under RNA-seq category): DESeq2, edgeR, edgeR for multivariate analysis DEXSeq

  10. Visualising your data:
    Video tutorial for interactive visualisations (7 min)