Basic RNA-seq processing: unix tools and IGV
RNA-seq (transcriptome sequencing) is a very powerful method for transcriptomic studies, that enables
quantification of transcript levels as well as discovery of novel transcripts and transcript isoforms. This practical will introduce some popular tools for basic processing of
RNA-seq data. The requirements for
aligning this type of data is slightly different from e.g. genome sequencing, since the
aligner needs to be able to handle splice junction-spanning reads. After alignment to a reference genome, special tools are available
to quantify the expression of known genes or to discover novel transcripts.
Common RNA-seq workflow (from bgisequence.com)
You will be introduced to the "Tuxedo suite" of tools (Bowtie,
Tophat, Cufflinks). Tophat is a splice junction-aware aligner,
that can correctly map reads to the genome even if they start in one exon and end in another. Tophat in turn
makes use of Bowtie, a simpler aligner for short reads. Once reads are aligned, Cufflinks can be used to assemble novel transcripts,
and to quantify known or novel transcripts. Cufflinks can also be used for testing differential expression
between samples. In addition to the Tuxedo tools, you will make use of Samtools for further processing of your alignments, and IGV (Integrative Genomics Viewer) for visualization.
If you want to learn more about using Tophat and Cufflinks for RNA-seq data analysis, this recent tutorial in Nature Protocols is highly recommended.
We will cover the following topics:
Students must provide answers to questions that are indicated in
italics and marked with 'Q'.
1. Human RNA-seq data files
RNA-seq data comes in many flavors, depending e.g. on the sample preparation and sequencing protocols that were used. In this lab, you will be working with "paired-end" reads (2x50 nt), that derive from human brain polyA+ RNA and were produced on the Illumina HiSeq 2000 instrument. An "unstranded" protocol was used, meaning that the direction of transcription is unknown (the DNA strand to which reads align gives no useful information). Copy the required files to your local directory:
% cp -r /home/erik/rnaseq_lab .
% cd rnaseq_lab
With paired-end data, each cDNA fragment is sequenced from two ends, in two different phases of the sequencing run. Therefore, there are two matching .fastq files that contain left and right sequence reads. Paired-end reads improve discovery of novel transcripts and increases the total read length per fragment. Typing:
% ll (Note: lower-case 'L', not numbers)
should reveal the files brain_1.fastq and brain_2.fastq (left and right reads). You will also see some other files (many are actually links) that will be useful later. To speed up the analyses in this tutorial, the brain files are small. Normal RNA-seq datasets can be 1000x larger (10-100 million reads per sample is common for RNA-seq in higher organisms).
Q1. Warm-up question: can you use the unix commands cat and wc to figure out how many reads the files contain? Hint: wc followed by a file name will reveal the number of rows, words and characters in the file, and cat will show the contents of a file.
2. Alignment with Tophat
Let's now use Tophat to align the brain reads to the human genome!
% ./tophat -p 8 -r 170 --library-type fr-unstranded -G genes.gtf -o tophat_out_brain bowtie_indexes/hg19 brain_1.fastq brain_2.fastq
This should initiate the alignment and various stats will be displayed while it runs. It is not as complicated as it might seem:-p 8 tells the aligner to use a maximum of 8 processor cores. Set this to a lower value to use less resources, e.g. when running long jobs on a shared computer.
-r 170 tells Tophat that the approximate distance between paired-end reads is 170 nt. This number will vary depending on the sample preparation protocol used and should be known by the person who performed the experiment. This only needs to be approximately correct.
-o specifices an output directory. bowtie_indexes/hg19 tells Tophat what genome (more precisely: genome index) to use. In this case we align against the human genome, build 19.
-G can be used to provide Tophat with a reference gene annotation. This will help when finding gapped alignments in cases where reads span across splice junctions. This is optional but will lead to slightly more reads being correctly aligned (in this case, genes.gtf contains the RefSeq human gene annotation).
Q2. Why was the --library-type fr-unstranded option used here? Check out the Tophat manual to figure out more.
For large datasets, alignment runs can easily take a whole night, but in this case Tophat should complete within 15 mins. While aligning, continue reading the instructions below, or familiarize yourself with the article that was mentioned earlier.
3. Inspect with Samtools
When complete, check the output directory:
% ll tophat_out_brain
Among others, you will see accepted_hits.bam which contains your alignments. This compressed file format is not "human-readable", but it can be converted to the .sam text format. Use Samtools to take a look at this file.
% samtools view tophat_out_brain/accepted_hits.bam | more (shows .bam file in .sam format - press q to exit)
% samtools view tophat_out_brain/accepted_hits.bam | wc (will tell number of aligned sequences)
% samtools view tophat_out_brain/accepted_hits.bam | cut -f1 | sort -u | wc (shows number of read pairs where at least one sequence was aligned to at least one place in the genome)
Q3. How many, and what fraction, of your read pairs were aligned?
4. View alignments in IGV
Let now visualize the alignments in the IGV browser. First, your .bam file needs to indexed for fast access. This creates an additional file, accepted_hits.bam.bai. Then we can launch IGV (this requires that your terminal can handle X windows):
% samtools index tophat_out_brain/accepted_hits.bam
% ./igv &
You can also choose to run IGV on your local computer (click Launch here), but you will then need to transfer e.g. the .bam file to your local machine. Ask one of the lab supervisors for help if needed.
Make sure that human genome build 19 (hg19) is selected in the upper left corner. The RefSeq human gene annotation loads by default - this is perfect for our purposes. Start by opening the "View, Preferences..." window. On the "Alignments" tab, set the visibility range threshold to 300 (kb). Now open the .bam file (File, Load from File...). Take a look at PLP1 gene by searching in the IGV genome position field. This gene (coding for myelin proteolipid protein) has strong brain-specific expression.
Explore IGV a bit. Try the "view as pairs" option (right-click on the alignment track). Zoom in on smaller parts of the gene by drag-clicking on the ruler. Hover over individual reads to see more info about alignments. Right-click the RefSeq annotation track and toggle between "Expanded" and "Collapsed" to see all known/annotated splice forms. Note the alignment coverage plot at the top.
Q4. Does the PLP1 gene appear to be well-covered with reads (despite your dataset being reduced to ~1/1000 of the original size)?
Q5. Are the benefits using a "splice-junction-aware" aligner such as Tophat visible?
Q6. How many transcript isoforms are known (in RefSeq) for the PLP1 gene?
Q7. Do all of them appear to be expressed in this sample? Which one(s)? (give RefSeq NM_... ID's)
Q8. How can one tell if your dataset is "stranded" or "unstranded" (see introduction)? Hint: hover over a few alignments and check the "Pair orientation" property!
Check out another gene with abundant expression in the brain: the APP gene (amyloid precursor protein). Proteolysis of APP generates beta amyloid, the main component of plaques found in the brains of Alzheimer disease patients. Compare observed reads to known splice forms in RefSeq.
Q9. Do all known splice forms appear to be expressed? (Again, you may need to switch to the "Expanded" annotation view)
5. Generate coverage plots
To save memory and reduce disk access, IGV only loads alignments for the particular chromosome window that your are looking at (.bam files can be several gigabytes in size). When you zoom out too much (remember the 300 kb limit that you set earlier), reads and alignment coverage plot will no longer be displayed. Now use IGVtools to precompute coverage data for all chromosomes, that will enable coverage plots to be shown always:
% ./igvtools count tophat_out_brain/accepted_hits.bam brain.tdf hg19
Open the brain.tdf file in IGV. Look at the complete chromosome 18.
Q10. Is any gene on this chromosome exceptional in terms of expression level? Which one? Does it make sense given its function (use e.g. NCBI Gene or simply Google to figure the full name!)?
6. De novo transcript discovery
You will now learn to use Cufflinks to discover and quantify transcripts. Cufflinks is not one piece of software, but includes several different programs. The main cufflinks program assembles transcripts ("discovers genes") from aligned reads, with or without the help of a known reference transcript annotation. It also provides quantitative expression levels for transcripts. It can also be used only to quantify known genes from a reference annotation, without assembling new transcripts.
Start by testing Cufflinks capabilities when it comes to assembling gene/transcript structures, without the help of a reference annotation:
% ./cufflinks -p 8 --library-type fr-unstranded -o cufflinks_out_brain tophat_out_brain/accepted_hits.bam
The run will finish quickly. Check out the Cufflinks manual if you need help understanding the parameters. A few files are now found in the output directory:
% ll cufflinks_out_brain
transcripts.gtf is a gene annotation file similar to the one IGV uses to display the RefSeq gene track at the bottom of the screen. Load it into IGV ("Load from File..."). A separate track, with newly discovered transcripts named CUFF.399.1 and similar, is now displayed. Now revisit the myelin proteolipid protein (PLP1) and the amyloid precursor protein (APP) genes. Compare with the reference annotation track. You may need to switch beetween collapsed and expanded views.
Q11. Did Cufflinks do a good job in assembling transcripts for the two genes? Explain what you see - e.g. why/why not.
Other files generated by Cufflinks are isoforms.fpkm.tracking and genes.fpkm.tracking. The first contains expression levels for discovered transcripts and the second contains expression levels for genes (all isoforms taken together). The unit is FPKM (fragments per kilobase per million fragments. One fragment = one read pair). You can load these into e.g. Excel or view using unix commands:
% more cufflinks_out_brain/genes.fpkm.tracking
7. Quantify known genes
Now run Cufflinks again, but only to quantify known genes. We can supply a reference annotation using the -G option. In this case you will be using a file (genes.gtf) that contains the same annotation that IGV shows by default at the bottom (RefSeq). (The -g option will instead discover new transcripts in addition to known ones.)
% ./cufflinks -p 8 -G genes.gtf --library-type fr-unstranded -o cufflinks_out_brain tophat_out_brain/accepted_hits.bam
Sort the genes.fpkm.tracking file by expression level, to find highly expressed genes in your brain sample (the 10th column is the FPKM value, rows will be shown in ascending order):
% sort -n +10 cufflinks_out_brain/genes.fpkm_tracking
Q12. What is the top expressed gene?
Now, let's check how this gene is expressed at the level of individual transcripts! Cufflinks uses sophisticated mathematics to try to determine expression levels of individual transcripts isoforms of a gene, given the aligned reads.
% cat cufflinks_out_brain/isoforms.fpkm_tracking | grep -w ABC1 (replace ABC1 with the gene of interest - case sensitive!)
Q13. What isoform is the top expressed one (give the NM_... number)?
Q14. Does it fit with what you saw by manual inspection of aligned reads? (See earlier question)
If everything worked correctly, you should now have seen the power of RNA-seq when it comes to discriminating between splice forms. Add to that the capability of discovering novel genes and transcripts, and even mutations. In combination with rapidly decreasing sequencing costs, microarrays are basically left in the dust already!
8. Find differentially expressed genes
Cufflinks includes the cuffdiff utility, which is useful for finding differentially expressed genes. cuffdiff requires only a few basic parameters. You need to supply a gene annotation (gtf) file. This can be e.g. one that you assembled yourself using cufflinks, or an established annotation. You also need to specificy .bam files for the samples you wish to compare. liver.bam contains Tophat-derived alignments from a liver RNA-seq library, which is otherwise identical to your brain data. Let's now compare brain and liver samples. Replicates can be grouped together by commas when calling cuffdiff, but in this case we only have one sample for each condition:
% ./cuffdiff -p 8 --library-type fr-unstranded -o cuffdiff_out -L liver,brain genes.gtf liver.bam tophat_out_brain/accepted_hits.bam
A number of files were now created, including statistics on differential expression for genes and individual transcript isoforms (gene_exp.diff and isoform_exp.diff):
% ll cuffdiff_out
gene_exp.diff can be opened in e.g. Excel and sorted to find differently expressed genes. If you want to try this you must first transfer the file to your local computer (ask a supervisor if needed). Otherwise, we can use some simple unix tricks to filter the file. The following line first selects genes classified as significant (marked 'yes'), then removes some unwanted columns to make it easier to read the result, and finally sorts by fold change (log2):
% grep yes cuffdiff_out/gene_exp.diff | cut -f1,10,12 | sort -n +1
Q15. Was any of the genes you looked at earlier identified as differentially expressed between the two samples?
Show your results to a supervisor!
EL, March 2012