From Reads to Counts

Step one in any RNA-Seq data analysis is generating the gene or transcript counts for each sample. This overlaps partly with steps taken in a genomics workflow:

  • Download SRA files and extract FastQ files
  • Perform quality control on raw reads (FastQC)
  • Perform trimming as needed (Trimmomatic)
  • Perform quality control post trimming (FastQC)
  • Perform mapping against reference genome (most often STAR)
  • Generate gene/ transcript counts (either STAR, HTseq or featurecounts)
  • Downstream analysis (chapter 4 of this document)

This chapter shows these steps for an example data set (GSE149995)

2.1 Downloading and extracting FastQ files

Given the SRA identifier for this project, select all samples and download as accession file (SRX identifiers). Using these identifiers as input for the prefetch system command (requires the NCBI SRA-Toolkit installed) downloads the SRA files and we convert them using the fasterq-dump command to fastq files.

prefetch $(<GSE149995_SRA_ACC_RNA-Seq.csv) --output-directory \
  /local-fs/staff/marcel/GSE149995/SRA

The fasterq-dump by default performs a split 3 operation, see the SRA toolkit manual for further details.

find /local-fs/staff/marcel/GSE149995/SRA/ -name "*.sra" | \
  parallel 'fasterq-dump -O /local-fs/staff/marcel/GSE149995/fastq/ {}'

2.2 (Optional) Annotation

The NCBI GEO page lists 9 samples with 3 conditions (WT, potent and arf7). However, this experiment includes 18 sequence files. By downloading the ‘series matrix file’ from the GEO (using the GEOquery library) and combining this with the run info downloadable from the SRA, we can list the mapping of run to sample.

Note: This is an optional step for now and is only useful when there are doubts about the group/sample setup.

library(GEOquery)
gse <- getGEO(GEO = "GSE149995")
# Combine the condition with a sample number and replicate number
Condition = paste(rep(gse[[1]]@phenoData@data$`genotype/variation:ch1`, each = 2),
                  rep(1:3, each = 2),
                  paste0('r', rep(1:2, 9)), sep = '_')
run_info <- read.csv(file = "data/GSE149995_Sra_RunInfo.csv")
setup <- cbind(run_info %>% dplyr::select(Run, Experiment), Condition)
pander(setup)
Run Experiment Condition
SRR11714145 SRX8273521 WT_1_r1
SRR11714146 SRX8273521 WT_1_r2
SRR11714147 SRX8273522 WT_2_r1
SRR11714148 SRX8273522 WT_2_r2
SRR11714149 SRX8273523 WT_3_r1
SRR11714150 SRX8273523 WT_3_r2
SRR11714151 SRX8273524 potent_1_r1
SRR11714152 SRX8273524 potent_1_r2
SRR11714153 SRX8273525 potent_2_r1
SRR11714154 SRX8273525 potent_2_r2
SRR11714155 SRX8273526 potent_3_r1
SRR11714156 SRX8273526 potent_3_r2
SRR11714157 SRX8273527 arf7_1_r1
SRR11714158 SRX8273527 arf7_1_r2
SRR11714159 SRX8273528 arf7_2_r1
SRR11714160 SRX8273528 arf7_2_r2
SRR11714143 SRX8273529 arf7_3_r1
SRR11714144 SRX8273529 arf7_3_r2

2.3 Quality Control

Here we perform the FastQC -> Trimmomatic -> FastQC steps

2.3.1 FastQC

Running FastQC

fqdir <- "/local-fs/staff/marcel/GSE149995/fastq/"
fastqc(fq.dir = "/local-fs/staff/marcel/GSE149995/fastq/",
       qc.dir = "/local-fs/staff/marcel/GSE149995/fqc",
       fastqc.path = "/usr/bin/fastqc",
       threads = 32)

QC-report using multiqc

multiqc \
  --filename ~/Development/2.1.2-Transcriptomics/GSE149995_FQC_multiqc_report.html \
  /local-fs/staff/marcel/GSE149995/fqc/ 

2.4 Trimming

Proper trimming requires experimentation and research regarding the used sequencing technology. The following command uses Trimmomatic (release 0.39, available on Github) to trim all fastq files. Note that this example is for single-end sequencing only, use the TrimmomaticPE command for paired-end data processing and the TruSeq FASTA file ending in PE instead.

cat data/GSE149995_Sra_RunInfo.csv | \
  parallel 'TrimmomaticSE -threads 4 ' \
                  '/local-fs/staff/marcel/GSE149995/fastq/{}.fastq.gz ' \
                  '/local-fs/staff/marcel/GSE149995/fastq-trimmed/{}.trimmed.fastq.gz ' \
                  'ILLUMINACLIP:/homes/marcelk/Development/2.1.2-Transcriptomics/TruSeq3-SE.fa:2:30:10 ' \
                  'MINLEN:40 ' \
                  'SLIDINGWINDOW:4:20'

Re-running FastQC

fastqc(fq.dir = "/local-fs/staff/marcel/GSE149995/fastq-trimmed/",
       qc.dir = "/local-fs/staff/marcel/GSE149995/fqc-trimmed",
       fastqc.path = "/usr/bin/fastqc",
       threads = 32)
multiqc --filename \
  ~/Development/2.1.2-Transcriptomics/GSE149995_FQC_Trimmed.html \
  /local-fs/staff/marcel/GSE149995/fqc-trimmed/

2.5 Mapping with STAR

2.5.1 Building STAR index

Other than with bwa, STAR requires an extra annotation file (GFF or GTF format) describing features on the genome. This data set requires the FASTA of the Arabidopsis thaliana genome and its accompanying annotation file. The genomeDir specifies the output location for the index.

cd /local-fs/staff/marcel/GSE149995/reference
mkdir star; cd star
STAR \
  --runThreadN 12 \
  --runMode genomeGenerate \
  --genomeDir /local-fs/staff/marcel/GSE149995/reference/star \
  --genomeFastaFiles /local-fs/staff/marcel/GSE149995/reference/TAIR10_chr_all.fas \
  --sjdbGTFfile /local-fs/staff/marcel/GSE149995/reference/TAIR10_GTF_genes.gtf \
  --genomeSAindexNbases 12 \
  --sjdbOverhang 49

2.5.2 Read mapping and transcript counting

The following command performs mapping against the reference genome, saves this as a sorted BAM file and directly performs a read count for all transcripts (NOTE: requires the --sjdbGTFfile parameter during index building). For each input file, there is a ReadsPerGenes.out.tab file generated containing 4 columns:

  • column 1: gene ID
  • column 2: counts for unstranded RNA-seq
  • column 3: counts for the 1st read strand aligned with RNA (htseq-count option -s yes)
  • column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)

See an interesting discussion on BioStars regarding which column to use.

When planning to use a different tool for calculating gene counts, you can remove the quantMode argument which will only generate the sorted BAM files. Please read the manual of the tool of choice to see if an indexed BAM file as the BAM file does not have an index.

find /local-fs/staff/marcel/GSE149995/fastq-trimmed/ -name "*.fastq.gz" -exec basename {} \; | \
  sed 's/\.fastq.gz$//' | parallel 'STAR --runThreadN 6 ' \
    '--genomeDir /local-fs/staff/marcel/GSE149995/reference/star ' \
    '--readFilesIn /local-fs/staff/marcel/GSE149995/fastq-trimmed/{}.fastq ' \
    '--outSAMtype BAM SortedByCoordinate ' \
    '--quantMode GeneCounts ' \
    '--outFileNamePrefix /local-fs/staff/marcel/GSE149995/mapping/star/{}_star_'

Generating MultiQC report for STAR mapping phase

~/miniconda3/bin/multiqc --filename \
  ~/Development/2.1.2-Transcriptomics/STAR-mapping.html \
  /local-fs/staff/marcel/GSE149995/mapping/star/

2.5.3 Combining STAR Count Files

STAR generates a [sample]_star_ReadsPerGene.out.tab file for each sample, containing the gene IDs and the count numbers. We can merge these files into a single data frame for further processing (see also Appendix A1 in the manual). This example takes the 2nd column for the count values. Make sure that you know which column to use based on the description above. The reason that we’re merging the data is because it might not always be guaranteed that you get the exact same transcripts in each file.

Note: if you use the featurecounts tool, you will already get a single file containing the count values for all samples.

file.names <- list.files('/local-fs/staff/marcel/GSE149995/mapping/star/',
                         pattern = '*_star_ReadsPerGene.out.tab')

## Function for reading in files
read_sample <- function(file.name) {
  ## Extract the sample name for naming the column (retaining the 'SRR....' part)
  sample.name <- strsplit(file.name, ".", fixed = TRUE)[[1]][1]
  sample <- read.table(file.name, header = FALSE, sep="\t", 
                       row.names = NULL, skip = 4)
  ## Rename the count column
  names(sample)[2] <- sample.name
  ## Return a subset containing the transcript ID and sample name columns
  return(sample[c(1, 2)])
}

Now we read all count files and merge them into a single data frame.

setwd('/local-fs/staff/marcel/GSE149995/mapping/star/')

## Read the FIRST sample
counts <- read_sample(file.names[1])

## Read the remaining files and merge the contents
for (file.name in file.names[2:length(file.names)]) {
  sample <- read_sample(file.name)
  counts <- merge(counts, sample, by = 1)
}

# Set the row names to the transcript IDs
rownames(counts) <- counts$V1
counts <- counts[-1]

We now have a single data frame with count values for all samples and transcripts. The names however are nondescriptive since they are just the SRR identifiers. Since we already did some annotation, we’re going to rename the columns to something more meaningful:

# For each SRR id, find the corresponding column in counts and rename
# using the name in the 'Condition' column
for (i in 1:nrow(setup)) {
  idx <- grep(setup$Run[i], names(counts))
  names(counts)[idx] <- setup$Condition[i]
}

# Show first 6 columns and rows
pander(counts[1:6, 1:6])
  arf7_3_r1 arf7_3_r2 WT_1_r1 WT_1_r2 WT_2_r1 WT_2_r2
AT1G01010 556 536 58 54 342 335
AT1G01020 24 26 68 62 161 145
AT1G01030 8 4 1 3 2 4
AT1G01040 363 333 69 70 76 85
AT1G01046 1 0 0 1 0 0
AT1G01050 1344 1336 959 935 2342 2272