C Annotating an RNA-Seq Experiment

This chapter describes annotate the data, meaning assigning names and functions to our Differentially Expressed Genes. This step can either be very easy or a bit more challenging depending on the data source of your project. The first example is relevant if your data set originated from the NCBI GEO, which ‘should’ always be annotated by default. If you have found your data set elsewhere (and do not have a GEO identifier for your project), skip to the Manually annotate your data section below.

The goal of this chapter is to - at least - find a Gene Symbol or common ID (NCBI/ Ensembl) for each gene. Using this information it will be much easier to find relevant information from other online sources to say something about the functionality and impact of your DEGs.

Even though you might be lucky and either already have your data or a simple GEO query results in everything you want/ need, you are tasked to also perform the manual annotation phase even if you will not use any of the found annotation data. Having written the code to access these sources is a good excercise in using complex Bioconductor packages. Also, you can always use this to find more information should you require this later on.

C.1 Downloading annotation data from GEO

Given the GSE**** id of your experiment you can download any available (annotation)data from GEO using the GEOquery library. For the example experiment used previously, the following code downloads two files; a series_matrix.txt.gz file that is also available for download from the NCBI GEO website. Note the destdir argument for storing it in a known location instead of the /tmp folder for later use.

library(GEOquery)
getGEO(GEO = 'GSE101942', destdir = "./data/")

Once downloaded, you can load the data using:

library(GEOquery)
gse <- getGEO(filename = './data/GSE101942_series_matrix.txt.gz')

The gse object may contain useful information about both samples and genes. Unfortunately for this example data set there is only information available regarding the samples which we can access in the phenoData slot. Slots in R objects are a method of storing multiple data objects into a single R object, in this case this object is called an ExpressionSet (see ?ExpressionSet for a description of its contents and structure). For storing information about an experiment this is very handy; you can store a data frame with the actual expression data, a list containing laboratory information, a chunk of text with an abstract, etc. all in a single object.

The most common data sets found in an ExpressionSet object: * @assayData: the actual expression data. Most common for microarray experiments and (very) rare for RNA-seq experiments * @phenoData: sample information, should be present for each GEO data set * @featureData: a data frame holding feature (=gene) information, slightly rare for RNA-seq data sets * @experimentData: information about the lab which performed the experiment

You already know how to access a column in a data frame using the $ notation, and accessing a complete data frame in the gse object is done using the @ symbol:

# Access a data frame containing phenotype information
gse@phenoData@data

Here you see that each sample is described using its sample identifier (GSM****), information about the the sample group (title column) and a number of characteristics. The actual information included depends on your data set.

  title source_name_ch1 characteristics_ch1
GSM2719212 C2A trisomic_IPSC cell type: human induced pluripotent stem cells (iPSCs)
GSM2719213 C2B trisomic_IPSC cell type: human induced pluripotent stem cells (iPSCs)
GSM2719214 C2C trisomic_IPSC cell type: human induced pluripotent stem cells (iPSCs)
GSM2719215 C244-A disomic_IPSC cell type: human induced pluripotent stem cells (iPSCs)
GSM2719216 C244-B disomic_IPSC cell type: human induced pluripotent stem cells (iPSCs)
GSM2719217 C243 disomic_IPSC cell type: human induced pluripotent stem cells (iPSCs)

Contents of the gse@phenoData@data information (selection) (continued below)

  characteristics_ch1.1
GSM2719212 ploidy: trisomic
GSM2719213 ploidy: trisomic
GSM2719214 ploidy: trisomic
GSM2719215 ploidy: disomic
GSM2719216 ploidy: disomic
GSM2719217 ploidy: disomic

If your data set has properly annotated gene information, this should be accessible in the @featureData slot. For instance, the data set GSE20489 does have this annotation available. Once loaded, we see that the following annotation columns are present for each gene in the featureData slot (following is a small subset, in total there can be well over 20 columns of information per gene). Also, the metadata listing shows for how many features (=genes, 54675) and samples (54) the expression data is present (in the @assayData slot):

# This gets a list with a single 'ExpressionSet' object
GSE20489 <- getGEO(GEO = "GSE20489", destdir = './data')
# Print some metadata of the featureData slot
GSE20489[[1]]
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 54 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM514737 GSM514738 ... GSM514790 (54 total)
##   varLabels: title geo_accession ... tissue:ch1 (39 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
##   fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23883607 
## Annotation: GPL570
ID Gene Title Gene Symbol ENTREZ_GENE_ID
1007_s_at discoidin domain receptor tyrosine kinase … DDR1 /// MIR4640 780 /// 100616237
1053_at replication factor C (activator 1) 2, 40kDa RFC2 5982
117_at heat shock 70kDa protein 6 (HSP70B’) HSPA6 3310
121_at paired box 8 PAX8 7849
1255_g_at guanylate cyclase activator 1A (retina) GUCA1A 2978
1294_at microRNA 5193 /// ubiquitin-like modifier … MIR5193 /// UBA7 7318 /// 100847079

Gene information stored in an ExpressionSet object for experiment GSE20489 (continued below)

Gene Ontology Biological Process
0001558 // regulation of cell growth // in…
0000278 // mitotic cell cycle // traceable…
0000902 // cell morphogenesis // inferred …
0001655 // urogenital system development /…
0007165 // signal transduction // non-trac…
0006464 // cellular protein modification p…

C.2 Manual Data Annotation

Luckily for us, R offers a number of libraries to automatically retrieve information for large sets of genes, unless you have chosen an organism for which not much data is available. This section demonstrates the use of two such libraries, starting with AnnotationDbi followed by biomaRt.

The AnnotationDbi library downloads a local copy of an organism-specific database with gene information where biomaRt uses online databases to retrieve data given a query. biomaRt offers far more data (over 1000 data fields per organism) but is more complex to use.

Since biomaRt is also relying on online databases it might be a good strategy to annotate only the genes of interest (the DEGs) instead of querying for >20.000 genes while we might only retain 20 after statistical analysis. If you are planning to use biomaRt, skip to the Discovering Differentialy Expressed Genes (DEGs) chapter first and then return to the Using R Bioconductors biomaRt section to annotate the data.

C.2.1 Using AnnotationDBI

The AnnotationDbi offers data sets for many organisms in the form of installable libraries and depending on your experiment you need to find the proper library. The example experiment contains samples of the house mouse (Mus musculus) and therefore we select its data set.

# Load the AnnotationDbi interface library
library(AnnotationDbi)
# Load the Bioconductor installation library (contains 'biocLite()')
library(BiocInstaller)

# Install and load the organism specific gene database
# 'org' for Organism
# 'Hs' for Homo sapiens
# 'eg' for Entrez Gene IDs
# (try to load (using the library function) first before installing, it might already be present)
biocLite('org.Hs.eg.db')
library(org.Hs.eg.db)

The following information types are available in this database (use the columns function to inspect).

Columns Columns Columns Columns Columns
ACCNUM ENZYME GOALL PATH UCSCKG
ALIAS EVIDENCE IPI PFAM UNIPROT
ENSEMBL EVIDENCEALL MAP PMID
ENSEMBLPROT GENENAME OMIM PROSITE
ENSEMBLTRANS GENETYPE ONTOLOGY REFSEQ
ENTREZID GO ONTOLOGYALL SYMBOL

Available fields in the database

The table below shows the data available for all information types given a randomly chosen EntrezID of ‘1080’. Note that the table has been split to show the 24 data types with their values.

.Hs.data.table <- cbind(.Hs.data[1:12,], .Hs.data[13:24,])
colnames(.Hs.data.table) <- c('Data Type', 'Example Value',
                              'Data Type', 'Example Value')
pander(.Hs.data.table, caption="Example values for each field in 'org.Hs.eg.db'")
Data Type Example Value Data Type Example Value
ACCNUM AAA35680 GOALL GO:0000003
ALIAS ABC35 IPI IPI00815998
ENSEMBL ENSG00000001626 MAP 7q31.2
ENSEMBLPROT NA OMIM 167800
ENSEMBLTRANS NA ONTOLOGY MF
ENTREZID NA ONTOLOGYALL BP
ENZYME 3.6.3.49 PATH 02010
EVIDENCE IDA PFAM PF14396
EVIDENCEALL ISS PMID 1284466
GENENAME CF transmembrane conductance regulator PROSITE PS00211
GENETYPE protein-coding REFSEQ NM_000492
GO GO:0005254 SYMBOL CFTR

Example values for each field in ‘org.Hs.eg.db’

Retrieving data from the locally stored annotation database can be done using the mapIds function which takes a number of arguments:

  • x: the local database to query
  • keys: the IDs from your own data set, most often you have ENSEMBL IDs or gene SYMBOLS. In the example data the rownames of the data set contain the gene SYMBOLS (see Table 1 on page 2)
  • column: the data column to retrieve from the database
  • keytype: the type of data that you provide. In the example data this is a gene SYMBOL
  • multiVals: a number of columns contain more then one entry for a single gene, in that case we only want to store the first one.

The output of the mapIds function is a single character vector that we can add to our data frame containing the DEGs (in this case the DEGs for the IPSC trisomic vs disomic comparison containing 828 genes stored in the ipsc.tri.vs.di data frame).

# Retrieve the ENSEMBL gene ID, e.g. 'ENSG00000001626'
ipsc.tri.vs.di$Ensembl <- mapIds(x = org.Hs.eg.db,
                           keys=row.names(ipsc.tri.vs.di),
                           column="ENSEMBL",
                           keytype="SYMBOL",
                           multiVals="first")

# Retrieve the ENTREZ gene ID, e.g. '12323'
ipsc.tri.vs.di$EntrezID <- mapIds(x = org.Hs.eg.db,
                            keys=row.names(ipsc.tri.vs.di),
                            column="ENTREZID",
                            keytype="SYMBOL",
                            multiVals="first")

# Retrieve the KEGG enzyme code, e.g. 2.7.11.17
ipsc.tri.vs.di$Enzyme <- mapIds(x = org.Hs.eg.db,
                           keys=row.names(ipsc.tri.vs.di),
                           column="ENZYME",
                           keytype="SYMBOL",
                           multiVals="first")

Unfortunately, not all organisms offer access to such information nor is the information always complete. For instance, the following table shows the number of available records for the Ensembl, Entrez en KEGG IDs downloaded:

  ENSEMBL ENTREZ ENZYME
Available 628 644 62
Missing 221 205 787

Statistics for annotation columns using AnnotationDbi

For other data sets you might have more luck, otherwise continue with the biomaRt method explained below.

C.2.2 Using biomaRt

A short explanation about biomart (source: Wikipedia):

“The purpose of the BioMarts in Ensembl Genomes is to allow the user to mine and download tables containing all the genes for a single species, genes in a specific region of a chromosome or genes on one region of a chromosome associated with an InterPro domain. The BioMarts also include filters to refine the data to be extracted and the attributes (Variant ID, Chromosome name, Ensembl ID, location, etc.) that will appear in the final table file can be selected by the user.”

The text above mentiones species, attributes and filters, and we need to combine these elements to query the biomart databases for our annotation. The following code ‘chunks’ show possible values for these elements and how to gather and store the relevant data.

The biomaRt library interfaces with the biomart.org online database. Sometimes the biomart website (which offers a browsable database) is down due to maintenance, but its many mirrors can still be used, for instance at Ensembl.

You are required to explore a number of objects and the given example code is most likely not suitable for your data/ organism. The project is well documented on the Bioconductor biomaRt website that links to the biomaRt users guide.

# Load the library
library(biomaRt)

# Use an alternative database server as the regular one sometimes has issues..
ensembl=useMart("ENSEMBL_MART_ENSEMBL", host="https://www.ensembl.org")
# Select the 'ensembl' database
ensembl <- useMart("ensembl")

Using the listDatasets function you can get a full list of available datasets. Store this list in an R object and ‘browse’ this object in RStudio to see if your organism is included. Copy the name of the dataset, this is the species element we will use.

mart.datasets <- listDatasets(ensembl)

# Select the correct dataset, for the example data we select the 'hsapiens_gene_ensembl' 
ensembl <- useDataset('hsapiens_gene_ensembl', mart = ensembl)
dataset description version
abrachyrhynchus_gene_ensembl Pink-footed goose genes (ASM259213v1) ASM259213v1
acalliptera_gene_ensembl Eastern happy genes (fAstCal1.2) fAstCal1.2
acarolinensis_gene_ensembl Green anole genes (AnoCar2.0v2) AnoCar2.0v2
acchrysaetos_gene_ensembl Golden eagle genes (bAquChr1.2) bAquChr1.2
acitrinellus_gene_ensembl Midas cichlid genes (Midas_v5) Midas_v5
amelanoleuca_gene_ensembl Giant panda genes (ASM200744v2) ASM200744v2

Subset of the 214 datasets available

Next is deciding on a filter. For this we can use the listFilters function on the ensembl object, storing the full list of filters. Here too it is wise to view this in RStudio to find the filter to use.

The filter specifies what you will use to search on. For instance, the AnnotationDbi queries above gave us Ensembl gene ID’s and we could use those with the ensembl_gene_id filter. When viewing the list of filters in RStudio you can use the search text-box in the top-right of the view, for example with ensembl to look for filters to use with this type of identifier.

filters <- listFilters(ensembl)
name description
chromosome_name Chromosome/scaffold name
start Start
end End
band_start Band Start
band_end Band End
marker_start Marker Start
marker_end Marker End
strand Strand
chromosomal_region e.g. 1:100:10000:-1, 1:100000:200000:1
with_biogrid With BioGRID Interaction data, The General Repository for Interaction Datasets ID(s)
with_ccds With CCDS ID(s)
with_chembl With ChEMBL ID(s)
with_dbass3 With DataBase of Aberrant 3’ Splice Sites ID(s)
with_dbass5 With DataBase of Aberrant 5’ Splice Sites ID(s)
with_entrezgene_trans_name With EntrezGene transcript name ID(s)

Subset of the 437 available filters

Finally we get to the point to determine what data we would like to get from the database. These are the attributes which we can get with the listAttributes function on the ensembl object. Again - and especially with the attributes since there are often >1000 of selectable options - we store the attributes and view them in RStudio to look for data that we want.

attributes <- listAttributes(ensembl)
name description page
ensembl_gene_id Gene stable ID feature_page
ensembl_gene_id_version Gene stable ID version feature_page
ensembl_transcript_id Transcript stable ID feature_page
ensembl_transcript_id_version Transcript stable ID version feature_page
ensembl_peptide_id Protein stable ID feature_page
ensembl_peptide_id_version Protein stable ID version feature_page
ensembl_exon_id Exon stable ID feature_page
description Gene description feature_page
chromosome_name Chromosome/scaffold name feature_page
start_position Gene start (bp) feature_page
end_position Gene end (bp) feature_page
strand Strand feature_page
band Karyotype band feature_page
transcript_start Transcript start (bp) feature_page
transcript_end Transcript end (bp) feature_page
transcription_start_site Transcription start site (TSS) feature_page

Subset of the 3172 available attributes

If we want to have the gene chromosome, start- and end-position as well as its description (note, just as an example, there is other, more interesting information available too!) we combine this in a character vector (attrs.get, see below). There is one caveat though, the order in which you get back the results are not the same as the input order! This means that we cannot simple combine the data with our original data set but we need to merge it together. However, to be able to merge the data we need to know which record belongs to which gene and therefore we add our selected filter (in our case the ensembl_gene_id) to the list of attributes to get. Then, when we get the results dataframe we can use the merge function in R to combine the gene information with the actual data:

merge(x = ipsc.tri.vs.di, y = results, by.x = 'Ensembl', by.y = 'ensembl_gene_id')

We now have all three needed elements (species, filter and attributes) so we are ready to query the database with the getBM function (read the help using ?getBM).

The example below retrieves data using a set of five Ensembl gene IDs since not all genes have an Ensembl ID as we’ve seen above, so we filter those out first and use this as the values parameter below.

# Set the 'attributes' values
attrs.get <- c("ensembl_gene_id", "chromosome_name", 
               "start_position","end_position", "description")

# Perform a biomaRt query using 'getBM'
results <- getBM(attributes = attrs.get,
                 filters = "ensembl_gene_id",
                 values = ipsc.tri.vs.di$Ensembl[1:5], 
                 mart = ensembl)

results$gene_length <- abs(results$end_position - results$start_position)

The results object is a data.frame with 5 columns that we can merge with our data set giving us the following annotation columns (combined from the AnnotationDBI and biomaRt libraries).

This was just an example on how to use the biomaRt library and it comes down to selecting the correct filter and looking for interesting attributes to retrieve. Further information can be found in the documentation avaialble with `vignette(‘biomaRt’)

al., Patrick Deelen et. 2015. “Calling Genotypes from Public RNA-Sequencing Data Enables Identification of Genetic Variants That Affect Gene-Expression Levels.” Genome Medicine 7 (30).
Chen, Yunshun, Aaron Lun, Davis McCarthy, Xiaobei Zhou, Mark Robinson, and Gordon Smyth. 2018. edgeR: Empirical Analysis of Digital Gene Expression Data in r. http://bioinf.wehi.edu.au/edgeR.
Daróczi, Gergely, and Roman Tsegelskyi. 2017. Pander: An r ’Pandoc’ Writer. https://CRAN.R-project.org/package=pander.
Love, Michael, Simon Anders, and Wolfgang Huber. 2017. DESeq2: Differential Gene Expression Analysis Based on the Negative Binomial Distribution. https://github.com/mikelove/DESeq2.