B Annotating an RNA-Seq Experiment

This chapter describes annotate the data, meaning assigning names and functions to our Differentially Expressed Genes. This step can either be very easy or a bit more challenging depending on the data source of your project. The first example is relevant if your data set originated from the NCBI GEO, which ‘should’ always be annotated by default. If you have found your data set elsewhere (and do not have a GEO identifier for your project), skip to the Manually annotate your data section below.

The goal of this chapter is to - at least - find a Gene Symbol or common ID (NCBI/ Ensembl) for each gene. Using this information it will be much easier to find relevant information from other online sources to say something about the functionality and impact of your DEGs.

Even though you might be lucky and either already have your data or a simple GEO query results in everything you want/ need, you are tasked to also perform the manual annotation phase even if you will not use any of the found annotation data. Having written the code to access these sources is a good excercise in using complex Bioconductor packages. Also, you can always use this to find more information should you require this later on.

B.1 Downloading annotation data from GEO

Given the GSE**** id of your experiment you can download any available (annotation)data from GEO using the GEOquery library. For the example experiment used previously, the following code downloads two files; a series_matrix.txt.gz file that is also available for download from the NCBI GEO website. Note the destdir argument for storing it in a known location instead of the /tmp folder for later use.

library(GEOquery)
getGEO(GEO = 'GSE101942', destdir = "./data/")

Once downloaded, you can load the data using:

library(GEOquery)
gse <- getGEO(filename = './data/GSE101942_series_matrix.txt.gz')

The gse object may contain useful information about both samples and genes. Unfortunately for this example data set there is only information available regarding the samples which we can access in the phenoData slot. Slots in R objects are a method of storing multiple data objects into a single R object, in this case this object is called an ExpressionSet (see ?ExpressionSet for a description of its contents and structure). For storing information about an experiment this is very handy; you can store a data frame with the actual expression data, a list containing laboratory information, a chunk of text with an abstract, etc. all in a single object.

The most common data sets found in an ExpressionSet object: * @assayData: the actual expression data. Most common for microarray experiments and (very) rare for RNA-seq experiments * @phenoData: sample information, should be present for each GEO data set * @featureData: a data frame holding feature (=gene) information, slightly rare for RNA-seq data sets * @experimentData: information about the lab which performed the experiment

You already know how to access a column in a data frame using the $ notation, and accessing a complete data frame in the gse object is done using the @ symbol:

# Access a data frame containing phenotype information
gse@phenoData@data

Here you see that each sample is described using its sample identifier (GSM****), information about the the sample group (title column) and a number of characteristics. The actual information included depends on your data set.

	title	source_name_ch1	characteristics_ch1
GSM2719212	C2A	trisomic_IPSC	cell type: human induced pluripotent stem cells (iPSCs)
GSM2719213	C2B	trisomic_IPSC	cell type: human induced pluripotent stem cells (iPSCs)
GSM2719214	C2C	trisomic_IPSC	cell type: human induced pluripotent stem cells (iPSCs)
GSM2719215	C244-A	disomic_IPSC	cell type: human induced pluripotent stem cells (iPSCs)
GSM2719216	C244-B	disomic_IPSC	cell type: human induced pluripotent stem cells (iPSCs)
GSM2719217	C243	disomic_IPSC	cell type: human induced pluripotent stem cells (iPSCs)

Contents of the gse@phenoData@data information (selection) (continued below)

	characteristics_ch1.1
GSM2719212	ploidy: trisomic
GSM2719213	ploidy: trisomic
GSM2719214	ploidy: trisomic
GSM2719215	ploidy: disomic
GSM2719216	ploidy: disomic
GSM2719217	ploidy: disomic

If your data set has properly annotated gene information, this should be accessible in the @featureData slot. For instance, the data set GSE20489 does have this annotation available. Once loaded, we see that the following annotation columns are present for each gene in the featureData slot (following is a small subset, in total there can be well over 20 columns of information per gene). Also, the metadata listing shows for how many features (=genes, 54675) and samples (54) the expression data is present (in the @assayData slot):

# This gets a list with a single 'ExpressionSet' object
GSE20489 <- getGEO(GEO = "GSE20489", destdir = './data')
# Print some metadata of the featureData slot
GSE20489[[1]]

## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 54 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: GSM514737 GSM514738 ... GSM514790 (54 total)
##   varLabels: title geo_accession ... tissue:ch1 (39 total)
##   varMetadata: labelDescription
## featureData
##   featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
##   fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
##   fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
##   pubMedIds: 23883607 
## Annotation: GPL570

ID	Gene Title	Gene Symbol	ENTREZ_GENE_ID
1007_s_at	discoidin domain receptor tyrosine kinase …	DDR1 /// MIR4640	780 /// 100616237
1053_at	replication factor C (activator 1) 2, 40kDa	RFC2	5982
117_at	heat shock 70kDa protein 6 (HSP70B’)	HSPA6	3310
121_at	paired box 8	PAX8	7849
1255_g_at	guanylate cyclase activator 1A (retina)	GUCA1A	2978
1294_at	microRNA 5193 /// ubiquitin-like modifier …	MIR5193 /// UBA7	7318 /// 100847079

Gene information stored in an ExpressionSet object for experiment GSE20489 (continued below)

Gene Ontology Biological Process
0001558 // regulation of cell growth // in…
0000278 // mitotic cell cycle // traceable…
0000902 // cell morphogenesis // inferred …
0001655 // urogenital system development /…
0007165 // signal transduction // non-trac…
0006464 // cellular protein modification p…

B.2 Manual Data Annotation

Luckily for us, R offers a number of libraries to automatically retrieve information for large sets of genes, unless you have chosen an organism for which not much data is available. This section demonstrates the use of two such libraries, starting with AnnotationDbi followed by biomaRt.

The AnnotationDbi library downloads a local copy of an organism-specific database with gene information where biomaRt uses online databases to retrieve data given a query. biomaRt offers far more data (over 1000 data fields per organism) but is more complex to use.

Since biomaRt is also relying on online databases it might be a good strategy to annotate only the genes of interest (the DEGs) instead of querying for >20.000 genes while we might only retain 20 after statistical analysis. If you are planning to use biomaRt, skip to the Discovering Differentialy Expressed Genes (DEGs) chapter first and then return to the Using R Bioconductors biomaRt section to annotate the data.

B.2.1 Using `AnnotationDBI`

The AnnotationDbi offers data sets for many organisms in the form of installable libraries and depending on your experiment you need to find the proper library. The example experiment contains samples of the house mouse (Mus musculus) and therefore we select its data set.

# Load the AnnotationDbi interface library
library(AnnotationDbi)
# Load the Bioconductor installation library (contains 'biocLite()')
library(BiocInstaller)

# Install and load the organism specific gene database
# 'org' for Organism
# 'Hs' for Homo sapiens
# 'eg' for Entrez Gene IDs
# (try to load (using the library function) first before installing, it might already be present)
biocLite('org.Hs.eg.db')
library(org.Hs.eg.db)

The following information types are available in this database (use the columns function to inspect).

Columns	Columns	Columns	Columns	Columns
ACCNUM	ENZYME	GOALL	PATH	UCSCKG
ALIAS	EVIDENCE	IPI	PFAM	UNIPROT
ENSEMBL	EVIDENCEALL	MAP	PMID
ENSEMBLPROT	GENENAME	OMIM	PROSITE
ENSEMBLTRANS	GENETYPE	ONTOLOGY	REFSEQ
ENTREZID	GO	ONTOLOGYALL	SYMBOL

Available fields in the database

The table below shows the data available for all information types given a randomly chosen EntrezID of ‘1080’. Note that the table has been split to show the 24 data types with their values.

.Hs.data.table <- cbind(.Hs.data[1:12,], .Hs.data[13:24,])
colnames(.Hs.data.table) <- c('Data Type', 'Example Value',
                              'Data Type', 'Example Value')
pander(.Hs.data.table, caption="Example values for each field in 'org.Hs.eg.db'")

Data Type	Example Value	Data Type	Example Value
ACCNUM	AAA35680	GOALL	GO:0000003
ALIAS	ABC35	IPI	IPI00816721
ENSEMBL	ENSG00000001626	MAP	7q31.2
ENSEMBLPROT	NA	OMIM	167800
ENSEMBLTRANS	NA	ONTOLOGY	MF
ENTREZID	NA	ONTOLOGYALL	BP
ENZYME	3.6.3.49	PATH	02010
EVIDENCE	IDA	PFAM	PF00664
EVIDENCEALL	ISS	PMID	1284466
GENENAME	CF transmembrane conductance regulator	PROSITE	PS50929
GENETYPE	protein-coding	REFSEQ	NM_000492
GO	GO:0005254	SYMBOL	CFTR

Example values for each field in ‘org.Hs.eg.db’

Retrieving data from the locally stored annotation database can be done using the mapIds function which takes a number of arguments:

x: the local database to query
keys: the IDs from your own data set, most often you have ENSEMBL IDs or gene SYMBOLS. In the example data the rownames of the data set contain the gene SYMBOLS (see Table 1 on page 2)
column: the data column to retrieve from the database
keytype: the type of data that you provide. In the example data this is a gene SYMBOL
multiVals: a number of columns contain more then one entry for a single gene, in that case we only want to store the first one.

The output of the mapIds function is a single character vector that we can add to our data frame containing the DEGs (in this case the DEGs for the IPSC trisomic vs disomic comparison containing 828 genes stored in the ipsc.tri.vs.di data frame).

# Retrieve the ENSEMBL gene ID, e.g. 'ENSG00000001626'
ipsc.tri.vs.di$Ensembl <- mapIds(x = org.Hs.eg.db,
                           keys=row.names(ipsc.tri.vs.di),
                           column="ENSEMBL",
                           keytype="SYMBOL",
                           multiVals="first")

# Retrieve the ENTREZ gene ID, e.g. '12323'
ipsc.tri.vs.di$EntrezID <- mapIds(x = org.Hs.eg.db,
                            keys=row.names(ipsc.tri.vs.di),
                            column="ENTREZID",
                            keytype="SYMBOL",
                            multiVals="first")

# Retrieve the KEGG enzyme code, e.g. 2.7.11.17
ipsc.tri.vs.di$Enzyme <- mapIds(x = org.Hs.eg.db,
                           keys=row.names(ipsc.tri.vs.di),
                           column="ENZYME",
                           keytype="SYMBOL",
                           multiVals="first")

Unfortunately, not all organisms offer access to such information nor is the information always complete. For instance, the following table shows the number of available records for the Ensembl, Entrez en KEGG IDs downloaded:

	ENSEMBL	ENTREZ	ENZYME
Available	626	642	62
Missing	223	207	787

Statistics for annotation columns using AnnotationDbi

For other data sets you might have more luck, otherwise continue with the biomaRt method explained below.

B.2.2 Using `biomaRt`

A short explanation about biomart (source: Wikipedia):

“The purpose of the BioMarts in Ensembl Genomes is to allow the user to mine and download tables containing all the genes for a single species, genes in a specific region of a chromosome or genes on one region of a chromosome associated with an InterPro domain. The BioMarts also include filters to refine the data to be extracted and the attributes (Variant ID, Chromosome name, Ensembl ID, location, etc.) that will appear in the final table file can be selected by the user.”

The text above mentiones species, attributes and filters, and we need to combine these elements to query the biomart databases for our annotation. The following code ‘chunks’ show possible values for these elements and how to gather and store the relevant data.

The biomaRt library interfaces with the biomart.org online database. Sometimes the biomart website (which offers a browsable database) is down due to maintenance, but its many mirrors can still be used, for instance at Ensembl.

You are required to explore a number of objects and the given example code is most likely not suitable for your data/ organism. The project is well documented on the Bioconductor biomaRt website that links to the biomaRt users guide.

# Load the library
library(biomaRt)

# Use an alternative database server as the regular one sometimes has issues..
ensembl=useMart("ENSEMBL_MART_ENSEMBL", host="https://www.ensembl.org")

# Select the 'ensembl' database
ensembl <- useMart("ensembl")

Using the listDatasets function you can get a full list of available datasets. Store this list in an R object and ‘browse’ this object in RStudio to see if your organism is included. Copy the name of the dataset, this is the species element we will use.

mart.datasets <- listDatasets(ensembl)

# Select the correct dataset, for the example data we select the 'hsapiens_gene_ensembl' 
ensembl <- useDataset('hsapiens_gene_ensembl', mart = ensembl)

dataset	description	version
abrachyrhynchus_gene_ensembl	Pink-footed goose genes (ASM259213v1)	ASM259213v1
acalliptera_gene_ensembl	Eastern happy genes (fAstCal1.3)	fAstCal1.3
acarolinensis_gene_ensembl	Green anole genes (AnoCar2.0v2)	AnoCar2.0v2
acchrysaetos_gene_ensembl	Golden eagle genes (bAquChr1.2)	bAquChr1.2
acitrinellus_gene_ensembl	Midas cichlid genes (Midas_v5)	Midas_v5
amelanoleuca_gene_ensembl	Giant panda genes (ASM200744v2)	ASM200744v2

Subset of the 214 datasets available

Next is deciding on a filter. For this we can use the listFilters function on the ensembl object, storing the full list of filters. Here too it is wise to view this in RStudio to find the filter to use.

The filter specifies what you will use to search on. For instance, the AnnotationDbi queries above gave us Ensembl gene ID’s and we could use those with the ensembl_gene_id filter. When viewing the list of filters in RStudio you can use the search text-box in the top-right of the view, for example with ensembl to look for filters to use with this type of identifier.

filters <- listFilters(ensembl)

name	description
chromosome_name	Chromosome/scaffold name
start	Start
end	End
band_start	Band Start
band_end	Band End
marker_start	Marker Start
marker_end	Marker End
strand	Strand
chromosomal_region	e.g. 1:100:10000:-1, 1:100000:200000:1
with_biogrid	With BioGRID Interaction data, The General Repository for Interaction Datasets ID(s)
with_ccds	With CCDS ID(s)
with_chembl	With ChEMBL ID(s)
with_dbass3	With DataBase of Aberrant 3’ Splice Sites ID(s)
with_dbass5	With DataBase of Aberrant 5’ Splice Sites ID(s)
with_entrezgene_trans_name	With EntrezGene transcript name ID(s)

Subset of the 438 available filters

Finally we get to the point to determine what data we would like to get from the database. These are the attributes which we can get with the listAttributes function on the ensembl object. Again - and especially with the attributes since there are often >1000 of selectable options - we store the attributes and view them in RStudio to look for data that we want.

attributes <- listAttributes(ensembl)

name	description	page
ensembl_gene_id	Gene stable ID	feature_page
ensembl_gene_id_version	Gene stable ID version	feature_page
ensembl_transcript_id	Transcript stable ID	feature_page
ensembl_transcript_id_version	Transcript stable ID version	feature_page
ensembl_peptide_id	Protein stable ID	feature_page
ensembl_peptide_id_version	Protein stable ID version	feature_page
ensembl_exon_id	Exon stable ID	feature_page
description	Gene description	feature_page
chromosome_name	Chromosome/scaffold name	feature_page
start_position	Gene start (bp)	feature_page
end_position	Gene end (bp)	feature_page
strand	Strand	feature_page
band	Karyotype band	feature_page
transcript_start	Transcript start (bp)	feature_page
transcript_end	Transcript end (bp)	feature_page
transcription_start_site	Transcription start site (TSS)	feature_page

Subset of the 3163 available attributes

If we want to have the gene chromosome, start- and end-position as well as its description (note, just as an example, there is other, more interesting information available too!) we combine this in a character vector (attrs.get, see below). There is one caveat though, the order in which you get back the results are not the same as the input order! This means that we cannot simple combine the data with our original data set but we need to merge it together. However, to be able to merge the data we need to know which record belongs to which gene and therefore we add our selected filter (in our case the ensembl_gene_id) to the list of attributes to get. Then, when we get the results dataframe we can use the merge function in R to combine the gene information with the actual data:

merge(x = ipsc.tri.vs.di, y = results, by.x = 'Ensembl', by.y = 'ensembl_gene_id')

We now have all three needed elements (species, filter and attributes) so we are ready to query the database with the getBM function (read the help using ?getBM).

The example below retrieves data using a set of five Ensembl gene IDs since not all genes have an Ensembl ID as we’ve seen above, so we filter those out first and use this as the values parameter below.

# Set the 'attributes' values
attrs.get <- c("ensembl_gene_id", "chromosome_name", 
               "start_position","end_position", "description")

# Perform a biomaRt query using 'getBM'
results <- getBM(attributes = attrs.get,
                 filters = "ensembl_gene_id",
                 values = ipsc.tri.vs.di$Ensembl[1:5], 
                 mart = ensembl)

results$gene_length <- abs(results$end_position - results$start_position)

The results object is a data.frame with 5 columns that we can merge with our data set giving us the following annotation columns (combined from the AnnotationDBI and biomaRt libraries).

This was just an example on how to use the biomaRt library and it comes down to selecting the correct filter and looking for interesting attributes to retrieve. Further information can be found in the documentation avaialble with `vignette(‘biomaRt’)

al., Patrick Deelen et. 2015. “Calling Genotypes from Public RNA-Sequencing Data Enables Identification of Genetic Variants That Affect Gene-Expression Levels.” Genome Medicine 7 (30).

Chen, Yunshun, Aaron Lun, Davis McCarthy, Xiaobei Zhou, Mark Robinson, and Gordon Smyth. 2018. edgeR: Empirical Analysis of Digital Gene Expression Data in r. http://bioinf.wehi.edu.au/edgeR.

Daróczi, Gergely, and Roman Tsegelskyi. 2017. Pander: An r ’Pandoc’ Writer. https://CRAN.R-project.org/package=pander.

Love, Michael, Simon Anders, and Wolfgang Huber. 2017. DESeq2: Differential Gene Expression Analysis Based on the Negative Binomial Distribution. https://github.com/mikelove/DESeq2.

B Annotating an RNA-Seq Experiment

B.1 Downloading annotation data from GEO

B.2 Manual Data Annotation

B.2.1 Using AnnotationDBI

B.2.2 Using biomaRt

B.2.1 Using `AnnotationDBI`

B.2.2 Using `biomaRt`