C Annotating an RNA-Seq Experiment
This chapter describes annotate the data, meaning assigning names and functions to our Differentially Expressed Genes. This step can either be very easy or a bit more challenging depending on the data source of your project. The first example is relevant if your data set originated from the NCBI GEO, which ‘should’ always be annotated by default. If you have found your data set elsewhere (and do not have a GEO identifier for your project), skip to the Manually annotate your data section below.
The goal of this chapter is to - at least - find a Gene Symbol or common ID (NCBI/ Ensembl) for each gene. Using this information it will be much easier to find relevant information from other online sources to say something about the functionality and impact of your DEGs.
Even though you might be lucky and either already have your data or a simple GEO query results in everything you want/ need, you are tasked to also perform the manual annotation phase even if you will not use any of the found annotation data. Having written the code to access these sources is a good excercise in using complex Bioconductor packages. Also, you can always use this to find more information should you require this later on.
C.1 Downloading annotation data from GEO
Given the GSE****
id of your experiment you can download any available (annotation)data from GEO using the GEOquery
library. For the example experiment used previously, the following code downloads two files; a series_matrix.txt.gz
file that is also available for download from the NCBI GEO website. Note the destdir
argument for storing it in a known location instead of the /tmp
folder for later use.
Once downloaded, you can load the data using:
The gse
object may contain useful information about both samples and genes. Unfortunately for this example data set there is only information available regarding the samples which we can access in the phenoData
slot. Slots in R objects are a method of storing multiple data objects into a single R object, in this case this object is called an ExpressionSet
(see ?ExpressionSet
for a description of its contents and structure). For storing information about an experiment this is very handy; you can store a data frame with the actual expression data, a list containing laboratory information, a chunk of text with an abstract, etc. all in a single object.
The most common data sets found in an ExpressionSet
object:
* @assayData
: the actual expression data. Most common for microarray experiments and (very) rare for RNA-seq experiments
* @phenoData
: sample information, should be present for each GEO data set
* @featureData
: a data frame holding feature (=gene) information, slightly rare for RNA-seq data sets
* @experimentData
: information about the lab which performed the experiment
You already know how to access a column in a data frame using the $
notation, and accessing a complete data frame in the gse
object is done using the @
symbol:
# Access a data frame containing phenotype information
gse@phenoData@data
Here you see that each sample is described using its sample identifier (GSM****
), information about the the sample group (title
column) and a number of characteristics. The actual information included depends on your data set.
title | source_name_ch1 | characteristics_ch1 | |
---|---|---|---|
GSM2719212 | C2A | trisomic_IPSC | cell type: human induced pluripotent stem cells (iPSCs) |
GSM2719213 | C2B | trisomic_IPSC | cell type: human induced pluripotent stem cells (iPSCs) |
GSM2719214 | C2C | trisomic_IPSC | cell type: human induced pluripotent stem cells (iPSCs) |
GSM2719215 | C244-A | disomic_IPSC | cell type: human induced pluripotent stem cells (iPSCs) |
GSM2719216 | C244-B | disomic_IPSC | cell type: human induced pluripotent stem cells (iPSCs) |
GSM2719217 | C243 | disomic_IPSC | cell type: human induced pluripotent stem cells (iPSCs) |
Contents of the gse@phenoData@data
information (selection) (continued below)
characteristics_ch1.1 | |
---|---|
GSM2719212 | ploidy: trisomic |
GSM2719213 | ploidy: trisomic |
GSM2719214 | ploidy: trisomic |
GSM2719215 | ploidy: disomic |
GSM2719216 | ploidy: disomic |
GSM2719217 | ploidy: disomic |
If your data set has properly annotated gene information, this should be accessible in the @featureData
slot. For instance, the data set GSE20489
does have this annotation available. Once loaded, we see that the following annotation columns are present for each gene in the featureData
slot (following is a small subset, in total there can be well over 20 columns of information per gene). Also, the metadata listing shows for how many features (=genes, 54675) and samples (54) the expression data is present (in the @assayData
slot):
# This gets a list with a single 'ExpressionSet' object
GSE20489 <- getGEO(GEO = "GSE20489", destdir = './data')
# Print some metadata of the featureData slot
GSE20489[[1]]
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 54675 features, 54 samples
## element names: exprs
## protocolData: none
## phenoData
## sampleNames: GSM514737 GSM514738 ... GSM514790 (54 total)
## varLabels: title geo_accession ... tissue:ch1 (39 total)
## varMetadata: labelDescription
## featureData
## featureNames: 1007_s_at 1053_at ... AFFX-TrpnX-M_at (54675 total)
## fvarLabels: ID GB_ACC ... Gene Ontology Molecular Function (16 total)
## fvarMetadata: Column Description labelDescription
## experimentData: use 'experimentData(object)'
## pubMedIds: 23883607
## Annotation: GPL570
ID | Gene Title | Gene Symbol | ENTREZ_GENE_ID |
---|---|---|---|
1007_s_at | discoidin domain receptor tyrosine kinase … | DDR1 /// MIR4640 | 780 /// 100616237 |
1053_at | replication factor C (activator 1) 2, 40kDa | RFC2 | 5982 |
117_at | heat shock 70kDa protein 6 (HSP70B’) | HSPA6 | 3310 |
121_at | paired box 8 | PAX8 | 7849 |
1255_g_at | guanylate cyclase activator 1A (retina) | GUCA1A | 2978 |
1294_at | microRNA 5193 /// ubiquitin-like modifier … | MIR5193 /// UBA7 | 7318 /// 100847079 |
Gene information stored in an ExpressionSet
object for experiment GSE20489 (continued below)
Gene Ontology Biological Process |
---|
0001558 // regulation of cell growth // in… |
0000278 // mitotic cell cycle // traceable… |
0000902 // cell morphogenesis // inferred … |
0001655 // urogenital system development /… |
0007165 // signal transduction // non-trac… |
0006464 // cellular protein modification p… |
C.2 Manual Data Annotation
Luckily for us, R offers a number of libraries to automatically retrieve information for large sets of genes, unless you have chosen an organism for which not much data is available. This section demonstrates the use of two such libraries, starting with AnnotationDbi
followed by biomaRt
.
The AnnotationDbi
library downloads a local copy of an organism-specific database with gene information where biomaRt
uses online databases to retrieve data given a query
. biomaRt
offers far more data (over 1000 data fields per organism) but is more complex to use.
Since biomaRt
is also relying on online databases it might be a good strategy to annotate only the genes of interest (the DEGs) instead of querying for >20.000 genes while we might only retain 20 after statistical analysis. If you are planning to use biomaRt
, skip to the Discovering Differentialy Expressed Genes (DEGs) chapter first and then return to the Using R Bioconductors biomaRt
section to annotate the data.
C.2.1 Using AnnotationDBI
The AnnotationDbi
offers data sets for many organisms in the form of installable libraries and depending on your experiment you need to find the proper library. The example experiment contains samples of the house mouse (Mus musculus) and therefore we select its data set.
# Load the AnnotationDbi interface library
library(AnnotationDbi)
# Load the Bioconductor installation library (contains 'biocLite()')
library(BiocInstaller)
# Install and load the organism specific gene database
# 'org' for Organism
# 'Hs' for Homo sapiens
# 'eg' for Entrez Gene IDs
# (try to load (using the library function) first before installing, it might already be present)
biocLite('org.Hs.eg.db')
library(org.Hs.eg.db)
The following information types are available in this database (use the columns
function to inspect).
Columns | Columns | Columns | Columns | Columns |
---|---|---|---|---|
ACCNUM | ENZYME | GOALL | PATH | UCSCKG |
ALIAS | EVIDENCE | IPI | PFAM | UNIPROT |
ENSEMBL | EVIDENCEALL | MAP | PMID | |
ENSEMBLPROT | GENENAME | OMIM | PROSITE | |
ENSEMBLTRANS | GENETYPE | ONTOLOGY | REFSEQ | |
ENTREZID | GO | ONTOLOGYALL | SYMBOL |
Available fields in the database
The table below shows the data available for all information types given a randomly chosen EntrezID of ‘1080’. Note that the table has been split to show the 24 data types with their values.
.Hs.data.table <- cbind(.Hs.data[1:12,], .Hs.data[13:24,])
colnames(.Hs.data.table) <- c('Data Type', 'Example Value',
'Data Type', 'Example Value')
pander(.Hs.data.table, caption="Example values for each field in 'org.Hs.eg.db'")
Data Type | Example Value | Data Type | Example Value |
---|---|---|---|
ACCNUM | AAA35680 | GOALL | GO:0000003 |
ALIAS | ABC35 | IPI | IPI00815998 |
ENSEMBL | ENSG00000001626 | MAP | 7q31.2 |
ENSEMBLPROT | NA | OMIM | 167800 |
ENSEMBLTRANS | NA | ONTOLOGY | MF |
ENTREZID | NA | ONTOLOGYALL | BP |
ENZYME | 3.6.3.49 | PATH | 02010 |
EVIDENCE | IDA | PFAM | PF14396 |
EVIDENCEALL | ISS | PMID | 1284466 |
GENENAME | CF transmembrane conductance regulator | PROSITE | PS00211 |
GENETYPE | protein-coding | REFSEQ | NM_000492 |
GO | GO:0005254 | SYMBOL | CFTR |
Example values for each field in ‘org.Hs.eg.db’
Retrieving data from the locally stored annotation database can be done using the mapIds
function which takes a number of arguments:
-
x
: the local database to query -
keys
: the IDs from your own data set, most often you haveENSEMBL
IDs or geneSYMBOLS
. In the example data therownames
of the data set contain the geneSYMBOLS
(see Table 1 on page 2) -
column
: the data column to retrieve from the database -
keytype
: the type of data that you provide. In the example data this is a geneSYMBOL
-
multiVals
: a number of columns contain more then one entry for a single gene, in that case we only want to store the first one.
The output of the mapIds
function is a single character vector that we can add to our data frame containing the DEGs (in this case the DEGs for the IPSC trisomic vs disomic comparison containing 828 genes stored in the ipsc.tri.vs.di
data frame).
# Retrieve the ENSEMBL gene ID, e.g. 'ENSG00000001626'
ipsc.tri.vs.di$Ensembl <- mapIds(x = org.Hs.eg.db,
keys=row.names(ipsc.tri.vs.di),
column="ENSEMBL",
keytype="SYMBOL",
multiVals="first")
# Retrieve the ENTREZ gene ID, e.g. '12323'
ipsc.tri.vs.di$EntrezID <- mapIds(x = org.Hs.eg.db,
keys=row.names(ipsc.tri.vs.di),
column="ENTREZID",
keytype="SYMBOL",
multiVals="first")
# Retrieve the KEGG enzyme code, e.g. 2.7.11.17
ipsc.tri.vs.di$Enzyme <- mapIds(x = org.Hs.eg.db,
keys=row.names(ipsc.tri.vs.di),
column="ENZYME",
keytype="SYMBOL",
multiVals="first")
Unfortunately, not all organisms offer access to such information nor is the information always complete. For instance, the following table shows the number of available records for the Ensembl, Entrez en KEGG IDs downloaded:
ENSEMBL | ENTREZ | ENZYME | |
---|---|---|---|
Available | 628 | 644 | 62 |
Missing | 221 | 205 | 787 |
Statistics for annotation columns using AnnotationDbi
For other data sets you might have more luck, otherwise continue with the biomaRt
method explained below.
C.2.2 Using biomaRt
A short explanation about biomart (source: Wikipedia):
“The purpose of the BioMarts in Ensembl Genomes is to allow the user to mine and download tables containing all the genes for a single species, genes in a specific region of a chromosome or genes on one region of a chromosome associated with an InterPro domain. The BioMarts also include filters to refine the data to be extracted and the attributes (Variant ID, Chromosome name, Ensembl ID, location, etc.) that will appear in the final table file can be selected by the user.”
The text above mentiones species, attributes and filters, and we need to combine these elements to query the biomart databases for our annotation. The following code ‘chunks’ show possible values for these elements and how to gather and store the relevant data.
The biomaRt
library interfaces with the biomart.org online database. Sometimes the biomart website (which offers a browsable database) is down due to maintenance, but its many mirrors can still be used, for instance at Ensembl.
You are required to explore a number of objects and the given example code is most likely not suitable for your data/ organism. The project is well documented on the Bioconductor biomaRt website that links to the biomaRt users guide.
# Load the library
library(biomaRt)
# Use an alternative database server as the regular one sometimes has issues..
ensembl=useMart("ENSEMBL_MART_ENSEMBL", host="https://www.ensembl.org")
# Select the 'ensembl' database
ensembl <- useMart("ensembl")
Using the listDatasets
function you can get a full list of available datasets. Store this list in an R object and ‘browse’ this object in RStudio to see if your organism is included. Copy the name of the dataset, this is the species element we will use.
mart.datasets <- listDatasets(ensembl)
# Select the correct dataset, for the example data we select the 'hsapiens_gene_ensembl'
ensembl <- useDataset('hsapiens_gene_ensembl', mart = ensembl)
dataset | description | version |
---|---|---|
abrachyrhynchus_gene_ensembl | Pink-footed goose genes (ASM259213v1) | ASM259213v1 |
acalliptera_gene_ensembl | Eastern happy genes (fAstCal1.2) | fAstCal1.2 |
acarolinensis_gene_ensembl | Green anole genes (AnoCar2.0v2) | AnoCar2.0v2 |
acchrysaetos_gene_ensembl | Golden eagle genes (bAquChr1.2) | bAquChr1.2 |
acitrinellus_gene_ensembl | Midas cichlid genes (Midas_v5) | Midas_v5 |
amelanoleuca_gene_ensembl | Giant panda genes (ASM200744v2) | ASM200744v2 |
Subset of the 214 datasets available
Next is deciding on a filter. For this we can use the listFilters
function on the ensembl
object, storing the full list of filters. Here too it is wise to view this in RStudio to find the filter to use.
The filter specifies what you will use to search on. For instance, the AnnotationDbi
queries above gave us Ensembl
gene ID’s and we could use those with the ensembl_gene_id
filter. When viewing the list of filters in RStudio you can use the search text-box in the top-right of the view, for example with ensembl
to look for filters to use with this type of identifier.
filters <- listFilters(ensembl)
name | description |
---|---|
chromosome_name | Chromosome/scaffold name |
start | Start |
end | End |
band_start | Band Start |
band_end | Band End |
marker_start | Marker Start |
marker_end | Marker End |
strand | Strand |
chromosomal_region | e.g. 1:100:10000:-1, 1:100000:200000:1 |
with_biogrid | With BioGRID Interaction data, The General Repository for Interaction Datasets ID(s) |
with_ccds | With CCDS ID(s) |
with_chembl | With ChEMBL ID(s) |
with_dbass3 | With DataBase of Aberrant 3’ Splice Sites ID(s) |
with_dbass5 | With DataBase of Aberrant 5’ Splice Sites ID(s) |
with_entrezgene_trans_name | With EntrezGene transcript name ID(s) |
Subset of the 437 available filters
Finally we get to the point to determine what data we would like to get from the database. These are the attributes
which we can get with the listAttributes
function on the ensembl
object. Again - and especially with the attributes since there are often >1000 of selectable options - we store the attributes and view them in RStudio to look for data that we want.
attributes <- listAttributes(ensembl)
name | description | page |
---|---|---|
ensembl_gene_id | Gene stable ID | feature_page |
ensembl_gene_id_version | Gene stable ID version | feature_page |
ensembl_transcript_id | Transcript stable ID | feature_page |
ensembl_transcript_id_version | Transcript stable ID version | feature_page |
ensembl_peptide_id | Protein stable ID | feature_page |
ensembl_peptide_id_version | Protein stable ID version | feature_page |
ensembl_exon_id | Exon stable ID | feature_page |
description | Gene description | feature_page |
chromosome_name | Chromosome/scaffold name | feature_page |
start_position | Gene start (bp) | feature_page |
end_position | Gene end (bp) | feature_page |
strand | Strand | feature_page |
band | Karyotype band | feature_page |
transcript_start | Transcript start (bp) | feature_page |
transcript_end | Transcript end (bp) | feature_page |
transcription_start_site | Transcription start site (TSS) | feature_page |
Subset of the 3172 available attributes
If we want to have the gene chromosome, start- and end-position as well as its description (note, just as an example, there is other, more interesting information available too!) we combine this in a character vector (attrs.get
, see below). There is one caveat though, the order in which you get back the results are not the same as the input order! This means that we cannot simple combine the data with our original data set but we need to merge it together. However, to be able to merge the data we need to know which record belongs to which gene and therefore we add our selected filter
(in our case the ensembl_gene_id
) to the list of attributes to get. Then, when we get the results dataframe
we can use the merge
function in R to combine the gene information with the actual data:
merge(x = ipsc.tri.vs.di, y = results, by.x = 'Ensembl', by.y = 'ensembl_gene_id')
We now have all three needed elements (species, filter and attributes) so we are ready to query the database with the getBM
function (read the help using ?getBM
).
The example below retrieves data using a set of five Ensembl gene IDs since not all genes have an Ensembl ID as we’ve seen above, so we filter those out first and use this as the values
parameter below.
# Set the 'attributes' values
attrs.get <- c("ensembl_gene_id", "chromosome_name",
"start_position","end_position", "description")
# Perform a biomaRt query using 'getBM'
results <- getBM(attributes = attrs.get,
filters = "ensembl_gene_id",
values = ipsc.tri.vs.di$Ensembl[1:5],
mart = ensembl)
results$gene_length <- abs(results$end_position - results$start_position)
The results
object is a data.frame
with 5 columns that we can merge with our data set giving us the following annotation columns (combined from the AnnotationDBI
and biomaRt
libraries).
This was just an example on how to use the biomaRt
library and it comes down to selecting the correct filter and looking for interesting attributes to retrieve. Further information can be found in the documentation avaialble with `vignette(‘biomaRt’)