# Data Analysis and Visualization

At this point we have one or more sets of DEGs from our experiment. It might be just one set or it might be a large number of comparisons made with a set of DEGs per comparison and/ or results from multiple approaches (manual, packages). This chapter offers some guidance to visualize, summarize and - finally - connect some biological relevance to the results. Browse this chapter first before choosing the visualization(s) to make for your data! If your article contains relevant figures that you would like to try and reproduce (given that enough data is available) that can be done as well instead of the visualizations shown here. Note however that the Volcano plot described below is required.

## 5.1 Volcano Plot

A volcano plot is often the first visualization of the data once the statistical tests are completed. This plot shows data for all genes and we highlight those genes that are considered DEG by using thresholds for both the (adjusted) p-value and a fold-change. Many articles describe values used for these thresholds in their methods section, otherwise a good default is 0.05 for the adjusted p-value and around 1.0 for the log-FC.

Here we use the EnhancedVolcano library that creates a ggplot2 volcano plot as shown below. This is packaged in a function deseq.volcano just because we create two figures and only need to change the input data set and plot title. The resulting scatterplot places the -log10(pvalue) values on the y-axis and the log-FC on the x-axis. This often results in a plot representing a volcanic ‘eruption’ where the fold-change influences the spread and the significance the height. Coloring is done based on the thresholds (-log10(0.05) for the adjusted and log2(2) for the log-FC) as follows:

• (dark) gray: not significant; both the FC and adjusted p-value is below the thresholds.
• green: high FC, but adjusted p-value is below the threshold.
• blue: adjusted p-value indicates significance, but FC below the threshold
• red: significant DEG

You can adjust the pCutoff and FCcutoff values depending on your own requirements, see the elaborate documentation and examples on GitHub.

library(EnhancedVolcano)

## Simple function for plotting a Volcano plot, returns a ggplot object
deseq.volcano <- function(res, datasetName) {
return(EnhancedVolcano(res, x = 'log2FoldChange', y = 'padj',
lab=rownames(res),
title = paste(datasetName, "trisomic vs disomic"),
subtitle = bquote(italic('FDR <= 0.05 and absolute FC >= 2')),
# Change text and icon sizes
labSize = 3, pointSize = 1.5, axisLabSize=10, titleLabSize=12,
subtitleLabSize=8, captionLabSize=10,
# Disable legend
legendPosition = "none",
# Set cutoffs
pCutoff = 0.05, FCcutoff = 2))
}

## Note: input data is the corrected DESeq2 output using the 'lfcShrink' function (see chapter 4)
deseq.volcano(res = res.ipsc.lfc, datasetName = "IPSC")

deseq.volcano(res = res.neuron.lfc, datasetName = "Neuron")

# Note that the most simplest case is:
EnhancedVolcano(res.ipsc, x = 'log2FoldChange', y = 'padj', lab = rownames(res.ipsc))

Notes:

• When you have used edgeR instead you need to change the names of the columns (i.e. log2FoldChange == logFC and padj == FDR) used in this example.
• In this example there are too many DEGs to annotate with their name in the plots. The EnhancedVolcano plot function has many options for annotating significant genes, see the linked manual for further information.
• This section has been rewritten to use the EnhancedVolcano function instead of creating a manual ggplot2 plot as can be seen in the old version of the course manual.

## 5.2 Venn Diagram

Another common visualization is a Venn-diagram. With our data set we’ve shown two comparisons; trisomic vs disomic in two cell types. Using a Venn-diagram we can show both the shared and unique DEGS for the IPSC trisomic vs IPSC disomic and NEUR trisomic vs NEUR disomic comparisons.

Creating a Venn-diagram can be a good method for selecting a (sub)set of DEGs, for instance by selecting the set of DEGs that are shared between groups (the Intersection, see the Wikipedia page on Set Theory or the other way around, only the DEGs that are unique for a group (the Difference). Keep this in mind when continuing on to the clustering section below where it is most likely not useful to cluster all combinations you might have.

Code examples:

• Data preparation (first code-chunk): creating vectors of DEGs for both packages. These DEGs are the gene names (taken from the row names of each DEG-analysis).
• The output of the DESeq2 results and lfcShrink functions are stored in the res.ipsc.lfc and res.ipsc.neuron objects.
• Using the VennDiagram library (second code-chunk): good for complex/ large number of comparison, highly configurable but more difficult to use. To see some examples, run the following function in the R console: example('draw.quad.venn')
• Using the gplots library (third code-chunk): Easier to use and support for complex Venn diagrams, but not good for adjusting appearance.

See the blog post “VENN DIAGRAM WITH R OR RSTUDIO: A MILLION WAYS” for more methods of creating Venn diagrams.

## Data preparation
pval_threshold <- 0.05
ipsc.degs <- row.names(res.ipsc.lfc[which(res.ipsc.lfc$padj <= pval_threshold), ]) neur.degs <- row.names(res.neuron.lfc[which(res.neuron.lfc$padj <= pval_threshold), ])

The two created vectors of gene names can be used by either taking their lengths (for the VennDiagram library) or just their names (for the gplots library).

## Venn-diagram using the VennDiagram library (see below for alternative method)
library(VennDiagram)
# Arguments for a pairwise (two-sets) venn-diagram are sizes for set1, set2 and overlap (intersect)
# Many more functions are available for triple, quad and quantuple diagrams (starting with 'draw.***')
venn.plot <- draw.pairwise.venn(length(ipsc.degs),
length(neur.degs),
# Calculate the intersection of the two sets
length( intersect(ipsc.degs, neur.degs) ),
category = c("IPSC Trisomic vs Disomic", "NEURON Trisomic vs Disomic"), scaled = F,
fill = c("light blue", "pink"), alpha = rep(0.5, 2),
cat.pos = c(0, 0))

# Actually plot the plot
grid.draw(venn.plot)

As mentioned, following is an easier alternative (especially when you have > 2 groups) with the venn function from the gplots library. This only needs a list containing either the row-numbers or the gene-names of the DEGs which is easier (but offers less adjustability to make it prettier).

## Alternative method using the gplots library
library(gplots)
# Create a Venn-diagram given just the list of gene-names for both sets
venn(list("IPSC Trisomic vs Disomic" = ipsc.degs,
"NEURON Trisomic vs Disomic" = neur.degs), )

While both analysis result in a similar amount of DEGS (849 for the IPSC and 769 for the Neuron cell types) only 96 DEGs are shared between both cell types and most are thus unique.

## 5.3 Clustering

Here, we can create a heatmap of the found DEGs for all samples where the colors show the fold-change value. Usually this heatmap is shown with a two color gradient; from red (downregulated) to green (upregulated). This is however optional and using the default colors from pheatmap for example is perfectly fine as long as a proper legend is present.

There are multiple methods of creating a heatmap (one of which you’ve already used) and most of these directly apply clustering in the visualization. This clustering can be applied to the expression pattern of a gene (row-clustering), the expression pattern of a sample (column-clustering) or both (default for pheatmap).

The following example is commonly found in publications as shows the log2 Fold Change values for the comparison that was done. The first and fifth columns show the log2(FC) as calculated by DESeq2 for the two trisomic vs disomic comparisons. The remaining columns are the log fold changes of the separate trisomic replicates vs the average disomic expression (manually calculating the Log(FC) as done in section 4.1 using the DESeq2 normalized count data from the rld.dds object). The DESeq2 calculated FC values show greater changes compared to the manually calculated FC values. The value in showing the manually calculated values too is to see if all replicates show the same patterns within groups. They are however optional unless you only have one comparison as you cannot create a heatmap with a single column.

Note that column-clustering has been turned off for this type of plot by setting cluster_cols = FALSE (all replicates are from the same group, clustering will not add any useful information).

A limitation of a heatmap will show itself when more then 100 DEGs are found; this just doesn’t fit well in a single figure and causes the clustering and especially the gene names/ labels to be unreadable. It is therefore always adviced to not only filter on the adjusted p-value, but also on the log-FC value to reduce the number of genes shown. Alternatively if the data is properly annotated, instead of showing gene names for each row, showing a GO-term clustering might reveal expression patterns for certain gene groups.

## 5.4 Pathway Analysis

In some cases you’ll see hundreds or even thousands of DEGs as a result of an analysis. These large amounts of DEGs are too much for most visualization approaches or to easily say something about the biological context for that many genes.

One approach for tackling such a large set of DEGs is pathway analysis where through different methods genes are grouped by pathway to get an overview of affected pathways in the experiment.

This section will demonstrate two methods for this analysis, one using an online platform for gene-annotation enrichment analysis and an R-method for signaling pathway impact analysis.

### 5.4.1 Gene Functional Classification using DAVID (>100 DEGs)

The online Database for Annotation, Visualization and Integrated Discovery DAVID tool suite allows for multiple - functional - annotation methods, one of these can be used for enrichment analysis based on gene annotation. To use the DAVID tools we need to have the list of gene names, ID’s or other identifyable information (no need for p-values or fold-changes). The basics of this method consist of grouping genes in pathways, count how many genes are in each pathway and compare this with what you would expect if you take a random sample from the genome with the same size as the number of your DEGs.

For example, if the RNA transport pathway contains a total of 50 genes out of the total of 25.000 for the whole mouse genome. This means 0.2% of all mouse genes have a role in this pathway.

If you uploaded a total of 1000 DEGs for analysis, you expect 0.2% = 2 genes from your DEG list to be placed in this pathway. But, it seems that there are 10 genes from the RNA transport pathway in the set of 1000 genes, which is 5 times as much as expected. This difference between expected and observed gene counts per pathway allows to say if a particular pathway is enriched.

DAVID provides a single interface to perform multiple different annotation pipelines, see the table below for available types and their use.

The following example code writes the gene symbols for both analysis approaches to a file for upload on DAVID.

# Write gene names to a file
write.table(edgeR.degs, file = "../data/edger-deg-names.txt",
row.names = FALSE, quote = FALSE, col.names = FALSE)
write.table(DESeq.degs, file = "../data/deseq-deg-names.txt",
row.names = FALSE, quote = FALSE, col.names = FALSE)

Click on the Start Analysis button at the top of the DAVID website. Then, copy the genes (A) or upload the file (B) and select the proper identifier type (OFFICIAL_GENE_SYMBOL in this example). Select the Gene List option in Step 3 and click on the Submit List button in Step 4.

This usually results in a popup that mentions that some of your IDs were found in multiple organisms. Close the popup and select your own organism (Mus musculus in this example) from the Species list. From the 940 uploaded gene-symbols, 881 were found in the mouse which we select followed by clicking on the Select Species button.

The most useful analysis is the Functional Annotation Chart, selectable from the Start Analysis page after uploading the set of genes. The output table shows many (in this case ~560) chart records in which the genes were grouped. These records orginate from various sources, such as GO-terms, Interpro, KEGG-pathways, SMART, etc. See the image below for a quick explaination of what is displayed in such a chart (mostly just a table..).

An interesting statistic to add to this table is the Fold Enrichment that shows how over- or under-represented this group is. Note that this is not based on gene-expression but only on the deviation of the of the amount of genes provided for a certain group from what is expected given the total number of uploaded gene IDs.

Now, looking at the article for a random example set with GSE ID GSE80128, one of the conclusions mentions the following on their gene-in-research (Igf2as):

It is suggested that Igf2as play a role in energy metabolism, the cell cycle, histone acetylation and muscle contraction pathways.

With almost a 1000 DEGs grouped in 560 different categories, you can find ‘evidence’ for most conclusions and the difficulty here is finding the truly important pathways. Adviced is to only check the KEGG-pathways listed in the table, sort by the corrected p-value and inspect the pathways that look interesting. For this example, amongst others, the following pathways were listed as significant (p-value < 0.05):

• Metabolic pathways: includes many pathways, 100 genes in total with a 1.6-fold enrichment
• Citrate cycle (TCA): major metabolic pathway, 9 genes with a 5.7-fold enrichment
• Cardiac Muscle contraction: 12 genes with a 3.3-fold enrichment

Clicking on the Term in the table for a KEGG-pathway shows a graphical representation of this pathway indicating the found DEGs (blinking, even), see below.

### 5.4.2 KEGG-pathway Visualization (>10 DEGs)

Knowing the interesting KEGG-pathway(s) upfront (i.e. it is listed in the article) allows for visualizations applied to that selected pathway. For this we can use the pathview website or Bioconductor library:

“… It maps and renders a wide variety of biological data on relevant pathway graphs. All users need is to supply their data and specify the target pathway. Pathview automatically downloads the pathway graph data, parses the data file, maps user data to the pathway, and render pathway graph with the mapped data.”

This section demonstrates only the use of the bioconductor package for visualizing not only the DEGs onto pathways, but also the change in expression for each gene.

## Bioconductor library (install with bioclite() if missing)
library(pathview)

## Example pathway IDs (for human, change organism key for other organisms)
data("paths.hsa")
pander(head(paths.hsa, n=5))
hsa00010 hsa00020
Glycolysis / Gluconeogenesis Citrate cycle (TCA cycle)

Table continues below

hsa00030 hsa00040 hsa00051
Pentose phosphate pathway Pentose and glucuronate interconversions Fructose and mannose metabolism
## Check 'gene.idtype' argument possibilities
data(gene.idtype.list);
pander(gene.idtype.list)

SYMBOL, GENENAME, ENSEMBL, ENSEMBLPROT, UNIGENE, UNIPROT, ACCNUM, ENSEMBLTRANS, REFSEQ, ENZYME, TAIR, PROSITE and ORF

## Prepare data for visualization
deseq.degs.logfc <- subset(deseq.results, padj < pval_threshold, select = log2FoldChange)

pander(head(deseq.degs.logfc))
log2FoldChange
ABAT 0.7499
ABCC13 -3.452
ABHD12B -3.306
AC002480.2 -0.1612
AC002480.5 -5.13
AC005062.2 -1.949
pathview(gene.data=deseq.degs.logfc,
pathway.id="00020", # TCA-cycle
species="hsa"     # Organism key
)

It is fine to use the webtool too for this purpose however it is not notacibly easier to use than the R package.

### 5.4.3 Signaling Pathway Impact Analysis (>20 DEGs)

To also look at the topology of how genes interact with each other, we can use the SPIA Bioconductor package. This package takes a list of Entrez gene IDs for all DEGs and the complete list of Entrez IDs for all present genes and evaluates - per pathway - if genes are involved and if the pathway is either inhibited or activated.

The code below uses data objects that were not always shown before, see the tables below for their contents.

# Bioconductor package for "Signaling Pathway Impact Analysis"
# (install with biocLite("SPIA") if it is not installed)
library(SPIA)
# Only used for the subset function
library(dplyr)

# Get a vector of log(FC) values for all significant genes
sig_genes <- subset(deseq.results, padj < pval_threshold, select = log2FoldChange)[[1]]
# Make it a named vector by assigning the Entrez ID's to each log(FC) value
names(sig_genes) <- subset(deseq.results, padj < pval_threshold, select = EntrezID)[[1]]
# A complete list of Entrez IDs for all genes in this experiment
all_genes <- deseq.results\$EntrezID

The input sig_genes named vector contents:

EntrezID logFC
18 0.7499
150000 -3.452
145447 -3.306
43 -3.055
51412 1.847
114 -1.926
146 2.754
60312 -1.068
3899 1.061
116986 1.222

(subset) of DESeq2 DEGs with their Entrez ID and log2(FC) value

And the all_genes vector or Entrez IDs input (note that this does not contain any expression information, just IDs):

pander(all_genes[which(idx)[100:110]])

55584, 5010, 6320, 387836, 154790, 7461, 79935, 26047, 85445, 78989 and 57699

To start processing all pathways looking for influenced signal pathway, execute the spia function for the proper organism (Mus Musculus, or mmu in this case):

# Process all signaling pathways to see if they are inhibited or activated
spia_result <- spia(de=sig_genes, all=all_genes, organism="hsa", plots=TRUE)

Running this function will print a list of (> 100) pathways it will analyze, so this will take a while:

Done pathway 1 : RNA transport..
Done pathway 2 : RNA degradation..
Done pathway 3 : PPAR signaling pathway..
Done pathway 4 : Fanconi anemia pathway..
Done pathway 5 : MAPK signaling pathway..
Done pathway 6 : ErbB signaling pathway..
Done pathway 7 : Calcium signaling pathway..
Done pathway 8 : Cytokine-cytokine receptor int..
Done pathway 9 : Chemokine signaling pathway..
Done pathway 10 : NF-kappa B signaling pathway..
Done pathway 11 : Phosphatidylinositol signaling..
Done pathway 12 : Neuroactive ligand-receptor in..
Done pathway 13 : Cell cycle..
...
Done pathway 132 : Graft-versus-host disease..
Done pathway 133 : Arrhythmogenic right ventricul..
Done pathway 134 : Dilated cardiomyopathy..
Done pathway 135 : Viral myocarditis..

The resulting data frame is ordered by a p-value and contains the following information:

Name ID pSize NDE pNDE tA
Alzheimer’s disease 05010 160 23 2.929e-06 -0.7695
Parkinson’s disease 05012 113 19 2.002e-06 0
Focal adhesion 04510 196 22 0.0002214 9.707
Glioma 05214 63 11 0.0002007 -0.6071
Neurotrophin signaling pathway 04722 120 15 0.000702 2.983

Top signaling pathways (continued below)

pPERT pG pGFdr pGFWER Status
0.548 2.302e-05 0.001654 0.002693 Inhibited
1 2.828e-05 0.001654 0.003308 Inhibited
0.018 5.352e-05 0.002087 0.006262 Activated
0.863 0.001674 0.04258 0.1958 Inhibited
0.274 0.001838 0.04258 0.2151 Activated

To get an overal picture we can create a plot where each pathway is a point which is placed according to the spia calculated over-representation p-value and pNDE (which is the probability to observe at least nDE genes on the pathway). In this case it is obvious that Alzheimer’s and Parkinson’s pathways with ID’s 05010 and 05012 respectively are the most influenced by this experiment, according to this method.

plotP(spia_result)