Discovering Data Sets

2.1 Finding Public Data Sets of Interest

This chapter describes a method of finding public data sets of interest. A data set in the context of this course refers to all data belonging to a certain gene-expression experiment, usually consisting of a number of sequencing-samples combined with meta-data describing the experiment.

There are a number of very large databases online that offer access to thousands of - published - experiments and here we will focus on searching and downloading gene expression data from high-throughput (NGS) sequencing techniques (RNA-Seq).

With thousands of freely available data sets it is possible to start performing research without the need of actually performing your own lab-experiment. For all common conditions, organisms and tissues you can download samples and compare them over multiple studies to find novel relations between genes and conditions or to verify experiments performed in your laboratory. For instance, the power of public data sets was demonstrated jointly by three of our alumni in an article called Calling genotypes from public RNA-sequencing data enables identification of genetic variants that affect gene-expression levels.(al. 2015).

It’s method section begins with the sentence “We downloaded the raw reads for all available human RNA-seq data sets …” of which the amount of data and work will become clear later on. If you are interested in expression Quantitative Trait Loci (eQTLs) or Allele-Specific Expression (ASE) analysis, please read this very interesting paper.

But before we start diving into large and complex databases potentially containing thousands of experiments with millions of samples and terabytes of data, we need to get a rough idea on what we would like to do once we have found something of interest in order to know what we are looking for in the first place. While there is no definitive guide or protocol that can be followed for processing and analyzing a gene-expression data set, the following goals can be considered:

  1. Re-do (part of) the analysis described in the accompanying publication.
    • As all data sets - except for the ones that were added very recently - are accompanied by a publication, you can gather a lot of information regarding experimental design and results from just this paper. Most often, the researchers already performed the most interesting research on this data set and it is therefore a good exercise to try and reproduce their results. Challenges here are getting a proper understanding of the techniques used by the researchers (from the article) and translating these to your own analysis steps. Often though, not all details are clear from the article which requires your own interpretation of their results and performing analysis steps to arrive at the same conclusion.
  2. Alternatively you can come up with your own ideas and (biological) question(s) that you could try and answer given a data set instead of redoing the published work.
    • This is of course more challenging and not suitable for all data sets.
    • You might notice that some publications only focus on a small set of genes instead of the whole transcriptome. For instance, when you perform an experiment on yeast where you want to measure the activity of genes involved in alcohol fermentation, researchers might only look at genes from the Glycolysis pathway. This leaves the other 6000+ yeast genes for you to explore and possibly come up with novel relations of gene expression and experimental setup.
    • If the analysis approach explained in the paper is not focusing on a pre-selected list of genes, depending on the experiment you might come up with comparisons not done in the original research. For instance, in an experiment where the maternal age (at birth) is correlated to autism in their offspring, the paternal age is also known but not addressed in the publication. This could be a subject of further research (spoiler alert; no clear conclusion can be drawn from including this extra factor…).
  3. Another type of project is to evaluate different methods of either normalization or statistical analysis to either confirm the published results or find novel genes involved in the experiment.
    • This can be viewed more as a technical research subject which is ok to pursue, however the final report should mainly concern the biological impact of your findings.
      • This means that when the accompanying publication shows very conclusive results, it will probably come down to acknowledging their results (therefore, you extend on point 1 from this list). Because with a very conclusive result you hope to find the same results and if this is not the case it might become hard to formulate a good conclusion in the end.
      • However, if the publication isn’t very specific and only states a conclusion such as ‘Benzene exposure shows increased risk of leukemia’ followed by a list of a few genes that might be involved, you could try to see if you can find other genes that might be involved by changing the analysis approach.
  4. Although this last project goal has many risks involved and will need a very sound project proposal, it might result in the most interesting project. As shown before with the linked article, it is possible to combine data from multiple experiments.
    • You could for instance find two very related experiments (i.e. researching the effect of a certain drug) both measuring expression in different tissue types (i.e. liver and brain tissue). After analyzing both experiments, you could present a set of genes that show an effect in both tissues and - more interestingly - genes showing an effect in only one tissue.

The list above is a guideline to be kept in mind while browsing for suitable data sets and it will be extended in week three. Also, don’t worry if some of the terms above are unclear, getting a good understanding on the used terminology is part of these first few weeks

2.2 The NCBI Gene Expression Omnibus

The easiest method of exploring and finding interesting data sets is by simply using your web-browser to access a data-repository. One such repository is the NCBI Gene Expression Omnibus (GEO). This repository was primarily used to store micro array data sets and describe those experiments, linking to raw data, processed data and an accompanying publication. There are however many more sources of data browsable in the GEO, and for this project we will limit our searches to experiments using the RNA-Seq technology for which over 60.000 experiments are listed. Following the “Expression profiling by high throughput sequencing” link in this summary table we get the complete listing in which you can find an interesting experiment. Besides finding an experiment that you are interested in, there are some further important requirements that you need to account for when browsing listed below. Note that at this point we do not need to download the data yet, this is described in chapter 3.

  • The experiment must be published and the publication must be fully available, free or otherwise through the Hanze University library.
  • For each sample group, a minimum of three replicates must be present. If one or more groups have less then three replicates, that group cannot be used (the rest of the data might still be usable if there are at minimum two groups that do have 3 replicates).
    • Replicates are often not very easy to spot. When lucky samples are named “WildType - rep 2” for instance but most often you need to read the “Overall design” part of the page or click on the samples (with the “GSM…” ID) to make sure.
  • The available data must contain at least the count data.
    • Data is offered in multiple formats, always including the RAW data (the actual reads which we don’t use), but also in further processed data such as FPKM, RPKM and TPM (see this interesting blog-post discussing the use of these data). The tools and R analysis libraries that we will use for the downstream analysis rely on unnormalized and unprocessed data which is the count data. These counts simply represent the number of reads mapped to a transcript.
    • This is often the hardest part to assess; sometimes it is not clear if the provided data are the actual raw count values. Usually, you can find some (sometimes rather cryptic) information by clicking on a sample ID under the ‘Data processing’ section. In this paragraph of text is often a mention of the ‘Supplementary_files_format_and_content’ that describes the actual content of the downloadable files. Following are a number of examples of valid and invalid data provided by the experiment. These are the single-sample pages where you can find such a description. Unfortunately, the authors of each experiment are free to explain their data as they please, so you will most likely not encounter the exact same descriptions: