Introduction

Course 2.1.2 involves two separate projects performed on a single, published, data set. First, we start by performing a Genomics analysis similar to the case study from module 2.1.1. Then, we take the result of the mapping step, process this into read counts and perform a subsequent Transcriptomics analysis.

1.1 2.1.2 Genomics

For this Genomics analysis we are going to look at variants between the chosen data and a reference genome. One important aspect during the analysis is often assessing the quality, both from the input data set as well as the output of certain steps. Once a set of variants have been determined they can be annotated and finally linked to the biology underlying the original experiment, either reproducing their findings or aiming for novel ideas.

1.1.1 Project Deliverables

  • Project Proposal: A short written proposal of the project clearly describing the data set, the research question and the methods + tools to be used.
  • Genomics pipeline: The genomics analysis pipeline is well documented in one or more lab journals and contains the correct tools based on the research question and data set.
  • Presentation: The research results will be presented in an attractive method and proper level given the target audience.

1.1.2 Data Processing

Similar to the 2.1.1 NGS & Mapping course we will go through similar steps to process the data (i.e. quality control, trimming, mapping, variant calling, etc.). This time however, we are not using a workflow manager such as Galaxy, but execute all the tools command line. Two reasons for this are that the number and size of the input files are far larger compared to the previously used data sets and the Galaxy server is not capable of handling that many large data sets. Also, having some proper terminal experience is very valuable.

Note on writing the pipeline: all steps need to be documented in RMarkdown documents. Each step must be introduced with a short description of the tool used, the version and followed by the command used to execute the tool. As this is a terminal command, we use code cells with the bash language. All the code cells must have the option eval=FALSE set. This is to prevent the code from being executed when knitting the document.

Furthermore, it is also required to run any command on the RStudio server provided at https://bioinf.nl/rstudio. This RStudio server is hosted on our assemblix server which is capable of both accessing the data and running tools on large data sets. The workstations we provide in the classrooms are not capable of these tasks. Also, this server already has a number of tools pre-installed (FastQC, trimmomatic, BWA/STAR mappers, etc.). Lastly, by running them from and RMarkdown document on the RStudio server forces you to document the commands compared to simply using ssh.

Note on storing (intermediate) data: all data relevant to this project must be stored in the /students/2024-2025/Thema05/ directory. In here, create a project directory for your team and create sub-directories as you go. Your home folder has a hard limit of 25GB of data and is therefore not sufficient for storing these project files. Use commands such as chmod and chown to allow your project partners to access the data too.

Note on processing large amounts of data: building correct terminal commands to execute for instance a trimming step on a large number of files can be quite challenging. It is therefore required that test-data is used for all steps. Reasons for this are:

  • Spotting errors quickly instead of having to wait several hours to see an error message
  • Inspecting resource usage (CPU, memory, disk space) to prevent overloading the system
    • Some steps can be very memory intensive (over 1 TB on large data sets!) and can crash the system
  • Creates a small output set that can be used as a test set for the next step

Note on the use of GATK: The Genome Analysis Toolkit (GATK) is an often used set of tools for variant calling. Many publications cite the use of GATK but unfortunately do not include all needed details to reproduce the results. Due to the complexity of GATK we do not recommend its use, unless the publication is very clear on the exact workflow and settings used. Prepare to spend a lot of time on the official GATK website browsing manuals before you can use it effectively.

1.1.3 Finding a Data Set

Finding a suitable data set can be challenging as there are a few requirements such a data set needs to meet:

  • It needs to have both a DNA- and RNA-Seq data set available.
  • The data set needs to be published (publication should be open access)
  • For the RNA-Seq part, there is a minimum amount of samples required
    • At least two sample groups with a minimum of 3 samples per group (replicates)
    • See the appendix chapter A1 for example valid experimental setups, this chapter also contains tips on what to do with it (mostly relevant for Transcriptomics).

Data is available from the NCBI Sequence Read Archive and Gene Expression Omnibus.

The proper way to search for a data set suitable for both Genomics and Transcriptomics projects is by creating a query that searches for both: (“expression profiling by high throughput sequencing”[DataSet Type]) AND “genome variation profiling by high throughput sequencing”[DataSet Type]. If this selection actually contains anything of interest is up to you; if you cannot find something interesting please use a different search strategy. One could be to search for RNA-Seq experiments in the GEO where one of the groups is preferably a mutant.

Note that for the genomics part, not all available samples have to be processed. Other than for transcriptomics, it is not common that there is a case/control experimental setup but rather a single condition of group with possibly a few replicates. If the data set does include many samples, it is recommended to select a subset of samples to process (do discuss the reasoning behind this selection).

1.2 2.1.2 Transcriptomics

The use of RNA-Sequencing techniques for measuring gene expression is relatively new and replaces microarrays, though in some cases microarrays are still used. Gene expression data gives valuable insights into the workings of cells in certain conditions. Especially when comparing for instance healthy and diseased samples it can become clear which genes are causal or under influence of a specific condition. Finding the genes of interest (genes showing differing expression accross conditions, called the Differentially Expressed Genes (DEGs)) is the goal of this project.

While there is no golden standard for analyzing RNA-sequencing data sets as there are many tools (all manufacturers of sequencing equipment also deliver software packages) we will use proven R libraries for processing, visualizing and analyzing publicly available data sets. This manual describes the steps often performed in a transcriptomics experiment. Use this as a very general guideline to process data accompanying the chosen published research. Note that this manual was originally written for processing count data and therefore does not describe the processing of NGS read files up to the count data as part of the Genomics analysis performed in this course. RNA-Seq counts describe the number of reads mapped to a gene which corresponds to the relative number of transcripts (mRNA sequences) of that gene present in the cell at the time of sampling.

1.2.1 Project Deliverables

  • Project Proposal: A short written proposal of the project clearly describing the data set, the research question and the methods + libraries to be used.
  • Lab Journal: The lab journal is readable, complete and aimed at the reproducability of the research. Relevant visualisations are include and statistical methods performed to support the biological interpretation of the results.

1.3 Lab Journal

As you may know from previous projects, it is essential to keep a proper journal detailing every step you have done during the analysis. This journal is to be kept in an R markdown file, showing which steps have been taken in the analysis of the data set. This markdown should be knitted into a single PDF-file once the project is completed thus containing text detailing the steps and any decisions you’ve made, R-code (always visible!) and their resulting output/ images.

Notes: As a general advice; do not wait with knitting this whole document until the project is done as knitting is very prone to errors and trying to fix these in a large document is not easy. Give each code chunk the proper attributes, including a name at the minimum. This helps spot errors during knitting as that process mentiones which chunk has been processed. Note that chunk-names must be unique. And try to make proper use of chapters and sections and include an (optional) table of contents.

1.4 Schedule

The Genomics and Transcriptomics parts both have their own schedule as listed below

1.4.1 Genomics

Empty rows means the previous topic or task will most likely take more than one session.

Week Lesson Subject
1 1 Finding a suitable experiment
2 Statistics - recap and experimental setup
3 Writing a Project Proposal
2 1 Presenting Project Proposal
2 NGS Quality Control
3
3 1 Read Mapping and QC
2
3 Variant Calling
4 1 Annotation
2
3
4
5 1 Visualisation
2

1.4.2 Transcriptomics

Week Lesson Subject
5 1 Writing a Project Proposal
2
6 1 Presenting Project Proposal
2 Read Mapping and Quantification
3
4 Statistics (distributions, normalization, PCA)
7 1 Exploratory Data Analysis (chapter 3)
2
3 Statistics (batch effects, regression, linear models)
4 Finding Differentially Expressed Genes (chapter 4)
5 Statistics (Enrichment analysis)
8 1 Data Analysis and Visualization (chapter 5)
2
3
4
5
6 Hand in Lab Journal

1.5 Literature

The following (online) documents can be used throughout this course: