B Batch Loading Expression Data in R

This code example shows how to batch-load multiple files containing expression (count) data for a single sample. The data for this example can be found on GEO with ID GSE109798. Depending on the tool used to convert the mapping data into a per-sample count file, contents of these files can be (very) different. Downloading the data for this example experiment from GEO gives us a single .tar file called GSE109798_RAW.tar. Extracting this archive file nets us a folder with the following files:

file.names <- list.files('./data/GSE109798_RAW/')
Files
GSM2970149_4T1E274.isoforms.results.txt
GSM2970150_4T1E266.isoforms.results.txt
GSM2970151_4T1E247D.isoforms.results.txt
GSM2970152_4T1P2247A.isoforms.results.txt
GSM2970153_4T1P2247G.isoforms.results.txt
GSM2970154_4T1P2247F.isoforms.results.txt
GSM2970155_HCC1806E224B.isoforms.results.txt
GSM2970156_HCC1806E224A.isoforms.results.txt
GSM2970157_HCC1806E224C.isoforms.results.txt
GSM2970158_HCC1806P2232A.isoforms.results.txt
GSM2970159_HCC1806P2232B.isoforms.results.txt
GSM2970160_HCC1806P2230.isoforms.results.txt

B.0.1 Decompressing

The file extension of all these files is .txt.gz which means that all files are compressed using gzip and need to be unpacked before they can be loaded. The easiest method is using the system gunzip command on all files which can be done from within R by applying the gunzip command using the system function on each file.

## Change directory to where the files are stored
setwd('./data/GSE109798_RAW/')
sapply(file.names, FUN = function(file.name) {
  system(paste("gunzip", file.name)) 
  })

Now we can update the file.names variable since each file name has changed.

file.names <- list.files('./data/GSE109798_RAW/')

B.0.2 Determining Data Format

Next, we can inspect what the contents are of these files, assuming that they all have the same layout/ column names etc. to decide what we need to use for our analysis.

## Call the system 'head' tool to 'peek' inside the file
system(paste0("head ", "./data/GSE109798_RAW/", file.names[1]))
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
uc007aet.1 1 3608 3608.00 1.82 0.44 0.24 100.00
uc007aeu.1 1 3634 3634.00 0.00 0.00 0.00 0.00
uc011whv.1 10 26 26.00 0.00 0.00 0.00 0.00
uc007amd.1 100 1823 1823.00 0.00 0.00 0.00 0.00
uc007ame.1 100 4355 4355.00 1.32 0.26 0.15 100.00
uc007dac.1 1000 1403 1403.00 2.00 1.25 0.70 100.00
uc008ajp.1 10000 1078 1078.00 0.00 0.00 0.00 0.00
uc012ajs.1 10000 1753 1753.00 0.00 0.00 0.00 0.00
uc008ajq.1 10001 2046 2046.00 0.00 0.00 0.00 0.00

These files contain (much) more then just a count value for each gene as we can see columns such as (transcript) length, TPM, FPKM, etc. Also, the count-column is called expected_count which raises a few questions as well.

The expected count value is usable as it contains more information - compared to the raw count - then we actually require. The expected part results from multimapped reads where a single read mapped to multiple positions in the genome. As each transcript originates only from one location, this multimapped read is usually discarded. With the expected count though, instead of discarding the read completely it is estimated where it originates from and this is added as a fraction to the count value. So the value of 1.32 that we see on line 5 in the example above means an true count of 1 (uniquely mapped read) and the .32 (the estimated part) results from an algorithm and can mean multiple things.

As mentioned before, we require integer count data for use with packages such as DESeq2 and edgeR and there are two methods to convert the expected count to raw count data: + round the value to the nearest integer (widely accepted method and is well within the expected sampling variation), or + discard the fraction part by using for example the floor() function.

B.0.3 Loading Data

From all these columns we want to keep the transcript_id and expected_count columns and ignore the rest (we might be interested in this data later on though). As we need to lead each file separately we can define a function that reads in the data, keeping the columns of interest and returning a dataframe with this data. Note that the first line of each file is used as a header, but check before setting the header argument to TRUE, sometimes the expression data starts at line 1. The file name is then also used to name the column in the dataframe so that we know which column is which sample. This is done by splitting the file name (using strsplit) using the dot (‘.’) keeping the first part (i.e. ‘GSM2970156_HCC1806E224A’) and discarding the second part (‘isoforms.results.txt’). The strsplit function however always returns a list, in this case containing a vector with the 5 splitted elements:

## String splitting in R
## (the fixed = TRUE is required as the dot is a special character, see '?strsplit')
strsplit('GSM2970155_HCC1806E224B.isoforms.results.txt.gz', '.', fixed = TRUE)
## [[1]]
## [1] "GSM2970155_HCC1806E224B" "isoforms"               
## [3] "results"                 "txt"                    
## [5] "gz"
## Keeping the sample identifier
strsplit('GSM2970155_HCC1806E224B.isoforms.results.txt.gz', '.', fixed = TRUE)[[1]][1]
## [1] "GSM2970155_HCC1806E224B"
## Function for reading in files
read_sample <- function(file.name) {
  ## Extract the sample name for naming the column
  sample.name <- strsplit(file.name, ".", fixed = TRUE)[[1]][1]
  ## Read the data, setting the 'transcript_id' as row.names (column 1)
  sample <- read.table(file.name, header = TRUE, sep="\t", row.names = NULL)
  ## Rename the count column
  names(sample)[5] <- sample.name
  ## Return a subset containing the 'transcript_id' and sample name columns
  return(sample[c(1, 5)])
}

Applying the read_sample function to all file names gives us a set of data frames that we can merge together using the merge function. We merge the data based on the transcript id defined with the by = 1 argument pointing to the first column. We start by reading in just one file which is the ‘base’ dataframe to which we will merge the other files.

During processing it seemed that this data set is divided into two groups which is also listed on the GEO website for this project:

  • GPL11154 Illumina HiSeq 2000 (Homo sapiens)
  • GPL13112 Illumina HiSeq 2000 (Mus musculus)

where the first 6 files are from human source and the last 6 from the mouse. Therefore, the following code only shows how to read the first 6 samples and merge these into a single dataframe. Repeating this process for the other 6 files would result into another dataframe for those samples.

setwd('./data/GSE109798_RAW/')

## Read the FIRST sample
dataset <- read_sample(file.names[1])

## Read first sample group (6)
for (file.name in file.names[2:6]) {
  sample <- read_sample(file.name)
  dataset <- merge(dataset, sample, by = 1)
}

pander(head(dataset))
transcript_id GSM2970149_4T1E274 GSM2970150_4T1E266 GSM2970151_4T1E247D
uc007aet.1 1.82 0 0
uc007aeu.1 0 1 0.24
uc007aev.1 0 0 0
uc007aew.1 0 0 0.97
uc007aex.2 0 0 0
uc007aey.1 0 0 0

Table continues below

GSM2970152_4T1P2247A GSM2970153_4T1P2247G GSM2970154_4T1P2247F
0 0 0
0.13 0 0
0 0 0
0 1 0
0 0 0
0 0 0

The dataset variable now contains all data for the first 6 samples in this experiment. It is advisable to compare the number of rows in this data set with the number of rows in a single sample. It is not guaranteed that all samples have exactly the same number of genes/transcripts present (i.e., 0-values might have been discarded) which results in a final data set that has as many rows as the smallest sample. See the help of merge if this is the case because the all argument can be used to introduce extra rows for missing data.