Tutorial 2: Wrestling with data


The sad truth of a lot of science and most data analysis is that a great deal of your time is spent on mundane preparation. You convert file-types, replace missing values, parse umlauts, rejig date and time formats, reshape, subset, transform, summarise…and so on…then you can begin some analysis.

Most tutorials ignore these steps and present beginners with idealised pre-loaded data so they can dive straight into plotting or statistical models. This is fine but usually leads to great frustration when the beginner tries to apply their newly learned analytical methods to their own data. Often they find they cannot load the data (“What is this strange error message?”) or get it into the format they had in the tutorial.. without excel that is .. and some may give up at this point and succumb to finishing the analysis in excel.

So the second tutorial here tackles (some of) these problems head on as it will save you a lot of bother later.

Reading Data Into A data.frame

The first data we will use here is a tab-delimited text file that compiles NHS waiting times for common diagnostic tests. We are going to begin by reading this file into an R data.frame. From within R or RStudio use the file menu to navigate to your tutorials folder, set this folder as your working directory and double click on the file to view it - now have a look at the first few lines using the readLines() function

readLines("diagnostics.txt", 5)
## [1] "Compiled By S Henderson July2011 from http://www.dh.gov.uk/en/Publicationsandstatistics/Statistics/Performancedataandstatistics/HospitalWaitingTimesandListStatistics/Diagnostics/DH_087009"
## [2] "Diagnostic\tDate\t week1 \t week2 \t week3 \t week4 \t week5 \t week6 \t week7 \t week8 \t week9 \t week10 \t week11 \t week12 \t week13 \tweek13_plus\torder"                              
## [3] "MRI\tMay-11\t27795\t30056\t22440\t14557\t5842\t3148\t591\t359\t166\t146\t135\t99\t81\t575\t25"                                                                                              
## [4] "CT\tMay-11\t22247\t20608\t12198\t6662\t2506\t1426\t275\t129\t79\t64\t82\t24\t28\t132\t25"                                                                                                   
## [5] "NON_OBSTETRIC_ULTRASOUND\tMay-11\t52803\t60869\t45965\t29359\t11166\t5365\t929\t524\t257\t123\t24\t11\t4\t19\t25"                                                                           

Firstly note that the first line is not data but some additional information about the file. The next line is like the header or column names that you see in a typical excel worksheet but all separated by '\t' . This character is the symbol for a tab as this is a tab delimited file. The 3rd to 5th lines are data with 2 columns of characters folowed by columns of numbers.

Run the following read.table() command to read the the text file and store it as a data.frame called df:

df = read.table(file = "diagnostics.txt", skip = 1, header = T, sep = "\t")
head(df)
##                 Diagnostic   Date week1 week2 week3 week4 week5 week6
## 1                      MRI May-11 27795 30056 22440 14557  5842  3148
## 2                       CT May-11 22247 20608 12198  6662  2506  1426
## 3 NON_OBSTETRIC_ULTRASOUND May-11 52803 60869 45965 29359 11166  5365
## 4             BARIUM_ENEMA May-11  1235  1077   565   281    86    69
## 5                DEXA_SCAN May-11  4610  5052  3178  2087   641   291
## 6    AUDIOLOGY_ASSESSMENTS May-11  8538  9334  8072  6227  2807  1650
##   week7 week8 week9 week10 week11 week12 week13 week13_plus order
## 1   591   359   166    146    135     99     81         575    25
## 2   275   129    79     64     82     24     28         132    25
## 3   929   524   257    123     24     11      4          19    25
## 4     2     3     0      0      0      0      0           1    25
## 5     3     5     2      1      2      0      0           1    25
## 6   552   434   205    118     76     41     34          41    25

The first argument is the name of the data file, the second tells R to skip 1 line as its not in the same format as the rest of the data file. The third tells R that the first row of data (excepting the skipped line) has header names that will be turned into variables of our data.frame. If a data file has no header information R will assign default variable names to the data.frame (X1, X2, X3 etc..). Finally the last option tells R what the character separating each data value is. There are two commons separators in text files: tabs which are represented by {sep=’\t’} and commas (often in .csv files) which would be {sep=’,’}.

The data seems to have read in OK at first glance? To be sure we let's look a little closer using the $ operator to access a column variable of the data.frame.

# The data.frame has dimensions of 375 rows (cases) and 17 columns
# (variables)
dim(df)
## [1] 375  17
# this is a numeric vector as we would expect
df$week1
##   [1] 27795 22247 52803  1235  4610  8538  8377   367  2495   772   853
##  [12]  5437  3036  2630  6570  5310 21720 17881 41248   984  3380  6473
##  [23]  6163   304  1935   683   703  4475  2498  2020 31216 24955 59659
##  [34]  1320  4939 10223  9063   390  2903   909  1177  6216  3781  3165
##  [45]  7661 28468 23737 57464  1321  4925  9715  8788   350  2506   857
##  [56]   976  5787  3356  2666  7165 31571 26435 62062  1576  5136 10208
##  [67]  9075   334  2966   904  1085  6168  3647  2932  7686 13521 12459
##  [78] 28222   673  2552  4795  4865   225  1755   411   472  3051  1652
##  [89]  1538  3663 30170 24112 56137  1620  4952  9123  8136   329  2657
## [100]  1023  1019  5720  3365  2975  7409 28722 22836 55525  1516  4611
## [111]  9254  8639   321  2892   867   993  5953  3396  2904  7480 30013
## [122] 23733 58660  1539  4866 10006  8983   336  2530  1186  1095  6388
## [133]  3583  3222  8008 22476 19031 46389  1307  3643  8641  7163   274
## [144]  2408   903   835  5058  2830  2383  6225 29550 23271 58319  1579
## [155]  4867 10196  8584   278  2643  1052  1127  6214  3299  2949  7723
## [166] 29628 23566 57564  1584  4845 10450  8557   289  2752  1075  1056
## [177]  6151  3470  2963  7909 25720 20806 49238  1430  4655  9985  7527
## [188]   262  2576   942   955  5286  3338  2610  6937 30517 24750 59960
## [199]  1727  4936 10994  9437   305  2872  1205  1205  6610  3963  3219
## [210]  8996 28621 22930 55442  2031  4889 10296  8944   283  2677  1076
## [221]  1075  6001  3721  3068  8194 29197 23224 54259  1904  5080 10115
## [232]  9274   316  2810   988  1214  5880  3533  3033  8499 29246 23359
## [243] 56350  1892  5525 10364  8732   296  2874   966   999  5951  3703
## [254]  3014  8665 14395 13025 28754  1207  2928  5632  5067   213  1330
## [265]   533   586  3341  2101  1535  4565 28097 22593 52421  1867  4614
## [276]  9864  8242   213  2678   985  1166  5638  3588  2820  7630 27077
## [287] 21465 52743  1916  4640  9897  8974   271  2416   999  1030  5650
## [298]  3386  2916  7695 28662 22961 56641  2079  4911  9513  8201   253
## [309]  2596   928  1076  5999  3305  3099  7840 22348 18106 43846  1670
## [320]  4037  7635  6146   245  1971   742   888  4981  2833  2550  6328
## [331] 26918 21625 52785  2044  4521  9518  8519   214  2715   981  1107
## [342]  5958  3308  2789  7609 26772 22289 54007  2142  5343  9658  8355
## [353]   204  2695   872  1087  5564  3319  2619  7491 21496 18251 45583
## [364]  1774  4490  8505  7512    30  2392   804  1005  4590  2656  2381
## [375]  6212
# this is not quite what we expect?
df$week13
##   [1] 81  28  4   0   0   34  99  0   19  7   8   68  29  17  42  65  127
##  [18] 76  11  0   0   20  164 0   10  5   10  72  50  16  47  10  9   0  
##  [35] 0   15  40  0   7   5   7   42  22  13  46  38  7   5   0   0   34 
##  [52] 16  0   19  4   6   40  17  16  18  33  10  5   0   0   32  6   0  
##  [69] 18  4   17  59  9   22  38  17  3   5   0   0   52  13  0   9   3  
##  [86] 8   21  3   9   20  10  1   2   -   1   78  8   -   3   6   3   16 
## [103] 12  6   17  16  2   1   -   -   53  5   -   1   2   7   24  12  9  
## [120] 32  17  1   3   -   9   38  5   -   -   1   5   13  5   6   23  9  
## [137] 3   4   2   3   42  5   -   -   -   6   10  -   3   16  3   4   1  
## [154] -   -   9   8   -   -   5   7   11  11  4   23  3   4   5   -   1  
## [171] 22  -   -   -   14  3   5   4   4   13  6   2   1   -   1   17  -  
## [188] -   -   7   4   7   2   4   7   1   2   -   -   -   8   1   -   -  
## [205] 1   4   10  3   11  8   1   4   1   -   -   6   1   -   -   -   1  
## [222] 9   -   1   3   7   26  3   1   -   13  28  1   -   -   4   35  17 
## [239] 10  28  12  41  27  -   1   25  5   -   -   -   2   59  17  6   46 
## [256] 2   34  1   -   -   6   3   -   -   1   4   6   4   10  7   2   31 
## [273] 2   -   -   11  1   -   2   -   1   2   6   3   3   7   11  2   -  
## [290] -   6   2   -   -   -   1   6   1   7   11  4   -   1   -   -   6  
## [307] 16  -   2   3   4   11  4   10  21  3   -   4   -   1   11  1   -  
## [324] 1   3   4   7   3   11  7   11  -   12  -   -   47  5   -   8   11 
## [341] 2   21  6   2   8   8   -   18  -   -   22  4   -   7   3   4   16 
## [358] 3   16  8   -   1   17  -   -   17  2   -   10  13  4   19  3   7  
## [375] 5  
## 55 Levels: - 0 1 10 11 12 127 13 14 15 16 164 17 18 19 2 20 21 22 ... 99

Unfortunately the ‘week13’ variable is not a numeric vector it’s a vector of mode ‘factor’. A factor is intended to represent categorical data (e.g. animal: cat/dog/rabbit/aardvark). The read.table() function has made ‘week13’ a factor because of the ‘-’ characters in this column- which are non-numeric. By default all columns with any non-numeric characters will be made into factors unless we specify otherwise.

Mistakes or errors due to missing or blank data are very common and it is always advisable to check newly read data before proceeding. Try summary(df) yourself. The R language has a special data-type with special properties to represent missing data: NA. If a text file contains the ‘NA’ string this will be automatically read without otherwise affecting the data-type of the column. We can work round this problem by using a slightly altered read.table function with a new option na.string='-' :

df = read.table("diagnostics.txt", skip = 1, sep = "\t", header = T, 
    na.string = "-")
dim(df)
## [1] 375  17
df$week13
##   [1]  81  28   4   0   0  34  99   0  19   7   8  68  29  17  42  65 127
##  [18]  76  11   0   0  20 164   0  10   5  10  72  50  16  47  10   9   0
##  [35]   0  15  40   0   7   5   7  42  22  13  46  38   7   5   0   0  34
##  [52]  16   0  19   4   6  40  17  16  18  33  10   5   0   0  32   6   0
##  [69]  18   4  17  59   9  22  38  17   3   5   0   0  52  13   0   9   3
##  [86]   8  21   3   9  20  10   1   2  NA   1  78   8  NA   3   6   3  16
## [103]  12   6  17  16   2   1  NA  NA  53   5  NA   1   2   7  24  12   9
## [120]  32  17   1   3  NA   9  38   5  NA  NA   1   5  13   5   6  23   9
## [137]   3   4   2   3  42   5  NA  NA  NA   6  10  NA   3  16   3   4   1
## [154]  NA  NA   9   8  NA  NA   5   7  11  11   4  23   3   4   5  NA   1
## [171]  22  NA  NA  NA  14   3   5   4   4  13   6   2   1  NA   1  17  NA
## [188]  NA  NA   7   4   7   2   4   7   1   2  NA  NA  NA   8   1  NA  NA
## [205]   1   4  10   3  11   8   1   4   1  NA  NA   6   1  NA  NA  NA   1
## [222]   9  NA   1   3   7  26   3   1  NA  13  28   1  NA  NA   4  35  17
## [239]  10  28  12  41  27  NA   1  25   5  NA  NA  NA   2  59  17   6  46
## [256]   2  34   1  NA  NA   6   3  NA  NA   1   4   6   4  10   7   2  31
## [273]   2  NA  NA  11   1  NA   2  NA   1   2   6   3   3   7  11   2  NA
## [290]  NA   6   2  NA  NA  NA   1   6   1   7  11   4  NA   1  NA  NA   6
## [307]  16  NA   2   3   4  11   4  10  21   3  NA   4  NA   1  11   1  NA
## [324]   1   3   4   7   3  11   7  11  NA  12  NA  NA  47   5  NA   8  11
## [341]   2  21   6   2   8   8  NA  18  NA  NA  22   4  NA   7   3   4  16
## [358]   3  16   8  NA   1  17  NA  NA  17   2  NA  10  13   4  19   3   7
## [375]   5
summary(df)
##                  Diagnostic       Date         week1           week2      
##  AUDIOLOGY_ASSESSMENTS: 25   Apr-10 : 15   Min.   :   30   Min.   :   48  
##  BARIUM_ENEMA         : 25   Apr-11 : 15   1st Qu.: 1698   1st Qu.: 1579  
##  COLONOSCOPY          : 25   Aug-09 : 15   Median : 4590   Median : 4313  
##  CT                   : 25   Aug-10 : 15   Mean   : 9781   Mean   : 9225  
##  CYSTOSCOPY           : 25   Dec-09 : 15   3rd Qu.: 9188   3rd Qu.: 8464  
##  DEXA_SCAN            : 25   Dec-10 : 15   Max.   :62062   Max.   :60869  
##  (Other)              :225   (Other):285                                  
##      week3           week4           week5           week6     
##  Min.   :   45   Min.   :   34   Min.   :   26   Min.   :   9  
##  1st Qu.:  922   1st Qu.:  696   1st Qu.:  432   1st Qu.: 250  
##  Median : 3110   Median : 2204   Median : 1365   Median : 751  
##  Mean   : 6960   Mean   : 4550   Mean   : 2498   Mean   :1246  
##  3rd Qu.: 7207   3rd Qu.: 4808   3rd Qu.: 2936   3rd Qu.:1620  
##  Max.   :45965   Max.   :34563   Max.   :19508   Max.   :9365  
##                                                                
##      week7          week8           week9           week10     
##  Min.   :   1   Min.   :  0.0   Min.   :  0.0   Min.   :  0.0  
##  1st Qu.:  32   1st Qu.: 19.0   1st Qu.: 12.0   1st Qu.:  8.0  
##  Median :  92   Median : 56.5   Median : 37.0   Median : 23.0  
##  Mean   : 152   Mean   : 92.1   Mean   : 54.4   Mean   : 38.4  
##  3rd Qu.: 213   3rd Qu.:115.2   3rd Qu.: 72.8   3rd Qu.: 50.5  
##  Max.   :1412   Max.   :609.0   Max.   :444.0   Max.   :320.0  
##  NA's   :2      NA's   :11      NA's   :21      NA's   :32     
##      week11          week12           week13     week13_plus   
##  Min.   :  0.0   Min.   :  0.00   Min.   :  0   Min.   :  0.0  
##  1st Qu.:  4.0   1st Qu.:  3.25   1st Qu.:  3   1st Qu.:  5.0  
##  Median : 13.5   Median :  9.50   Median :  7   Median : 16.0  
##  Mean   : 25.7   Mean   : 17.37   Mean   : 13   Mean   : 36.9  
##  3rd Qu.: 32.0   3rd Qu.: 20.75   3rd Qu.: 16   3rd Qu.: 41.0  
##  Max.   :284.0   Max.   :128.00   Max.   :164   Max.   :575.0  
##  NA's   :39      NA's   :61       NA's   :74    NA's   :46     
##      order   
##  Min.   : 1  
##  1st Qu.: 7  
##  Median :13  
##  Mean   :13  
##  3rd Qu.:19  
##  Max.   :25  
##              

This now works fine, the column/variable 'week13 is numeric but contains the special NA data-type where ‘-’ characters were previously. If you find a strange error message telling you that the column numbers are wrong, or if the data is all messed up, you need to look through the file for special characters (apostrophes ‘ are a common problem), header or occasionally footer lines, or spaces in the data.

If you have a very large file and perhaps many human data entry errors or ambiguities e.g. the country column contains “UK”, “U.K.”, “uk”, ‘United-Kingdom” and so on, then R is possibly not the best data cleaning tool. I would recommend instead something like Google Refine prior to loading into R, or even…for smaller files…shudder…Excel.

Exercises

  1. Sometimes there is an unhelpful little comment after the last line of data e.g. “data-stops-here”. Can you write a command to check for this? (Hint ReadLines(), a command like head() but different)
  2. Occasionally with huge data files people want to just read in a smaller subset of the data and use this to explore and develop their analysis before laoding the whole dataset. Can you load the first hundred lines into a data.frame?

A Short Diversion on Dates

There is one other frequent annoyance exemplified by the diagnostics data in that the “Date” column has been recognised as a factor. R actually has a specific Date mode but as there are so many ways to write dates and times it is perhaps easier to fix post loading. Here we have an extra problem because the format is a month and year - not as Date requires a day, month and year.

# this code doesn’t give us what we want
df$Date[1:3]
## [1] May-11 May-11 May-11
## 25 Levels: Apr-10 Apr-11 Aug-09 Aug-10 Dec-09 Dec-10 Feb-10 ... Sep-10
as.Date(df$Date[1:3], format = "%b-%y")
## [1] NA NA NA

The format options here are correct, have a look at the as.Date help page and the examples at the bottom. The %b is May, then a hyphen -, and the %y denotes the last two digits of year as opposed to the full year which would be %Y 2011, but the base R version will not create a valid date without a day too. Instead we can use the zoo package which has many methods for dealing with time-series data.

library(zoo)
## Attaching package: 'zoo'
## The following object(s) are masked from 'package:base':
## 
## as.Date, as.Date.numeric
months = as.yearmon(df$Date, format = "%b-%y")
months[1:3]
## [1] "May 2011" "May 2011" "May 2011"
# If we convert this back to a Date then it will by default become the
# first day of this month.
df$Date = as.Date(months)

Dealing with times and dates in R or indeed most computing languages is quite tedious but this is the price paid for exacting precision. Further down the line having date and time data in a Date or time (POSIXct) format enables R to recognise time series data and automatically select the correct analysis or plot.

Exercises

  1. Perhaps in some cases the Monthly zoo data would be better coerced to an end of the month Date (e.g. payroll?). Try this instead.

Integrating Multiple Files

Sometimes data doesn't come in a single file but you will want to integrate it all into a single data.frame or similar. the next example uses a small set of miRNA and gene expression files for Kidney Renal Cancers taken from The Cancer Genome Atlas. We will write a script to load the data then extend it to make a function that generalises to load other data of the same format and offers a few different options for processing.

Begin by having a look at one of the expression data files in the KIRC folder:

# there are a number of files with very long somewhat redundant names
path = "./KIRC"
expFiles = dir(path)
head(expFiles)
## [1] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3306-01A-01R-0864-07__gene_expression_analysis.txt"
## [2] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3307-01A-01R-0864-07__gene_expression_analysis.txt"
## [3] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3308-01A-01R-0864-07__gene_expression_analysis.txt"
## [4] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3311-01A-01R-0864-07__gene_expression_analysis.txt"
## [5] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3313-01A-01R-0864-07__gene_expression_analysis.txt"
## [6] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3316-01A-01R-0864-07__gene_expression_analysis.txt"
# NB note this trick below of putting brackets around the whole expression
# to force it to echo to the screen
(numFiles = length(expFiles))
## [1] 16
# the first few lines of the 1st file look like this
readLines(paste(path, "/", expFiles[1], sep = ""), n = 8)
## [1] "barcode\tgene symbol\tvalue"                           
## [2] "TCGA-A3-3306-01A-01R-0864-07\t15E1.2\t-2.328"          
## [3] "TCGA-A3-3306-01A-01R-0864-07\t2'-PDE\t-0.6748125"      
## [4] "TCGA-A3-3306-01A-01R-0864-07\t7A5\t2.9785"             
## [5] "TCGA-A3-3306-01A-01R-0864-07\tA1BG\t-1.13666666666667" 
## [6] "TCGA-A3-3306-01A-01R-0864-07\tA2BP1\t-1.35366666666667"
## [7] "TCGA-A3-3306-01A-01R-0864-07\tA2M\t0.66125"            
## [8] "TCGA-A3-3306-01A-01R-0864-07\tA2ML1\t0.5305"           

There is a header line, and 3 columns separated by tabs. The first column contains the same sample identifier (a subset of the file name) in every row, the second a unique gene symbol for each row and the third an expression value for the gene. The other files are similar and have the same gene symbol order (take my word for it). We wish to take the second column from any file and append the 3rd column of each file to the right within a single data.frame. We could do this by looping through the files and then cbind() each new value column to our data.frame. However for large datasets it is not recommended that we use programming loops that continually reassign large data objects. Instead we will pre-assign a whole object and insert our data into it like so:

# we can read the geneSymbols from any file they are all in the same order
# I set colClasses to 'character' as the gene symbols are all unique so no
# need for them to be factors
geneSymbols = read.delim(paste(path, "/", expFiles[1], sep = ""), 
    colClasses = "character", header = T)[, 2]
geneSymbols[1:4]
## [1] "15E1.2" "2'-PDE" "7A5"    "A1BG"  
# next I work out the length of the columns from the length of geneSymbols
colLength = length(geneSymbols)
# I pre-assign a numeric matrix to hold data values as this is more memory
# efficient with enough rows for each gene symbol and enough columns for
# each file
dataMat = matrix(0, colLength, numFiles)

# generally we avoid big loops in R and later we will discuss how..  in
# this case it's unavoidable though. This may take a while but that's the
# fault of the people who put the data in lots of separate files
for (i in 1:numFiles) dataMat[, i] = read.delim(paste(path, "/", 
    expFiles[i], sep = ""), header = T)$value
head(dataMat[, 1:4])
##         [,1]  [,2]  [,3]    [,4]
## [1,] -2.3280  6035  5296 -0.8715
## [2,] -0.6748   374  8219 -0.3122
## [3,]  2.9785 12071 12868  2.3395
## [4,] -1.1367  4997  4887 -0.7248
## [5,] -1.3537  5613  5274 -0.9292
## [6,]  0.6613 13432 12576  1.3818
summary(dataMat[, 1:4])
##        V1               V2              V3              V4        
##  Min.   :-7.891   Min.   :    1   Min.   :    1   Min.   :-6.792  
##  1st Qu.:-0.744   1st Qu.: 3155   1st Qu.: 3137   1st Qu.:-0.619  
##  Median :-0.015   Median : 7066   Median : 6966   Median : 0.033  
##  Mean   :-0.071   Mean   : 6749   Mean   : 6665   Mean   : 0.030  
##  3rd Qu.: 0.566   3rd Qu.:10039   3rd Qu.: 9900   3rd Qu.: 0.575  
##  Max.   : 8.962   Max.   :14017   Max.   :13819   Max.   : 8.499  

On first look this seems OK as all the columns have numeric data, but look a little closer and you see that the range of values is very different and the 2nd and 3rd column look like integers (gene expression data shouldn't be integers). Something has gone wrong… let's try reading the 3rd column (values) of just the second file

valuesf2 = read.delim(paste(path, "/", expFiles[2], sep = ""), header = T)$value
valuesf2[1:10]
##  [1] -2.148            -0.06075          1.3675           
##  [4] -1.2785           -1.67766666666667 2.89375          
##  [7] 0.135             0.2315            2.55816666666667 
## [10] 0.7285           
## 14017 Levels: -0.000166666666666662 -0.00025 ... null
class(valuesf2)
## [1] "factor"

There's the problem it has read in the data as a factor not numeric. Why? The answer is there in the factor levels right on the end. There are null values in this file and column. By default R recognises the 'NA' string for missing data not 'null'. Incidentally null has a distinct meaning to NA explained here. The TCGA people probably meant NA rather than null and we can make R recognise it as such.

# I add the argument 'na.string=null'
valuesf2 = read.delim(paste(path, "/", expFiles[2], sep = ""), header = T, 
    na.strings = "null")$value
valuesf2[1:10]
##  [1] -2.14800 -0.06075  1.36750 -1.27850 -1.67767  2.89375  0.13500
##  [8]  0.23150  2.55817  0.72850
# these values look correct so we'll modify our loop and try again
for (i in 1:numFiles) dataMat[, i] = read.delim(paste(path, "/", 
    expFiles[i], sep = ""), header = T, na.string = "null")$value
head(dataMat[, 1:4])
##         [,1]     [,2]    [,3]    [,4]
## [1,] -2.3280 -2.14800 -1.3042 -0.8715
## [2,] -0.6748 -0.06075  0.2044 -0.3122
## [3,]  2.9785  1.36750  2.1360  2.3395
## [4,] -1.1367 -1.27850 -1.1227 -0.7248
## [5,] -1.3537 -1.67767 -1.2925 -0.9292
## [6,]  0.6613  2.89375  1.7855  1.3818
summary(dataMat[, 1:4])
##        V1               V2               V3               V4        
##  Min.   :-7.891   Min.   :-7.848   Min.   :-6.556   Min.   :-6.792  
##  1st Qu.:-0.744   1st Qu.:-0.634   1st Qu.:-0.618   1st Qu.:-0.619  
##  Median :-0.015   Median : 0.004   Median :-0.004   Median : 0.033  
##  Mean   :-0.071   Mean   : 0.004   Mean   : 0.025   Mean   : 0.030  
##  3rd Qu.: 0.566   3rd Qu.: 0.578   3rd Qu.: 0.538   3rd Qu.: 0.575  
##  Max.   : 8.962   Max.   : 8.533   Max.   : 8.720   Max.   : 8.499  
##                   NA's   :1        NA's   :1                        

Exercises

  1. OK..so what is the difference between NA and NULL then?

Some of you (?) may think that this tutorial is getting repetitive and we faced a very similar problem earlier - but this is really what these files are like, I'm not making it up. Such problems are a common occurrence in science - get used to it. Now we need to add column names that match the samples and add the gene symbol column to form a data.frame.

# the filenames contain the sample name but are redundant and very long
expFiles[1:3]
## [1] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3306-01A-01R-0864-07__gene_expression_analysis.txt"
## [2] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3307-01A-01R-0864-07__gene_expression_analysis.txt"
## [3] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3308-01A-01R-0864-07__gene_expression_analysis.txt"
# I can pick out the central part of the name using the substr() function
samples = substr(expFiles, 30, 57)
colnames(dataMat) = samples
# now add the gene symbols
dataFrame = data.frame(geneSymbols, dataMat)
summary(dataFrame[, 1:5])
##   geneSymbols    TCGA.A3.3306.01A.01R.0864.07 TCGA.A3.3307.01A.01R.0864.07
##  15E1.2 :    1   Min.   :-7.891               Min.   :-7.848              
##  2'-PDE :    1   1st Qu.:-0.744               1st Qu.:-0.634              
##  7A5    :    1   Median :-0.015               Median : 0.004              
##  A1BG   :    1   Mean   :-0.071               Mean   : 0.004              
##  A2BP1  :    1   3rd Qu.: 0.566               3rd Qu.: 0.578              
##  A2M    :    1   Max.   : 8.962               Max.   : 8.533              
##  (Other):17808                                NA's   :1                   
##  TCGA.A3.3308.01A.01R.0864.07 TCGA.A3.3311.01A.01R.0864.07
##  Min.   :-6.556               Min.   :-6.792              
##  1st Qu.:-0.618               1st Qu.:-0.619              
##  Median :-0.004               Median : 0.033              
##  Mean   : 0.025               Mean   : 0.030              
##  3rd Qu.: 0.538               3rd Qu.: 0.575              
##  Max.   : 8.720               Max.   : 8.499              
##  NA's   :1                                                

Essentially we have the required data now. If we tidy up our RHistory and delete the non-essential commands we used to inspect and test the data then we have a Script that we can probably use to load other TCGA data files. Sometimes however other operations maybe required such as normalisation, data, centering or even imputation of missing data. If we want these options then it is perhaps best to add these as a Function.

# as a rule of thumb if the sd between sample is reduced then thats
# probably better this is not a universal rule you maywant to plot other
# data types
mean(apply(dataMat, 1, sd), na.rm = T)
## [1] 0.5502
mean(apply(scale(dataMat, scale = T, center = F), 1, sd), na.rm = T)
## [1] 0.3745
mean(apply(scale(dataMat, scale = F, center = T), 1, sd), na.rm = T)
## [1] 0.5506
mean(apply(scale(dataMat, scale = T, center = T), 1, sd), na.rm = T)
## [1] 0.3753

Probably at least scaling is useful for this data - whereas centering is more of a useful plotting feature. On the other hand the original data values are probably ratio and the original scaling has a naturalitic understanding to it. So our function will give both options.


readTCGA = function(path = "./", namesubstr = c(30, 57), scale = T, 
    center = F) {
    # the file names
    expFiles = dir(path)
    # number of files
    numFiles = length(expFiles)
    # the gene symbols
    geneSymbols = read.delim(paste(path, "/", expFiles[1], sep = ""), colClasses = "character", 
        header = T)[, 2]
    # the number of rows
    colLength = length(geneSymbols)
    # a matrix for the data
    dataMat = matrix(0, colLength, numFiles)
    # load the data into matrix
    for (i in 1:numFiles) dataMat[, i] = read.delim(paste(path, "/", expFiles[i], 
        sep = ""), header = T, na.string = "null")$value
    # fileNames change so I add an option to change the substr for future
    # flexibility
    samples = substr(expFiles, namesubstr[1], namesubstr[2])
    # add those sample names as a header
    colnames(dataMat) = samples
    # data from separate experiments is often best normalised by scaling
    # and/or centering the TCGA files often are already somewhat similar
    # though
    if (scale == T) 
        dataMat = scale(dataMat, scale = T, center = F)
    if (center == T) 
        dataMat = scale(dataMat, scale = F, center = T)
    # put it all together and return() it
    return(data.frame(geneSymbols, dataMat))
}

# we can test this function on another folder of TCGA files they are
# samples from another type of cancer
dataFrame2 = readTCGA("./Ovary")
head(dataFrame2[, 1:5])
##   geneSymbols TCGA.01.0630.11A.01R.0363.07 TCGA.01.0631.11A.01R.0363.07
## 1      15E1.2                     -1.28927                      -0.9601
## 2      2'-PDE                     -0.35622                      -0.2245
## 3         7A5                      0.05864                      -0.1909
## 4        A1BG                     -0.48738                       0.1565
## 5       A2BP1                      0.28057                       1.0078
## 6         A2M                      1.09795                       1.3679
##   TCGA.01.0633.11A.01R.0363.07 TCGA.01.0636.11A.01R.0363.07
## 1                      -1.4455                      -1.6749
## 2                      -0.3238                      -0.2981
## 3                      -0.4735                      -0.7562
## 4                      -0.4343                      -0.7533
## 5                       0.1430                       0.2862
## 6                       1.9433                       0.5732

This function automates a lot of steps but you probably cannot expect that the TCGA - or anyone else- will always stick to this format. However the function is heavily annotated and should it break in future you should easily be able to work out what is going on and make changes - or perhaps adapt it for similar datafiles.

Exercises

  1. Can you add a new option to our readTCGA function that will write the final dataFrame to a text file only if we supply a fileName e.g. file='myNewData'.
  2. Some array data are ratio of measurements whilst some are very skewed. Often the data is best log-transformed. Add an option to log transform the data (base-2 please) before it is scaled or centered.

Other Data Files or File-Sytems

In the main data-files will be rectangular tabular, usually text data. However R also has well established packages for relational databses such as MySQL and Oracle. If you are more adventurous there are even experimental packages for reading and writing data (and more than this) from distributed databases such as Hadoop/HBase. Then there are a whole set of packages that will connect to the web via HTTP requests and then parse HTML and XML documents into R data objects.

Then again there will always be just plain non-standard or free-text data files. The R language has the tools to read these line by line and parse them using regular expressions. It is tedious to write such file parsers but for some applciations this will be the only way. We do not have the time (or lets be honest expertise!) to go into the detail of such methods in this course. However here is a short example that reads email files borrowed from Machine Learning for Hackers, try and understand as much as you can:

# I don't expect you to fully understand this example that would require
# both the email files and a good knowledge of regular expressions

path = file.path("/pathtodata")


# this parent function calls the functions below
parse.email <- function(path) {
    # 1
    full.msg <- msg.full(path)
    # 2
    date <- get.date(full.msg)
    # 3
    from <- get.from(full.msg)
    # 4
    subj <- get.subject(full.msg)
    # 5
    msg <- get.msg(full.msg)
    return(c(date, from, subj, msg, path))
}

# 1 reads everything in the file
msg.full <- function(path) {
    con <- file(path, open = "rt", encoding = "latin1")
    msg <- readLines(con)
    close(con)
    return(msg)
}

# 2 Returns the date a given email message was received
get.date <- function(msg.vec) {
    date.grep <- grepl("^Date: ", msg.vec)
    date.grep <- which(date.grep == TRUE)
    date <- msg.vec[date.grep[1]]
    date <- strsplit(date, "\\+|\\-|: ")[[1]][2]
    date <- gsub("^\\s+|\\s+$", "", date)
    return(strtrim(date, 25))
}

# 3 Returns the email address of the sender for a given email message
get.from <- function(msg.vec) {
    from <- msg.vec[grepl("From: ", msg.vec)]
    from <- strsplit(from, "[\":<> ]")[[1]]
    from <- from[which(from != "" & from != " ")]
    return(from[grepl("@", from)][1])
}

# 4 Returns the subject string for a given email message
get.subject <- function(msg.vec) {
    subj <- msg.vec[grepl("Subject: ", msg.vec)]
    if (length(subj) > 0) {
        return(strsplit(subj, "Subject: ")[[1]][2])
    } else {
        return("")
    }
}

# 5 this functions takes the fullmsg but returns only the message body.
get.msg <- function(msg.vec) {
    msg <- msg.vec[seq(which(msg.vec == "")[1] + 1, length(msg.vec), 1)]
    return(paste(msg, collapse = "\n"))
}

Whilst we have introduced many of the common obstacles in reading data - we have in passing touched on the way that you should be using R to make your workflow repeatable moving from exploratory work on the console, to scripting, and towards the end within well annotated functions. In the next section we will be looking more closely at the use of functions. Indeed if you found this last section of code a little difficult to follow you will probably want to return and have another look after the next part.