The sad truth of a lot of science and most data analysis is that a great deal of your time is spent on mundane preparation. You convert file-types, replace missing values, parse umlauts, rejig date and time formats, reshape, subset, transform, summarise…and so on…then you can begin some analysis.
Most tutorials ignore these steps and present beginners with idealised pre-loaded data so they can dive straight into plotting or statistical models. This is fine but usually leads to great frustration when the beginner tries to apply their newly learned analytical methods to their own data. Often they find they cannot load the data (“What is this strange error message?”) or get it into the format they had in the tutorial.. without excel that is .. and some may give up at this point and succumb to finishing the analysis in excel.
So the second tutorial here tackles (some of) these problems head on as it will save you a lot of bother later.
The first data we will use here is a tab-delimited text file that compiles NHS waiting times for common diagnostic tests. We are going to begin by reading this file into an R data.frame. From within R or RStudio use the file menu to navigate to your tutorials folder, set this folder as your working directory and double click on the file to view it - now have a look at the first few lines using the readLines() function
readLines("diagnostics.txt", 5)
## [1] "Compiled By S Henderson July2011 from http://www.dh.gov.uk/en/Publicationsandstatistics/Statistics/Performancedataandstatistics/HospitalWaitingTimesandListStatistics/Diagnostics/DH_087009"
## [2] "Diagnostic\tDate\t week1 \t week2 \t week3 \t week4 \t week5 \t week6 \t week7 \t week8 \t week9 \t week10 \t week11 \t week12 \t week13 \tweek13_plus\torder"
## [3] "MRI\tMay-11\t27795\t30056\t22440\t14557\t5842\t3148\t591\t359\t166\t146\t135\t99\t81\t575\t25"
## [4] "CT\tMay-11\t22247\t20608\t12198\t6662\t2506\t1426\t275\t129\t79\t64\t82\t24\t28\t132\t25"
## [5] "NON_OBSTETRIC_ULTRASOUND\tMay-11\t52803\t60869\t45965\t29359\t11166\t5365\t929\t524\t257\t123\t24\t11\t4\t19\t25"
Firstly note that the first line is not data but some additional information about the file. The next line is like the header or column names that you see in a typical excel worksheet but all separated by '\t' . This character is the symbol for a tab as this is a tab delimited file. The 3rd to 5th lines are data with 2 columns of characters folowed by columns of numbers.
Run the following read.table() command to read the the text file and store it as a data.frame called df:
df = read.table(file = "diagnostics.txt", skip = 1, header = T, sep = "\t")
head(df)
## Diagnostic Date week1 week2 week3 week4 week5 week6
## 1 MRI May-11 27795 30056 22440 14557 5842 3148
## 2 CT May-11 22247 20608 12198 6662 2506 1426
## 3 NON_OBSTETRIC_ULTRASOUND May-11 52803 60869 45965 29359 11166 5365
## 4 BARIUM_ENEMA May-11 1235 1077 565 281 86 69
## 5 DEXA_SCAN May-11 4610 5052 3178 2087 641 291
## 6 AUDIOLOGY_ASSESSMENTS May-11 8538 9334 8072 6227 2807 1650
## week7 week8 week9 week10 week11 week12 week13 week13_plus order
## 1 591 359 166 146 135 99 81 575 25
## 2 275 129 79 64 82 24 28 132 25
## 3 929 524 257 123 24 11 4 19 25
## 4 2 3 0 0 0 0 0 1 25
## 5 3 5 2 1 2 0 0 1 25
## 6 552 434 205 118 76 41 34 41 25
The first argument is the name of the data file, the second tells R to skip 1 line as its not in the same format as the rest of the data file. The third tells R that the first row of data (excepting the skipped line) has header names that will be turned into variables of our data.frame. If a data file has no header information R will assign default variable names to the data.frame (X1, X2, X3 etc..). Finally the last option tells R what the character separating each data value is. There are two commons separators in text files: tabs which are represented by {sep=’\t’} and commas (often in .csv files) which would be {sep=’,’}.
The data seems to have read in OK at first glance? To be sure we let's look a little closer using the $ operator to access a column variable of the data.frame.
# The data.frame has dimensions of 375 rows (cases) and 17 columns
# (variables)
dim(df)
## [1] 375 17
# this is a numeric vector as we would expect
df$week1
## [1] 27795 22247 52803 1235 4610 8538 8377 367 2495 772 853
## [12] 5437 3036 2630 6570 5310 21720 17881 41248 984 3380 6473
## [23] 6163 304 1935 683 703 4475 2498 2020 31216 24955 59659
## [34] 1320 4939 10223 9063 390 2903 909 1177 6216 3781 3165
## [45] 7661 28468 23737 57464 1321 4925 9715 8788 350 2506 857
## [56] 976 5787 3356 2666 7165 31571 26435 62062 1576 5136 10208
## [67] 9075 334 2966 904 1085 6168 3647 2932 7686 13521 12459
## [78] 28222 673 2552 4795 4865 225 1755 411 472 3051 1652
## [89] 1538 3663 30170 24112 56137 1620 4952 9123 8136 329 2657
## [100] 1023 1019 5720 3365 2975 7409 28722 22836 55525 1516 4611
## [111] 9254 8639 321 2892 867 993 5953 3396 2904 7480 30013
## [122] 23733 58660 1539 4866 10006 8983 336 2530 1186 1095 6388
## [133] 3583 3222 8008 22476 19031 46389 1307 3643 8641 7163 274
## [144] 2408 903 835 5058 2830 2383 6225 29550 23271 58319 1579
## [155] 4867 10196 8584 278 2643 1052 1127 6214 3299 2949 7723
## [166] 29628 23566 57564 1584 4845 10450 8557 289 2752 1075 1056
## [177] 6151 3470 2963 7909 25720 20806 49238 1430 4655 9985 7527
## [188] 262 2576 942 955 5286 3338 2610 6937 30517 24750 59960
## [199] 1727 4936 10994 9437 305 2872 1205 1205 6610 3963 3219
## [210] 8996 28621 22930 55442 2031 4889 10296 8944 283 2677 1076
## [221] 1075 6001 3721 3068 8194 29197 23224 54259 1904 5080 10115
## [232] 9274 316 2810 988 1214 5880 3533 3033 8499 29246 23359
## [243] 56350 1892 5525 10364 8732 296 2874 966 999 5951 3703
## [254] 3014 8665 14395 13025 28754 1207 2928 5632 5067 213 1330
## [265] 533 586 3341 2101 1535 4565 28097 22593 52421 1867 4614
## [276] 9864 8242 213 2678 985 1166 5638 3588 2820 7630 27077
## [287] 21465 52743 1916 4640 9897 8974 271 2416 999 1030 5650
## [298] 3386 2916 7695 28662 22961 56641 2079 4911 9513 8201 253
## [309] 2596 928 1076 5999 3305 3099 7840 22348 18106 43846 1670
## [320] 4037 7635 6146 245 1971 742 888 4981 2833 2550 6328
## [331] 26918 21625 52785 2044 4521 9518 8519 214 2715 981 1107
## [342] 5958 3308 2789 7609 26772 22289 54007 2142 5343 9658 8355
## [353] 204 2695 872 1087 5564 3319 2619 7491 21496 18251 45583
## [364] 1774 4490 8505 7512 30 2392 804 1005 4590 2656 2381
## [375] 6212
# this is not quite what we expect?
df$week13
## [1] 81 28 4 0 0 34 99 0 19 7 8 68 29 17 42 65 127
## [18] 76 11 0 0 20 164 0 10 5 10 72 50 16 47 10 9 0
## [35] 0 15 40 0 7 5 7 42 22 13 46 38 7 5 0 0 34
## [52] 16 0 19 4 6 40 17 16 18 33 10 5 0 0 32 6 0
## [69] 18 4 17 59 9 22 38 17 3 5 0 0 52 13 0 9 3
## [86] 8 21 3 9 20 10 1 2 - 1 78 8 - 3 6 3 16
## [103] 12 6 17 16 2 1 - - 53 5 - 1 2 7 24 12 9
## [120] 32 17 1 3 - 9 38 5 - - 1 5 13 5 6 23 9
## [137] 3 4 2 3 42 5 - - - 6 10 - 3 16 3 4 1
## [154] - - 9 8 - - 5 7 11 11 4 23 3 4 5 - 1
## [171] 22 - - - 14 3 5 4 4 13 6 2 1 - 1 17 -
## [188] - - 7 4 7 2 4 7 1 2 - - - 8 1 - -
## [205] 1 4 10 3 11 8 1 4 1 - - 6 1 - - - 1
## [222] 9 - 1 3 7 26 3 1 - 13 28 1 - - 4 35 17
## [239] 10 28 12 41 27 - 1 25 5 - - - 2 59 17 6 46
## [256] 2 34 1 - - 6 3 - - 1 4 6 4 10 7 2 31
## [273] 2 - - 11 1 - 2 - 1 2 6 3 3 7 11 2 -
## [290] - 6 2 - - - 1 6 1 7 11 4 - 1 - - 6
## [307] 16 - 2 3 4 11 4 10 21 3 - 4 - 1 11 1 -
## [324] 1 3 4 7 3 11 7 11 - 12 - - 47 5 - 8 11
## [341] 2 21 6 2 8 8 - 18 - - 22 4 - 7 3 4 16
## [358] 3 16 8 - 1 17 - - 17 2 - 10 13 4 19 3 7
## [375] 5
## 55 Levels: - 0 1 10 11 12 127 13 14 15 16 164 17 18 19 2 20 21 22 ... 99
Unfortunately the ‘week13’ variable is not a numeric vector it’s a vector of mode ‘factor’. A factor is intended to represent categorical data (e.g. animal: cat/dog/rabbit/aardvark). The read.table() function has made ‘week13’ a factor because of the ‘-’ characters in this column- which are non-numeric. By default all columns with any non-numeric characters will be made into factors unless we specify otherwise.
Mistakes or errors due to missing or blank data are very common and it is always advisable to check newly read data before proceeding. Try summary(df) yourself. The R language has a special data-type with special properties to represent missing data: NA. If a text file contains the ‘NA’ string this will be automatically read without otherwise affecting the data-type of the column. We can work round this problem by using a slightly altered read.table function with a new option na.string='-' :
df = read.table("diagnostics.txt", skip = 1, sep = "\t", header = T,
na.string = "-")
dim(df)
## [1] 375 17
df$week13
## [1] 81 28 4 0 0 34 99 0 19 7 8 68 29 17 42 65 127
## [18] 76 11 0 0 20 164 0 10 5 10 72 50 16 47 10 9 0
## [35] 0 15 40 0 7 5 7 42 22 13 46 38 7 5 0 0 34
## [52] 16 0 19 4 6 40 17 16 18 33 10 5 0 0 32 6 0
## [69] 18 4 17 59 9 22 38 17 3 5 0 0 52 13 0 9 3
## [86] 8 21 3 9 20 10 1 2 NA 1 78 8 NA 3 6 3 16
## [103] 12 6 17 16 2 1 NA NA 53 5 NA 1 2 7 24 12 9
## [120] 32 17 1 3 NA 9 38 5 NA NA 1 5 13 5 6 23 9
## [137] 3 4 2 3 42 5 NA NA NA 6 10 NA 3 16 3 4 1
## [154] NA NA 9 8 NA NA 5 7 11 11 4 23 3 4 5 NA 1
## [171] 22 NA NA NA 14 3 5 4 4 13 6 2 1 NA 1 17 NA
## [188] NA NA 7 4 7 2 4 7 1 2 NA NA NA 8 1 NA NA
## [205] 1 4 10 3 11 8 1 4 1 NA NA 6 1 NA NA NA 1
## [222] 9 NA 1 3 7 26 3 1 NA 13 28 1 NA NA 4 35 17
## [239] 10 28 12 41 27 NA 1 25 5 NA NA NA 2 59 17 6 46
## [256] 2 34 1 NA NA 6 3 NA NA 1 4 6 4 10 7 2 31
## [273] 2 NA NA 11 1 NA 2 NA 1 2 6 3 3 7 11 2 NA
## [290] NA 6 2 NA NA NA 1 6 1 7 11 4 NA 1 NA NA 6
## [307] 16 NA 2 3 4 11 4 10 21 3 NA 4 NA 1 11 1 NA
## [324] 1 3 4 7 3 11 7 11 NA 12 NA NA 47 5 NA 8 11
## [341] 2 21 6 2 8 8 NA 18 NA NA 22 4 NA 7 3 4 16
## [358] 3 16 8 NA 1 17 NA NA 17 2 NA 10 13 4 19 3 7
## [375] 5
summary(df)
## Diagnostic Date week1 week2
## AUDIOLOGY_ASSESSMENTS: 25 Apr-10 : 15 Min. : 30 Min. : 48
## BARIUM_ENEMA : 25 Apr-11 : 15 1st Qu.: 1698 1st Qu.: 1579
## COLONOSCOPY : 25 Aug-09 : 15 Median : 4590 Median : 4313
## CT : 25 Aug-10 : 15 Mean : 9781 Mean : 9225
## CYSTOSCOPY : 25 Dec-09 : 15 3rd Qu.: 9188 3rd Qu.: 8464
## DEXA_SCAN : 25 Dec-10 : 15 Max. :62062 Max. :60869
## (Other) :225 (Other):285
## week3 week4 week5 week6
## Min. : 45 Min. : 34 Min. : 26 Min. : 9
## 1st Qu.: 922 1st Qu.: 696 1st Qu.: 432 1st Qu.: 250
## Median : 3110 Median : 2204 Median : 1365 Median : 751
## Mean : 6960 Mean : 4550 Mean : 2498 Mean :1246
## 3rd Qu.: 7207 3rd Qu.: 4808 3rd Qu.: 2936 3rd Qu.:1620
## Max. :45965 Max. :34563 Max. :19508 Max. :9365
##
## week7 week8 week9 week10
## Min. : 1 Min. : 0.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 32 1st Qu.: 19.0 1st Qu.: 12.0 1st Qu.: 8.0
## Median : 92 Median : 56.5 Median : 37.0 Median : 23.0
## Mean : 152 Mean : 92.1 Mean : 54.4 Mean : 38.4
## 3rd Qu.: 213 3rd Qu.:115.2 3rd Qu.: 72.8 3rd Qu.: 50.5
## Max. :1412 Max. :609.0 Max. :444.0 Max. :320.0
## NA's :2 NA's :11 NA's :21 NA's :32
## week11 week12 week13 week13_plus
## Min. : 0.0 Min. : 0.00 Min. : 0 Min. : 0.0
## 1st Qu.: 4.0 1st Qu.: 3.25 1st Qu.: 3 1st Qu.: 5.0
## Median : 13.5 Median : 9.50 Median : 7 Median : 16.0
## Mean : 25.7 Mean : 17.37 Mean : 13 Mean : 36.9
## 3rd Qu.: 32.0 3rd Qu.: 20.75 3rd Qu.: 16 3rd Qu.: 41.0
## Max. :284.0 Max. :128.00 Max. :164 Max. :575.0
## NA's :39 NA's :61 NA's :74 NA's :46
## order
## Min. : 1
## 1st Qu.: 7
## Median :13
## Mean :13
## 3rd Qu.:19
## Max. :25
##
This now works fine, the column/variable 'week13 is numeric but contains the special NA data-type where ‘-’ characters were previously. If you find a strange error message telling you that the column numbers are wrong, or if the data is all messed up, you need to look through the file for special characters (apostrophes ‘ are a common problem), header or occasionally footer lines, or spaces in the data.
If you have a very large file and perhaps many human data entry errors or ambiguities e.g. the country column contains “UK”, “U.K.”, “uk”, ‘United-Kingdom” and so on, then R is possibly not the best data cleaning tool. I would recommend instead something like Google Refine prior to loading into R, or even…for smaller files…shudder…Excel.
There is one other frequent annoyance exemplified by the diagnostics data in that the “Date” column has been recognised as a factor. R actually has a specific Date mode but as there are so many ways to write dates and times it is perhaps easier to fix post loading. Here we have an extra problem because the format is a month and year - not as Date requires a day, month and year.
# this code doesn’t give us what we want
df$Date[1:3]
## [1] May-11 May-11 May-11
## 25 Levels: Apr-10 Apr-11 Aug-09 Aug-10 Dec-09 Dec-10 Feb-10 ... Sep-10
as.Date(df$Date[1:3], format = "%b-%y")
## [1] NA NA NA
The format options here are correct, have a look at the as.Date help page and the examples at the bottom. The %b is May, then a hyphen -, and the %y denotes the last two digits of year as opposed to the full year which would be %Y 2011, but the base R version will not create a valid date without a day too. Instead we can use the zoo package which has many methods for dealing with time-series data.
library(zoo)
## Attaching package: 'zoo'
## The following object(s) are masked from 'package:base':
##
## as.Date, as.Date.numeric
months = as.yearmon(df$Date, format = "%b-%y")
months[1:3]
## [1] "May 2011" "May 2011" "May 2011"
# If we convert this back to a Date then it will by default become the
# first day of this month.
df$Date = as.Date(months)
Dealing with times and dates in R or indeed most computing languages is quite tedious but this is the price paid for exacting precision. Further down the line having date and time data in a Date or time (POSIXct) format enables R to recognise time series data and automatically select the correct analysis or plot.
Sometimes data doesn't come in a single file but you will want to integrate it all into a single data.frame or similar. the next example uses a small set of miRNA and gene expression files for Kidney Renal Cancers taken from The Cancer Genome Atlas. We will write a script to load the data then extend it to make a function that generalises to load other data of the same format and offers a few different options for processing.
Begin by having a look at one of the expression data files in the KIRC folder:
# there are a number of files with very long somewhat redundant names
path = "./KIRC"
expFiles = dir(path)
head(expFiles)
## [1] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3306-01A-01R-0864-07__gene_expression_analysis.txt"
## [2] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3307-01A-01R-0864-07__gene_expression_analysis.txt"
## [3] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3308-01A-01R-0864-07__gene_expression_analysis.txt"
## [4] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3311-01A-01R-0864-07__gene_expression_analysis.txt"
## [5] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3313-01A-01R-0864-07__gene_expression_analysis.txt"
## [6] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3316-01A-01R-0864-07__gene_expression_analysis.txt"
# NB note this trick below of putting brackets around the whole expression
# to force it to echo to the screen
(numFiles = length(expFiles))
## [1] 16
# the first few lines of the 1st file look like this
readLines(paste(path, "/", expFiles[1], sep = ""), n = 8)
## [1] "barcode\tgene symbol\tvalue"
## [2] "TCGA-A3-3306-01A-01R-0864-07\t15E1.2\t-2.328"
## [3] "TCGA-A3-3306-01A-01R-0864-07\t2'-PDE\t-0.6748125"
## [4] "TCGA-A3-3306-01A-01R-0864-07\t7A5\t2.9785"
## [5] "TCGA-A3-3306-01A-01R-0864-07\tA1BG\t-1.13666666666667"
## [6] "TCGA-A3-3306-01A-01R-0864-07\tA2BP1\t-1.35366666666667"
## [7] "TCGA-A3-3306-01A-01R-0864-07\tA2M\t0.66125"
## [8] "TCGA-A3-3306-01A-01R-0864-07\tA2ML1\t0.5305"
There is a header line, and 3 columns separated by tabs. The first column contains the same sample identifier (a subset of the file name) in every row, the second a unique gene symbol for each row and the third an expression value for the gene. The other files are similar and have the same gene symbol order (take my word for it). We wish to take the second column from any file and append the 3rd column of each file to the right within a single data.frame. We could do this by looping through the files and then cbind() each new value column to our data.frame. However for large datasets it is not recommended that we use programming loops that continually reassign large data objects. Instead we will pre-assign a whole object and insert our data into it like so:
# we can read the geneSymbols from any file they are all in the same order
# I set colClasses to 'character' as the gene symbols are all unique so no
# need for them to be factors
geneSymbols = read.delim(paste(path, "/", expFiles[1], sep = ""),
colClasses = "character", header = T)[, 2]
geneSymbols[1:4]
## [1] "15E1.2" "2'-PDE" "7A5" "A1BG"
# next I work out the length of the columns from the length of geneSymbols
colLength = length(geneSymbols)
# I pre-assign a numeric matrix to hold data values as this is more memory
# efficient with enough rows for each gene symbol and enough columns for
# each file
dataMat = matrix(0, colLength, numFiles)
# generally we avoid big loops in R and later we will discuss how.. in
# this case it's unavoidable though. This may take a while but that's the
# fault of the people who put the data in lots of separate files
for (i in 1:numFiles) dataMat[, i] = read.delim(paste(path, "/",
expFiles[i], sep = ""), header = T)$value
head(dataMat[, 1:4])
## [,1] [,2] [,3] [,4]
## [1,] -2.3280 6035 5296 -0.8715
## [2,] -0.6748 374 8219 -0.3122
## [3,] 2.9785 12071 12868 2.3395
## [4,] -1.1367 4997 4887 -0.7248
## [5,] -1.3537 5613 5274 -0.9292
## [6,] 0.6613 13432 12576 1.3818
summary(dataMat[, 1:4])
## V1 V2 V3 V4
## Min. :-7.891 Min. : 1 Min. : 1 Min. :-6.792
## 1st Qu.:-0.744 1st Qu.: 3155 1st Qu.: 3137 1st Qu.:-0.619
## Median :-0.015 Median : 7066 Median : 6966 Median : 0.033
## Mean :-0.071 Mean : 6749 Mean : 6665 Mean : 0.030
## 3rd Qu.: 0.566 3rd Qu.:10039 3rd Qu.: 9900 3rd Qu.: 0.575
## Max. : 8.962 Max. :14017 Max. :13819 Max. : 8.499
On first look this seems OK as all the columns have numeric data, but look a little closer and you see that the range of values is very different and the 2nd and 3rd column look like integers (gene expression data shouldn't be integers). Something has gone wrong… let's try reading the 3rd column (values) of just the second file
valuesf2 = read.delim(paste(path, "/", expFiles[2], sep = ""), header = T)$value
valuesf2[1:10]
## [1] -2.148 -0.06075 1.3675
## [4] -1.2785 -1.67766666666667 2.89375
## [7] 0.135 0.2315 2.55816666666667
## [10] 0.7285
## 14017 Levels: -0.000166666666666662 -0.00025 ... null
class(valuesf2)
## [1] "factor"
There's the problem it has read in the data as a factor not numeric. Why? The answer is there in the factor levels right on the end. There are null values in this file and column. By default R recognises the 'NA' string for missing data not 'null'. Incidentally null has a distinct meaning to NA explained here. The TCGA people probably meant NA rather than null and we can make R recognise it as such.
# I add the argument 'na.string=null'
valuesf2 = read.delim(paste(path, "/", expFiles[2], sep = ""), header = T,
na.strings = "null")$value
valuesf2[1:10]
## [1] -2.14800 -0.06075 1.36750 -1.27850 -1.67767 2.89375 0.13500
## [8] 0.23150 2.55817 0.72850
# these values look correct so we'll modify our loop and try again
for (i in 1:numFiles) dataMat[, i] = read.delim(paste(path, "/",
expFiles[i], sep = ""), header = T, na.string = "null")$value
head(dataMat[, 1:4])
## [,1] [,2] [,3] [,4]
## [1,] -2.3280 -2.14800 -1.3042 -0.8715
## [2,] -0.6748 -0.06075 0.2044 -0.3122
## [3,] 2.9785 1.36750 2.1360 2.3395
## [4,] -1.1367 -1.27850 -1.1227 -0.7248
## [5,] -1.3537 -1.67767 -1.2925 -0.9292
## [6,] 0.6613 2.89375 1.7855 1.3818
summary(dataMat[, 1:4])
## V1 V2 V3 V4
## Min. :-7.891 Min. :-7.848 Min. :-6.556 Min. :-6.792
## 1st Qu.:-0.744 1st Qu.:-0.634 1st Qu.:-0.618 1st Qu.:-0.619
## Median :-0.015 Median : 0.004 Median :-0.004 Median : 0.033
## Mean :-0.071 Mean : 0.004 Mean : 0.025 Mean : 0.030
## 3rd Qu.: 0.566 3rd Qu.: 0.578 3rd Qu.: 0.538 3rd Qu.: 0.575
## Max. : 8.962 Max. : 8.533 Max. : 8.720 Max. : 8.499
## NA's :1 NA's :1
Some of you (?) may think that this tutorial is getting repetitive and we faced a very similar problem earlier - but this is really what these files are like, I'm not making it up. Such problems are a common occurrence in science - get used to it. Now we need to add column names that match the samples and add the gene symbol column to form a data.frame.
# the filenames contain the sample name but are redundant and very long
expFiles[1:3]
## [1] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3306-01A-01R-0864-07__gene_expression_analysis.txt"
## [2] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3307-01A-01R-0864-07__gene_expression_analysis.txt"
## [3] "unc.edu__AgilentG4502A_07_3__TCGA-A3-3308-01A-01R-0864-07__gene_expression_analysis.txt"
# I can pick out the central part of the name using the substr() function
samples = substr(expFiles, 30, 57)
colnames(dataMat) = samples
# now add the gene symbols
dataFrame = data.frame(geneSymbols, dataMat)
summary(dataFrame[, 1:5])
## geneSymbols TCGA.A3.3306.01A.01R.0864.07 TCGA.A3.3307.01A.01R.0864.07
## 15E1.2 : 1 Min. :-7.891 Min. :-7.848
## 2'-PDE : 1 1st Qu.:-0.744 1st Qu.:-0.634
## 7A5 : 1 Median :-0.015 Median : 0.004
## A1BG : 1 Mean :-0.071 Mean : 0.004
## A2BP1 : 1 3rd Qu.: 0.566 3rd Qu.: 0.578
## A2M : 1 Max. : 8.962 Max. : 8.533
## (Other):17808 NA's :1
## TCGA.A3.3308.01A.01R.0864.07 TCGA.A3.3311.01A.01R.0864.07
## Min. :-6.556 Min. :-6.792
## 1st Qu.:-0.618 1st Qu.:-0.619
## Median :-0.004 Median : 0.033
## Mean : 0.025 Mean : 0.030
## 3rd Qu.: 0.538 3rd Qu.: 0.575
## Max. : 8.720 Max. : 8.499
## NA's :1
Essentially we have the required data now. If we tidy up our RHistory and delete the non-essential commands we used to inspect and test the data then we have a Script that we can probably use to load other TCGA data files. Sometimes however other operations maybe required such as normalisation, data, centering or even imputation of missing data. If we want these options then it is perhaps best to add these as a Function.
# as a rule of thumb if the sd between sample is reduced then thats
# probably better this is not a universal rule you maywant to plot other
# data types
mean(apply(dataMat, 1, sd), na.rm = T)
## [1] 0.5502
mean(apply(scale(dataMat, scale = T, center = F), 1, sd), na.rm = T)
## [1] 0.3745
mean(apply(scale(dataMat, scale = F, center = T), 1, sd), na.rm = T)
## [1] 0.5506
mean(apply(scale(dataMat, scale = T, center = T), 1, sd), na.rm = T)
## [1] 0.3753
Probably at least scaling is useful for this data - whereas centering is more of a useful plotting feature. On the other hand the original data values are probably ratio and the original scaling has a naturalitic understanding to it. So our function will give both options.
readTCGA = function(path = "./", namesubstr = c(30, 57), scale = T,
center = F) {
# the file names
expFiles = dir(path)
# number of files
numFiles = length(expFiles)
# the gene symbols
geneSymbols = read.delim(paste(path, "/", expFiles[1], sep = ""), colClasses = "character",
header = T)[, 2]
# the number of rows
colLength = length(geneSymbols)
# a matrix for the data
dataMat = matrix(0, colLength, numFiles)
# load the data into matrix
for (i in 1:numFiles) dataMat[, i] = read.delim(paste(path, "/", expFiles[i],
sep = ""), header = T, na.string = "null")$value
# fileNames change so I add an option to change the substr for future
# flexibility
samples = substr(expFiles, namesubstr[1], namesubstr[2])
# add those sample names as a header
colnames(dataMat) = samples
# data from separate experiments is often best normalised by scaling
# and/or centering the TCGA files often are already somewhat similar
# though
if (scale == T)
dataMat = scale(dataMat, scale = T, center = F)
if (center == T)
dataMat = scale(dataMat, scale = F, center = T)
# put it all together and return() it
return(data.frame(geneSymbols, dataMat))
}
# we can test this function on another folder of TCGA files they are
# samples from another type of cancer
dataFrame2 = readTCGA("./Ovary")
head(dataFrame2[, 1:5])
## geneSymbols TCGA.01.0630.11A.01R.0363.07 TCGA.01.0631.11A.01R.0363.07
## 1 15E1.2 -1.28927 -0.9601
## 2 2'-PDE -0.35622 -0.2245
## 3 7A5 0.05864 -0.1909
## 4 A1BG -0.48738 0.1565
## 5 A2BP1 0.28057 1.0078
## 6 A2M 1.09795 1.3679
## TCGA.01.0633.11A.01R.0363.07 TCGA.01.0636.11A.01R.0363.07
## 1 -1.4455 -1.6749
## 2 -0.3238 -0.2981
## 3 -0.4735 -0.7562
## 4 -0.4343 -0.7533
## 5 0.1430 0.2862
## 6 1.9433 0.5732
This function automates a lot of steps but you probably cannot expect that the TCGA - or anyone else- will always stick to this format. However the function is heavily annotated and should it break in future you should easily be able to work out what is going on and make changes - or perhaps adapt it for similar datafiles.
In the main data-files will be rectangular tabular, usually text data. However R also has well established packages for relational databses such as MySQL and Oracle. If you are more adventurous there are even experimental packages for reading and writing data (and more than this) from distributed databases such as Hadoop/HBase. Then there are a whole set of packages that will connect to the web via HTTP requests and then parse HTML and XML documents into R data objects.
Then again there will always be just plain non-standard or free-text data files. The R language has the tools to read these line by line and parse them using regular expressions. It is tedious to write such file parsers but for some applciations this will be the only way. We do not have the time (or lets be honest expertise!) to go into the detail of such methods in this course. However here is a short example that reads email files borrowed from Machine Learning for Hackers, try and understand as much as you can:
# I don't expect you to fully understand this example that would require
# both the email files and a good knowledge of regular expressions
path = file.path("/pathtodata")
# this parent function calls the functions below
parse.email <- function(path) {
# 1
full.msg <- msg.full(path)
# 2
date <- get.date(full.msg)
# 3
from <- get.from(full.msg)
# 4
subj <- get.subject(full.msg)
# 5
msg <- get.msg(full.msg)
return(c(date, from, subj, msg, path))
}
# 1 reads everything in the file
msg.full <- function(path) {
con <- file(path, open = "rt", encoding = "latin1")
msg <- readLines(con)
close(con)
return(msg)
}
# 2 Returns the date a given email message was received
get.date <- function(msg.vec) {
date.grep <- grepl("^Date: ", msg.vec)
date.grep <- which(date.grep == TRUE)
date <- msg.vec[date.grep[1]]
date <- strsplit(date, "\\+|\\-|: ")[[1]][2]
date <- gsub("^\\s+|\\s+$", "", date)
return(strtrim(date, 25))
}
# 3 Returns the email address of the sender for a given email message
get.from <- function(msg.vec) {
from <- msg.vec[grepl("From: ", msg.vec)]
from <- strsplit(from, "[\":<> ]")[[1]]
from <- from[which(from != "" & from != " ")]
return(from[grepl("@", from)][1])
}
# 4 Returns the subject string for a given email message
get.subject <- function(msg.vec) {
subj <- msg.vec[grepl("Subject: ", msg.vec)]
if (length(subj) > 0) {
return(strsplit(subj, "Subject: ")[[1]][2])
} else {
return("")
}
}
# 5 this functions takes the fullmsg but returns only the message body.
get.msg <- function(msg.vec) {
msg <- msg.vec[seq(which(msg.vec == "")[1] + 1, length(msg.vec), 1)]
return(paste(msg, collapse = "\n"))
}
Whilst we have introduced many of the common obstacles in reading data - we have in passing touched on the way that you should be using R to make your workflow repeatable moving from exploratory work on the console, to scripting, and towards the end within well annotated functions. In the next section we will be looking more closely at the use of functions. Indeed if you found this last section of code a little difficult to follow you will probably want to return and have another look after the next part.