Covers basic ideas behind getting data ready for analysis
What this course depends on
What would be useful
(Screenshot of Excel spreadsheet)
Data is ideally in neat rows and columns, one observation per row, one variable per column
(My observation: I've seen some of the crappiest, dirtiest data imaginable in Excel)
(Screenshot of a flat file, with indecipherable field at start of each row, more gibberish at end of row)
You may have to parse data to extract the parts you care about.
(Screenshot of JSON file)
The data may be highly structured, but may not be in a row-column format that you're accustomed to. It may need to be reorganized into such a row-column form.
(Screenshot of medical data)
Sometimes the data is in a form like “Take one tablet by mouth, take one half-table with grapefruit juice”. We may need to extract pieces of that information, such as number of tablets.
(Screenshot of mySQL and MongoDB diagrams)
Data may be in structured databases, and you may need to be able to collect data from those databases in its raw form and process it.
Data may be on local file system, or it may be on the web, so methods are needed to obtain it no matter where it is.
One site that will be used in the course: http://open.baltimorecity.gov
The pipeline for data analysis:
Raw data > Processing script > tidy data > data analysis > data communication
Many courses on data analysis tend to skip the first three steps. This course will focus on them.
“Data are values of quantitative or qualitative variables, belonging to a set of items”
Qualitative: country of origin, sex, treatment Quantitative: height, weight, blood pressure
Raw data
http://en.wikipedia.org/wiki/Raw_data
Processed data
http://en.wikipedia.org/wiki/Computer_data_processing
(Screenshot of human genome processing machine)
Example discusses how DNA data is collected, and various samples are combined. Considering that all of this is done by a machine, I don't think this was a terribly relevant example. I guess the idea is that a data scientist had to set up the procedure.
You know the raw data is in the right format if you
https://github.com/jtleek/datasharing
Some other important tips
https://github.com/jtleek/datasharing
I found the phrase “not contained in the tidy data” confusing. It could refer to “information” or “variables” (maybe some variables in the raw data are dropped). I think it refers to “information”. For example, the tidy data may show numbers, but it won't show what unit those numbers are in.
“Experimental study” is an unfortunate phrase, it may be an observational study
Some other important tips
In some cases it will not be possible to script every step. In that case you should provide instructions like:
I.e., be as detailed as possible–err on the side of too much detail rather than not enough.
Example on study of austerity programs that had serious miscalculations, which were brought to light only because the steps were clearly described. So if you want to be mocked by Colbert, be sure to follow these guidelines!
if(!file.exists("data")) {
dir.create("data")
}
My note: last point is a red herring. It is useful to download any type of file.
You can go to any site that has data and copy the URL to the data file.
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accesstype=DOWNLOAD"
download.file(fileUrl, destfile="./data/cameras.csv", method="curl")
list.files("./data")
## [1] "cameras.csv"
dateDownloaded <- date()
dateDownloaded
## [1] "Sun Jan 12 21:37:44 2014"
It is important to capture the date that you downloaded the data, because the data might change from time to time, so the date will allow someone else to select the appropriate set of data if they are trying to reproduce your work.
https://data.baltimorecity.gov/Transportation/Baltimore-Fixed-Speed-Cameras/dz54-2aru
if(!file.exists("data")) {
dir.create("data")
}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.csv?accessType=DOWNLOAD"
download.file(fileUrl, destfile="./data/cameras.csv", method="curl")
dateDownloaded <- date()
Note about the above: The video does not show “./data/” at the start of destfile, which is clearly an error. It also does not show any verification that the working directory is where we want it to be, i.e., where the /data directory should exist, or be created. A truly reproducible script would probably confirm this.
Video says it can create problems with big data “unless you read it in chunks”, but no indication given of how to do that.
This function defaults to sep=“\t” (tab), header=FALSE.
cameraData <- read.table("./data/cameras.csv")
## Error: line 1 did not have 13 elements
head(cameraData)
## Error: object 'cameraData' cannot be found
What happened: The read.table function by default expects records to be tab-delimited, but this file is comma-delimited. So the data is not loaded, and cameraData is not defined. Also, the file has a header row listing field names.
cameraData <- read.table("./data/cameras.csv", sep=",", header=TRUE)
head(cameraData)
## address direction street crossStreet
## 1 S CATON AVE & BENSON AVE N/B Caton Ave Benson Ave
## 2 S CATON AVE & BENSON AVE S/B Caton Ave Benson Ave
Now the head command shows the first several rows of the contents of cameraData (I include only two here).
You could also use read.csv here. It defaults to sep=“,” and header=TRUE, so you don't need to include those parameters (although there is no harm if you do).
cameraData <- read.csv("./data/cameras.csv")
head(cameraData)
## address direction street crossStreet
## 1 S CATON AVE & BENSON AVE N/B Caton Ave Benson Ave
## 2 S CATON AVE & BENSON AVE S/B Caton Ave Benson Ave
In instructor's experience, the biggest trouble with reading flat files are quotation marks ' or “ placed in data values; setting quote=”“ often resolves these. (What I think this means is that any such quote marks are just considered to be data, rather than delimiters).
Excel files are still probably the most widely used format for sharing data
if(!file.exists("data")){dir.create("data")}
fileUrl <- "https://data.baltimorecity.gov/api/views/dz54-2aru/rows.xlsx?accessType=DOWNLOAD"
download.file(fileUrl, destfile="./data/cameras.xlsx", method="curl")
dateDownloaded <- date()
(Note that first you need to install the xlsx package. There are other packages that can read Excel data.)
library(xlsx)
cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, header=TRUE)
head(cameraData)
address direction street crossStreet intersection
1 S CATON & BENSON AVE N/B Caton Ave Benson Ave Caton Ave & Benson Ave
2 S CATON & BENSON AVE S/B Caton Ave Benson Ave Caton Ave & Benson Ave
1 (39.2693779962, -76.66881825297)
2 (39.2693779962, -76.66881825297)
Note that in the above, we see two rows of results, but they can't fit on a single line, so they are each split into two.
Also note that we must specify a sheetIndex, because an Excel file can contain many sheets. header=TRUE means that the first row in the sheet will contain field names.
colIndex <- 2:3
rowIndex <- 1:4
cameraData <- read.xlsx("./data/cameras.xlsx", sheetIndex=1, colIndex=colIndex, rowIndex=rowIndex)
The above reads data from the first worksheet, only from rows 2 and 3, and columns 1 through 4. Sometimes the data in a worksheet may not be in the upper left corner (in fact there may be two "tables” in different parts of a single worksheet); sometimes we want to read only part of the data in any case.
My note: I don't see how a database is easy to distribute!
http://en/wikipedia.org/wiki/XML
From: http://www.w3schools.com/xml/simple.xml
<!-- Edited by XMLSpy -->
<breakfast_menu>
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
<food>
<name>Strawberry Belgian Waffles</name>
<price>$7.95</price>
<description>Light Belgian waffles covered with strawberries and whipped cream</description>
<calories>900</calories>
</food>
<food>
<name>Berry-Berry Belgian Waffles</name>
<price>$8.95</price>
<description>Light Belgian waffles covered with an assortment of fresh berries and whipped cream</description>
<calories>900</calories>
</food>
...
library(XML)
fileUrl <- "http://www.w3schools.com/xml/simple.xml"
doc <- xmlTreeParse(fileUrl,useInteral=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode) # outputs the tag name of the document root element
[1] "breakfast_menu"
names(rootNode) # outputs tag names of direct children of root node
food food food food food
"food" "food" "food" "food" "food"
rootNode[[1]]
<food>
<name>Belgian Waffles</name>
<price>$5.95</price>
<description>Two of our famous Belgian Waffles with plenty of real maple syrup</description>
<calories>650</calories>
</food>
rootNode[[1]][[1]]
<name>Belgian Waffles</name>
xmlSApply(rootNode,xmlValue) # uses SApply recursively on children of rootNode
"Belgian Waffles$5.95Two of our famous Belgian Waffles with plenty of real maple syrup"
"Strawberry Belgian Waffles$7.95Light Belgian waffles covered with strawberries and whipped cream"
...
So the xmlValue function returns only the text value of an XML node, not the tag name or attributes.
http://www.stat.berkeley.edu/~statcur/Workshop2/Presentations/XML.pdf
My tip: If you want to learn a lot about XPath, get the O'Reilly book XSLT (Doug Tidwell).
# gets all <name> nodes within rootNode, sends them to function xmlValue
xpathSApply(rootNode, "//name", xmlValue)
[1] "Belgian Waffles" "Strawberry Belgain Waffles" "Berry-Berry Belgian Waffles"
[4] "French Toast" "Homestyle Breakfast"
xpathSApply(rootNode, "//price", xmlValue)
[1] "$5.95" "$7.95" "$8.95" "$4.50" "$6.95"
http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens
This is HTML, which is similar to XML (tags, attributes).
Right-click on page and view source. (Screenshot of HTML from the Ravens page.)
Up to this point, I have included no actual R code in these notes. The following is “live” and will actually run. You need to have the XML library installed for it to work.
library(XML)
fileUrl <- "http://espn.go.com/nfl/team/_/name/bal/baltimore-ravens"
doc <- htmlTreeParse(fileUrl, useInternal = TRUE)
scores <- xpathSApply(doc, "//li[@class='score']", xmlValue)
teams <- xpathSApply(doc, "//li[@class='team-name']", xmlValue)
scores
## list()
teams
## [1] "San Francisco" "Dallas" "Washington" "New Orleans"
## [5] "Cincinnati" "Pittsburgh" "Cleveland" "Carolina"
## [9] "Indianapolis" "Tampa Bay" "Atlanta" "Cincinnati"
## [13] "Pittsburgh" "Tennessee" "New Orleans" "San Diego"
## [17] "Miami" "Jacksonville" "Houston" "Cleveland"
Note: The scores show “list()” above because the list is empty; as of now, there are no scores on the site. Instead they're listing the 2014 schedule. An example of how things can change!
http://en.wikipedia.org/wiki/JSON
https://api.github.com/users/jtleek/repos
library(jsonlite)
jsonData <- fromJSON("https://api.github.com/users/jtleek/repos")
names(jsonData)
## [1] "id" "name" "full_name"
## [4] "owner" "private" "html_url"
## [7] "description" "fork" "url"
## [10] "forks_url" "keys_url" "collaborators_url"
## [13] "teams_url" "hooks_url" "issue_events_url"
## [16] "events_url" "assignees_url" "branches_url"
## [19] "tags_url" "blobs_url" "git_tags_url"
## [22] "git_refs_url" "trees_url" "statuses_url"
## [25] "languages_url" "stargazers_url" "contributors_url"
## [28] "subscribers_url" "subscription_url" "commits_url"
## [31] "git_commits_url" "comments_url" "issue_comment_url"
## [34] "contents_url" "compare_url" "merges_url"
## [37] "archive_url" "downloads_url" "issues_url"
## [40] "pulls_url" "milestones_url" "notifications_url"
## [43] "labels_url" "releases_url" "created_at"
## [46] "updated_at" "pushed_at" "git_url"
## [49] "ssh_url" "clone_url" "svn_url"
## [52] "homepage" "size" "stargazers_count"
## [55] "watchers_count" "language" "has_issues"
## [58] "has_downloads" "has_wiki" "forks_count"
## [61] "mirror_url" "open_issues_count" "forks"
## [64] "open_issues" "watchers" "default_branch"
names(jsonData$owner) # shows the names in the owner node
## [1] "login" "id" "avatar_url"
## [4] "gravatar_id" "url" "html_url"
## [7] "followers_url" "following_url" "gists_url"
## [10] "starred_url" "subscriptions_url" "organizations_url"
## [13] "repos_url" "events_url" "received_events_url"
## [16] "type" "site_admin"
jsonData$owner$login # shows the values for all nodes named owner
## [1] "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek"
## [8] "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek"
## [15] "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek"
## [22] "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek" "jtleek"
## [29] "jtleek" "jtleek"
myjson <- toJSON(iris[1:3, ], pretty = TRUE) # just show the JSON for first 3 rows
# pretty=TRUE means the file is indented for easy reading
cat(myjson)
## [
## {
## "Sepal.Length" : 5.1,
## "Sepal.Width" : 3.5,
## "Petal.Length" : 1.4,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## },
## {
## "Sepal.Length" : 4.9,
## "Sepal.Width" : 3,
## "Petal.Length" : 1.4,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## },
## {
## "Sepal.Length" : 4.7,
## "Sepal.Width" : 3.2,
## "Petal.Length" : 1.3,
## "Petal.Width" : 0.2,
## "Species" : "setosa"
## }
## ]
Convert the JSON back to data frame
iris2 <- fromJSON(myjson)
head(iris2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
http://www.r-bloggers.com/new-package-jsonlite-a-smarter-json-encoderdecoder/
My note: JSON can be a difficult topic. The www.json.org site is the authoritative source in my opinion.
library(data.table)
DF = data.frame(x = rnorm(9), y = rep(c("a", "b", "c"), each = 3), z = rnorm(9))
head(DF, 3)
## x y z
## 1 0.324544 a -1.07157
## 2 -0.121757 a 2.05970
## 3 0.006384 a -0.08257
DT = data.table(x = rnorm(9), y = rep(c("a", "b", "c"), each = 3), z = rnorm(9))
head(DT, 3)
## x y z
## 1: 0.41135 a 0.9590
## 2: -0.01255 a -0.4972
## 3: -2.40613 a -0.3190
tables()
## NAME NROW MB COLS KEY
## [1,] DT 9 1 x,y,z
## Total: 1MB
Subsetting rows in a data table is the same as in a data frame
DT[2, ]
## x y z
## 1: -0.01255 a -0.4972
DT[DT$y == "a", ]
## x y z
## 1: 0.41135 a 0.9590
## 2: -0.01255 a -0.4972
## 3: -2.40613 a -0.3190
But with a data table, you can subset with only a single index, in which case it is assumed you are subsetting by row.
DT[c(2, 3)] # return rows 2 and 3 of DT
## x y z
## 1: -0.01255 a -0.4972
## 2: -2.40613 a -0.3190
This is where data tables are very different from data frames.
DT[, c(2, 3)]
## [1] 2 3
(The video gives no hint why the above returns a vector with the numbers 2,3.)
{
x = 1
y = 2
}
k = {
print(10)
5
} # this prints 10, but sets k to 5 because 5 is last statement in expression
## [1] 10
print(k) # this prints 5
## [1] 5
DT[, list(mean(x), sum(z))] # we pass a list of functions and specify against which column to apply each
## V1 V2
## 1: -0.1398 4.923
DT[, table(y)] # this applies the table function to the column y
## y
## a b c
## 3 3 3
Therefore, if you pass a list as the second index, it is assumed to be a list of functions to be applied.
So how do you simply get a column? Specify its name.
DT[, y]
## [1] "a" "a" "a" "b" "b" "b" "c" "c" "c"
But what if you need the second column of the data table but don't know it's name? I don't know if this is the best technique but it works:
DT[, DT[[1]]]
## [1] 0.41135 -0.01255 -2.40613 -1.35985 0.60222 0.45792 -0.86639 1.73805
## [9] 0.17702
Use := to add a new column to a data table.
DT[, `:=`(w, z^2)]
## x y z w
## 1: 0.41135 a 0.9590 0.9197
## 2: -0.01255 a -0.4972 0.2472
## 3: -2.40613 a -0.3190 0.1018
## 4: -1.35985 b -0.8536 0.7286
## 5: 0.60222 b 0.5701 0.3250
## 6: 0.45792 b 1.5983 2.5547
## 7: -0.86639 c 0.4750 0.2256
## 8: 1.73805 c 2.0440 4.1780
## 9: 0.17702 c 0.9463 0.8955
Note that the above outputs DT at the same time.
When you add a new column to a data frame, R creates an entirely new copy of it. This is not the case with data table, which makes it more memory-efficient.
Also note that assigning an existing data table to a new variable does not create a new copy of the data table, as it would with data frame. Therefore, in the code below, the y column is set to 2 for all rows in both the original DT data table and the DT2 data table. In fact, they both refer to only a single data table.
DT2 <- DT
DT[, `:=`(y, 2)]
## Warning: Coerced 'double' RHS to 'character' to match the column's type;
## may have truncated precision. Either change the target column to 'double'
## first (by creating a new 'double' vector length 9 (nrows of entire table)
## and assign that; i.e. 'replace' column), or coerce RHS to 'character'
## (e.g. 1L, NA_[real|integer]_, as.*, etc) to make your intent clear and for
## speed. Or, set the column type correctly up front when you create the
## table and stick to it, please.
## x y z w
## 1: 0.41135 2 0.9590 0.9197
## 2: -0.01255 2 -0.4972 0.2472
## 3: -2.40613 2 -0.3190 0.1018
## 4: -1.35985 2 -0.8536 0.7286
## 5: 0.60222 2 0.5701 0.3250
## 6: 0.45792 2 1.5983 2.5547
## 7: -0.86639 2 0.4750 0.2256
## 8: 1.73805 2 2.0440 4.1780
## 9: 0.17702 2 0.9463 0.8955
head(DT2, n = 3)
## x y z w
## 1: 0.41135 2 0.9590 0.9197
## 2: -0.01255 2 -0.4972 0.2472
## 3: -2.40613 2 -0.3190 0.1018
The following adds a new column, but the values assigned to it are calculated in two statements (which are separated by semicolons). First, the value of the x and z columns are added together for each row; then log_2 is calculated for that sum plus 5. The last statement in the curly brackets is the one assigned to the m column.
DT[, `:=`(m, {
tmp <- (x + z)
log2(tmp + 5)
})]
## x y z w m
## 1: 0.41135 2 0.9590 0.9197 2.671
## 2: -0.01255 2 -0.4972 0.2472 2.167
## 3: -2.40613 2 -0.3190 0.1018 1.186
## 4: -1.35985 2 -0.8536 0.7286 1.479
## 5: 0.60222 2 0.5701 0.3250 2.626
## 6: 0.45792 2 1.5983 2.5547 2.819
## 7: -0.86639 2 0.4750 0.2256 2.204
## 8: 1.73805 2 2.0440 4.1780 3.135
## 9: 0.17702 2 0.9463 0.8955 2.614
The following sets a to TRUE for any row in which x is greater than zero; otherwise FALSE.
DT[, `:=`(a, x > 0)]
## x y z w m a
## 1: 0.41135 2 0.9590 0.9197 2.671 TRUE
## 2: -0.01255 2 -0.4972 0.2472 2.167 FALSE
## 3: -2.40613 2 -0.3190 0.1018 1.186 FALSE
## 4: -1.35985 2 -0.8536 0.7286 1.479 FALSE
## 5: 0.60222 2 0.5701 0.3250 2.626 TRUE
## 6: 0.45792 2 1.5983 2.5547 2.819 TRUE
## 7: -0.86639 2 0.4750 0.2256 2.204 FALSE
## 8: 1.73805 2 2.0440 4.1780 3.135 TRUE
## 9: 0.17702 2 0.9463 0.8955 2.614 TRUE
This example adds a new column, b, and sets it to the mean of the sum of the x and w columns, but groups by the a column. So rows with a=TRUE will have one value, and rows with a=FALSE will have another.
.N is an integer of length one, containing the number of matching elements
set.seed(123)
DT <- data.table(x = sample(letters[1:3], 1e+05, TRUE)) # create large number of records
DT[, .N, by = x] # show number of rows for each value of x
## x N
## 1: a 33387
## 2: c 33201
## 3: b 33412
DT <- data.table(x = rep(c("a", "b", "c"), each = 100), y = rnorm(300))
setkey(DT, x)
head(DT["a"]) # assumes a key is set, so this is the same as DT[x=='a']
## x y
## 1: a 0.2596
## 2: a 0.9175
## 3: a -0.7223
## 4: a -0.8083
## 5: a -0.1414
## 6: a 2.2570
head(DT[x == "a"])
## x y
## 1: a 0.2596
## 2: a 0.9175
## 3: a -0.7223
## 4: a -0.8083
## 5: a -0.1414
## 6: a 2.2570
Keys can be used to faciliate joins between data tables.
DT1 <- data.table(x = c("a", "a", "b", "dt1"), y = 1:4)
DT2 <- data.table(x = c("a", "b", "dt2"), z = 5:7)
setkey(DT1, x)
setkey(DT2, x)
merge(DT1, DT2)
## x y z
## 1: a 1 5
## 2: a 2 5
## 3: b 3 6
fread is much faster than read.table.
big_df <- data.frame(x=rnorm(1E6, y=rnorm(1E6)))
file <- tempfile()
write.table(big_df, file=file, row.names=FALSE, col.names=TRUE, sep="\t", quote=FALSE)
system.time(fread(file))
user system elapsed
0.312 0.015 0.326
system.time(read.table(file, header=TRUE, sep="\t"))
user system elapsed
5.702 0.048 5.755