Getting and Cleaning Data: Week 2

Video 2-1: Reading from MySQL

mySQL

http://en.wikipedia.org/wiki/MySQL http://www.mysql.com/

Example structure

Graphic showing six database tables, with their fields, and relationships between the tables (link goes to graphic)

http://dev.mysql.com/doc/employee/en/sakila-structure.html

Step 1: Install MySQL

http://dev.mysql.com/doc/refman/5.7/en/installing.html

Step 2: Install RMySQL

My note: Also look in the forums, many issues have arisen, many people have tried to help

Example: UCSC database

http://genome.ucsc.edu/

UCSC MySQL

This link has info on how to connect to the database:

http://genome.ucsc.edu/goldenPath/help/mysql.html

Connecting and listing databases

(The following is not an R code block and so won't execute.)

ucscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);

[1] TRUE

result

              Database
1   information_schema
2              ailMel1
3              allMis1
...

Connecting to hg19 and listing tables

The following gets a list of tables in a specific database in the server.

hg19 <- dbConnect(MySQL(),user="genome",db="hg19",host="genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)

[1] 10949

allTables[1:5]

[1] "HInv"    "HInvGeneMrna"    "acembly"   "acemblyClass"   "acemblyPep"

Get dimensions of a specific table

dbListFields(hg19,"affyU133Plus2")

[1]  "bin"         "matches"    "misMatches"  "repMatches"   "nCount"   "qNumInsert"
[7]  "qBaseInsert" "tNumInsert" "tBaseInsert" "strand"       "qName"    "qSize"
[13] "qStart"     "qEnd"        "tName"       "tSize"        "tStart"   "tEnd"
[19] "blockCount" "blockSizes"  "qStarts"     "tStarts"

dbGetQuery(hg19, "select count(*) from affyU133Plus2")

   count(*)
1     58463

Read from the table

dbReadTable reads all of the data from a table into a data frame. Caution is advised here, the table may have a huge number of rows.

affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
# too lazy to type it all, but this shows the first several rows of all columns in the table

Select a specific subset

The difference between dbGetQuery and dbSendQuery is that the former submits the query, fetches all output records, and clears the result set all in one step, but the latter does not extract any records; you need to use fetch to do that.

The first fetch statement below gets all records. The second requests only the first 10 rows (n=10).

The code below appears to be invalid. It shows two fetch commands run on a single query, and I don't think you can do that.

query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)

  0%  25%  50%  75% 100%
    1   1    2    2    3

affyMisSmall <- fetch(query,n=10); dbClearResult(query);

[1] TRUE

dim(affyMisSmall)

[1] 10 22

Don't forget to close the connection!

dbDisconnect(hg19)

[1] TRUE

Further resources

Below is a simple example from my local server. Note that for each fetch command, I first need a separate dbSendQuery.

## Loading required package: DBI
## MYSQL_HOME defined as C:\Program Files\MySQL\MySQL Server 5.6
##   Code        Name     Continent                    Region
## 1  ABW       Aruba North America                 Caribbean
## 2  AFG Afghanistan          Asia Southern and Central Asia
## 3  AGO      Angola        Africa            Central Africa
## 4  AIA    Anguilla North America                 Caribbean
## 5  ALB     Albania        Europe           Southern Europe
## 6  AND     Andorra        Europe           Southern Europe
## [1] TRUE
##    Code                 Name     Continent                    Region
## 1   ABW                Aruba North America                 Caribbean
## 2   AFG          Afghanistan          Asia Southern and Central Asia
## 3   AGO               Angola        Africa            Central Africa
## 4   AIA             Anguilla North America                 Caribbean
## 5   ALB              Albania        Europe           Southern Europe
## 6   AND              Andorra        Europe           Southern Europe
## 7   ANT Netherlands Antilles North America                 Caribbean
## 8   ARE United Arab Emirates          Asia               Middle East
## 9   ARG            Argentina South America             South America
## 10  ARM              Armenia          Asia               Middle East
## [1] TRUE
## [1] TRUE
library(RMySQL)
world <- dbConnect(MySQL(),user="root",password="yourPassword",db="world",host="localhost")
query <- dbSendQuery(world, "select * from country")
country <- fetch(query); 
head(country[,1:4])
dbClearResult(query)
query <- dbSendQuery(world, "select * from country")
country2 <- fetch(query,n=10)
country2[,1:4]
dbDisconnect(world)

Video 2-2: Reading from HDF5

HDF5

http://www.hdfgroup.org

R HDF5 package

Unusual method needed to install:

source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")

I am on Windows 7, and a prompt appeared asking me to approve updating several other packages. I entered “a” (for “all”) and the process failed, complaining that it could not write to the corresponding directories. I restarted R as an administrator and tried again, and this time it worked.

NOTE: The code below creates a file example.h5 in the same directory as this Rmd file, but it will first check for the existence of a file with that name and delete it if found.

library(rhdf5)
# my addition: delete the file, if it already exists
if (file.exists("example.h5")) {
    file.remove("example.h5")
}
## [1] TRUE
created = h5createFile("example.h5")
created
## [1] TRUE

Create groups

Note that if the groups already exist in the HDF5 file, then they are not overwritten; instead the output indicates that they already exist.

created = h5createGroup("example.h5", "foo")
created = h5createGroup("example.h5", "baa")
created = h5createGroup("example.h5", "foo/foobaa")
h5ls("example.h5")
##   group   name     otype dclass dim
## 0     /    baa H5I_GROUP           
## 1     /    foo H5I_GROUP           
## 2  /foo foobaa H5I_GROUP

Write to groups

A <- matrix(1:10, nr = 5, nc = 2)
h5write(A, "example.h5", "foo/A")
B = array(seq(0.1, 2, by = 0.1), dim = c(5, 2, 2))
attr(B, "scale") <- "liter"
h5write(B, "example.h5", "foo/foobaa/B")
h5ls("example.h5")
##         group   name       otype  dclass       dim
## 0           /    baa   H5I_GROUP                  
## 1           /    foo   H5I_GROUP                  
## 2        /foo      A H5I_DATASET INTEGER     5 x 2
## 3        /foo foobaa   H5I_GROUP                  
## 4 /foo/foobaa      B H5I_DATASET   FLOAT 5 x 2 x 2

Write a data set

The following example creates a data frame and writes it to the top of the HDF5 file's hierarchy, using the name “df”.

df = data.frame(1L:5L, seq(0, 1, length.out = 5), c("ab", "cde", "fghi", "a", 
    "s"), stringsAsFactors = F)
h5write(df, "example.h5", "df")
h5ls("example.h5")
##         group   name       otype   dclass       dim
## 0           /    baa   H5I_GROUP                   
## 1           /     df H5I_DATASET COMPOUND         5
## 2           /    foo   H5I_GROUP                   
## 3        /foo      A H5I_DATASET  INTEGER     5 x 2
## 4        /foo foobaa   H5I_GROUP                   
## 5 /foo/foobaa      B H5I_DATASET    FLOAT 5 x 2 x 2

Reading data

The h5read function reads from a specified point in an HDF5 file. I've added to the video example below to show that you can read not only the data structures you wrote into the file, but also the paths. So “foo” is a list, containing A (a matrix) and foobaa (a subpath in the HDF5 file, which in R will another list).

readA = h5read("example.h5", "foo/A")
readB = h5read("example.h5", "foo/foobaa/B")
readdf = h5read("example.h5", "df")
readA
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
foo = h5read("example.h5", "foo")
class(foo)
## [1] "list"
names(foo)
## [1] "A"      "foobaa"
class(foo$foobaa)
## [1] "list"

Writing and reading chunks

We can write to or read from subsets of data structures in HDF5 files.

h5write(c(12, 13, 14), "example.h5", "foo/A", index = list(1:3, 1))
h5read("example.h5", "foo/A")
##      [,1] [,2]
## [1,]   12    6
## [2,]   13    7
## [3,]   14    8
## [4,]    4    9
## [5,]    5   10

Notes and further resources

Video 2-3: Reading data from the web

Webscraping

Webscraping: Programmatically extracting data from the HTML code of websites

http://en.wikipedia.org/wiki/Web_scraping

Example: Google scholar page

http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en

Getting data off webpages: readLines()

The readLines function puts all of the contents of a web page into a single character vector.

con = url("http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
htmlCode = readLines(con)
## Warning: incomplete final line found on
## 'http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'
close(con)
substr(htmlCode, start = 1, stop = 1000)
## [1] "<!DOCTYPE html><html><head><title>Jeff Leek - Google Scholar Citations</title><meta name=\"robots\" content=\"noarchive\"><meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"><meta http-equiv=\"X-UA-Compatible\" content=\"IE=Edge\"><meta name=\"format-detection\" content=\"telephone=no\"><link rel=\"canonical\" href=\"http://scholar.google.com/citations?user=HI-I6C0AAAAJ&amp;hl=en\"><style type=\"text/css\" media=\"screen, projection\">html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_top{position:relative;min-width:964px;_width:expression(document.documentElement.clientWidth<966?\"964px\":\"auto\");-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:300px;_width:expression(document.documentElement.clientWidth<302?\"300px\":\"auto\");}body,td{font-size:13px;font-family:Arial,sans-seri"
class(htmlCode)  # character vector
## [1] "character"
length(htmlCode)  # length is 1, there is only one element in the vector
## [1] 1
nchar(htmlCode)  # number of characters in the vector
## [1] 74509

Parsing with XML

More convenient is using the XML library's htmlTreeParse function, against which we can use xpathSApply. As before, it helps to know XPath.

library(XML)
url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, useInternalNodes = T)
xpathSApply(html, "//title", xmlValue)
## [1] "Jeff Leek - Google Scholar Citations"
xpathSApply(html, "//td[@id='col-citedby']", xmlValue)
##  [1] "Cited by" "404"      "290"      "267"      "178"      "152"     
##  [7] "136"      "134"      "126"      "117"      "45"       "33"      
## [13] "32"       "30"       "23"       "15"       "15"       "12"      
## [19] "12"       "9"        "7"

GET from the httr package

This is an alternative to using htmlTreeParse. It is preferable when a website requires authorization, as shown in next section.

library(httr)
html2 = GET(url)
class(html2)
## [1] "response"
content2 = content(html2, as = "text")
class(content2)
## [1] "character"
parsedHtml = htmlParse(content2, asText = TRUE)
xpathSApply(parsedHtml, "//title", xmlValue)
## [1] "Jeff Leek - Google Scholar Citations"

Accessing websites with passwords

The following requires authentication, so returns a 401. My code below also demonstrates that the response is a list, and one of its members is status_code, which contains the 401. So you can programmatically check for a 200 code, which means everything is OK.

pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
pg1
## Response [http://httpbin.org/basic-auth/user/passwd]
##   Status: 401
##   Content-type: 
## 
names(pg1)
## [1] "url"         "handle"      "status_code" "headers"     "cookies"    
## [6] "content"     "times"       "config"
pg1$status_code
## [1] 401

http://cran.r-project.org/web/packages/httr/httr.pdf

Now authenticate:

pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user", 
    "passwd"))
pg2$status_code
## [1] 200
pg2
## Response [http://httpbin.org/basic-auth/user/passwd]
##   Status: 200
##   Content-type: application/json
## {
##   "user": "user",
##   "authenticated": true
## }
names(pg2)
## [1] "url"         "handle"      "status_code" "headers"     "cookies"    
## [6] "content"     "times"       "config"

Using handles

A handle is a way of caching a URL and login, along with information such as cookies. So it represents a web session. This is important if you need to access the URL again without re-authenticating, or having the cookies overwritten. A handle will apply to everything on a site that is defined as being within a single web session. That generally (but not always) means all subpaths below the path you pass to the handle, and generally (but not always) excludes anything outside of that path, and its children.

Note that, as far as I can tell, authentication does not happen with the handle function. You first define a handle, and then use GET to authenticate, using the handle. But because the handle can retain cookies, subsequent use of that handle will not require re-authentication.

google = handle("http://google.com")
pg1 = GET(handle = google, path = "/")
pg2 = GET(handle = google, path = "search")

Notes and further resources

Video 2-4: Reading from APIs

Application programming interfaces

Many sites have public APIs that you can access

https://dev.twitter.com/docs/api/1/get/blocks/blocking

Creating an application

First, create an account. This usually means a developer account, beyond or in addition to a user account. Then you create an application.

My notes on creating an application:

After account is created, the settings are a bit different from the video (at least for me).

myapp = oauth_app("twitter", key="yourAccessToken",secret="yourAccessTokenSecret")
sig = sign_oauth1.0(myapp, token="yourAPIKey", token_secret="yourAPISecret")

The following loads the content of homeTL into json1, but the result is created using the fromJSON function from the “R JSON IO” package (that's what video says), which returns a structured R object that is not very easy to read. So the jsonlite's fromJSON and toJSON functions are used to create a more user-friendly representation of the data (it is put into a data frame).

Note that there is an error in the video. It uses jsonlite::fromJSON, in order to avoid having to load the jsonlite library first, but then it references the toJSON function with prepending it with jsonlite:: which indicates that when the author made the video, he must have already had the json lite library loaded. So I've simply loaded it in the following snippet.

library(jsonlite)
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)
json1 = content(homeTL)
json2 = fromJSON(toJSON(json1))
json2[1, 1:4]
##                       created_at        id             id_str
## 1 Sun May 11 14:41:25 +0000 2014 4.655e+17 465501937341706240
##                                                                                                                                    text
## 1 Separatist referendum in #Ukraine. Vote yes, you get roses. Vote no, Molotov cocktails. http://t.co/LZD1BZOIak http://t.co/33ycqSsdR7

How did I know what URL to use?

https://dev.twitter.com/docs/api/1.1/get/search/tweets

Look at the “Resource URL” line on this page.

NOTE: When I do this, the URL I see is:

https://api.twitter.com/1.1/search/tweets.json

But note: The screenshot in the video has a URL of:

https://dev.twitter.com/docs/api/1.1/get/statuses/home_timeline

There are actually several different URLs you can use, listed at:

https://dev.twitter.com/docs/api/1.1

Also note that, whichever URL you choose, the docs page will list other parameters you can send with your request, for example, to determine how many tweets to return.

In general look at the documentation

https://dev.twitter.com/docs/api/1.1/overview

Video 2-5: Reading from other sources

There is a package for that

Interacting more directly with files

foreign package

Examples of other database packages

Reading images

Reading GIS data

Reading music (MP3) data