http://en.wikipedia.org/wiki/MySQL http://www.mysql.com/
Graphic showing six database tables, with their fields, and relationships between the tables (link goes to graphic)
http://dev.mysql.com/doc/employee/en/sakila-structure.html
http://dev.mysql.com/doc/refman/5.7/en/installing.html
My note: Also look in the forums, many issues have arisen, many people have tried to help
This link has info on how to connect to the database:
http://genome.ucsc.edu/goldenPath/help/mysql.html
(The following is not an R code block and so won't execute.)
ucscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb);
[1] TRUE
result
Database
1 information_schema
2 ailMel1
3 allMis1
...
The following gets a list of tables in a specific database in the server.
hg19 <- dbConnect(MySQL(),user="genome",db="hg19",host="genome-mysql.cse.ucsc.edu")
allTables <- dbListTables(hg19)
length(allTables)
[1] 10949
allTables[1:5]
[1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass" "acemblyPep"
dbListFields(hg19,"affyU133Plus2")
[1] "bin" "matches" "misMatches" "repMatches" "nCount" "qNumInsert"
[7] "qBaseInsert" "tNumInsert" "tBaseInsert" "strand" "qName" "qSize"
[13] "qStart" "qEnd" "tName" "tSize" "tStart" "tEnd"
[19] "blockCount" "blockSizes" "qStarts" "tStarts"
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
count(*)
1 58463
dbReadTable reads all of the data from a table into a data frame. Caution is advised here, the table may have a huge number of rows.
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
# too lazy to type it all, but this shows the first several rows of all columns in the table
The difference between dbGetQuery and dbSendQuery is that the former submits the query, fetches all output records, and clears the result set all in one step, but the latter does not extract any records; you need to use fetch to do that.
The first fetch statement below gets all records. The second requests only the first 10 rows (n=10).
The code below appears to be invalid. It shows two fetch commands run on a single query, and I don't think you can do that.
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
affyMis <- fetch(query); quantile(affyMis$misMatches)
0% 25% 50% 75% 100%
1 1 2 2 3
affyMisSmall <- fetch(query,n=10); dbClearResult(query);
[1] TRUE
dim(affyMisSmall)
[1] 10 22
dbDisconnect(hg19)
[1] TRUE
Below is a simple example from my local server. Note that for each fetch command, I first need a separate dbSendQuery.
## Loading required package: DBI
## MYSQL_HOME defined as C:\Program Files\MySQL\MySQL Server 5.6
## Code Name Continent Region
## 1 ABW Aruba North America Caribbean
## 2 AFG Afghanistan Asia Southern and Central Asia
## 3 AGO Angola Africa Central Africa
## 4 AIA Anguilla North America Caribbean
## 5 ALB Albania Europe Southern Europe
## 6 AND Andorra Europe Southern Europe
## [1] TRUE
## Code Name Continent Region
## 1 ABW Aruba North America Caribbean
## 2 AFG Afghanistan Asia Southern and Central Asia
## 3 AGO Angola Africa Central Africa
## 4 AIA Anguilla North America Caribbean
## 5 ALB Albania Europe Southern Europe
## 6 AND Andorra Europe Southern Europe
## 7 ANT Netherlands Antilles North America Caribbean
## 8 ARE United Arab Emirates Asia Middle East
## 9 ARG Argentina South America South America
## 10 ARM Armenia Asia Middle East
## [1] TRUE
## [1] TRUE
library(RMySQL)
world <- dbConnect(MySQL(),user="root",password="yourPassword",db="world",host="localhost")
query <- dbSendQuery(world, "select * from country")
country <- fetch(query);
head(country[,1:4])
dbClearResult(query)
query <- dbSendQuery(world, "select * from country")
country2 <- fetch(query,n=10)
country2[,1:4]
dbDisconnect(world)
Unusual method needed to install:
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
I am on Windows 7, and a prompt appeared asking me to approve updating several other packages. I entered “a” (for “all”) and the process failed, complaining that it could not write to the corresponding directories. I restarted R as an administrator and tried again, and this time it worked.
NOTE: The code below creates a file example.h5 in the same directory as this Rmd file, but it will first check for the existence of a file with that name and delete it if found.
library(rhdf5)
# my addition: delete the file, if it already exists
if (file.exists("example.h5")) {
file.remove("example.h5")
}
## [1] TRUE
created = h5createFile("example.h5")
created
## [1] TRUE
Note that if the groups already exist in the HDF5 file, then they are not overwritten; instead the output indicates that they already exist.
created = h5createGroup("example.h5", "foo")
created = h5createGroup("example.h5", "baa")
created = h5createGroup("example.h5", "foo/foobaa")
h5ls("example.h5")
## group name otype dclass dim
## 0 / baa H5I_GROUP
## 1 / foo H5I_GROUP
## 2 /foo foobaa H5I_GROUP
A <- matrix(1:10, nr = 5, nc = 2)
h5write(A, "example.h5", "foo/A")
B = array(seq(0.1, 2, by = 0.1), dim = c(5, 2, 2))
attr(B, "scale") <- "liter"
h5write(B, "example.h5", "foo/foobaa/B")
h5ls("example.h5")
## group name otype dclass dim
## 0 / baa H5I_GROUP
## 1 / foo H5I_GROUP
## 2 /foo A H5I_DATASET INTEGER 5 x 2
## 3 /foo foobaa H5I_GROUP
## 4 /foo/foobaa B H5I_DATASET FLOAT 5 x 2 x 2
The following example creates a data frame and writes it to the top of the HDF5 file's hierarchy, using the name “df”.
df = data.frame(1L:5L, seq(0, 1, length.out = 5), c("ab", "cde", "fghi", "a",
"s"), stringsAsFactors = F)
h5write(df, "example.h5", "df")
h5ls("example.h5")
## group name otype dclass dim
## 0 / baa H5I_GROUP
## 1 / df H5I_DATASET COMPOUND 5
## 2 / foo H5I_GROUP
## 3 /foo A H5I_DATASET INTEGER 5 x 2
## 4 /foo foobaa H5I_GROUP
## 5 /foo/foobaa B H5I_DATASET FLOAT 5 x 2 x 2
The h5read function reads from a specified point in an HDF5 file. I've added to the video example below to show that you can read not only the data structures you wrote into the file, but also the paths. So “foo” is a list, containing A (a matrix) and foobaa (a subpath in the HDF5 file, which in R will another list).
readA = h5read("example.h5", "foo/A")
readB = h5read("example.h5", "foo/foobaa/B")
readdf = h5read("example.h5", "df")
readA
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
foo = h5read("example.h5", "foo")
class(foo)
## [1] "list"
names(foo)
## [1] "A" "foobaa"
class(foo$foobaa)
## [1] "list"
We can write to or read from subsets of data structures in HDF5 files.
h5write(c(12, 13, 14), "example.h5", "foo/A", index = list(1:3, 1))
h5read("example.h5", "foo/A")
## [,1] [,2]
## [1,] 12 6
## [2,] 13 7
## [3,] 14 8
## [4,] 4 9
## [5,] 5 10
Webscraping: Programmatically extracting data from the HTML code of websites
http://en.wikipedia.org/wiki/Web_scraping
http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en
The readLines function puts all of the contents of a web page into a single character vector.
con = url("http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en")
htmlCode = readLines(con)
## Warning: incomplete final line found on
## 'http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en'
close(con)
substr(htmlCode, start = 1, stop = 1000)
## [1] "<!DOCTYPE html><html><head><title>Jeff Leek - Google Scholar Citations</title><meta name=\"robots\" content=\"noarchive\"><meta http-equiv=\"Content-Type\" content=\"text/html;charset=ISO-8859-1\"><meta http-equiv=\"X-UA-Compatible\" content=\"IE=Edge\"><meta name=\"format-detection\" content=\"telephone=no\"><link rel=\"canonical\" href=\"http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en\"><style type=\"text/css\" media=\"screen, projection\">html,body,form,table,div,h1,h2,h3,h4,h5,h6,img,ol,ul,li,button{margin:0;padding:0;border:0;}table{border-collapse:collapse;border-width:0;empty-cells:show;}#gs_top{position:relative;min-width:964px;_width:expression(document.documentElement.clientWidth<966?\"964px\":\"auto\");-webkit-tap-highlight-color:rgba(0,0,0,0);}#gs_top>*:not(#x){-webkit-tap-highlight-color:rgba(204,204,204,.5);}.gs_el_ph #gs_top,.gs_el_ta #gs_top{min-width:300px;_width:expression(document.documentElement.clientWidth<302?\"300px\":\"auto\");}body,td{font-size:13px;font-family:Arial,sans-seri"
class(htmlCode) # character vector
## [1] "character"
length(htmlCode) # length is 1, there is only one element in the vector
## [1] 1
nchar(htmlCode) # number of characters in the vector
## [1] 74509
More convenient is using the XML library's htmlTreeParse function, against which we can use xpathSApply. As before, it helps to know XPath.
library(XML)
url <- "http://scholar.google.com/citations?user=HI-I6C0AAAAJ&hl=en"
html <- htmlTreeParse(url, useInternalNodes = T)
xpathSApply(html, "//title", xmlValue)
## [1] "Jeff Leek - Google Scholar Citations"
xpathSApply(html, "//td[@id='col-citedby']", xmlValue)
## [1] "Cited by" "404" "290" "267" "178" "152"
## [7] "136" "134" "126" "117" "45" "33"
## [13] "32" "30" "23" "15" "15" "12"
## [19] "12" "9" "7"
This is an alternative to using htmlTreeParse. It is preferable when a website requires authorization, as shown in next section.
library(httr)
html2 = GET(url)
class(html2)
## [1] "response"
content2 = content(html2, as = "text")
class(content2)
## [1] "character"
parsedHtml = htmlParse(content2, asText = TRUE)
xpathSApply(parsedHtml, "//title", xmlValue)
## [1] "Jeff Leek - Google Scholar Citations"
The following requires authentication, so returns a 401. My code below also demonstrates that the response is a list, and one of its members is status_code, which contains the 401. So you can programmatically check for a 200 code, which means everything is OK.
pg1 = GET("http://httpbin.org/basic-auth/user/passwd")
pg1
## Response [http://httpbin.org/basic-auth/user/passwd]
## Status: 401
## Content-type:
##
names(pg1)
## [1] "url" "handle" "status_code" "headers" "cookies"
## [6] "content" "times" "config"
pg1$status_code
## [1] 401
http://cran.r-project.org/web/packages/httr/httr.pdf
Now authenticate:
pg2 = GET("http://httpbin.org/basic-auth/user/passwd", authenticate("user",
"passwd"))
pg2$status_code
## [1] 200
pg2
## Response [http://httpbin.org/basic-auth/user/passwd]
## Status: 200
## Content-type: application/json
## {
## "user": "user",
## "authenticated": true
## }
names(pg2)
## [1] "url" "handle" "status_code" "headers" "cookies"
## [6] "content" "times" "config"
A handle is a way of caching a URL and login, along with information such as cookies. So it represents a web session. This is important if you need to access the URL again without re-authenticating, or having the cookies overwritten. A handle will apply to everything on a site that is defined as being within a single web session. That generally (but not always) means all subpaths below the path you pass to the handle, and generally (but not always) excludes anything outside of that path, and its children.
Note that, as far as I can tell, authentication does not happen with the handle function. You first define a handle, and then use GET to authenticate, using the handle. But because the handle can retain cookies, subsequent use of that handle will not require re-authentication.
google = handle("http://google.com")
pg1 = GET(handle = google, path = "/")
pg2 = GET(handle = google, path = "search")
Many sites have public APIs that you can access
https://dev.twitter.com/docs/api/1/get/blocks/blocking
First, create an account. This usually means a developer account, beyond or in addition to a user account. Then you create an application.
My notes on creating an application:
After account is created, the settings are a bit different from the video (at least for me).
myapp = oauth_app("twitter", key="yourAccessToken",secret="yourAccessTokenSecret")
sig = sign_oauth1.0(myapp, token="yourAPIKey", token_secret="yourAPISecret")
The following loads the content of homeTL into json1, but the result is created using the fromJSON function from the “R JSON IO” package (that's what video says), which returns a structured R object that is not very easy to read. So the jsonlite's fromJSON and toJSON functions are used to create a more user-friendly representation of the data (it is put into a data frame).
Note that there is an error in the video. It uses jsonlite::fromJSON, in order to avoid having to load the jsonlite library first, but then it references the toJSON function with prepending it with jsonlite:: which indicates that when the author made the video, he must have already had the json lite library loaded. So I've simply loaded it in the following snippet.
library(jsonlite)
homeTL = GET("https://api.twitter.com/1.1/statuses/home_timeline.json", sig)
json1 = content(homeTL)
json2 = fromJSON(toJSON(json1))
json2[1, 1:4]
## created_at id id_str
## 1 Sun May 11 14:41:25 +0000 2014 4.655e+17 465501937341706240
## text
## 1 Separatist referendum in #Ukraine. Vote yes, you get roses. Vote no, Molotov cocktails. http://t.co/LZD1BZOIak http://t.co/33ycqSsdR7
https://dev.twitter.com/docs/api/1.1/get/search/tweets
Look at the “Resource URL” line on this page.
NOTE: When I do this, the URL I see is:
https://api.twitter.com/1.1/search/tweets.json
But note: The screenshot in the video has a URL of:
https://dev.twitter.com/docs/api/1.1/get/statuses/home_timeline
There are actually several different URLs you can use, listed at:
https://dev.twitter.com/docs/api/1.1
Also note that, whichever URL you choose, the docs page will list other parameters you can send with your request, for example, to determine how many tweets to return.
https://dev.twitter.com/docs/api/1.1/overview