From: “Getting and Cleaning Data”, week 2 - lesson 1 by Johns Hopkins University (Coursera)
This lecture’s about reading data from a MySQL database. MySQL’s one of the most widely used open source databases, and it’s widely used because it’s, partially because it’s free but also because it works really well. And it’s, again, widely used in Internet-based applications.
Data are structured in databases, and then the databases, within each database, there’s a series of tables. And then within the tables, there a series of fields. So, you can think of each table sort of as the data set, and each field as, say, one of the calms of that data set. So, each row in the database is called a record. If you want to know a little bit more about MySQL structure, you can look at the the Wikipedia page, or there’s quite extensive documentation at MySQL.com.
The first step in the using the MySQL package is actually installing MySQL. We’re actually going to be showing in this lecture, an example of a web-facing MySQL server that we’ll be able to access, butas a act of good citizenship to the world I would appreciate if you didn’t all go and try to access that one server because we’ll probably overwhelm them with the large number of people in the class.
A better example is to try to create MySQL database on your own system and then play around with it. So, the first thing that you need to do is install RMySQL with an R.
We’re going to be using a web-facing version of a MySQL database, so that we can just show how the R of MySQL package works, without having to install MySQL and go through the process of actually building that database.
The important point to keep in mind here is that as a data scientist what role that you will have is likely to collect data from a database, and maybe later you’re going to put some data back in it. But usually, the basic data collection has already been formed before you get there, so you usually be handed a database and trying, having to get data out of it.
This is information about the human genome that’s collected by University of California Santa Cruz. It’s one of the most famous databases for genomic. And so, they have a public facing MySQL page, sort of MySQL server. And so it gives you information here on how to connect. If you have MySQL installed on your computer, how to connect to their MySQL server. And so we’re going to use that information actually just connect with an R. And what we’re going to try to do is access the database and collect some information about a particular genome that we’re interested in.
Connecting and listing databases
Load RMySQL package in R
library(RMySQL)
You’re going to use the dbConnect() command to connect to a database, and so in this case, it’s a MySQL database. You can actually use this dbConnect() to actually connect to other kinds of databases as well.
The user that we’re going to pass it is “genome”.
The host we’re going to pass just where the MySQL server is. And in this case, “genome-mysql.cse.ucsc.edu”
So this opens a connection, and that connection is given this handle: ucscDb .
ucscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")
* Now you can actually apply a query to that database. So, what you can do is you can use the dbGetQuery() command.
- So, what that does is it’ll go to the connection to ucscDb database. And it’ll run this MySQL command.
“show databases;” isn’t an R command; it’s actually a MySQL command that we’re sending to ucscDb database through the dbGetQuery() function.
- Then we use dbDisconnect(). This going to disconnect from the MySQL server. It’s very important that whenever you’re done analyzing data or collecting data from a MySQL server that you disconnect from that server.
And you should get a TRUE response. That true is coming from the dbDisconnect because it’s saying that we did, in fact, disconnect from the server.
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb)
[1] TRUE
Then we look at result, what result will show is a list of all the databases that are available for that all the databases that are available within the MySQL server that is located at this particular host address: “genome-mysql.cse.ucsc.edu”.
result
Connecting to hg19 and listing tables
We’re going to actually focus on one in particular,and that’s hg19. Hg19 is a particular build of the human genome. So, it’s the 19th build of the human genome.
Let’s run dbConnect() command again, but instead of just passing it the user, we’ll also pass it the database (db). We’ll say, we’re going to, we want to access hg19 database within this “genome-mysql.cse.ucsc.edu” MySQL server.
hg19 <- dbConnect(MySQL(), user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu")
And then what we might want to see what are the tables within that database. So remember, on a server, there might be multiple databases and within the database, they’ll be multiple tables. Each table corresponding to what you might think of as a data frame.
So, we can look at all the tables that exist in the hg19 database with dbListTables() command.
allTables <- dbListTables(hg19)
And if you look at the length of allTables, in this database, it’s very long. They’re a 111131 tables that are in that hg19 database.
length(allTables)
[1] 11131
So, if you look at the first, say, five tables, you get these tables here.
allTables[1:5]
[1] "HInv" "HInvGeneMrna" "acembly" "acemblyClass" "acemblyPep"
These are all different tables corresponding to various different elements that describe components of the human genome. Each table corresponds to a different kind of data set.
You can kind of think about it in the same way that the tidy data principle is. You thought about each data set corresponds to its own file. It’s the same way here; you get a, each different data type gets its own table.
Get dimensions of a specific table
Suppose we know what table that we’re interested in. So, our second argument “affyU133Plus2” is a particular kind of microarray, which is a measurement technology used to measure something about the genome.
So, you can actually say, okay, I want to look at this hg19 database within the MySQL server. And I want to know what are all the fields in this particular affyU133Plus2 table. We use them as arguments for dbListFields() command.
dbListFields(hg19, "affyU133Plus2")
[1] "bin" "matches" "misMatches" "repMatches" "nCount" "qNumInsert" "qBaseInsert"
[8] "tNumInsert" "tBaseInsert" "strand" "qName" "qSize" "qStart" "qEnd"
[15] "tName" "tSize" "tStart" "tEnd" "blockCount" "blockSizes" "qStarts"
[22] "tStarts"
Remember a table corresponds to something like a data frame. And the fields correspond to something like the column names that data frame. And so, you can see if we look at the fields here, there are things like bin, matches, misMatches, and so forth.
Now suppose we want to find out, how many different rows there are in this data set. So, we already know something about the columns, right? Because dbListFields() told us all the column names. There are 22 columns in hg19 data set or in affyU133Plus2 table.
Then, what we might want to know is how many rows it has. And so, what we’re going to do is send another query to the database. So, it’s dbGetQuery() again.
So again, we’re going to pass a, in quotes (“”) a MySQL command. This is a very special MySQL command that says, select count ALL from affyU133Plus2.
What it’s doing is it’s basically going to count all of the records in the table. And it’s going to return the number of records.
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
Read from the table
Suppose you want to get one of the tables out. You can basically get the data frame out from the data set. You can do it using dbReadTable.
We pass a database that we’re interested in getting the table from (hg19), and the table name that we’re interested in getting the data from (affyU133Plus2).
What that’ll return is the data frame itself.
affyData <- dbReadTable(hg19, "affyU133Plus2")
head(affyData)
So, you can see this is the data that’s actually been extracted nowfrom that MySQL server, from that particular database (hg19) and that particular table (affyU133Plus2). You can extract the data one table at a time.
Select a specific subset
One important thing to keep in mind is that often if I use MySQL databases, there’ll be a huge amount of data stored. Any particular table might be gigantic and might be too big to read into R.
So, one thing that you might want to do is select only a subset of the data. So, you can do that with the dbSendQuery() command.
What you do is call dbSendQuery() command, and you give it the database (hg19). And then, we’re going to send it a different MySQL command.
The MySQL command here is select star (all), so it’s going to select all the different observations from this affyU133plus2 table, where the misMatches variable is between one and three.
Then we send this query to the database. And now, that’s stored remotely at the database.
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3")
If we want to fetch the results of that, we use the fetch() command like this. And that’ll return this affyMis, which will give us some information about only the samples that we’ve selected using this database query.
affyMis <- fetch(query)
affyMis
Remember, we selected the misMatches between one and three, so you can see that the quantiles reflect that we only had data for which the misMatches are between one and three.
quantile(affyMis$misMatches)
0% 25% 50% 75% 100%
1 1 2 2 3
Other thing that you can do is, you can actually just fetch. Remember, when you used the **SendQuery()* command, it sent it to the database, but it didn’t try to suck the data back to your personal computer yet.
Suppose you just want to see a little bit of the data and make sure that you don’t accidentally suck down a gigantic table, you can use the fetch command, and you can tell it only bring back, say, the top 10 records. And so, what you’ve got now is just a very small data set that only contains the first 10 rows of that table.
affyMisSmall <- fetch(query, n=10)
affyMisSmall
When you do that, you need to clear the query. Remember you sent a query out to the MySQL server, and now, it’s still sitting out there. And you fetch the data back, but it didn’t stop that query from still being out there, at the MySQL Server. So, you need to do dbClearResult() to re-clear that query from the remote server, and it should return TRUE again because you’ve cleared that result.
dbClearResult(query)
[1] TRUE
If you look at the small data set where we fetched only the top 10 rows or observations, you can see that that produces a data frame that actually has a dimension of 10 rows as well. What we’ve done is select a very specific subset.
dim(affyMisSmall)
[1] 10 22
An important point here is that you can basically send any MySQL query that you would like within the quotes. So, it’s a little bit beyond the scope of this class to teach you all of the different queries. The most important ones are the queries that I’ve shown you about selecting, the sort of the total count and selecting all of the observations that are subject to some particular conditions.
But if you go to the MySQL documentation, you can actually come up with a query that will allow you to very flexibly select almost any sort of subset of rows or columns from a data set that you’re actually interested in selecting.
Don’t forget to close the connection!
Remember to close the connection. So, this is one of the most common mistakes when using accessing a database from R, is that you open a connection; you select some data out; you go happily on your way. You should try to close the connection as soon as you don’t need the connection anymore. So, immediately after extracting out the data that you’re interested in, you should close the connection and move on with the rest of your script.
dbDisconnect(hg19)
[1] TRUE
Further resources
- List of commands: http://www.pantz.org/software/mysql/mysqlcommands.html
- Do not, do not, delete, add or join things from ensembl. Only select.
- In general be careful with mySQL commands
There’s very nice collection of MySQL commands here, very nicely organized that will give you a lot more information if you want to be able to use them.
One thing to keep in mind is that I have shown an example here about public spacing MySQL server. What I’ve done primarily for this analysis is select data from that server. Please, please do not acces that server to delete, add or join things. In other words, don’t push anything back into the server, only sub things or select things out of the server. So, you should only be using the select command.
Another important component is that we again, are a large class. So, if you all hit that server at the same time, it will likely cause problems for them. It’s much better to practice on your own local version of MySQL. In careful, in general, you need to be a little bit careful with MySQL commands because you can delete data that other people are working on.
A nice blog post summarizing some other commands: http://www.r-bloggers.com/mysql-and-r/
A nice blog summarizing some other commands that are used with MySQL and R is right here, and so if you want to take a look and learn a lot more, there’s a ton more that can be learned there, and R and MySQL play very nicely together if you want to get data out.
END
---
title: "Reading Data from MySQL"
author: Jeffrey Leek
output: html_notebook
---

#### From: "Getting and Cleaning Data", week 2 - lesson 1 by Johns Hopkins University (Coursera)

<br />

This lecture's about reading data from a MySQL database. MySQL's one of the most widely used open source databases, and it's widely used because it's, partially because it's free but also because it works really well. And it's, again, widely used in Internet-based applications.

Data are structured in databases, and then the databases, within each database, there's a series of tables. And then within the tables, there a series of fields. So, you can think of each table sort of as the data set, and each field as, say, one of the calms of that data set. So, each row in the database is called a record. If you want to know a little bit more about MySQL structure, you can look at the the Wikipedia page, or there's quite extensive documentation at MySQL.com.

The first step in the using the MySQL package is actually installing MySQL. We're actually going to be showing in this lecture, an example of a web-facing MySQL server that we'll be able to access, butas a act of good citizenship to the world I would appreciate if you didn't all go and try to access that one server because we'll probably overwhelm them with the large number of people in the class.

A better example is to try to create MySQL database on your own system and then play around with it. So, the first thing that you need to do is install RMySQL with an R.

We're going to be using a web-facing version of a MySQL database, so that we can just show how the R of MySQL package works, without having to install MySQL and go through the process of actually building that database. 

The important point to keep in mind here is that as a data scientist what role that you will have is likely to collect data from a database, and maybe later you're going to put some data back in it. But usually, the basic data collection has already been formed before you get there, so you usually be handed a database and trying, having to get data out of it.

This is information about the human genome that's collected by University of California Santa Cruz. It's one of the most famous databases for genomic. And so, they have a public facing MySQL page, sort of MySQL server. And so it gives you information here on how to connect. If you have MySQL installed on your computer, how to connect to their MySQL server. And so we're going to use that information actually just connect with an R. And what we're going to try to do is access the database and collect some information about a particular genome that we're interested in.

<br \>

---

<br \>

### Connecting and listing databases

<br \>
Load **RMySQL** package in R
```{r}
library(RMySQL)
```

<br \>
You're going to use the **dbConnect()** command to connect to a database, and so in this case, it's a MySQL database. You can actually use this **dbConnect()** to actually connect to other kinds of databases as well. 

The **user** that we're going to pass it is *"genome"*.

The **host** we're going to pass just where the MySQL server is. And in this case, *"genome-mysql.cse.ucsc.edu"*

So this opens a connection, and that connection is given this handle: **ucscDb **.


```{r}
ucscDb <- dbConnect(MySQL(), user="genome", host="genome-mysql.cse.ucsc.edu")
```

<br \>
* Now you can actually apply a query to that database. So, what you can do is you can use the **dbGetQuery()** command.

* So, what that does is it'll go to the connection to **ucscDb** database. And it'll run this MySQL command.
* **"show databases;"** isn't an R command; it's actually a MySQL command that we're sending to **ucscDb**  database through the **dbGetQuery()** function.

* Then we use **dbDisconnect()**. This going to disconnect from the MySQL server. It's very important that whenever you're done analyzing data or collecting data from a MySQL server that you disconnect from that server. 
* And you should get a TRUE response. That true is coming from the dbDisconnect because it's saying that we did, in fact, disconnect from the server.
```{r}
result <- dbGetQuery(ucscDb, "show databases;"); dbDisconnect(ucscDb)
```

<br \>
Then we look at **result**, what result will show is a list of all the databases that are available for that all the databases that are available within the MySQL server that is located at this particular host address: *"genome-mysql.cse.ucsc.edu"*.

```{r}
result
```

<br \>

---

<br \>

### Connecting to hg19 and listing tables

<br \>
We're going to actually focus on one in particular,and that's **hg19**. **Hg19** is a particular build of the human genome. So, it's the 19th build of the human genome.

Let's run **dbConnect()** command again, but instead of just passing it the **user**, we'll also pass it the database (**db**). We'll say, we're going to, we want to access *hg19* database within this *"genome-mysql.cse.ucsc.edu"* MySQL server.

```{r}
hg19 <- dbConnect(MySQL(), user="genome", db="hg19", host="genome-mysql.cse.ucsc.edu")
```

<br \>
And then what we might want to see what are the tables within that database. So remember, on a server, there might be multiple databases and within the database, they'll be multiple tables. Each table corresponding to what you might think of as a *data frame*.

So, we can look at all the tables that exist in the *hg19* database with **dbListTables()** command. 

```{r}
allTables <- dbListTables(hg19)
```

<br \>
And if you look at the **length** of **allTables**, in this database, it's very long. They're a 111131 tables that are in that *hg19* database.

```{r}
length(allTables)
```

<br \>
So, if you look at the first, say, five tables, you get these tables here.

```{r}
allTables[1:5]
```

<br \>
These are all different tables corresponding to various different elements that describe components of the human genome. Each table corresponds to a different kind of data set. 

You can kind of think about it in the same way that the tidy data principle is. You thought about each data set corresponds to its own file. It's the same way here; you get a, each different data type gets its own table.

<br \>

---

<br \>

### Get dimensions of a specific table

<br \>
Suppose we know what table that we're interested in. So, our second argument *"affyU133Plus2"* is a particular kind of microarray, which is a measurement technology used to measure something about the genome. 

So, you can actually say, okay, I want to look at this *hg19* database within the MySQL server. And I want to know what are all the fields in this particular *affyU133Plus2* table. We use them as arguments for **dbListFields()** command.

```{r}
dbListFields(hg19, "affyU133Plus2")
```

<br \>
Remember a table corresponds to something like a data frame. And the fields correspond to something like the column names that data frame. And so, you can see if we look at the fields here, there are things like *bin, matches, misMatches*, and so forth.

Now suppose we want to find out, how many different rows there are in this data set. So, we already know something about the columns, right? Because **dbListFields()** told us all the column names. There are 22 columns in *hg19* data set or in *affyU133Plus2* table.

Then, what we might want to know is how many rows it has. And so, what we're going to do is send
another query to the database. So, it's **dbGetQuery()** again.

So again, we're going to pass a, in quotes ("") a MySQL command. This is a very special MySQL command that says, *select count ALL from affyU133Plus2*.

What it's doing is it's basically going to count all of the records in the table. And it's going to return the number of records.

```{r}
dbGetQuery(hg19, "select count(*) from affyU133Plus2")
```

<br \>

---

<br \>

### Read from the table

<br \>
Suppose you want to get one of the tables out. You can basically get the data frame out from the data set. You can do it using **dbReadTable**.

We pass a *database* that we're interested in getting the table from (*hg19*),
and the *table* name that we're interested in getting the data from (*affyU133Plus2*).

What that'll return is the data frame itself.

```{r}
affyData <- dbReadTable(hg19, "affyU133Plus2")
```
```{r}
head(affyData)
```

<br \>
So, you can see this is the data that's actually been extracted nowfrom that MySQL server, from that particular database (*hg19*) and that particular table (*affyU133Plus2*). You can extract the data one table at
a time.

<br \>

---

<br \>

### Select a specific subset

<br \>
One important thing to keep in mind is that often if I use MySQL databases, there'll be a huge amount of data stored. Any particular table might be gigantic and might be too big to read into R.

So, one thing that you might want to do is select only a subset of the data. So, you can do that with the **dbSendQuery()** command.

What you do is call **dbSendQuery()** command, and you give it the database (*hg19*). And then, we're going to send it a different MySQL command.

The MySQL command here is *select star (all)*, so it's going to select all the different observations from this *affyU133plus2* table, *where the* **misMatches** *variable is between one and three*. 

Then we send this query to the database. And now, that's stored remotely at the database.

```{r}
query <- dbSendQuery(hg19, "select * from affyU133Plus2 where misMatches between 1 and 3") 
```

<br \>
If we want to fetch the results of that, we use the **fetch()** command like this. And that'll return this **affyMis**, which will give us some information about only the samples that we've selected using this database query.

```{r}
affyMis <- fetch(query)
affyMis
```

<br \>
Remember, we selected the *misMatches* between one and three, so you can see that the quantiles reflect that we only had data for which the *misMatches* are between one and three.

```{r}
quantile(affyMis$misMatches)
```

<br \>
Other thing that you can do is, you can actually just fetch. Remember, when you used the **SendQuery()* command, it sent it to the database, but it didn't try to suck the data back to your personal computer yet.

Suppose you just want to see a little bit of the data and make sure that you don't accidentally suck down a gigantic table, you can use the fetch command, and you can tell it only bring back, say, the top 10 records. And so, what you've got now is just a very small data set that only contains the first 10 rows of that table.

```{r}
affyMisSmall <- fetch(query, n=10)
affyMisSmall
```

<br \>
When you do that, you need to clear the query. Remember you sent a query out to the MySQL server, and now, it's still sitting out there. And you fetch the data back, but it didn't stop that query from still being out there, at the MySQL Server. So, you need to do **dbClearResult()** to re-clear that query from the remote server, and it should return **TRUE** again because you've cleared that result.

```{r}
dbClearResult(query)
```

<br \>
If you look at the small data set where we fetched only the top 10 rows or observations, you can see that that produces a data frame that actually has a dimension of 10 rows as well. What we've done is select a very specific subset.

```{r}
dim(affyMisSmall)
```

<br \>
An important point here is that you can basically send any MySQL query that you would like within the quotes. So, it's a little bit beyond the scope of this class to teach you all of the different queries. The most important ones are the queries that I've shown you about selecting, the sort of the total count and selecting all of the observations that are subject to some particular conditions.

But if you go to the MySQL documentation, you can actually come up with a query that will allow you to very flexibly select almost any sort of subset of rows or columns from a data set that you're actually interested in selecting.

<br \>

---

<br \>

### Don't forget to close the connection!

Remember to close the connection. So, this is one of the most common mistakes when using accessing a database
from R, is that you open a connection; you select some data out; you go happily on your way. You should try to close the connection as soon as you don't need the connection anymore. So, immediately after extracting out the data that you're interested in, you should close the connection and move on with the rest of your script.

```{r}
dbDisconnect(hg19)
```

<br \>

---

<br \>

### Further resources

<br \>

* RMySQL vignette: http://cran.r-project.org/web/packages/RMySQL/RMySQL.pdf

    The RMySQL vignette is very good. It gives you access to a bunch of other MySQL commands that might be useful for selecting data for your particular examples.

<br \>

* List of commands: http://www.pantz.org/software/mysql/mysqlcommands.html
    + **Do not, do not, delete, add or join things from ensembl. Only select.**
    + In general be careful with mySQL commands

    There's very nice collection of MySQL commands here, very nicely organized that will give you a lot more information if you want to be able to use them.

    One thing to keep in mind is that I have shown an example here about public spacing MySQL server. What I've done primarily for this analysis is select data from that server. Please, please do not acces that server to delete, add or join things. In other words, don't push anything back into the server, only sub things or select things out of the server. So, you should only be using the select command.

    Another important component is that we again, are a large class. So, if you all hit that server at the same time, it will likely cause problems for them. It's much better to practice on your own local version of MySQL. In careful, in general, you need to be a little bit careful with MySQL commands because you can delete data that other people are working on. 

<br \>

* A nice blog post summarizing some other commands: http://www.r-bloggers.com/mysql-and-r/

    A nice blog summarizing some other commands that are used with MySQL and R is right here, and so if you want to take a look and learn a lot more, there's a ton more that can be learned there, and R and MySQL play very nicely together if you want to get data out.

<br \>

---

<center>END</center>

---