Setting Up Database Connection

After setting up the tables in SQL, we can use the RMySQL package to connect to the MySQL database and use an SQL script just as we would in MySQL.

library('RMySQL')
## Loading required package: DBI
drv <- dbDriver('MySQL')
con <- dbConnect(drv, user="root", password="", dbname = "tb",
                 host = "localhost")

Using the combination of the MySQL driver stored in drv and the database connection, stored in con, we can view the tables in our database

dbListTables(con)
## [1] "population" "tb"         "tb_cases"

As well as view the fields in a particular table

dbListFields(con, "tb")
## [1] "country" "year"    "sex"     "child"   "adult"   "elderly"

Querying the Database

Following is the code that does most of the work. This is the same SELECT statement that wold retrieve the data from each table and combine it together to return the data. The SQL script is stored in the query object.

query <- "SELECT t.country, t.year, sum(t.cases) AS cases, p.population,
(t.cases / p.population) * 1000 AS rate
FROM tb_cases AS t
INNER JOIN population AS p
ON p.country = t.country
AND p.year = t.year
WHERE cases is not null
GROUP BY t.country, t.year"

cases_per_1000 <- dbGetQuery(con, query)

Using dbGetQuery and calling our con connection object and the SQL script stored in query, we can store the retrieved data in an object, which I have called cases_per_1000. Because the rate returned by dividing cases by population was small, I used a multiple to return a more managable number.

dbDisconnect(con)
## [1] TRUE
dbUnloadDriver(drv)
## [1] TRUE

Creating Data Frame

To manipulate the data in R, the returned SQL data can be stored in a data frame, which I have called cases_df. Below is a preview of the first few rows:

cases_df <- data.frame(cases_per_1000)
head(cases_df)
##       country year cases population   rate
## 1 Afghanistan 1997   128   19021226 0.0054
## 2 Afghanistan 1998  1778   19496836 0.0619
## 3 Afghanistan 1999   745   19987071 0.0259
## 4 Afghanistan 2000  2666   20595360 0.0850
## 5 Afghanistan 2001  4639   21347782 0.1434
## 6 Afghanistan 2002  6509   22202806 0.1990

And here are the summary statistics for the data:

summary(cases_df)
##    country               year          cases           population       
##  Length:1691        Min.   :1995   Min.   :      0   Min.   :6.759e+05  
##  Class :character   1st Qu.:2000   1st Qu.:   2800   1st Qu.:8.842e+06  
##  Mode  :character   Median :2005   Median :   6515   Median :2.014e+07  
##                     Mean   :2004   Mean   :  25208   Mean   :6.387e+07  
##                     3rd Qu.:2009   3rd Qu.:  15266   3rd Qu.:4.868e+07  
##                     Max.   :2013   Max.   :1198289   Max.   :1.386e+09  
##       rate       
##  Min.   :0.0000  
##  1st Qu.:0.0656  
##  Median :0.1361  
##  Mean   :0.2461  
##  3rd Qu.:0.2851  
##  Max.   :4.0372

Plotting Data

Lets make a quick histogram of the rate data to see what we are working with:

hist(cases_df$rate, breaks=30)

In most years, countries have less than 1 case per 1000, and in more severe cases, there are several cases per 1000 people, which is a relatively high rate. Still, for the rates below 1 per 1000, there is a lot of variance.

Let’s take a look at two countries that greatly differ in their overall economies, the United States, and Angola (TB rates in Africa are generally higher than other parts of the world):

library(ggplot2)
US_vs_Angola <- subset(cases_df, country == 'Angola'| country == 'United States of America')
ggplot(US_vs_Angola, aes(x=US_vs_Angola$year, y=US_vs_Angola$rate)) + geom_point(aes(color = US_vs_Angola$country)) + labs(title='TB Cases United States vs. Angola', x="Year", y="TB Rate per 1000 People")

Though both have rates that are below 1 case per 1000 people, the rate of TB cases in Angola is much higher than that of the rate in the United States, which is virtually zero.