After setting up the tables in SQL, we can use the RMySQL
package to connect to the MySQL database and use an SQL script just as we would in MySQL.
library('RMySQL')
## Loading required package: DBI
drv <- dbDriver('MySQL')
con <- dbConnect(drv, user="root", password="", dbname = "tb",
host = "localhost")
Using the combination of the MySQL driver stored in drv
and the database connection, stored in con
, we can view the tables in our database
dbListTables(con)
## [1] "population" "tb" "tb_cases"
As well as view the fields in a particular table
dbListFields(con, "tb")
## [1] "country" "year" "sex" "child" "adult" "elderly"
Following is the code that does most of the work. This is the same SELECT
statement that wold retrieve the data from each table and combine it together to return the data. The SQL script is stored in the query
object.
query <- "SELECT t.country, t.year, sum(t.cases) AS cases, p.population,
(t.cases / p.population) * 1000 AS rate
FROM tb_cases AS t
INNER JOIN population AS p
ON p.country = t.country
AND p.year = t.year
WHERE cases is not null
GROUP BY t.country, t.year"
cases_per_1000 <- dbGetQuery(con, query)
Using dbGetQuery
and calling our con
connection object and the SQL script stored in query
, we can store the retrieved data in an object, which I have called cases_per_1000
. Because the rate returned by dividing cases by population was small, I used a multiple to return a more managable number.
dbDisconnect(con)
## [1] TRUE
dbUnloadDriver(drv)
## [1] TRUE
To manipulate the data in R, the returned SQL data can be stored in a data frame, which I have called cases_df
. Below is a preview of the first few rows:
cases_df <- data.frame(cases_per_1000)
head(cases_df)
## country year cases population rate
## 1 Afghanistan 1997 128 19021226 0.0054
## 2 Afghanistan 1998 1778 19496836 0.0619
## 3 Afghanistan 1999 745 19987071 0.0259
## 4 Afghanistan 2000 2666 20595360 0.0850
## 5 Afghanistan 2001 4639 21347782 0.1434
## 6 Afghanistan 2002 6509 22202806 0.1990
And here are the summary statistics for the data:
summary(cases_df)
## country year cases population
## Length:1691 Min. :1995 Min. : 0 Min. :6.759e+05
## Class :character 1st Qu.:2000 1st Qu.: 2800 1st Qu.:8.842e+06
## Mode :character Median :2005 Median : 6515 Median :2.014e+07
## Mean :2004 Mean : 25208 Mean :6.387e+07
## 3rd Qu.:2009 3rd Qu.: 15266 3rd Qu.:4.868e+07
## Max. :2013 Max. :1198289 Max. :1.386e+09
## rate
## Min. :0.0000
## 1st Qu.:0.0656
## Median :0.1361
## Mean :0.2461
## 3rd Qu.:0.2851
## Max. :4.0372
Lets make a quick histogram of the rate data to see what we are working with:
hist(cases_df$rate, breaks=30)
In most years, countries have less than 1 case per 1000, and in more severe cases, there are several cases per 1000 people, which is a relatively high rate. Still, for the rates below 1 per 1000, there is a lot of variance.
Let’s take a look at two countries that greatly differ in their overall economies, the United States, and Angola (TB rates in Africa are generally higher than other parts of the world):
library(ggplot2)
US_vs_Angola <- subset(cases_df, country == 'Angola'| country == 'United States of America')
ggplot(US_vs_Angola, aes(x=US_vs_Angola$year, y=US_vs_Angola$rate)) + geom_point(aes(color = US_vs_Angola$country)) + labs(title='TB Cases United States vs. Angola', x="Year", y="TB Rate per 1000 People")
Though both have rates that are below 1 case per 1000 people, the rate of TB cases in Angola is much higher than that of the rate in the United States, which is virtually zero.