Your task is to create an R dataframe that shows rates of tuberculosis infection by country. You should use the information from the tb database and the attached population.csv file. Your R dataframe should have the following columns:
Country, Year, Rate
Where Rate is defined as Cases/Population.
First I will extract the TB cases from the MySQL database. I will make use of a MySQL Settings file which hold authentication credentials. It is not included in the GitHub repo for obvious reasons. I will start by connecting to the database.
library(RMySQL)
library(ggplot2)
library(dplyr)
source('hidden/MySQL Settings.R')
conn <- dbConnect(MySQL(), user=user, password=password, dbname='tb', host=host)
Now that we have a connection I will pull the total cases by country and year from the database.
sql <- 'SELECT
`country`, `year`, SUM(child + adult + elderly) AS `cases`
FROM
`tb`
GROUP BY `country`, `year`
HAVING `cases` IS NOT NULL'
tb <- dbGetQuery(conn, sql)
## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric
# Disconnect from the database
disconnected <- dbDisconnect(conn)
head(tb)
## country year cases
## 1 Afghanistan 1997 128
## 2 Afghanistan 1998 1778
## 3 Afghanistan 1999 745
## 4 Afghanistan 2000 2666
## 5 Afghanistan 2001 4639
## 6 Afghanistan 2002 6509
I will now bring in the population data from GitHub.
url <- 'https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/population.csv'
pop <- read.csv(url)
head(pop)
## country year population
## 1 Afghanistan 1995 17586073
## 2 Algeria 1995 29315463
## 3 Angola 1995 12104952
## 4 Argentina 1995 34833168
## 5 Azerbaijan 1995 7770806
## 6 Bangladesh 1995 119869585
The next step is to wrangle the data prior to merging the two datasets together. I will be joining on the country name and year. Since there may be variations in the capitalization between the two datasets I will be capitalizing all of them.
The data from the MySQL is not encoded in a way that plays nicely with the population data, so an additional step is taken.
tb <- tb %>%
mutate(country = enc2utf8(country)) %>%
mutate(country = toupper(country))
df <- pop %>%
mutate(country = toupper(country)) %>%
merge(tb)
Since there are nrow(tb)
observations in the TB dataset and nrow(df)
in the combined dataset I want to see what fell out.
tb_countries <- unique(tb$country)
df_countires <- unique(df$country)
missing <- tb_countries[!tb_countries %in% df_countires]
missing
## [1] "C<F4>TE D'IVOIRE"
The Ivory Coast fell out of the dataset. This is due to the encoding issue and is something that I will not be addressing.
Now that the two datasets are combined I will calculate the prevalance rate:
df <- df %>%
mutate(rate = cases / population)
First we will look at the newly created rate variable
summary(df$rate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0001817 0.0003748 0.0006051 0.0007133 0.0081348
The scale is not very useful. I will rescale it to per 100,000.
df <- df %>%
mutate(rate_per_100k = rate * 100000)
ggplot(df, aes(x=year, y=rate_per_100k)) +
geom_smooth(method='lm',formula=y~x) +
ylab("TB Infections Rate per 100,000") +
xlab('Year')
The trend show that on average TB infections have been increasing over time from 25 to 90 cases per 100,000 residents. But the summary above says there is 813 cases per 100,000 as a maximum. So I want to look at a boxplot of the data.
ggplot(df, aes(x=year, y=rate_per_100k, group=year)) +
geom_boxplot() +
ylab("TB Infections Rate per 100,000") +
xlab('Year')
The median has been on the rise but there is a lot of outliers. It seems that there was an outbreak in 2006 where TB rates increased above 400 cases for a few locations.