Hands on Lab 2-3: Combining Data from Multiple Sources

Description

Your task is to create an R dataframe that shows rates of tuberculosis infection by country. You should use the information from the tb database and the attached population.csv file. Your R dataframe should have the following columns:

Country, Year, Rate

Where Rate is defined as Cases/Population.

Step 1: Read in the Data

Step 1-A: Extract the TB Cases From MySQL Database

First I will extract the TB cases from the MySQL database. I will make use of a MySQL Settings file which hold authentication credentials. It is not included in the GitHub repo for obvious reasons. I will start by connecting to the database.

library(RMySQL)
library(ggplot2)
library(dplyr)
source('hidden/MySQL Settings.R')
conn <- dbConnect(MySQL(), user=user, password=password, dbname='tb', host=host)

Now that we have a connection I will pull the total cases by country and year from the database.

sql <- 'SELECT 
    `country`, `year`, SUM(child + adult + elderly) AS `cases`
FROM
    `tb`
GROUP BY `country`, `year`
HAVING `cases` IS NOT NULL'

tb <- dbGetQuery(conn, sql)

## Warning in .local(conn, statement, ...): Decimal MySQL column 2 imported as
## numeric

# Disconnect from the database
disconnected <- dbDisconnect(conn)

head(tb)

##       country year cases
## 1 Afghanistan 1997   128
## 2 Afghanistan 1998  1778
## 3 Afghanistan 1999   745
## 4 Afghanistan 2000  2666
## 5 Afghanistan 2001  4639
## 6 Afghanistan 2002  6509

Step 1-B: Retreive the Population From GitHub

I will now bring in the population data from GitHub.

url <- 'https://raw.githubusercontent.com/mikeasilva/CUNY-SPS/master/DATA607/data/population.csv'

pop <- read.csv(url)

head(pop)

##       country year population
## 1 Afghanistan 1995   17586073
## 2     Algeria 1995   29315463
## 3      Angola 1995   12104952
## 4   Argentina 1995   34833168
## 5  Azerbaijan 1995    7770806
## 6  Bangladesh 1995  119869585

Step 2: Wrangle the Datasets

The next step is to wrangle the data prior to merging the two datasets together. I will be joining on the country name and year. Since there may be variations in the capitalization between the two datasets I will be capitalizing all of them.

The data from the MySQL is not encoded in a way that plays nicely with the population data, so an additional step is taken.

tb <- tb %>%
  mutate(country = enc2utf8(country)) %>%
  mutate(country = toupper(country)) 

df <- pop %>%
  mutate(country = toupper(country))  %>%
  merge(tb)

Since there are nrow(tb) observations in the TB dataset and nrow(df) in the combined dataset I want to see what fell out.

tb_countries <- unique(tb$country)
df_countires <- unique(df$country)
missing <- tb_countries[!tb_countries %in% df_countires]
missing

## [1] "C<F4>TE D'IVOIRE"

The Ivory Coast fell out of the dataset. This is due to the encoding issue and is something that I will not be addressing.

Step 2: Compute the Rate

Now that the two datasets are combined I will calculate the prevalance rate:

df <- df %>%
  mutate(rate = cases / population)

Step 3: Exploratory Analysis

First we will look at the newly created rate variable

summary(df$rate)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0001817 0.0003748 0.0006051 0.0007133 0.0081348

The scale is not very useful. I will rescale it to per 100,000.

df <- df %>%
  mutate(rate_per_100k = rate * 100000)

ggplot(df, aes(x=year, y=rate_per_100k)) +
  geom_smooth(method='lm',formula=y~x) +
  ylab("TB Infections Rate per 100,000") +
  xlab('Year')

The trend show that on average TB infections have been increasing over time from 25 to 90 cases per 100,000 residents. But the summary above says there is 813 cases per 100,000 as a maximum. So I want to look at a boxplot of the data.

ggplot(df, aes(x=year, y=rate_per_100k, group=year)) + 
  geom_boxplot() +
  ylab("TB Infections Rate per 100,000") +
  xlab('Year')

The median has been on the rise but there is a lot of outliers. It seems that there was an outbreak in 2006 where TB rates increased above 400 cases for a few locations.