The focus of this assignment is to create an R dataframe that shows rates of tuberculosis infection by country. Tuberculosis (tb) data will be sourced from a tb database in MySQL and combined with a CSV file of population data, located on GitHub. The final R dataframe will have the following columns:
CountryYearRate
Where Rate is defined as Cases/Population.
To run this code, the following packages are required:
This connection uses the RMySQL package to store the tb dataset in the tb dataframe in an unaggregated form.
mydb = dbConnect(MySQL(), user='data607', password='password', dbname='tb', host='localhost')
# return the tb query below and store the results a dataframe called tb
tb <- dbGetQuery(mydb, "select country, year, sex, child + adult + elderly as cases from tb")
## [1] TRUE
Let’s take a look at the tb dataframe:
summary(tb)
## country year sex cases
## Length:3800 Min. :1995 Length:3800 Min. : 0
## Class :character 1st Qu.:1999 Class :character 1st Qu.: 1311
## Mode :character Median :2004 Mode :character Median : 3071
## Mean :2004 Mean : 12612
## 3rd Qu.:2009 3rd Qu.: 7859
## Max. :2013 Max. :767767
## NA's :420
tb dataset variables
| Variable Name | Description |
|---|---|
country |
country; distint countries = 100 |
year |
year of tb cases recorded; ranges from min 1995 to max 2013 |
sex |
male or female; always valued |
cases |
numberic; represents the sum of child, elderly, and adult tb cases recorded for the given country, year, and sex |
In the tb dataset, we can see that there are 420 NA values for cases of tb infections recorded. Let’s see which countries are most impacted by the missing values for cases:
# Use sqldf to find the countries and years associated with the missing cases variable
missing <- sqldf("select country,
min(year) as min_year,
max(year) as max_year,
sum(case when sex = 'male' then 1 else 0 end) missing_male_cases,
sum(case when sex = 'female' then 1 else 0 end) missing_female_cases
from tb where cases is null
group by country
order by missing_male_cases desc, missing_female_cases desc
")
## Loading required package: tcltk
htmlTable(missing, caption = 'Summary of Missing Values by Country in the TB dataset')
| Summary of Missing Values by Country in the TB dataset | |||||
| country | min_year | max_year | missing_male_cases | missing_female_cases | |
|---|---|---|---|---|---|
| 1 | Mozambique | 1998 | 2013 | 16 | 16 |
| 2 | Chad | 1995 | 2008 | 9 | 9 |
| 3 | Turkey | 1995 | 2003 | 9 | 9 |
| 4 | Belarus | 1995 | 2005 | 9 | 8 |
| 5 | Congo | 1996 | 2005 | 7 | 7 |
| 6 | Ecuador | 1995 | 2002 | 7 | 7 |
| 7 | Zimbabwe | 1995 | 2001 | 7 | 7 |
| 8 | Niger | 1995 | 2004 | 6 | 6 |
| 9 | Venezuela (Bolivarian Republic of) | 1995 | 2001 | 6 | 6 |
| 10 | Zambia | 1996 | 2010 | 6 | 6 |
| 11 | Tajikistan | 1995 | 2006 | 5 | 5 |
| 12 | Colombia | 1995 | 1999 | 4 | 5 |
| 13 | Brazil | 1995 | 1998 | 4 | 4 |
| 14 | Liberia | 1995 | 2007 | 4 | 4 |
| 15 | Madagascar | 1997 | 2013 | 4 | 4 |
| 16 | Peru | 2009 | 2012 | 4 | 4 |
| 17 | Philippines | 1999 | 2002 | 4 | 4 |
| 18 | South Africa | 1995 | 1998 | 4 | 4 |
| 19 | Spain | 1996 | 2000 | 4 | 4 |
| 20 | Swaziland | 1997 | 2007 | 4 | 4 |
| 21 | Afghanistan | 1995 | 2013 | 3 | 3 |
| 22 | Algeria | 1995 | 1998 | 3 | 3 |
| 23 | Azerbaijan | 2001 | 2008 | 3 | 3 |
| 24 | Botswana | 1995 | 1998 | 3 | 3 |
| 25 | Central African Republic | 2000 | 2007 | 3 | 3 |
| 26 | China, Hong Kong SAR | 1995 | 1998 | 3 | 3 |
| 27 | Kazakhstan | 1995 | 2003 | 3 | 3 |
| 28 | Pakistan | 1996 | 2013 | 3 | 3 |
| 29 | Rwanda | 1995 | 2001 | 3 | 3 |
| 30 | Saudi Arabia | 1995 | 1997 | 3 | 3 |
| 31 | Sudan | 1996 | 2013 | 3 | 3 |
| 32 | Ukraine | 2002 | 2005 | 3 | 3 |
| 33 | Democratic People’s Republic of Korea | 1995 | 1996 | 2 | 2 |
| 34 | Dominican Republic | 1995 | 2001 | 2 | 2 |
| 35 | Ethiopia | 2012 | 2013 | 2 | 2 |
| 36 | Guatemala | 2006 | 2012 | 2 | 2 |
| 37 | Haiti | 1995 | 2009 | 2 | 2 |
| 38 | Lesotho | 1999 | 2001 | 2 | 2 |
| 39 | Mexico | 1995 | 1997 | 2 | 2 |
| 40 | Nepal | 1995 | 1997 | 2 | 2 |
| 41 | Russian Federation | 1995 | 2005 | 2 | 2 |
| 42 | Sierra Leone | 1999 | 2008 | 2 | 2 |
| 43 | Togo | 1997 | 2001 | 2 | 2 |
| 44 | United Kingdom of Great Britain and Northern Ireland | 1995 | 1996 | 2 | 2 |
| 45 | Uzbekistan | 1995 | 1997 | 2 | 2 |
| 46 | Argentina | 1995 | 1995 | 1 | 1 |
| 47 | Bangladesh | 2009 | 2009 | 1 | 1 |
| 48 | Bolivia (Plurinational State of) | 1995 | 1995 | 1 | 1 |
| 49 | Burundi | 2000 | 2000 | 1 | 1 |
| 50 | Cambodia | 1997 | 1997 | 1 | 1 |
| 51 | Cameroon | 2013 | 2013 | 1 | 1 |
| 52 | Chile | 1998 | 1998 | 1 | 1 |
| 53 | Côte d’Ivoire | 2000 | 2000 | 1 | 1 |
| 54 | Djibouti | 1995 | 1995 | 1 | 1 |
| 55 | France | 1998 | 1998 | 1 | 1 |
| 56 | Germany | 2000 | 2000 | 1 | 1 |
| 57 | Ghana | 2008 | 2008 | 1 | 1 |
| 58 | India | 2013 | 2013 | 1 | 1 |
| 59 | Indonesia | 2000 | 2000 | 1 | 1 |
| 60 | Iraq | 1996 | 1996 | 1 | 1 |
| 61 | Malawi | 2002 | 2002 | 1 | 1 |
| 62 | Malaysia | 2009 | 2009 | 1 | 1 |
| 63 | Mali | 2001 | 2001 | 1 | 1 |
| 64 | Myanmar | 2013 | 2013 | 1 | 1 |
| 65 | Nigeria | 1999 | 1999 | 1 | 1 |
| 66 | Senegal | 2013 | 2013 | 1 | 1 |
| 67 | Syrian Arab Republic | 1999 | 1999 | 1 | 1 |
| 68 | Thailand | 2013 | 2013 | 1 | 1 |
| 69 | United Republic of Tanzania | 2013 | 2013 | 1 | 1 |
| 70 | Viet Nam | 1995 | 1995 | 1 | 1 |
We can see from the table of missing values that Mozambique is most impacted, followed by the countries of Chad, Turkey, and Belarus.
An additional discovery in this dataset is that there may be scenarios where male cases are reported for a given country in a specific year but no female cases, or vice versa. This may factor into the next level of data preparation as we look to aggregate the tb data by country, year, and cases which are total reported cases among both men and women.
This step in processing will aggregate the tb dataset so that it represents the total number of tb cases among men and women for a given country by each year reported. NA or missing values will be excluded since these represent incomplete cases.
tb <- tbl_df(tb)
## Aggregation using sqldf
tb_agg_sqldf <- sqldf("select country, year, sum(cases) as cases
from tb
group by country, year
having cases is not null")
## Aggregation using dplyr
tb_agg2_dplyr <- tb %>%
group_by(country, year) %>%
summarise(cases = sum(cases)) %>%
filter(!is.na(cases)) %>% # remove NA's
arrange(country, year) # order by country and year
tb_agg2_dplyr <- as.data.frame(tb_agg2_dplyr)
Aggregation has been performed using two methods: 1.) using sqldf and 2.) using dplyr
Inspecting the results, we can see that the results of the aggregation are slightly different:
| Method Used | Rowcount |
|---|---|
| sqldf | 1690 |
| dplyr | 1689 |
To determine the difference between the two aggregated dataframes, we’ll use sqldf to isoloate the value(s):
sqldf("select country, year from tb_agg_sqldf except select country, year from tb_agg2_dplyr")
## country year
## 1 Colombia 1999
Based on the results, we can see that something in particular with Columbia in 1999 is causing a difference. The output below shows the issue:
| country | year | sex | cases | |
|---|---|---|---|---|
| 1 | Colombia | 1999 | female | |
| 2 | Colombia | 1999 | male | 8329 |
This is an incomplete case since the number of female cases for this year is missing. We’ll continue procesing using the dplyr aggregated dataframe and remove the sqldf version from further consideration.
This dataset captures the total population by country over multiple years.
# location of the population CSV file on GitHub
url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/population.csv"
download_file <- "population.csv" # name of the file on the local machine after download
# the file will be downloaded to the working directory
downloader::download(url, download_file)
# read the population dataset into a dataframe
population <- read.csv(download_file, header=TRUE, stringsAsFactors = FALSE)
Let’s take a look at the population dataset:
summary(population)
## country year population
## Length:1900 Min. :1995 Min. : 663999
## Class :character 1st Qu.:1999 1st Qu.: 8842847
## Mode :character Median :2004 Median : 19603861
## Mean :2004 Mean : 60843799
## 3rd Qu.:2009 3rd Qu.: 47557915
## Max. :2013 Max. :1385566537
population dataset variables
| Variable Name | Description |
|---|---|
country |
country; distint countries = 100 |
year |
year of tb cases recorded; ranges from min 1995 to max 2013 |
population |
popuation of the country; no missing values |
During this step, the tb dataframe will be the driving set of data for the calculation of the variable rate. In other words, rate will only be calculated for countries where the year and the number of tb cases are complete.
# join the tb.agg dataframe to the population dataframe
# create a new variable called rate = cases/population
# remove the cases and population variables
final_tb_df <-
inner_join(tb_agg2_dplyr, population, by=c("country", "year")) %>%
mutate(rate = cases/population) %>%
select( -cases, -population)
final_tb_df <- as.data.frame(final_tb_df)
# round the rate variable to 4 digits
final_tb_df$rate <- round(final_tb_df$rate, digits = 4)
Let’s look at the a portion of the final tb dataset:
htmlTable(head(arrange(final_tb_df, country, year), 20))
| country | year | rate | |
|---|---|---|---|
| 1 | Afghanistan | 1997 | 0 |
| 2 | Afghanistan | 1998 | 0.0001 |
| 3 | Afghanistan | 1999 | 0 |
| 4 | Afghanistan | 2000 | 0.0001 |
| 5 | Afghanistan | 2001 | 0.0002 |
| 6 | Afghanistan | 2002 | 0.0003 |
| 7 | Afghanistan | 2003 | 0.0003 |
| 8 | Afghanistan | 2004 | 0.0003 |
| 9 | Afghanistan | 2005 | 0.0004 |
| 10 | Afghanistan | 2006 | 0.0005 |
| 11 | Afghanistan | 2007 | 0.0005 |
| 12 | Afghanistan | 2008 | 0.0005 |
| 13 | Afghanistan | 2009 | 0.0005 |
| 14 | Afghanistan | 2010 | 0.0005 |
| 15 | Afghanistan | 2011 | 0.0005 |
| 16 | Afghanistan | 2012 | 0.0005 |
| 17 | Algeria | 1997 | 0.0003 |
| 18 | Algeria | 1999 | 0.0002 |
| 19 | Algeria | 2000 | 0.0003 |
| 20 | Algeria | 2001 | 0.0002 |
# Alternative approach to calculate the combined dataframe using sqldf
# This approach seems slightly more intuitive than the dplyr approach for this particular problem
final_with_sqldf <- sqldf("select tb.country as country,
tb.year as year,
tb.cases/p.population as rate
from tb_agg2_dplyr tb
inner join population p on tb.country = p.country and tb.year = p.year")
highest_rates <- final_tb_df %>%
top_n( 1, year ) %>%
top_n(10) %>% arrange(desc(rate))
## Selecting by rate
ggplot(highest_rates, aes(x=country, y=rate, fill=country)) + geom_bar(stat="identity") + coord_flip() +
geom_text(aes(label=rate), vjust=0, size = 4) +
xlab("Country") + ylab("Rate of TB Infection") +
ggtitle("Top 10 Countries with the Highest TB Infection Rate")
final_tb_df %>%
top_n( 1, year ) %>%
arrange(rate) %>%
filter(row_number() <= 10 ) %>%
ggplot(aes(x=country, y=rate, fill=country)) + geom_bar(stat="identity") +
coord_flip() +
geom_text(aes(label=rate), vjust=0, size = 2) +
xlab("Country") + ylab("Rate of TB Infection") +
ggtitle("Top 10 Countries with the Lowest TB Infection Rate")