Introduction

The focus of this assignment is to create an R dataframe that shows rates of tuberculosis infection by country. Tuberculosis (tb) data will be sourced from a tb database in MySQL and combined with a CSV file of population data, located on GitHub. The final R dataframe will have the following columns:

Country
Year
Rate

Where Rate is defined as Cases/Population.

To run this code, the following packages are required:

  1. RMySQL
  2. dplyr
  3. sqldf
  4. ggplot2
  5. downloader
  6. htmlTable

Getting and Preparing the Data

Step 1. Connect to MySQL and retrieve the tb dataset stored in a database table.

This connection uses the RMySQL package to store the tb dataset in the tb dataframe in an unaggregated form.

mydb = dbConnect(MySQL(), user='data607', password='password', dbname='tb', host='localhost')

# return the tb query below and store the results a dataframe called tb
tb <- dbGetQuery(mydb, "select country, year, sex, child + adult + elderly as cases from tb")
## [1] TRUE

Let’s take a look at the tb dataframe:

summary(tb)
##    country               year          sex                cases       
##  Length:3800        Min.   :1995   Length:3800        Min.   :     0  
##  Class :character   1st Qu.:1999   Class :character   1st Qu.:  1311  
##  Mode  :character   Median :2004   Mode  :character   Median :  3071  
##                     Mean   :2004                      Mean   : 12612  
##                     3rd Qu.:2009                      3rd Qu.:  7859  
##                     Max.   :2013                      Max.   :767767  
##                                                       NA's   :420

tb dataset variables

Variable Name Description
country country; distint countries = 100
year year of tb cases recorded; ranges from min 1995 to max 2013
sex male or female; always valued
cases numberic; represents the sum of child, elderly, and adult tb cases recorded for the given country, year, and sex

In the tb dataset, we can see that there are 420 NA values for cases of tb infections recorded. Let’s see which countries are most impacted by the missing values for cases:

# Use sqldf to find the countries and years associated with the missing cases variable

missing <- sqldf("select country, 
                         min(year) as min_year, 
                         max(year) as max_year,
                         sum(case when sex = 'male' then 1 else 0 end) missing_male_cases,
                         sum(case when sex = 'female' then 1 else 0 end) missing_female_cases
                    from tb where cases is null
                   group by country 
                   order by missing_male_cases desc, missing_female_cases desc
                   ")
## Loading required package: tcltk
htmlTable(missing, caption = 'Summary of Missing Values by Country in the TB dataset')
Summary of Missing Values by Country in the TB dataset
country min_year max_year missing_male_cases missing_female_cases
1 Mozambique 1998 2013 16 16
2 Chad 1995 2008 9 9
3 Turkey 1995 2003 9 9
4 Belarus 1995 2005 9 8
5 Congo 1996 2005 7 7
6 Ecuador 1995 2002 7 7
7 Zimbabwe 1995 2001 7 7
8 Niger 1995 2004 6 6
9 Venezuela (Bolivarian Republic of) 1995 2001 6 6
10 Zambia 1996 2010 6 6
11 Tajikistan 1995 2006 5 5
12 Colombia 1995 1999 4 5
13 Brazil 1995 1998 4 4
14 Liberia 1995 2007 4 4
15 Madagascar 1997 2013 4 4
16 Peru 2009 2012 4 4
17 Philippines 1999 2002 4 4
18 South Africa 1995 1998 4 4
19 Spain 1996 2000 4 4
20 Swaziland 1997 2007 4 4
21 Afghanistan 1995 2013 3 3
22 Algeria 1995 1998 3 3
23 Azerbaijan 2001 2008 3 3
24 Botswana 1995 1998 3 3
25 Central African Republic 2000 2007 3 3
26 China, Hong Kong SAR 1995 1998 3 3
27 Kazakhstan 1995 2003 3 3
28 Pakistan 1996 2013 3 3
29 Rwanda 1995 2001 3 3
30 Saudi Arabia 1995 1997 3 3
31 Sudan 1996 2013 3 3
32 Ukraine 2002 2005 3 3
33 Democratic People’s Republic of Korea 1995 1996 2 2
34 Dominican Republic 1995 2001 2 2
35 Ethiopia 2012 2013 2 2
36 Guatemala 2006 2012 2 2
37 Haiti 1995 2009 2 2
38 Lesotho 1999 2001 2 2
39 Mexico 1995 1997 2 2
40 Nepal 1995 1997 2 2
41 Russian Federation 1995 2005 2 2
42 Sierra Leone 1999 2008 2 2
43 Togo 1997 2001 2 2
44 United Kingdom of Great Britain and Northern Ireland 1995 1996 2 2
45 Uzbekistan 1995 1997 2 2
46 Argentina 1995 1995 1 1
47 Bangladesh 2009 2009 1 1
48 Bolivia (Plurinational State of) 1995 1995 1 1
49 Burundi 2000 2000 1 1
50 Cambodia 1997 1997 1 1
51 Cameroon 2013 2013 1 1
52 Chile 1998 1998 1 1
53 Côte d’Ivoire 2000 2000 1 1
54 Djibouti 1995 1995 1 1
55 France 1998 1998 1 1
56 Germany 2000 2000 1 1
57 Ghana 2008 2008 1 1
58 India 2013 2013 1 1
59 Indonesia 2000 2000 1 1
60 Iraq 1996 1996 1 1
61 Malawi 2002 2002 1 1
62 Malaysia 2009 2009 1 1
63 Mali 2001 2001 1 1
64 Myanmar 2013 2013 1 1
65 Nigeria 1999 1999 1 1
66 Senegal 2013 2013 1 1
67 Syrian Arab Republic 1999 1999 1 1
68 Thailand 2013 2013 1 1
69 United Republic of Tanzania 2013 2013 1 1
70 Viet Nam 1995 1995 1 1

We can see from the table of missing values that Mozambique is most impacted, followed by the countries of Chad, Turkey, and Belarus.

An additional discovery in this dataset is that there may be scenarios where male cases are reported for a given country in a specific year but no female cases, or vice versa. This may factor into the next level of data preparation as we look to aggregate the tb data by country, year, and cases which are total reported cases among both men and women.

Aggregate the tb dataset

This step in processing will aggregate the tb dataset so that it represents the total number of tb cases among men and women for a given country by each year reported. NA or missing values will be excluded since these represent incomplete cases.

tb <- tbl_df(tb)

## Aggregation using sqldf
tb_agg_sqldf <- sqldf("select country, year, sum(cases) as cases
                         from tb
                        group by country, year 
                       having cases is not null")

## Aggregation using dplyr
tb_agg2_dplyr <- tb %>% 
                 group_by(country, year) %>% 
                 summarise(cases = sum(cases)) %>% 
                 filter(!is.na(cases)) %>%           # remove NA's 
                 arrange(country, year)              # order by country and year

tb_agg2_dplyr <- as.data.frame(tb_agg2_dplyr)

Aggregation has been performed using two methods: 1.) using sqldf and 2.) using dplyr

Inspecting the results, we can see that the results of the aggregation are slightly different:

Method Used Rowcount
sqldf 1690
dplyr 1689

Determine which country/value combination is not in the aggregate dataframe created by dplyr

To determine the difference between the two aggregated dataframes, we’ll use sqldf to isoloate the value(s):

sqldf("select country, year from tb_agg_sqldf except select country, year from tb_agg2_dplyr")
##    country year
## 1 Colombia 1999

Based on the results, we can see that something in particular with Columbia in 1999 is causing a difference. The output below shows the issue:

country year sex cases
1 Colombia 1999 female
2 Colombia 1999 male 8329

This is an incomplete case since the number of female cases for this year is missing. We’ll continue procesing using the dplyr aggregated dataframe and remove the sqldf version from further consideration.

Step 2. Retrieve the population dataset from GitHub and load into a dataframe.

This dataset captures the total population by country over multiple years.

# location of the population CSV file on GitHub

url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/population.csv"

download_file <- "population.csv" # name of the file on the local machine after download

# the file will be downloaded to the working directory
downloader::download(url, download_file)

# read the population dataset into a dataframe

population <- read.csv(download_file, header=TRUE, stringsAsFactors = FALSE)

Let’s take a look at the population dataset:

summary(population)
##    country               year        population        
##  Length:1900        Min.   :1995   Min.   :    663999  
##  Class :character   1st Qu.:1999   1st Qu.:   8842847  
##  Mode  :character   Median :2004   Median :  19603861  
##                     Mean   :2004   Mean   :  60843799  
##                     3rd Qu.:2009   3rd Qu.:  47557915  
##                     Max.   :2013   Max.   :1385566537

population dataset variables

Variable Name Description
country country; distint countries = 100
year year of tb cases recorded; ranges from min 1995 to max 2013
population popuation of the country; no missing values

Step 3. Combine the tb and population datasets

Combine the tb dataset with the population dataset to calculate the rate of tuberculosis infection by country.

During this step, the tb dataframe will be the driving set of data for the calculation of the variable rate. In other words, rate will only be calculated for countries where the year and the number of tb cases are complete.

# join the tb.agg dataframe to the population dataframe
# create a new variable called rate = cases/population
# remove the cases and population variables

final_tb_df <- 
        inner_join(tb_agg2_dplyr, population, by=c("country", "year")) %>%   
        mutate(rate = cases/population) %>%                                  
        select( -cases, -population)                                         

final_tb_df <- as.data.frame(final_tb_df)

# round the rate variable to 4 digits 

final_tb_df$rate <- round(final_tb_df$rate, digits = 4)

Let’s look at the a portion of the final tb dataset:

htmlTable(head(arrange(final_tb_df, country, year), 20))
country year rate
1 Afghanistan 1997 0
2 Afghanistan 1998 0.0001
3 Afghanistan 1999 0
4 Afghanistan 2000 0.0001
5 Afghanistan 2001 0.0002
6 Afghanistan 2002 0.0003
7 Afghanistan 2003 0.0003
8 Afghanistan 2004 0.0003
9 Afghanistan 2005 0.0004
10 Afghanistan 2006 0.0005
11 Afghanistan 2007 0.0005
12 Afghanistan 2008 0.0005
13 Afghanistan 2009 0.0005
14 Afghanistan 2010 0.0005
15 Afghanistan 2011 0.0005
16 Afghanistan 2012 0.0005
17 Algeria 1997 0.0003
18 Algeria 1999 0.0002
19 Algeria 2000 0.0003
20 Algeria 2001 0.0002
# Alternative approach to calculate the combined dataframe using sqldf
# This approach seems slightly more intuitive than the dplyr approach for this particular problem

final_with_sqldf <-  sqldf("select tb.country            as country, 
                                   tb.year               as year, 
                                   tb.cases/p.population as rate
                              from tb_agg2_dplyr tb 
                              inner join population p on tb.country = p.country and tb.year = p.year")

Future Analysis Options Using the Final TB Dataset

  1. Using the latest year reported, what are the top 10 countries with the highest rates of tb infection based on population?
highest_rates <- final_tb_df %>%            
                 top_n( 1, year ) %>%  
                 top_n(10)   %>% arrange(desc(rate)) 
## Selecting by rate
ggplot(highest_rates, aes(x=country, y=rate, fill=country)) + geom_bar(stat="identity") + coord_flip() +
    geom_text(aes(label=rate), vjust=0, size = 4)  +
    xlab("Country") + ylab("Rate of TB Infection") +
    ggtitle("Top 10 Countries with the Highest TB Infection Rate")

  1. Using the latest year reported, what are the top 10 countries with the lowest rates of tb infection based on population?
final_tb_df %>%           
  top_n( 1, year ) %>%  
  arrange(rate)      %>% 
  filter(row_number() <= 10 ) %>% 
ggplot(aes(x=country, y=rate, fill=country)) + geom_bar(stat="identity") +
    coord_flip() + 
    geom_text(aes(label=rate), vjust=0, size = 2)  + 
    xlab("Country") + ylab("Rate of TB Infection") +
    ggtitle("Top 10 Countries with the Lowest TB Infection Rate")

  1. Find countries where the tb rate is increasing.