Week 3 - Combining Data from Multiple Sources

Getting and Preparing the Data

Step 1. Connect to MySQL and retrieve the tb dataset stored in a database table.

This connection uses the RMySQL package to store the tb dataset in the tb dataframe in an unaggregated form.

mydb = dbConnect(MySQL(), user='data607', password='password', dbname='tb', host='localhost')

# return the tb query below and store the results a dataframe called tb
tb <- dbGetQuery(mydb, "select country, year, sex, child + adult + elderly as cases from tb")

## [1] TRUE

Let’s take a look at the tb dataframe:

summary(tb)

##    country               year          sex                cases       
##  Length:3800        Min.   :1995   Length:3800        Min.   :     0  
##  Class :character   1st Qu.:1999   Class :character   1st Qu.:  1311  
##  Mode  :character   Median :2004   Mode  :character   Median :  3071  
##                     Mean   :2004                      Mean   : 12612  
##                     3rd Qu.:2009                      3rd Qu.:  7859  
##                     Max.   :2013                      Max.   :767767  
##                                                       NA's   :420

tb dataset variables

Variable Name	Description
`country`	country; distint countries = 100
`year`	year of tb cases recorded; ranges from min 1995 to max 2013
`sex`	male or female; always valued
`cases`	numberic; represents the sum of child, elderly, and adult tb cases recorded for the given country, year, and sex

In the tb dataset, we can see that there are 420 NA values for cases of tb infections recorded. Let’s see which countries are most impacted by the missing values for cases:

# Use sqldf to find the countries and years associated with the missing cases variable

missing <- sqldf("select country, 
                         min(year) as min_year, 
                         max(year) as max_year,
                         sum(case when sex = 'male' then 1 else 0 end) missing_male_cases,
                         sum(case when sex = 'female' then 1 else 0 end) missing_female_cases
                    from tb where cases is null
                   group by country 
                   order by missing_male_cases desc, missing_female_cases desc
                   ")

## Loading required package: tcltk

htmlTable(missing, caption = 'Summary of Missing Values by Country in the TB dataset')

	country	min_year	max_year	missing_male_cases	missing_female_cases
Summary of Missing Values by Country in the TB dataset
1	Mozambique	1998	2013	16	16
2	Chad	1995	2008	9	9
3	Turkey	1995	2003	9	9
4	Belarus	1995	2005	9	8
5	Congo	1996	2005	7	7
6	Ecuador	1995	2002	7	7
7	Zimbabwe	1995	2001	7	7
8	Niger	1995	2004	6	6
9	Venezuela (Bolivarian Republic of)	1995	2001	6	6
10	Zambia	1996	2010	6	6
11	Tajikistan	1995	2006	5	5
12	Colombia	1995	1999	4	5
13	Brazil	1995	1998	4	4
14	Liberia	1995	2007	4	4
15	Madagascar	1997	2013	4	4
16	Peru	2009	2012	4	4
17	Philippines	1999	2002	4	4
18	South Africa	1995	1998	4	4
19	Spain	1996	2000	4	4
20	Swaziland	1997	2007	4	4
21	Afghanistan	1995	2013	3	3
22	Algeria	1995	1998	3	3
23	Azerbaijan	2001	2008	3	3
24	Botswana	1995	1998	3	3
25	Central African Republic	2000	2007	3	3
26	China, Hong Kong SAR	1995	1998	3	3
27	Kazakhstan	1995	2003	3	3
28	Pakistan	1996	2013	3	3
29	Rwanda	1995	2001	3	3
30	Saudi Arabia	1995	1997	3	3
31	Sudan	1996	2013	3	3
32	Ukraine	2002	2005	3	3
33	Democratic People’s Republic of Korea	1995	1996	2	2
34	Dominican Republic	1995	2001	2	2
35	Ethiopia	2012	2013	2	2
36	Guatemala	2006	2012	2	2
37	Haiti	1995	2009	2	2
38	Lesotho	1999	2001	2	2
39	Mexico	1995	1997	2	2
40	Nepal	1995	1997	2	2
41	Russian Federation	1995	2005	2	2
42	Sierra Leone	1999	2008	2	2
43	Togo	1997	2001	2	2
44	United Kingdom of Great Britain and Northern Ireland	1995	1996	2	2
45	Uzbekistan	1995	1997	2	2
46	Argentina	1995	1995	1	1
47	Bangladesh	2009	2009	1	1
48	Bolivia (Plurinational State of)	1995	1995	1	1
49	Burundi	2000	2000	1	1
50	Cambodia	1997	1997	1	1
51	Cameroon	2013	2013	1	1
52	Chile	1998	1998	1	1
53	CÃ´te d’Ivoire	2000	2000	1	1
54	Djibouti	1995	1995	1	1
55	France	1998	1998	1	1
56	Germany	2000	2000	1	1
57	Ghana	2008	2008	1	1
58	India	2013	2013	1	1
59	Indonesia	2000	2000	1	1
60	Iraq	1996	1996	1	1
61	Malawi	2002	2002	1	1
62	Malaysia	2009	2009	1	1
63	Mali	2001	2001	1	1
64	Myanmar	2013	2013	1	1
65	Nigeria	1999	1999	1	1
66	Senegal	2013	2013	1	1
67	Syrian Arab Republic	1999	1999	1	1
68	Thailand	2013	2013	1	1
69	United Republic of Tanzania	2013	2013	1	1
70	Viet Nam	1995	1995	1	1

We can see from the table of missing values that Mozambique is most impacted, followed by the countries of Chad, Turkey, and Belarus.

An additional discovery in this dataset is that there may be scenarios where male cases are reported for a given country in a specific year but no female cases, or vice versa. This may factor into the next level of data preparation as we look to aggregate the tb data by country, year, and cases which are total reported cases among both men and women.

Aggregate the tb dataset

This step in processing will aggregate the tb dataset so that it represents the total number of tb cases among men and women for a given country by each year reported. NA or missing values will be excluded since these represent incomplete cases.

tb <- tbl_df(tb)

## Aggregation using sqldf
tb_agg_sqldf <- sqldf("select country, year, sum(cases) as cases
                         from tb
                        group by country, year 
                       having cases is not null")

## Aggregation using dplyr
tb_agg2_dplyr <- tb %>% 
                 group_by(country, year) %>% 
                 summarise(cases = sum(cases)) %>% 
                 filter(!is.na(cases)) %>%           # remove NA's 
                 arrange(country, year)              # order by country and year

tb_agg2_dplyr <- as.data.frame(tb_agg2_dplyr)

Aggregation has been performed using two methods: 1.) using sqldf and 2.) using dplyr

Inspecting the results, we can see that the results of the aggregation are slightly different:

Method Used	Rowcount
sqldf	1690
dplyr	1689

Determine which country/value combination is not in the aggregate dataframe created by dplyr

To determine the difference between the two aggregated dataframes, we’ll use sqldf to isoloate the value(s):

sqldf("select country, year from tb_agg_sqldf except select country, year from tb_agg2_dplyr")

##    country year
## 1 Colombia 1999

Based on the results, we can see that something in particular with Columbia in 1999 is causing a difference. The output below shows the issue:

	country	year	sex	cases
1	Colombia	1999	female
2	Colombia	1999	male	8329

This is an incomplete case since the number of female cases for this year is missing. We’ll continue procesing using the dplyr aggregated dataframe and remove the sqldf version from further consideration.

Step 2. Retrieve the population dataset from GitHub and load into a dataframe.

This dataset captures the total population by country over multiple years.

# location of the population CSV file on GitHub

url <- "https://raw.githubusercontent.com/kfolsom98/DATA607/master/population.csv"

download_file <- "population.csv" # name of the file on the local machine after download

# the file will be downloaded to the working directory
downloader::download(url, download_file)

# read the population dataset into a dataframe

population <- read.csv(download_file, header=TRUE, stringsAsFactors = FALSE)

Let’s take a look at the population dataset:

summary(population)

##    country               year        population        
##  Length:1900        Min.   :1995   Min.   :    663999  
##  Class :character   1st Qu.:1999   1st Qu.:   8842847  
##  Mode  :character   Median :2004   Median :  19603861  
##                     Mean   :2004   Mean   :  60843799  
##                     3rd Qu.:2009   3rd Qu.:  47557915  
##                     Max.   :2013   Max.   :1385566537

population dataset variables

Variable Name	Description
`country`	country; distint countries = 100
`year`	year of tb cases recorded; ranges from min 1995 to max 2013
`population`	popuation of the country; no missing values

Step 3. Combine the tb and population datasets

Combine the tb dataset with the population dataset to calculate the rate of tuberculosis infection by country.

During this step, the tb dataframe will be the driving set of data for the calculation of the variable rate. In other words, rate will only be calculated for countries where the year and the number of tb cases are complete.

# join the tb.agg dataframe to the population dataframe
# create a new variable called rate = cases/population
# remove the cases and population variables

final_tb_df <- 
        inner_join(tb_agg2_dplyr, population, by=c("country", "year")) %>%   
        mutate(rate = cases/population) %>%                                  
        select( -cases, -population)                                         

final_tb_df <- as.data.frame(final_tb_df)

# round the rate variable to 4 digits 

final_tb_df$rate <- round(final_tb_df$rate, digits = 4)

Let’s look at the a portion of the final tb dataset:

htmlTable(head(arrange(final_tb_df, country, year), 20))

	country	year	rate
1	Afghanistan	1997	0
2	Afghanistan	1998	0.0001
3	Afghanistan	1999	0
4	Afghanistan	2000	0.0001
5	Afghanistan	2001	0.0002
6	Afghanistan	2002	0.0003
7	Afghanistan	2003	0.0003
8	Afghanistan	2004	0.0003
9	Afghanistan	2005	0.0004
10	Afghanistan	2006	0.0005
11	Afghanistan	2007	0.0005
12	Afghanistan	2008	0.0005
13	Afghanistan	2009	0.0005
14	Afghanistan	2010	0.0005
15	Afghanistan	2011	0.0005
16	Afghanistan	2012	0.0005
17	Algeria	1997	0.0003
18	Algeria	1999	0.0002
19	Algeria	2000	0.0003
20	Algeria	2001	0.0002

# Alternative approach to calculate the combined dataframe using sqldf
# This approach seems slightly more intuitive than the dplyr approach for this particular problem

final_with_sqldf <-  sqldf("select tb.country            as country, 
                                   tb.year               as year, 
                                   tb.cases/p.population as rate
                              from tb_agg2_dplyr tb 
                              inner join population p on tb.country = p.country and tb.year = p.year")

Week 3 - Combining Data from Multiple Sources

Keith Folsom

February 13, 2016

Introduction