DATA 607 Week 3 Assignment

Reading the Data

First the population data is read in:

pop <- read.csv('./population.csv')

Then the tuberculosis data is read in, using the column names from the tb.sql file:

tb <- read.csv('./tb.csv', header = FALSE, col.names = c('country', 'year', 'sex', 'child', 'adult', 'elderly'))

Modifying the Data

Values of -1 in the child, adult, and eldery columns are converted to NA:

tb$child   <- ifelse(tb$child   == -1, NA, tb$child)
tb$adult   <- ifelse(tb$adult   == -1, NA, tb$adult)
tb$elderly <- ifelse(tb$elderly == -1, NA, tb$elderly)

The columns of tb are then summed to give the total number of tuberculosis cases:

tb$cases <- tb$child + tb$adult + tb$elderly

Linking the Data

The dpylr package is used to join the population and tb datasets:

require(dplyr)
pop_tb <- inner_join(pop, tb, by = c("country" = "country", "year" = "year"))

Calculating the Infection Rate

The number of cases is then summed across genders and the infection rate is calculated, as cases per 1,000,000 residents:

by_country_year <- group_by(pop_tb, country, year, population)
infection_rate <- summarize(by_country_year, cases=sum(cases))
infection_rate$rate <- infection_rate$cases/infection_rate$population * 1000000
infection_rate <- as.data.frame(infection_rate[, c(1, 2, 5)])

Results

A sample of the resulting dataset:

infection_rate[seq(1, nrow(infection_rate), 100), ]

##                               country year       rate
## 1                         Afghanistan 1995         NA
## 101                        Bangladesh 2000  286.91693
## 201                            Brazil 2005  226.13332
## 301                          Cameroon 2010  701.30719
## 401                          Colombia 1996         NA
## 501                Dominican Republic 2001         NA
## 601                           Georgia 2006  971.20309
## 701                             Haiti 2011 1482.22880
## 801                             Italy 1997   33.92683
## 901  Lao People's Democratic Republic 2002  329.83214
## 1001                         Malaysia 2007  581.64038
## 1101                       Mozambique 2012         NA
## 1201                          Nigeria 1998  112.61484
## 1301                           Poland 2003   77.93377
## 1401               Russian Federation 2008  841.01820
## 1501                          Somalia 2013  622.92871
## 1601             Syrian Arab Republic 1999         NA
## 1701                           Uganda 2004  754.09697
## 1801                       Uzbekistan 2009  604.86733