First the population data is read in:
pop <- read.csv('./population.csv')
Then the tuberculosis data is read in, using the column names from the tb.sql file:
tb <- read.csv('./tb.csv', header = FALSE, col.names = c('country', 'year', 'sex', 'child', 'adult', 'elderly'))
Values of -1 in the child, adult, and eldery columns are converted to NA:
tb$child <- ifelse(tb$child == -1, NA, tb$child)
tb$adult <- ifelse(tb$adult == -1, NA, tb$adult)
tb$elderly <- ifelse(tb$elderly == -1, NA, tb$elderly)
The columns of tb are then summed to give the total number of tuberculosis cases:
tb$cases <- tb$child + tb$adult + tb$elderly
The dpylr package is used to join the population and tb datasets:
require(dplyr)
pop_tb <- inner_join(pop, tb, by = c("country" = "country", "year" = "year"))
The number of cases is then summed across genders and the infection rate is calculated, as cases per 1,000,000 residents:
by_country_year <- group_by(pop_tb, country, year, population)
infection_rate <- summarize(by_country_year, cases=sum(cases))
infection_rate$rate <- infection_rate$cases/infection_rate$population * 1000000
infection_rate <- as.data.frame(infection_rate[, c(1, 2, 5)])
A sample of the resulting dataset:
infection_rate[seq(1, nrow(infection_rate), 100), ]
## country year rate
## 1 Afghanistan 1995 NA
## 101 Bangladesh 2000 286.91693
## 201 Brazil 2005 226.13332
## 301 Cameroon 2010 701.30719
## 401 Colombia 1996 NA
## 501 Dominican Republic 2001 NA
## 601 Georgia 2006 971.20309
## 701 Haiti 2011 1482.22880
## 801 Italy 1997 33.92683
## 901 Lao People's Democratic Republic 2002 329.83214
## 1001 Malaysia 2007 581.64038
## 1101 Mozambique 2012 NA
## 1201 Nigeria 1998 112.61484
## 1301 Poland 2003 77.93377
## 1401 Russian Federation 2008 841.01820
## 1501 Somalia 2013 622.92871
## 1601 Syrian Arab Republic 1999 NA
## 1701 Uganda 2004 754.09697
## 1801 Uzbekistan 2009 604.86733