This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
This week assignment is to explore how data can be combined from different sources. We will explore 2 data sources. One already residing in database (mySQL) and the other is a .csv file (population.csv)
The Metada below is based on analysis of tb.csv table and basic assumptions on how this data was derived.
tb is the file that resides in mySQL. It is an aggregated file. Each record represents various new and relapse cases of turbeculosis per coutry per year, per gender. The data is separated based on age category: child, adult, elderly. Hence, 2 rows of the data represent the total new cases, for a country, for a year; one row per gender.
country; country for which the new and relapse cases of turbeculosis have been recorded.
Assumption is that country name is by UN convention.
year; year for which the new and relapse cases of turbeculosis have been recorded.
sex; gender, the new cases are segragated by gender, possible value ‘female’ or ‘male’.
child; count for new and relapse cases of turbeculosis for children of the given gender, in the given year, in the given country. We will assume that children are age 0 - 14 years of age.
If data is not available, a NULL value will be in the attribute.
adult; count for new and relapse cases of turbeculosis for adult of the given gender, in the given year,
in the given country.
We will assume that adult are age 15 - 64 years of age.
If data is not available, a NULL value will be in the attribute.
elderly; count for new and relapse cases of turbeculosis for children of the given gender, in the given year, in the given country. We will assume that elderly are age 65 and older. If data is not available, a NULL value will be in the attribute.
The other file population.csv represents the total population per country per year. Each record represent the population count for a given country for a given year.
country; country for which the population count is recorded. Assumption is that country name is by UN convention.
year; year for which the population count is recorded.
population; Actual population count for the country for the year.
Load population.csv from loca to mySQL population.csv was loaded in mysql using the sql script: population.sql https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/population.sql
Compare structure of tb and population From within mySQL, some analysis was done to confirm that every country/year in tb was found in population (and vice versa). We also verify that every year in tb were within 1995-2013 range. Finally, we confirmed that gender was always ‘female’ or ‘male’, that they were no unknown. This analysis was done running queries in mySQL. We then merge (join) the files matching by country and year and set NULL = -999. We decided to keep the data segragated by gender as to be able to pursue an line of inquiry by gender.
The various sql statements can be found in the sql script: cuny_607_week3_analysis.sql https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/cuny_607_week3_analysis.sql
New data set = tb_population.csv the new data set which is the result of the join of the 2 tables (tb and population) has been named tb_population.csv
and can be found on github at: https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/tb_population.csv
tb_population.csv has the following attribute: country, year, sex, child, adult, elderly, population as defined above.
The NULL values have been replaced by -999 and will be handled in subsequent transformation in R.
data in tb_population will be modified to be aggregated accross the columns child, adult, elderly and also accross rows
for same country, year. The granularity of the data was kept to allow statistical analysis along gender lines and age group lines.
library(RCurl)
## Loading required package: bitops
library(plyr)
# data frame df_tb_raw
x <- getURL("https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/tb_population.csv")
df_tb_raw <- read.csv(text = x, header = FALSE, sep = ";")
head(df_tb_raw)
## V1 V2 V3 V4 V5 V6 V7
## 1 Afghanistan 1995 female -999 -999 -999 17586073
## 2 Afghanistan 1995 male -999 -999 -999 17586073
## 3 Afghanistan 1996 female -999 -999 -999 18415307
## 4 Afghanistan 1996 male -999 -999 -999 18415307
## 5 Afghanistan 1997 female 96 5 1 19021226
## 6 Afghanistan 1997 male 26 0 0 19021226
# add column name to data frame, we will use gender instead of "sex"
names(df_tb_raw) <- c("country", "year", "gender", "child", "adult", "elderly", "population")
df_tb_raw$child[df_tb_raw$child==-999] <- NA
df_tb_raw$adult[df_tb_raw$adult==-999] <- NA
df_tb_raw$elderly[df_tb_raw$elderly==-999] <- NA
df2 <- mutate(df_tb_raw, total = child + adult + elderly)
df3 <- ddply(df2, c("country", "year"), summarize, total_count = sum(as.numeric(total), na.rm = FALSE), population_count = sum(as.numeric(population))/2)
tb_rate <- mutate(df3, rate = total_count/population_count, rate_100 = round(total_count/population_count*100000), 2)
head(tb_rate)
## country year total_count population_count rate rate_100
## 1 Afghanistan 1995 NA 17586073 NA NA
## 2 Afghanistan 1996 NA 18415307 NA NA
## 3 Afghanistan 1997 128 19021226 6.729324e-06 1
## 4 Afghanistan 1998 1778 19496836 9.119428e-05 9
## 5 Afghanistan 1999 745 19987071 3.727410e-05 4
## 6 Afghanistan 2000 2666 20595360 1.294466e-04 13
We will explore the following: Is there a gender or age factor in new instance of TB Is the overall infection rate decreasing in the last 10 years or so? Where are the regions with high incidence rate (as defined as 40 per 100,000 or higher)? Where are the regions with low incidence rate (as defined as 40 per 100,000 or lower)?
Using statistical graph (ggplot) and geo represenation (rworldmap) a visual represenation of analysis will be provided.