607_Week3_Assignment

The following creates an R dataframe that shows rates of tuburculosis infection by country.

The following analysis requires the RMySQL, Hmisc, and plyr libraries, along with dependencies.

library(RMySQL)

## Loading required package: DBI

library(Hmisc)

## Loading required package: lattice

## Loading required package: survival

## Loading required package: Formula

## Loading required package: ggplot2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(plyr)

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:Hmisc':
## 
##     is.discrete, summarize

First, load the population.csv and tb.csv into R list. Make sure that you have run the tb.R script on a database named tb. I have hidden the connection string and if you want to reproduce this code you will need to connect to your own database. The variable “con” refers to my connection string.

The second table was loaded into my github.

population<-read.csv("https://raw.githubusercontent.com/RobertSellers/R/master/data/population.csv", header= TRUE, sep=",")

The following demonstrates a simple select query using RMySQL along with some basic R-side refinement of the retrieved dataset.

tb_load = dbSendQuery(con, "select * from tb")
tb = fetch(tb_load, n=-1)
colnames(tb)<-c("country","year", "sex","child","adult","elderly")
head(tb)

##       country year    sex child adult elderly
## 1 Afghanistan 1995 female    NA    NA      NA
## 2 Afghanistan 1995   male    NA    NA      NA
## 3 Afghanistan 1996 female    NA    NA      NA
## 4 Afghanistan 1996   male    NA    NA      NA
## 5 Afghanistan 1997 female     5    96       1
## 6 Afghanistan 1997   male     0    26       0

First we calculate the total per record, and remove NA values.

tb$total_tb<- with(tb, child+adult+elderly)
head(tb)

##       country year    sex child adult elderly total_tb
## 1 Afghanistan 1995 female    NA    NA      NA       NA
## 2 Afghanistan 1995   male    NA    NA      NA       NA
## 3 Afghanistan 1996 female    NA    NA      NA       NA
## 4 Afghanistan 1996   male    NA    NA      NA       NA
## 5 Afghanistan 1997 female     5    96       1      102
## 6 Afghanistan 1997   male     0    26       0       26

tb<-tb[!(tb$child=="-1" | tb$adult=="-1" | tb$elderly=="-1"),]

We then, for purposes of thorough data joining, add underscores to the countrynames of both tables we with to join.

tb$country<-gsub(" ","_",tb$country)
population$country<-gsub(" ","_",population$country)

With some help consulting stackoverflow for usage of the ddply line, we first connect the year with the underscored countries, separated by a space. We then can see the data grouped by country and year inside tb2.

tb$concat <- paste(tb$country, tb$year)
tb2<-ddply(tb, .(concat), summarize,tot=sum(total_tb))
tb2$country <-lapply(strsplit(as.character(tb2$concat), " "), "[", 1)
tb2$year <-lapply(strsplit(as.character(tb2$concat), " "), "[",2)
tb2$concat <-NULL
head(tb2)

##    tot     country year
## 1  128 Afghanistan 1997
## 2 1778 Afghanistan 1998
## 3  745 Afghanistan 1999
## 4 2666 Afghanistan 2000
## 5 4639 Afghanistan 2001
## 6 6509 Afghanistan 2002

After converting the data, the two datasets are merged by country and year and the rate column, or tuburculosis rate per capita, is calculated. It can be converted into either scientific or decimal notation.

xx<- as.data.frame(lapply(tb2,unlist))
tb3<-merge(xx,population, by = c("country","year"), all=TRUE)
tb3$rate <- tb3$tot/tb3$population
tb3$rate <- format(tb3$rate, scientific = FALSE)
head(tb3)

##       country year  tot population           rate
## 1 Afghanistan 1995   NA   17586073             NA
## 2 Afghanistan 1996   NA   18415307             NA
## 3 Afghanistan 1997  128   19021226 0.000006729324
## 4 Afghanistan 1998 1778   19496836 0.000091194284
## 5 Afghanistan 1999  745   19987071 0.000037274096
## 6 Afghanistan 2000 2666   20595360 0.000129446633

Some housekeeping.

tb3$tot = NULL
tb3$population = NULL
tb3$country<-gsub("_"," ",tb3$country)
tb3$rate <- as.numeric(tb3$rate)

## Warning: NAs introduced by coercion

We can now add the data to our SQL database using the following code to refresh the database and create/repopulate the tables.

population$country<-gsub("_"," ",population$country)
dbSendQuery(con,'drop table if exists population,rates')

## <MySQLResult:0,0,1>

dbWriteTable(con, name="population",value=population)

## [1] TRUE

dbWriteTable(con, name="rates",value=tb3)

## [1] TRUE

dbDisconnect(con)

## [1] TRUE

We are no longer in need of MySQL, but it would be nice to represent this data in a visually meaningful way. To do this we can create error bars (a misnomer in this situation). This idea was modified and inspired by code found at stackoverflow.

set.seed(42)
df <- data.frame(x = tb3$country, y = tb3$rate)
df2 <- ddply(df,.(x),function(df) c(mean=mean(df$y, na.rm=TRUE),min=min(df$y, na.rm=TRUE),max=max(df$y, na.rm=TRUE)))

## Warning in min(df$y, na.rm = TRUE): no non-missing arguments to min;
## returning Inf

## Warning in max(df$y, na.rm = TRUE): no non-missing arguments to max;
## returning -Inf

df2<-df2[-60,]
df3<-subset(df2[order(df2$mean, decreasing=T),])
top <- df3[ df3$mean >= df3$mean[order(df3$mean, decreasing=TRUE)][20] , ]
df4<-subset(df2[order(df2$mean, decreasing=F),])
bottom <- df4[ df4$mean <= df4$mean[order(df4$mean, decreasing=FALSE)][20] , ]

We can calculate both the nations with the highest mean rates…

bottom <- df4[ df4$mean <= df4$mean[order(df4$mean, decreasing=FALSE)][20] , ]
with(arrange(top,top$mean),errbar(x,mean,max,min, ylim=NULL))
title(main="Tuburculosis Rates 1995-2013 - TOP 20", xlab="Percentage range with mean")

…and can calculate both the nations with the lowest mean rates.

with(arrange(bottom,rev(bottom$mean)),errbar(x,mean,max,min, ylim=NULL))
title(main="Tuburculosis Rates 1995-2013 - BOTTOM 20", xlab="Percentage range with mean")

607_Week3_Assignment

Robert Sellers

February 8, 2016