The following creates an R dataframe that shows rates of tuburculosis infection by country.
The following analysis requires the RMySQL, Hmisc, and plyr libraries, along with dependencies.
library(RMySQL)
## Loading required package: DBI
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(plyr)
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:Hmisc':
##
## is.discrete, summarize
First, load the population.csv and tb.csv into R list. Make sure that you have run the tb.R script on a database named tb. I have hidden the connection string and if you want to reproduce this code you will need to connect to your own database. The variable “con” refers to my connection string.
The second table was loaded into my github.
population<-read.csv("https://raw.githubusercontent.com/RobertSellers/R/master/data/population.csv", header= TRUE, sep=",")
The following demonstrates a simple select query using RMySQL along with some basic R-side refinement of the retrieved dataset.
tb_load = dbSendQuery(con, "select * from tb")
tb = fetch(tb_load, n=-1)
colnames(tb)<-c("country","year", "sex","child","adult","elderly")
head(tb)
## country year sex child adult elderly
## 1 Afghanistan 1995 female NA NA NA
## 2 Afghanistan 1995 male NA NA NA
## 3 Afghanistan 1996 female NA NA NA
## 4 Afghanistan 1996 male NA NA NA
## 5 Afghanistan 1997 female 5 96 1
## 6 Afghanistan 1997 male 0 26 0
First we calculate the total per record, and remove NA values.
tb$total_tb<- with(tb, child+adult+elderly)
head(tb)
## country year sex child adult elderly total_tb
## 1 Afghanistan 1995 female NA NA NA NA
## 2 Afghanistan 1995 male NA NA NA NA
## 3 Afghanistan 1996 female NA NA NA NA
## 4 Afghanistan 1996 male NA NA NA NA
## 5 Afghanistan 1997 female 5 96 1 102
## 6 Afghanistan 1997 male 0 26 0 26
tb<-tb[!(tb$child=="-1" | tb$adult=="-1" | tb$elderly=="-1"),]
We then, for purposes of thorough data joining, add underscores to the countrynames of both tables we with to join.
tb$country<-gsub(" ","_",tb$country)
population$country<-gsub(" ","_",population$country)
With some help consulting stackoverflow for usage of the ddply line, we first connect the year with the underscored countries, separated by a space. We then can see the data grouped by country and year inside tb2.
tb$concat <- paste(tb$country, tb$year)
tb2<-ddply(tb, .(concat), summarize,tot=sum(total_tb))
tb2$country <-lapply(strsplit(as.character(tb2$concat), " "), "[", 1)
tb2$year <-lapply(strsplit(as.character(tb2$concat), " "), "[",2)
tb2$concat <-NULL
head(tb2)
## tot country year
## 1 128 Afghanistan 1997
## 2 1778 Afghanistan 1998
## 3 745 Afghanistan 1999
## 4 2666 Afghanistan 2000
## 5 4639 Afghanistan 2001
## 6 6509 Afghanistan 2002
After converting the data, the two datasets are merged by country and year and the rate column, or tuburculosis rate per capita, is calculated. It can be converted into either scientific or decimal notation.
xx<- as.data.frame(lapply(tb2,unlist))
tb3<-merge(xx,population, by = c("country","year"), all=TRUE)
tb3$rate <- tb3$tot/tb3$population
tb3$rate <- format(tb3$rate, scientific = FALSE)
head(tb3)
## country year tot population rate
## 1 Afghanistan 1995 NA 17586073 NA
## 2 Afghanistan 1996 NA 18415307 NA
## 3 Afghanistan 1997 128 19021226 0.000006729324
## 4 Afghanistan 1998 1778 19496836 0.000091194284
## 5 Afghanistan 1999 745 19987071 0.000037274096
## 6 Afghanistan 2000 2666 20595360 0.000129446633
Some housekeeping.
tb3$tot = NULL
tb3$population = NULL
tb3$country<-gsub("_"," ",tb3$country)
tb3$rate <- as.numeric(tb3$rate)
## Warning: NAs introduced by coercion
We can now add the data to our SQL database using the following code to refresh the database and create/repopulate the tables.
population$country<-gsub("_"," ",population$country)
dbSendQuery(con,'drop table if exists population,rates')
## <MySQLResult:0,0,1>
dbWriteTable(con, name="population",value=population)
## [1] TRUE
dbWriteTable(con, name="rates",value=tb3)
## [1] TRUE
dbDisconnect(con)
## [1] TRUE
We are no longer in need of MySQL, but it would be nice to represent this data in a visually meaningful way. To do this we can create error bars (a misnomer in this situation). This idea was modified and inspired by code found at stackoverflow.
set.seed(42)
df <- data.frame(x = tb3$country, y = tb3$rate)
df2 <- ddply(df,.(x),function(df) c(mean=mean(df$y, na.rm=TRUE),min=min(df$y, na.rm=TRUE),max=max(df$y, na.rm=TRUE)))
## Warning in min(df$y, na.rm = TRUE): no non-missing arguments to min;
## returning Inf
## Warning in max(df$y, na.rm = TRUE): no non-missing arguments to max;
## returning -Inf
df2<-df2[-60,]
df3<-subset(df2[order(df2$mean, decreasing=T),])
top <- df3[ df3$mean >= df3$mean[order(df3$mean, decreasing=TRUE)][20] , ]
df4<-subset(df2[order(df2$mean, decreasing=F),])
bottom <- df4[ df4$mean <= df4$mean[order(df4$mean, decreasing=FALSE)][20] , ]
We can calculate both the nations with the highest mean rates…
bottom <- df4[ df4$mean <= df4$mean[order(df4$mean, decreasing=FALSE)][20] , ]
with(arrange(top,top$mean),errbar(x,mean,max,min, ylim=NULL))
title(main="Tuburculosis Rates 1995-2013 - TOP 20", xlab="Percentage range with mean")
…and can calculate both the nations with the lowest mean rates.
with(arrange(bottom,rev(bottom$mean)),errbar(x,mean,max,min, ylim=NULL))
title(main="Tuburculosis Rates 1995-2013 - BOTTOM 20", xlab="Percentage range with mean")