Tuberculosis is a contagious disease that has 19th century origins but still proliferates at high rates in the developing world. This exercise looks at a 20 year data set of countries and observed cases. The subset of data being evaluated queries this data after it has been transformed via ETL processing to replace null values with 0’s and aggragate infection rates per country.
Original data-set was sourced from the World Health Organization’s Tuberculosis http://www.who.int/tb/country/data/download/en/.
loads the PostgreSQL driver
library(RPostgreSQL)
## Loading required package: DBI
library(TTR)
assign connection parameters and connect to db
dbname <- "tb"
dbuser <- "postgres"
dbpass <- "postgres"
dbhost <- "localhost"
dbport <- 5432
drv <- dbDriver("PostgreSQL")
con <- dbConnect(drv, host=dbhost, port=dbport, dbname=dbname,user=dbuser, password=dbpass)
After initially loading the tb database, the objective is then to discover useful patterns that are easily explainable. The focus on of this investigation is on infection rates which we will measure using aggregate ordinal statistics such as head counts and trends regressed by year. The primary intuition is that data-reducing to one set of observations per country of infections and trends, it becomes possible to identify which countries may be facing an epidemic and which countries are succesfully reversing a trend despite high infection rates. Finally, this data set initially considers only the top 25 countries by headcount so as to isolate the observed cases of greatest consequence first. A follow-on investigation may look into details specific to countries that are at high risk as well as countries with low infection rates yet that are trending higher.
query tuberculosis rates with cleaned data
query <- dbSendQuery(
con, query <- "
select
country,
round(avg(population)),
round(avg(rate_per_100thsd)) avgRate_per_100thsd,
CASE
WHEN regr_slope(rate_per_100thsd, year) > 0 THEN 'higher'
WHEN regr_slope(rate_per_100thsd, year) < 0 THEN 'lower'
ELSE 'na'
END trend,
regr_slope(rate_per_100thsd, year) regr_slope
from tb_rates
where 1=1
and rate_per_100thsd <> 0
group by country
order by round(avg(rate_per_100thsd)) desc
limit 25 ")
rs1 <- fetch(query,n=-1)
be good citizen, closeout connect {r disconnect, eval=TRUE} dbDisconnect(con) dbUnloadDriver(drv) ```
Load Data Frame from website *Note, assigning from rs1 is duplicative but loads data from database and csv extract
theUrl <- "https://raw.githubusercontent.com/scottkarr/IS607-scottkarr-wk3/master/extract"
df_tbrates <- read.table(file = theUrl, header = TRUE, sep = ",")
df_tbrates <- rs1
colnames(df_tbrates)[2] <- "avgpopulation"
Note the few outlier countries with 100+ cases per 100 thousand persons such as South Africa. Also note that most trends show an increasing rate of ~5% however, there are outliers that exceed 20%.
boxplot(df_tbrates$avgrate_per_100thsd ~ df_tbrates$trend,
main="TB Infection Rate Distribution \r\n Top 25 ranked countries",
xlab="Trending Category",
ylab="infection rate per 100 thsd"
)
hist(
df_tbrates$avgrate_per_100thsd,
main="Distribution of Countries",
xlab="Distribution of Infection rates per 100 Thsd"
)
hist(
df_tbrates$regr_slope,
main="Trend infection rate",
xlab="Magnitude of Trend \r\n (+) increasing (-) decreasing"
)
plot(
df_tbrates$regr_slope,
df_tbrates$avgrate_per_100thsd,
main="TB Infection Rate Distribution \r\n vs. Magnitude of Trend \r\n Top 25 ranked countries",
xlab="Degree of Improvement",
ylab="Infection rates per 100 Thsd"
)