This is a brief EDA study of epidemics using a dataset from gapminder.org’indicator_epidemic affected.xlsx’. This dataset lists the number of deaths from various epidemics around the world by country and year, from 1970 to 2008.

setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 3')
df = read.xls('indicator_epidemic affected.xlsx',sheet=1,header=TRUE)
df2 = tidyr::gather(df,'year','n',2:39)
names(df2) <- c('country','year','n')
summary(df2)

##         country          year            n          
##  Afghanistan:  38   X1970  : 143   Min.   :      0  
##  Albania    :  38   X1971  : 143   1st Qu.:      0  
##  Algeria    :  38   X1972  : 143   Median :      0  
##  Angola     :  38   X1974  : 143   Mean   :   4031  
##  Argentina  :  38   X1975  : 143   3rd Qu.:      0  
##  Australia  :  38   X1976  : 143   Max.   :6501000  
##  (Other)    :5206   (Other):4576

Histogram of deaths due to epidemics

The following is a histogram of deaths for all countries and years, using a log scale on the horizontal axis.

qplot(x=n,data=df2[df2$n>0,],fill=I('#099DD9'),binwidth=0.1) +
  scale_x_log10() +
  xlab('Number of deaths for each epidemic') +
  ggtitle('Histogram of deaths due to epidemics,\nall countries, from 1970 - 2008')

Top 5 countries for total epidemics

I made a frequency polygon of deaths for the top 5 countries with the greatest number of deaths due to epidemics.

dplyr::top_n(df2,5,n) # Top 5 countries for largest epidemics in a given year

##      country  year       n
## 1      Japan X1978 2000000
## 2 Bangladesh X1991 1610700
## 3      Kenya X1994 6501000
## 4    Burundi X1999  616514
## 5    Burundi X2000  730999

df3 = dplyr::group_by(df2,country)
df4 = dplyr::summarise_each(df3,funs(sum)) # total deaths in any given country over the period 1970~2008
df5 = dplyr::top_n(df4,5,n) # Top 5 countries for largest epidemics in a given year. Result: Bangladesh, Brazil, Burundi, Japan, Kenya
df6 = dplyr::filter(df2,country=='Bangladesh' | country=='Brazil' | country=='Burundi'| country=='Japan' | country=='Kenya') # epidemics for top 5 countries
qplot(x=n,data=df6,geom='freqpoly',color=country,binwidth=0.2)+scale_x_log10()

Plot of epidemics over time, all countries

There seems to be an increasing trend!

df2$year = substr(df2$year,2,5) # Get rid of the X prefix on the year
df2$yearnum=as.numeric(df2$year) # Create numeric year column
qplot(x=yearnum,data=df2[df2$n>0,],fill=I('#088AA7'),binwidth=1) + 
  xlab('Year') +
  ggtitle('Number of epidemics per year, for all countries, from 1970 - 2008')

Boxplot of deaths due to epidemics over the years

The following is a boxplot of the logarithm of the number of deaths due to epidemics, for all countries, over the years 1970 - 2008.

qplot(x=year,y=log10(n),data=df2[df2$n>0,],fill=I('#088AA7'),binwidth=1,geom='boxplot') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab('Log10(deaths)') +
  ggtitle('Deaths due to epidemics worldwide, from 1970 - 2008')

Barchart of the total number of deaths per year due to epidemics

df7 = dplyr::group_by(df2,year)
df8 = dplyr::summarise_each(df7,funs(sum)) # total deaths in any year
ggplot(aes(x=year,y=log10(n)),data=df8) +
  geom_bar(stat='identity') +
  ylab('Log10(total deaths)') +
  ggtitle('Total number of deaths due to epidemics for the years 1970 - 2008') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Epidemics

Christopher Kaalund

8 March 2015

Histogram of deaths due to epidemics

Top 5 countries for total epidemics

Plot of epidemics over time, all countries

Boxplot of deaths due to epidemics over the years

Barchart of the total number of deaths per year due to epidemics