This is a brief EDA study of epidemics using a dataset from gapminder.org’indicator_epidemic affected.xlsx’. This dataset lists the number of deaths from various epidemics around the world by country and year, from 1970 to 2008.

setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 3')
df = read.xls('indicator_epidemic affected.xlsx',sheet=1,header=TRUE)
df2 = tidyr::gather(df,'year','n',2:39)
names(df2) <- c('country','year','n')
summary(df2)
##         country          year            n          
##  Afghanistan:  38   X1970  : 143   Min.   :      0  
##  Albania    :  38   X1971  : 143   1st Qu.:      0  
##  Algeria    :  38   X1972  : 143   Median :      0  
##  Angola     :  38   X1974  : 143   Mean   :   4031  
##  Argentina  :  38   X1975  : 143   3rd Qu.:      0  
##  Australia  :  38   X1976  : 143   Max.   :6501000  
##  (Other)    :5206   (Other):4576

Histogram of deaths due to epidemics

The following is a histogram of deaths for all countries and years, using a log scale on the horizontal axis.

qplot(x=n,data=df2[df2$n>0,],fill=I('#099DD9'),binwidth=0.1) +
  scale_x_log10() +
  xlab('Number of deaths for each epidemic') +
  ggtitle('Histogram of deaths due to epidemics,\nall countries, from 1970 - 2008')

Top 5 countries for total epidemics

I made a frequency polygon of deaths for the top 5 countries with the greatest number of deaths due to epidemics.

dplyr::top_n(df2,5,n) # Top 5 countries for largest epidemics in a given year
##      country  year       n
## 1      Japan X1978 2000000
## 2 Bangladesh X1991 1610700
## 3      Kenya X1994 6501000
## 4    Burundi X1999  616514
## 5    Burundi X2000  730999
df3 = dplyr::group_by(df2,country)
df4 = dplyr::summarise_each(df3,funs(sum)) # total deaths in any given country over the period 1970~2008
df5 = dplyr::top_n(df4,5,n) # Top 5 countries for largest epidemics in a given year. Result: Bangladesh, Brazil, Burundi, Japan, Kenya
df6 = dplyr::filter(df2,country=='Bangladesh' | country=='Brazil' | country=='Burundi'| country=='Japan' | country=='Kenya') # epidemics for top 5 countries
qplot(x=n,data=df6,geom='freqpoly',color=country,binwidth=0.2)+scale_x_log10()

Plot of epidemics over time, all countries

There seems to be an increasing trend!

df2$year = substr(df2$year,2,5) # Get rid of the X prefix on the year
df2$yearnum=as.numeric(df2$year) # Create numeric year column
qplot(x=yearnum,data=df2[df2$n>0,],fill=I('#088AA7'),binwidth=1) + 
  xlab('Year') +
  ggtitle('Number of epidemics per year, for all countries, from 1970 - 2008')

Boxplot of deaths due to epidemics over the years

The following is a boxplot of the logarithm of the number of deaths due to epidemics, for all countries, over the years 1970 - 2008.

qplot(x=year,y=log10(n),data=df2[df2$n>0,],fill=I('#088AA7'),binwidth=1,geom='boxplot') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylab('Log10(deaths)') +
  ggtitle('Deaths due to epidemics worldwide, from 1970 - 2008')

Barchart of the total number of deaths per year due to epidemics

df7 = dplyr::group_by(df2,year)
df8 = dplyr::summarise_each(df7,funs(sum)) # total deaths in any year
ggplot(aes(x=year,y=log10(n)),data=df8) +
  geom_bar(stat='identity') +
  ylab('Log10(total deaths)') +
  ggtitle('Total number of deaths due to epidemics for the years 1970 - 2008') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))