This is a brief EDA study of epidemics using a dataset from gapminder.org’indicator_epidemic affected.xlsx’. This dataset lists the number of deaths from various epidemics around the world by country and year, from 1970 to 2008.
setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 3')
df = read.xls('indicator_epidemic affected.xlsx',sheet=1,header=TRUE)
df2 = tidyr::gather(df,'year','n',2:39)
names(df2) <- c('country','year','n')
summary(df2)
## country year n
## Afghanistan: 38 X1970 : 143 Min. : 0
## Albania : 38 X1971 : 143 1st Qu.: 0
## Algeria : 38 X1972 : 143 Median : 0
## Angola : 38 X1974 : 143 Mean : 4031
## Argentina : 38 X1975 : 143 3rd Qu.: 0
## Australia : 38 X1976 : 143 Max. :6501000
## (Other) :5206 (Other):4576
The following is a histogram of deaths for all countries and years, using a log scale on the horizontal axis.
qplot(x=n,data=df2[df2$n>0,],fill=I('#099DD9'),binwidth=0.1) +
scale_x_log10() +
xlab('Number of deaths for each epidemic') +
ggtitle('Histogram of deaths due to epidemics,\nall countries, from 1970 - 2008')
I made a frequency polygon of deaths for the top 5 countries with the greatest number of deaths due to epidemics.
dplyr::top_n(df2,5,n) # Top 5 countries for largest epidemics in a given year
## country year n
## 1 Japan X1978 2000000
## 2 Bangladesh X1991 1610700
## 3 Kenya X1994 6501000
## 4 Burundi X1999 616514
## 5 Burundi X2000 730999
df3 = dplyr::group_by(df2,country)
df4 = dplyr::summarise_each(df3,funs(sum)) # total deaths in any given country over the period 1970~2008
df5 = dplyr::top_n(df4,5,n) # Top 5 countries for largest epidemics in a given year. Result: Bangladesh, Brazil, Burundi, Japan, Kenya
df6 = dplyr::filter(df2,country=='Bangladesh' | country=='Brazil' | country=='Burundi'| country=='Japan' | country=='Kenya') # epidemics for top 5 countries
qplot(x=n,data=df6,geom='freqpoly',color=country,binwidth=0.2)+scale_x_log10()
There seems to be an increasing trend!
df2$year = substr(df2$year,2,5) # Get rid of the X prefix on the year
df2$yearnum=as.numeric(df2$year) # Create numeric year column
qplot(x=yearnum,data=df2[df2$n>0,],fill=I('#088AA7'),binwidth=1) +
xlab('Year') +
ggtitle('Number of epidemics per year, for all countries, from 1970 - 2008')
The following is a boxplot of the logarithm of the number of deaths due to epidemics, for all countries, over the years 1970 - 2008.
qplot(x=year,y=log10(n),data=df2[df2$n>0,],fill=I('#088AA7'),binwidth=1,geom='boxplot') +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylab('Log10(deaths)') +
ggtitle('Deaths due to epidemics worldwide, from 1970 - 2008')
df7 = dplyr::group_by(df2,year)
df8 = dplyr::summarise_each(df7,funs(sum)) # total deaths in any year
ggplot(aes(x=year,y=log10(n)),data=df8) +
geom_bar(stat='identity') +
ylab('Log10(total deaths)') +
ggtitle('Total number of deaths due to epidemics for the years 1970 - 2008') +
theme(axis.text.x = element_text(angle = 90, hjust = 1))