Problem set 4 - Udacity Data Analysis in R

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Loading required package: XLConnectJars
## XLConnect 0.2-10 by Mirai Solutions GmbH [aut],
##   Martin Studer [cre],
##   The Apache Software Foundation [ctb, cph] (Apache POI, Apache Commons
##     Codec),
##   Stephen Colebourne [ctb, cph] (Joda-Time Java library)
## http://www.mirai-solutions.com ,
## http://miraisolutions.wordpress.com

This is an EDA study of worldwide arms imports and exports data, spanning 1960 - 2010. My motivation for studying this particular data set was simply curiousity. The data was downloaded as an Excel file from http://www.gapminder.org/data/. I also downloaded population data to look for correlations between arms imports and population.

Firstly, I examined and cleaned up exports data. I found that many cells contained NA’s. That is, there was a lot of missing data. Complete arms export data for every year from 1960 - 2010 was available only for a small fraction of countries. Arms exports are given in 1990 $US terms.

setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 4')
dfexports = readWorksheetFromFile('Arms exports.xlsx', 1, useCachedValues=TRUE)
dfexports = tidyr::gather(dfexports,'year','n',2:52) # Gather into columns
dfexports$X2011 = NULL # Delete column X2011, which is empty
names(dfexports) = c('country','year','n') # n is the value of arms exports
dfexports$year = substr(dfexports$year,2,5) # Get rid of the X prefix on the year
dfexports = dfexports[!is.na(dfexports$n),] # Remove rows with NA in n column
dfexports$n = dfexports$n/10^9 # Convert n to billions of dollars

The following code produces a bar chart of mean arms exports for the top 10 countries. Since the data is incomplete, the total number of arms exports cannot be worked out for most countries. The mean value of arms exports is still a useful quantity, although it can fluctuate greatly between years and decades. I calculated the mean number of arms exports only for those countries having >=5 years worth of data.

df.exports_by_country = dfexports %>%
  group_by(country) %>%
  summarise(
    exports_mean=mean(n),
    exports_median=median(n),
    exports_total=sum(n),
    exports_max=max(n),
    exports_min=min(n),
    count=n())
df.exports_by_country = subset(df.exports_by_country,count>=5) # Remove countries for which count is <5, as the average no good in this case
df.exports_by_country_top10 = top_n(arrange(df.exports_by_country,exports_mean),10,exports_total)
df.exports_by_country_top10$country = factor(df.exports_by_country_top10$country,levels=df.exports_by_country_top10$country,ordered=TRUE) # maintain ordering of country in the plot axis
ggplot(aes(x=country,y=exports_mean),data=df.exports_by_country_top10) +
  geom_bar(stat='identity') +
  ylab('Mean arms exports ($US billions)') +
  ggtitle('Mean arms exports by country, 1960 - 2010, top ten') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Mean arms exports can fluctuate greatly over time. The code below produces a graph of mean arms exports for all countries over time.

dfexports$year = as.integer(dfexports$year) # Plots more nicely with integer scale on x
df.exports_by_year = dfexports %>%
  group_by(year) %>%
  summarise(
    exports_mean=mean(n),
    exports_median=median(n),
    exports_total=sum(n),
    exports_max=max(n),
    exports_min=min(n),
    count=n()) %>%
  arrange(year)

ggplot(aes(x=year,y=exports_mean),data=df.exports_by_year) +
  geom_line() +
  ylab('Mean exports ($US billions)') +
  ggtitle('Mean yearly arms exports, all countries')

The below plot shows exports for countries with complete data for all years. The fluctuations in this data are still somewhat large.

countries1 = subset(df.exports_by_country, count==51)$country
df.exports_by_year_complete = subset(dfexports,country %in% countries1) %>%
  group_by(year) %>%
  summarise(
    exports_mean=mean(n),
    exports_median=median(n),
    exports_total=sum(n),
    exports_max=max(n),
    exports_min=min(n),
    count=n()) %>%
  arrange(year)

ggplot(aes(x=year,y=exports_mean),data=df.exports_by_year_complete) +
  geom_line() +
  ylab('Mean exports ($US billions)') +
  ggtitle('Mean yearly arms exports for countries\nwith data for every year from 1960 - 2010')

The following code plots exports over time for all countries and all years. The previous graph of mean exports illustrates fluctuations over time more clearly, however this graph demonstrates that most countries in most years export few arms and that a few countries export a large number of arms, although the graph does not identify those countries. A histogram and frequency polygon, with a log scale, also illustrates this point by showing that is distribution is long-tailed.

p1 = ggplot(aes(x=year,y=n),data=dfexports) +
  geom_point() +
  geom_line(stat='summary',fun.y=quantile,probs=0.1,linetype=2,color='blue') +
  geom_line(stat='summary',fun.y=quantile,probs=0.9,linetype=2,color='blue') +
  geom_line(stat='summary',fun.y=quantile,probs=0.5,color='purple') +
  ylab('Arms exports ($US billions)') +
  ggtitle('Yearly arms exports, all countries, 1960 - 2010')

p1+coord_cartesian(ylim=c(0,3)) # set upper limit of y-scale to 3 to better view 90% quantile

ggplot(aes(x=n),data=dfexports) +
  geom_histogram() +
  xlab('Arms exports ($US billions)') +
  ggtitle('Histogram of arms exports, 1960 - 2010, all countries')

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

ggplot(aes(x=n),data=dfexports) +
  geom_freqpoly() +
  scale_x_log10() +
  xlab('Arms exports ($US billions)') +
  ggtitle('Frequency polygon of arms exports, 1960 - 2010, all countries')

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The following code imports and cleans arms import data and produces a bar chart of the top ten countries for arms imports for the period 1960 - 2010.

dfimports = readWorksheetFromFile('Arms imports.xlsx', 1, useCachedValues=TRUE)
dfimports = tidyr::gather(dfimports,'year','n',2:52) # gather into columns
dfimports$X2011 = NULL # delete column X2011
names(dfimports) = c('country','year','n')
dfimports$year = substr(dfimports$year,2,5) # Get rid of the X prefix on the year
dfimports = dfimports[!is.na(dfimports$n),] # Remove rows with NA in n column
dfimports$n = dfimports$n/10^9 # Convert n to billions of dollars
dfimports$year = as.integer(dfimports$year) # Make year an integer

df.imports_by_country = dfimports %>%
  group_by(country) %>%
  summarise(
    imports_mean=mean(n),
    imports_median=median(n),
    imports_total=sum(n),
    imports_max=max(n),
    imports_min=min(n),
    count=n())
df.imports_by_country = subset(df.imports_by_country,count>=5) # Remove countries for which count is <5, as average no good in this case
df.imports_by_country_top10 = top_n(arrange(df.imports_by_country,imports_mean),10,imports_total)
df.imports_by_country_top10$country = factor(df.imports_by_country_top10$country,levels=df.imports_by_country_top10$country,ordered=TRUE) # keep them ordered in plot
ggplot(aes(x=country,y=imports_mean),data=df.imports_by_country_top10) +
  geom_bar(stat='identity') +
  ylab('Mean arm imports (billion USD)') +
  ggtitle('Mean yearly arms imports by country, 1960 - 2010, top ten') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

I now searched for a correlation between arms imports and exports. I expected an inverse relationship, since a country that produces arms has a reduced need to import them, and vis versa, although some countries may import particular types of arms and export others The correlation coefficient that I calculated was only 0.017. i.e. uncorrelated. This lack of correlation is evident in a graph of mean exports vs. mean imports for all countries.

df.imports_exports = inner_join(df.exports_by_country,df.imports_by_country,by="country")
p2 = ggplot(aes(x=imports_mean,y=exports_mean),data=df.imports_exports) +
  geom_point() +
  ggtitle('Arms exports vs. imports for all countries, 1960 - 2010') +
  xlab('Mean yearly arms imports ($US billions)') +
  ylab('Mean yearly arms exports ($US billions)')

p2

p2 + coord_cartesian(ylim=c(0,0.5))

cor.test(df.imports_exports$exports_mean,df.imports_exports$imports_mean) # ans: 0.017

## 
##  Pearson's product-moment correlation
## 
## data:  df.imports_exports$exports_mean and df.imports_exports$imports_mean
## t = 0.124, df = 53, p-value = 0.9018
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.2493901  0.2810611
## sample estimates:
##        cor 
## 0.01703405

Since it’s more fun to find a correlation than not, I decided to see if population and arms imports were correlated. My rationale was that a higher population results in higher levels of economic activity and therefore a greater capacity to import arms. The following code imports ‘indicator gapminder population.xlsx’ from gapminder into R. This data is relatively complete for the years of interest, 1960 - 2010.

dfpop = readWorksheetFromFile('indicator gapminder population.xlsx', 1, useCachedValues=TRUE)
dfpop = tidyr::gather(dfpop,'year','n',2:233) # gather into columns
dfpop$X2011 = NULL # delete column X2011
names(dfpop) = c('country','year','n')
dfpop$year = substr(dfpop$year,2,233) # Get rid of the X prefix on the year
dfpop = dfpop[!is.na(dfpop$n),] # Remove rows with NA in n column
dfpop$year = as.integer(dfpop$year) # Make year an integer
dfpop = subset(dfpop,dfpop$year>=1960 & dfpop$year<=2010) # Keep only the years 1960 - 2010

Having obtained the population data, I joined it to the arms import data. I did this by collecting all combinations of year-country for imports and population, and used this as the key for joining the dataframes. I then made a new column that is a combination of country and year.

dfpop$countryyear = paste(dfpop$country,as.character(dfpop$year),sep="-")
dfimports$countryyear = paste(dfimports$country,as.character(dfimports$year),sep="-")
df.imports_pop = inner_join(dfimports,dfpop,by='countryyear')

I next plotted arms imports vs. population for all countries, and made a second version of the plot with transparency and jitter to determine if there was any structure in the denser region near the origin. The correlation coefficient was only 0.33, however there does seem to be some structure in some areas of the graph.

ggplot(aes(y=n.x,x=n.y/10^9),data=df.imports_pop) +
  geom_point(aes(colour=country.x),show_guide=FALSE)+
  ggtitle('Imports vs. total population, all countries 1960 - 2010') +
  xlab('Population (billions)') +
  ylab('Arms imports expenditure ($US billions)')

ggplot(aes(y=n.x,x=n.y/10^9),data=df.imports_pop) +
  geom_point(alpha=1/10, position=position_jitter(h=0),color='blue') +
  coord_cartesian(ylim=c(0,0.1),xlim=c(0,0.05)) +
  ggtitle('Imports vs. total population, all countries 1960 - 2010') +
  xlab('Population (billions)') +
  ylab('Arms imports expenditure ($US billions)')

cor.test(df.imports_pop$n.x,df.imports_pop$n.y) # ANS: 0.33

## 
##  Pearson's product-moment correlation
## 
## data:  df.imports_pop$n.x and df.imports_pop$n.y
## t = 24.5652, df = 5148, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2992499 0.3481476
## sample estimates:
##       cor 
## 0.3239151

A subset of the data looks correlated at the higher population levels. The subset of data for which populations is >0.4 billion contains data for China and India, and so population and imports may be correlated for those countries. I created various subsets of data and calculated correlation coefficients. In particular, for India for the years 1960 - 1990, the correlation coefficient is large (0.91).

imports_pop_subset = subset(df.imports_pop, n.y>4e8) # From this it can be seen that only China and India have population >0.4 billion

chinasubset = subset(df.imports_pop, country.x=='China')
indiasubset = subset(df.imports_pop, country.x=='India')
indiasubset2 = subset(indiasubset,year.x<1990)
cor.test(chinasubset$n.x,chinasubset$n.y) # ANS: 0.544

## 
##  Pearson's product-moment correlation
## 
## data:  chinasubset$n.x and chinasubset$n.y
## t = 4.5481, df = 49, p-value = 3.572e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3168098 0.7133097
## sample estimates:
##      cor 
## 0.544826

cor.test(indiasubset$n.x,indiasubset$n.y) # ANS: 0.269

## 
##  Pearson's product-moment correlation
## 
## data:  indiasubset$n.x and indiasubset$n.y
## t = 1.9565, df = 49, p-value = 0.05611
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.006907956  0.507149532
## sample estimates:
##       cor 
## 0.2691882

cor.test(indiasubset2$n.x,indiasubset2$n.y) # ANS: 0.91

## 
##  Pearson's product-moment correlation
## 
## data:  indiasubset2$n.x and indiasubset2$n.y
## t = 11.6137, df = 28, p-value = 3.211e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8178561 0.9566379
## sample estimates:
##       cor 
## 0.9099964

I plotted arms imports vs. population for India for the years 1960 - 1990, along with a line-of-best-fit. There was a large drop in imports at around the year 1990, which accounts for the low correlation coefficient for the complete range of years. The population of India, of course, does not drop so dramatically in that year. This example illustrates that a correlation can be found if the data is subset sufficiently!

ggplot(aes(x=n.y/10^9,y=n.x),data=subset(indiasubset,year.x<1990)) +
  geom_point() +
  coord_cartesian(ylim=c(0,5)) +
  ggtitle('Imports vs. total population for India, 1960 - 1990') +
  xlab('Population (billions)') +
  ylab('Arms imports ($US billion)') +
  geom_smooth(method='lm',color='red')

ggplot(aes(year.x,n.x),data=indiasubset) +
  geom_line() +
  coord_cartesian(ylim=c(0,5)) +
  ggtitle('Arms imports, India, 1960 - 2010') +
  ylab('Arms imports ($US billion)') +
  xlab('Year')

Problem set 4 - Udacity Data Analysis in R

Christopher Kaalund

4 March 2015