STAT545A hw03: Data aggregation

Once again I will be exploring the gapminder data, but this time using plyr data aggregation techniques and without the aid of graphs!

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
library(plyr)
library(xtable)
contMinMaxGdp <- ddply(gDat, ~continent, summarize, minGdpPercap = min(gdpPercap), 
    maxGdpPercap = max(gdpPercap))
(contMinMaxGdp <- arrange(contMinMaxGdp, minGdpPercap))
##   continent minGdpPercap maxGdpPercap
## 1    Africa        241.2        21951
## 2      Asia        331.0       113523
## 3    Europe        973.5        49357
## 4  Americas       1201.6        42952
## 5   Oceania      10039.6        34435
contMinMaxGdpXT <- xtable(contMinMaxGdp)
print(contMinMaxGdpXT, type = "html", include.rownames = FALSE)
continent minGdpPercap maxGdpPercap
Africa 241.17 21951.21
Asia 331.00 113523.13
Europe 973.53 49357.19
Americas 1201.64 42951.65
Oceania 10039.60 34435.37

By arranging contMinMaxGdp by minGdpPercap, we find that Africa has the lowest GdpPercap value, and Oceania has the highest minGdpPercap value. By comparing the minGdpPercap ranking to the maxGdpPercap ranking we find:

Follow up questions: which countries are contributing to the range in GdpPercap values in Asia and Europe? Are there outliers? If so, are the same countries outliers over time?

# Write own function to produce a data frame in tall format
minmax <- function(x) {
    ## Make character vector to specify min and max
    factor = c("Min", "Max")
    ## Specify function to compute min and max (same order as line above)
    value = c(min(x$gdpPercap), max(x$gdpPercap))
    ## Make factor and value two columns in a data frame
    data.frame(factor, value)
}
# Use ddply to apply the function by continent
contMinMaxGdpTall <- ddply(gDat, ~continent, minmax)
contMinMaxGdpTallXT <- xtable(contMinMaxGdpTall)
print(contMinMaxGdpTallXT, type = "html", include.rownames = FALSE)
continent factor value
Africa Min 241.17
Africa Max 21951.21
Americas Min 1201.64
Americas Max 42951.65
Asia Min 331.00
Asia Max 113523.13
Europe Min 973.53
Europe Max 49357.19
Oceania Min 10039.60
Oceania Max 34435.37
contSpreadGdp <- ddply(gDat, ~continent, summarize, meanGdpPercap = mean(gdpPercap), 
    sdGdpPercap = sd(gdpPercap), medianGdpPercap = median(gdpPercap), madGdpPercap = mad(gdpPercap), 
    iqrGdpPercap = IQR(gdpPercap))
(contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap))  #order table by standard deviation
##   continent meanGdpPercap sdGdpPercap medianGdpPercap madGdpPercap
## 1    Africa          2194        2828            1192        775.3
## 2   Oceania         18622        6359           17983       6459.1
## 3  Americas          7136        6397            5466       3269.3
## 4    Europe         14469        9355           12082       8846.1
## 5      Asia          7902       14045            2647       2820.8
##   iqrGdpPercap
## 1         1616
## 2         8072
## 3         4402
## 4        13248
## 5         7492
contSpreadGdpXT <- xtable(contSpreadGdp)
print(contSpreadGdpXT, type = "html", include.rownames = FALSE)
continent meanGdpPercap sdGdpPercap medianGdpPercap madGdpPercap iqrGdpPercap
Africa 2193.75 2827.93 1192.14 775.32 1616.17
Oceania 18621.61 6358.98 17983.30 6459.10 8072.26
Americas 7136.11 6396.76 5465.51 3269.33 4402.43
Europe 14469.48 9355.21 12081.75 8846.05 13248.30
Asia 7902.15 14045.37 2646.79 2820.83 7492.26

First look at the mean together with standard deviation:

Now look at the remaining summary statistics which are not affected by outliers:


Pain and suffering

  1. I am absolutely hopeless at writing my own functions. I seriously struggled… Looking forward to learning more about writing functions in class!

  2. I couldn't figure out how to make “tall” format using plyr without reshaping functions… Had to seek help from fellow R users for the code above… Initial notes to myself were on the right track though: Do you need to write your own function, where min and max are factors?

  3. I couldn't figure out a way to use “arrange” within my own function. Is this possible? For now had to write in separate line of code.

  4. Had to mess around with CSS a little, since I'm working on Windows :( I finally figured out how to change from the default using R.profile all thanks to Jenny :)