Once again I will be exploring the gapminder data, but this time using plyr data aggregation techniques and without the aid of graphs!
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
library(plyr)
library(xtable)
contMinMaxGdp <- ddply(gDat, ~continent, summarize, minGdpPercap = min(gdpPercap),
maxGdpPercap = max(gdpPercap))
(contMinMaxGdp <- arrange(contMinMaxGdp, minGdpPercap))
## continent minGdpPercap maxGdpPercap
## 1 Africa 241.2 21951
## 2 Asia 331.0 113523
## 3 Europe 973.5 49357
## 4 Americas 1201.6 42952
## 5 Oceania 10039.6 34435
contMinMaxGdpXT <- xtable(contMinMaxGdp)
print(contMinMaxGdpXT, type = "html", include.rownames = FALSE)
| continent | minGdpPercap | maxGdpPercap |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Asia | 331.00 | 113523.13 |
| Europe | 973.53 | 49357.19 |
| Americas | 1201.64 | 42951.65 |
| Oceania | 10039.60 | 34435.37 |
By arranging contMinMaxGdp by minGdpPercap, we find that Africa has the lowest GdpPercap value, and Oceania has the highest minGdpPercap value. By comparing the minGdpPercap ranking to the maxGdpPercap ranking we find:
- Africa has the lowest minGdpPercap and lowest maxGdpPercap value
- Asia has the greatest range in GdpPercap values, followed by Europe
- Africa has the smallest range in GdpPercap values, followed by Oceania
Follow up questions: which countries are contributing to the range in GdpPercap values in Asia and Europe? Are there outliers? If so, are the same countries outliers over time?
# Write own function to produce a data frame in tall format
minmax <- function(x) {
## Make character vector to specify min and max
factor = c("Min", "Max")
## Specify function to compute min and max (same order as line above)
value = c(min(x$gdpPercap), max(x$gdpPercap))
## Make factor and value two columns in a data frame
data.frame(factor, value)
}
# Use ddply to apply the function by continent
contMinMaxGdpTall <- ddply(gDat, ~continent, minmax)
contMinMaxGdpTallXT <- xtable(contMinMaxGdpTall)
print(contMinMaxGdpTallXT, type = "html", include.rownames = FALSE)
| continent | factor | value |
|---|---|---|
| Africa | Min | 241.17 |
| Africa | Max | 21951.21 |
| Americas | Min | 1201.64 |
| Americas | Max | 42951.65 |
| Asia | Min | 331.00 |
| Asia | Max | 113523.13 |
| Europe | Min | 973.53 |
| Europe | Max | 49357.19 |
| Oceania | Min | 10039.60 |
| Oceania | Max | 34435.37 |
contSpreadGdp <- ddply(gDat, ~continent, summarize, meanGdpPercap = mean(gdpPercap),
sdGdpPercap = sd(gdpPercap), medianGdpPercap = median(gdpPercap), madGdpPercap = mad(gdpPercap),
iqrGdpPercap = IQR(gdpPercap))
(contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap)) #order table by standard deviation
## continent meanGdpPercap sdGdpPercap medianGdpPercap madGdpPercap
## 1 Africa 2194 2828 1192 775.3
## 2 Oceania 18622 6359 17983 6459.1
## 3 Americas 7136 6397 5466 3269.3
## 4 Europe 14469 9355 12082 8846.1
## 5 Asia 7902 14045 2647 2820.8
## iqrGdpPercap
## 1 1616
## 2 8072
## 3 4402
## 4 13248
## 5 7492
contSpreadGdpXT <- xtable(contSpreadGdp)
print(contSpreadGdpXT, type = "html", include.rownames = FALSE)
| continent | meanGdpPercap | sdGdpPercap | medianGdpPercap | madGdpPercap | iqrGdpPercap |
|---|---|---|---|---|---|
| Africa | 2193.75 | 2827.93 | 1192.14 | 775.32 | 1616.17 |
| Oceania | 18621.61 | 6358.98 | 17983.30 | 6459.10 | 8072.26 |
| Americas | 7136.11 | 6396.76 | 5465.51 | 3269.33 | 4402.43 |
| Europe | 14469.48 | 9355.21 | 12081.75 | 8846.05 | 13248.30 |
| Asia | 7902.15 | 14045.37 | 2646.79 | 2820.83 | 7492.26 |
First look at the mean together with standard deviation:
- Africa has the smallest mean and spread
- Oceania has the highest mean and small dispersion about the mean
- Asia has the largest spread about the mean.
Now look at the remaining summary statistics which are not affected by outliers:
- It becomes clear that GdpPercap for Asia has extreme outliers. The max GdpPerCap value for Asia was 113523.13 whereas the median is 2646.787
- Whilst Europe has the second largest amount of dispersion about the mean, it has the largest IQR and MAD, which indicates Europe's GdpPercap is not affected by extreme outliers
Pain and suffering
I am absolutely hopeless at writing my own functions. I seriously struggled… Looking forward to learning more about writing functions in class!
I couldn't figure out how to make “tall” format using plyr without reshaping functions… Had to seek help from fellow R users for the code above… Initial notes to myself were on the right track though: Do you need to write your own function, where min and max are factors?
I couldn't figure out a way to use “arrange” within my own function. Is this possible? For now had to write in separate line of code.
Had to mess around with CSS a little, since I'm working on Windows :( I finally figured out how to change from the default using R.profile all thanks to Jenny :)