## Attaching package: 'reshape'
##
## The following object is masked from 'package:plyr':
##
## rename, round_any
Load the data and get rid of the continent Oceania since it has only 2 countries
This code comes from homework 3: gao-wen source | report
She first made a generalized function called print_table, this makes the code compact, although one improvement would be to include an option to round certain columns as desired.
print_table = function(df) {
print(xtable(df), type = "html", include.rownames = FALSE)
}
She sometimes makes the code uber compact (2 lines!).
foo for objects (like data.frames) that are not going to be used laterIn fact the original source is acttually much more readable than the rpubs version
trimFrac <- 0.15
print_table(
ddply(gDat, ~year, summarize,
trimmedMean = mean(lifeExp, trim=trimFrac)))
print_table(ddply(gDat, ~year, summarize, trimmedMean = mean(lifeExp, trim = trimFrac)))
Ploting this data shows that it is increasing, but at a slower and slower rate
trimFrac <- 0.15
df <- ddply(gDat, ~year, summarize, meanLifeExp = mean(lifeExp, trim = trimFrac))
xyplot(meanLifeExp ~ year, df, type = c("a", "p"))
These questions use just a subset of the data from year 2007. Why the year 2007? This is the final year for each country. Perhaps it would be better to have a year variable set to 2007 instead of hardcoding it in, since it it referenced in multiple places. If there was just one variable controling it then it could be changed easily by just modifying it in one spot.
gdp_stat <- function(gDatCntnt) {
minGDP <- min(gDatCntnt$gdpPercap)
maxGDP <- max(gDatCntnt$gdpPercap)
return(data.frame(stat.type = c("min", "max"), GDP.per.cap = c(minGDP, maxGDP)))
}
gdpByContinentTall <- ddply(subset(gDat, year == 2007), ~continent, gdp_stat)
gdpByContinentTall <- arrange(gdpByContinentTall, GDP.per.cap)
One quick fix is to get the year from the data itself and then replace all references to
2007withfinalYear
(finalYear = max(gDat$year))
## [1] 2007
The above results tell me that something of the ordering between continents (omitting Oceania):
max or min in one column and the value in another. This makes it very easy to plot on the same graph since we can set groups = stat.type. gdpByContYear <- ddply(gDat, ~continent + year, gdp_stat)
xyplot(GDP.per.cap ~ year | continent,
gdpByContYear,
groups = stat.type, auto.key=TRUE, type=c("a","p")
)
There is not as much information as a box plot or violin plot, but the presentation is perhaps cleaner. But what are we really interested in? We want to compare the different continents, right? I don't really like looking at different panels back and forth, so we can swap the conditioning argument (
|) andgroups:| A, groups = B–>| B, groups = A.
xyplot(GDP.per.cap ~ year | stat.type, data = gdpByContYear,
groups = continent, auto.key=TRUE, type=c("a","p"), scale="free"
)
note the
scale="free"argument , which automatically sets the scales on the panels.
We can find “oputlying” countries by finding countries that have a large residual. The approach is:
ddplymax_residual is a function that we feed into ddply without the summarize function. It is customized to give back what we want, in this case the maximum residual and the year.max_residual <- function(cDat){
clm <- lm(lifeExp ~ I(year-min(cDat$year)), data=cDat)
largestResidual <- abs(clm$residuals)
maxRsdl <- max(largestResidual)
return(c(Max.Residual=maxRsdl,
Year=cDat[which.max(largestResidual),'year']))
}
interesting <- ddply(gDat, ~country, max_residual)
interesting <- arrange(interesting, abs(Max.Residual), decreasing=T)
print_table(head(interesting))
| country | Max.Residual | Year |
|---|---|---|
| Rwanda | 17.31 | 1992.00 |
| Cambodia | 15.69 | 1977.00 |
| Swaziland | 12.00 | 2007.00 |
| Zimbabwe | 10.58 | 2002.00 |
| Lesotho | 10.04 | 2007.00 |
| Botswana | 9.33 | 2002.00 |
At the end we checked the output with head. Here we see an issue with the
print_tablefunction – it has.00after each year, which are always integers. What we need instead is to use thedigitscommand inxtable. This could be packaged up intoprint_tableeasily since it works well with data.frames (there aren't issues with rounding factors).
numCountries <- 11 # number of interesting countries
rounded <- xtable(head(interesting, numCountries), digits = c(0, 0, 2, 0)) # just use 0 for row number and country
print(rounded, type = "html", include.rownames = FALSE)
| country | Max.Residual | Year |
|---|---|---|
| Rwanda | 17.31 | 1992 |
| Cambodia | 15.69 | 1977 |
| Swaziland | 12.00 | 2007 |
| Zimbabwe | 10.58 | 2002 |
| Lesotho | 10.04 | 2007 |
| Botswana | 9.33 | 2002 |
| South Africa | 9.31 | 2007 |
| China | 8.00 | 1962 |
| Namibia | 7.21 | 2002 |
| Gabon | 6.77 | 2007 |
| Iraq | 6.70 | 1987 |
We used the residuals to find countries and their associalted years. We can see the big residuals if we plot the lifeExp vs year with a line – just look for the data points that are far from the line.
interesting <- arrange(interesting, abs(Max.Residual), decreasing = T)
xyplot(lifeExp ~ year | country, gDat,
subset = country %in% head(interesting,numCountries)$country,
type = c("p", "r"))
As we can wee with Iraq, the largest residual is in 1987, but this isn't some dramatic year, but the year before a change in trend. There are so many wars in Iraq, that it is difficult to relate historical events to the data:
This answers the question Find countries with sudden, substantial departures from the temporal trend in one of the quantitative measures. Why does the life expencancy decrease so dramatically? Two stricking examples were Cambodia and Rwanda, which had clear conflict times in their history (see Khmer Rouge rule of Cambodia in the 1970s and the Rwandan Genocide in the 1990s). But the life expectancy doesn't just reflect a over loss of life, it reflects changes in the average age of the population. Roiughly speaking the life expectancy changes when on the average more young people are dying than old people, or vice-versa. So I decided to look into the relation between population and life expencancy. Obviously I need to somehow normalize, since population is on the oder of 106, while life expencancy tends to fluctuate in the range 20-100.
wide <- ddply(subset(gDat, subset = country %in% head(interesting,numCountries)$country),
~country, summarize,
normPop=scale(pop, center=FALSE),
normLifeExp=scale(lifeExp, center=FALSE),
year=year
)
tall <- melt(wide, id=c('country','year'))
xyplot(data=tall,
value ~ year | country,
group = variable, auto.key=TRUE
)
I think the take home message from here is that life exectancy is a much more sensative variable than overall population in visualizing major shifts in the population. There are only a few cases where the overall population is decreases over a 5 year interval (Lesotho, Rwanda, Cambodia), and at these time points the life expectancy has a more defined negative trend.