Simple Exploratory Data Analytics on Gapminder Data Set (II)

Data import from URL

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)

Load the lattice and plyr packages.

library(lattice)
## Warning: package 'lattice' was built under R version 3.0.2
library(plyr)
## Warning: package 'plyr' was built under R version 3.0.2
library(xtable)

QUESTION(1) Examine “typical” life expectancy for different years.

First,Compute a trimmed(trim=0.2) mean of life expectancy for different years

Trim <- 0.2
Trimmed_meanlifeExp<- ddply(gDat,~year,summarize, tMean = mean(lifeExp, trim = Trim))
Trimmed_meanlifeExp <- arrange(Trimmed_meanlifeExp, tMean)

Then show it in a table using JB’s table print function

htmlPrint <- function(x, ...,digits = 0, include.rownames = FALSE) {
   print(xtable(x, digits = digits, ...), type = 'html',
         include.rownames = include.rownames, ...)
   }
htmlPrint(arrange(Trimmed_meanlifeExp, tMean))
year tMean
1952 48
1957 51
1962 53
1967 56
1972 58
1977 60
1982 62
1987 64
1992 66
1997 67
2002 68
2007 69

Then plot it with xyplot

xyplot(tMean~year,Trimmed_meanlifeExp)

QUESTION(2) How is life expectancy changing over time on different continents?

First,compute the life expenctancy for different continents over time to a data frame.

LifeExp_cont <- ddply(gDat,~year+continent,summarize,avglifeExp=mean(lifeExp))

Then plot it with stripplot

stripplot(avglifeExp~year,LifeExp_cont,group=continent,auto.key=TRUE,grid="h", jitter.data = TRUE)

We can see that overall, lifeExp is increasing in all continents overtime. But this is a little bit drop in Africa around 2000.

QUESTION(3) Depict the maximum and minimum of GDP per capita for all continents.Use year 2007

get the subset data from 2007

subset2007= subset(gDat,year==2007)

Find the max and min GDP for all continents in 2007

GDP_cont <- ddply(subset2007,~continent,summarize,maxGDP=max(gdpPercap),minGDP=min(gdpPercap))

Plot max GDP with stripplot

stripplot(maxGDP~continent,GDP_cont,grid="h", jitter.data = TRUE)

## Plot min GDP with stripplot

stripplot(minGDP~continent,GDP_cont,grid="h", jitter.data = TRUE)

QUESTION(4) Look at the spread of GDP per capita within the continents.

bwplot(gdpPercap~as.factor(year) | continent,gDat)

QUESTION(5) Depict the number and/or proportion of countries with low life expectancy over time by continent.

Define low lifeExp as 60

lowlifeExp=60

Compute proportion of countries with low life expectancy over time by continent using JB’s code

ProplowlifeExp <- ddply(gDat, ~ continent + year, function(x) c(
     lowlifeExp = sum(x$lifeExp <= lowlifeExp)/nrow(x)) )

Plot it with stripplot, changing the default y-axis to show from 0% to 100%

stripplot(lowlifeExp~year,ProplowlifeExp,group=continent,auto.key=TRUE,ylim=c(0,100),ylab="percentage of Low LifeExp country")

QUESTION(6) Find countries with sudden, substantial departures from the temporal trend in gdpPerCap

The way I am doing this is to first fit a linear model to the data, and then find the country that has the largest maxResid.

Fit a linear model using JB’s code with some modification

yearMin <- min(gDat$year)
jFun <- function(x) {
   jFit <- lm(lifeExp ~ I(year - yearMin), x)
   jCoef <- coef(jFit)
   names(jCoef) <- NULL
   return(c(intercept = jCoef[1],
            slope = jCoef[2],
            maxResid = max(abs(resid(jFit)))/summary(jFit)$sigma))
  }
linearModel <- ddply(gDat, ~ country, jFun)

Then find the country that has the largest maxResid then plot its data

country_interest <- linearModel[which.max(linearModel$maxResid),]
xyplot(lifeExp ~ year , gDat, subset = country %in% country_interest$country, type = c("p", "r"))