STAT 545A Homework 3

In this article, we will perform several statistical analysis on the Gapminder data dataset. Please refer to my previous articles for more information about the data.

We load the data into gDat object.

gDat <- read.delim(file = "~/R/gapminderDataFiveYear.txt", sep = "\t")

Then install the plyr package for performing data aggregation:

# install.packages('plyr', dependencies = TRUE)
require(package = plyr)
## Loading required package: plyr

We also install the xtable package whichs converts a data.frame to an HTML table.

# install.packages('xtable', dependencies = TRUE)
require(package = xtable)
## Loading required package: xtable

We perform the maximum and minimum of GDP per capita for all continents using ddply. We use ddply to operate on data frame to split it up into continent and apply summarise function from plyr package:

gdp <- ddply(.data = gDat, .(continent), .fun = summarise, MaxGdpPercap = max(gdpPercap), 
    MinGdpPercap = min(gdpPercap))
temptable <- xtable(gdp)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:24 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> continent </TH> <TH> MaxGdpPercap </TH> <TH> MinGdpPercap </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD> Africa </TD> <TD align="right"> 21951.21 </TD> <TD align="right"> 241.17 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD> Americas </TD> <TD align="right"> 42951.65 </TD> <TD align="right"> 1201.64 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD> Asia </TD> <TD align="right"> 113523.13 </TD> <TD align="right"> 331.00 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD> Europe </TD> <TD align="right"> 49357.19 </TD> <TD align="right"> 973.53 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD> Oceania </TD> <TD align="right"> 34435.37 </TD> <TD align="right"> 10039.60 </TD> </TR>
##    </TABLE>

The table looks nice and nothing seems to be odd.

Next, we further analyzed quantitatively variations of GDP per capita over all the continents. We use standard deviation, and variance:

gdp <- ddply(.data = gDat, .(continent), .fun = summarise, variance = var(gdpPercap), 
    standardeviation = sd(gdpPercap))
temptable <- xtable(gdp)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:25 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> continent </TH> <TH> variance </TH> <TH> standardeviation </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD> Africa </TD> <TD align="right"> 7997187.31 </TD> <TD align="right"> 2827.93 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD> Americas </TD> <TD align="right"> 40918591.10 </TD> <TD align="right"> 6396.76 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD> Asia </TD> <TD align="right"> 197272505.85 </TD> <TD align="right"> 14045.37 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD> Europe </TD> <TD align="right"> 87520019.60 </TD> <TD align="right"> 9355.21 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD> Oceania </TD> <TD align="right"> 40436668.87 </TD> <TD align="right"> 6358.98 </TD> </TR>
##    </TABLE>

The above results indicates that the dispersion of GDP per capita within african coutires are less in magnitude then others continets, such as Asia. However, we further dig to obtain information about the GDP per capita among the countries that falls under the same continent for Asia and Africa:

sDat <- subset(x = gDat, continent == c("Africa", "Asia"))
gdp <- ddply(.data = sDat, .(continent, country), .fun = summarise, variance = var(gdpPercap), 
    standardeviation = sd(gdpPercap))
temptable <- head(xtable(gdp), 10)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:25 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> continent </TH> <TH> country </TH> <TH> variance </TH> <TH> standardeviation </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD> Africa </TD> <TD> Algeria </TD> <TD align="right"> 2006972.63 </TD> <TD align="right"> 1416.68 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD> Africa </TD> <TD> Angola </TD> <TD align="right"> 1259476.49 </TD> <TD align="right"> 1122.26 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD> Africa </TD> <TD> Benin </TD> <TD align="right"> 23879.64 </TD> <TD align="right"> 154.53 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD> Africa </TD> <TD> Botswana </TD> <TD align="right"> 16969693.11 </TD> <TD align="right"> 4119.43 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD> Africa </TD> <TD> Burkina Faso </TD> <TD align="right"> 29446.55 </TD> <TD align="right"> 171.60 </TD> </TR>
##   <TR> <TD align="right"> 6 </TD> <TD> Africa </TD> <TD> Burundi </TD> <TD align="right"> 12987.07 </TD> <TD align="right"> 113.96 </TD> </TR>
##   <TR> <TD align="right"> 7 </TD> <TD> Africa </TD> <TD> Cameroon </TD> <TD align="right"> 174875.89 </TD> <TD align="right"> 418.18 </TD> </TR>
##   <TR> <TD align="right"> 8 </TD> <TD> Africa </TD> <TD> Central African Republic </TD> <TD align="right"> 34546.81 </TD> <TD align="right"> 185.87 </TD> </TR>
##   <TR> <TD align="right"> 9 </TD> <TD> Africa </TD> <TD> Chad </TD> <TD align="right"> 37035.42 </TD> <TD align="right"> 192.45 </TD> </TR>
##   <TR> <TD align="right"> 10 </TD> <TD> Africa </TD> <TD> Comoros </TD> <TD align="right"> 100297.81 </TD> <TD align="right"> 316.70 </TD> </TR>
##    </TABLE>

The information above is too tedious to follow. We might require some summary of the results, in another article.

We might be asked to conduct the average life expectancy over the intervals [1952,2007] for each continent:

lifeExpect <- ddply(.data = gDat, .(continent, year), .fun = summarise, meanLifeExpec = mean(lifeExp), 
    medianLifeExpec = median(lifeExp))
temptable <- head(xtable(lifeExpect), 10)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:25 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> continent </TH> <TH> year </TH> <TH> meanLifeExpec </TH> <TH> medianLifeExpec </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD> Africa </TD> <TD align="right"> 1952 </TD> <TD align="right"> 39.14 </TD> <TD align="right"> 38.83 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD> Africa </TD> <TD align="right"> 1957 </TD> <TD align="right"> 41.27 </TD> <TD align="right"> 40.59 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD> Africa </TD> <TD align="right"> 1962 </TD> <TD align="right"> 43.32 </TD> <TD align="right"> 42.63 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD> Africa </TD> <TD align="right"> 1967 </TD> <TD align="right"> 45.33 </TD> <TD align="right"> 44.70 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD> Africa </TD> <TD align="right"> 1972 </TD> <TD align="right"> 47.45 </TD> <TD align="right"> 47.03 </TD> </TR>
##   <TR> <TD align="right"> 6 </TD> <TD> Africa </TD> <TD align="right"> 1977 </TD> <TD align="right"> 49.58 </TD> <TD align="right"> 49.27 </TD> </TR>
##   <TR> <TD align="right"> 7 </TD> <TD> Africa </TD> <TD align="right"> 1982 </TD> <TD align="right"> 51.59 </TD> <TD align="right"> 50.76 </TD> </TR>
##   <TR> <TD align="right"> 8 </TD> <TD> Africa </TD> <TD align="right"> 1987 </TD> <TD align="right"> 53.34 </TD> <TD align="right"> 51.64 </TD> </TR>
##   <TR> <TD align="right"> 9 </TD> <TD> Africa </TD> <TD align="right"> 1992 </TD> <TD align="right"> 53.63 </TD> <TD align="right"> 52.43 </TD> </TR>
##   <TR> <TD align="right"> 10 </TD> <TD> Africa </TD> <TD align="right"> 1997 </TD> <TD align="right"> 53.60 </TD> <TD align="right"> 52.76 </TD> </TR>
##    </TABLE>

The results highlight the life expectancy over year and it is proportionally increasing for every continent. We could also perform the average after discarding sample at the high and low end:

lifeExpect <- ddply(.data = gDat, .(continent, year), .fun = summarise, meanLifeExpec = mean(lifeExp, 
    trim = 0.3))
temptable <- head(xtable(lifeExpect), 10)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:25 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> continent </TH> <TH> year </TH> <TH> meanLifeExpec </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD> Africa </TD> <TD align="right"> 1952 </TD> <TD align="right"> 39.15 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD> Africa </TD> <TD align="right"> 1957 </TD> <TD align="right"> 41.01 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD> Africa </TD> <TD align="right"> 1962 </TD> <TD align="right"> 42.87 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD> Africa </TD> <TD align="right"> 1967 </TD> <TD align="right"> 44.93 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD> Africa </TD> <TD align="right"> 1972 </TD> <TD align="right"> 47.03 </TD> </TR>
##   <TR> <TD align="right"> 6 </TD> <TD> Africa </TD> <TD align="right"> 1977 </TD> <TD align="right"> 49.22 </TD> </TR>
##   <TR> <TD align="right"> 7 </TD> <TD> Africa </TD> <TD align="right"> 1982 </TD> <TD align="right"> 51.11 </TD> </TR>
##   <TR> <TD align="right"> 8 </TD> <TD> Africa </TD> <TD align="right"> 1987 </TD> <TD align="right"> 52.60 </TD> </TR>
##   <TR> <TD align="right"> 9 </TD> <TD> Africa </TD> <TD align="right"> 1992 </TD> <TD align="right"> 53.36 </TD> </TR>
##   <TR> <TD align="right"> 10 </TD> <TD> Africa </TD> <TD align="right"> 1997 </TD> <TD align="right"> 52.57 </TD> </TR>
##    </TABLE>

Analyzing data becomes more intersting and we ask ourself what is the trend of the life expectancy across the countries, interesting!. We might use our own age as benchmark magnitude to perform such computation as below:

cDat <- subset(x = gDat, subset = lifeExp < 35, select = c(continent, year, 
    country))
cFun <- function(x) {
    return(count(vars = "year", x))
}
trend <- ddply(cDat, .(continent), cFun)
temptable <- xtable(trend)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:25 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> continent </TH> <TH> year </TH> <TH> freq </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD> Africa </TD> <TD align="right"> 1952 </TD> <TD align="right">  12 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD> Africa </TD> <TD align="right"> 1957 </TD> <TD align="right">   8 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD> Africa </TD> <TD align="right"> 1962 </TD> <TD align="right">   4 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD> Africa </TD> <TD align="right"> 1967 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD> Africa </TD> <TD align="right"> 1992 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 6 </TD> <TD> Asia </TD> <TD align="right"> 1952 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 7 </TD> <TD> Asia </TD> <TD align="right"> 1957 </TD> <TD align="right">   2 </TD> </TR>
##   <TR> <TD align="right"> 8 </TD> <TD> Asia </TD> <TD align="right"> 1962 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 9 </TD> <TD> Asia </TD> <TD align="right"> 1967 </TD> <TD align="right">   1 </TD> </TR>
##   <TR> <TD align="right"> 10 </TD> <TD> Asia </TD> <TD align="right"> 1977 </TD> <TD align="right">   1 </TD> </TR>
##    </TABLE>

Finally, we report on how to obtain the minimum life expectancy for each year and its associated country and continent and closely monitor the value of lm which tries to find a fitted line across several years:

mFun <- function(x) {
    m <- subset(x, subset = lifeExp == min(lifeExp), select = c(continent, country, 
        lifeExp))
    return(m)
}
lifeExpect <- ddply(.data = gDat, .(year), mFun)
temptable <- xtable(lifeExpect)
print(temptable, type = "html", include.rownames = TRUE)
## <!-- html table generated in R 3.0.1 by xtable 1.7-1 package -->
## <!-- Mon Sep 23 16:28:25 2013 -->
## <TABLE border=1>
## <TR> <TH>  </TH> <TH> year </TH> <TH> continent </TH> <TH> country </TH> <TH> lifeExp </TH>  </TR>
##   <TR> <TD align="right"> 1 </TD> <TD align="right"> 1952 </TD> <TD> Asia </TD> <TD> Afghanistan </TD> <TD align="right"> 28.80 </TD> </TR>
##   <TR> <TD align="right"> 2 </TD> <TD align="right"> 1957 </TD> <TD> Asia </TD> <TD> Afghanistan </TD> <TD align="right"> 30.33 </TD> </TR>
##   <TR> <TD align="right"> 3 </TD> <TD align="right"> 1962 </TD> <TD> Asia </TD> <TD> Afghanistan </TD> <TD align="right"> 32.00 </TD> </TR>
##   <TR> <TD align="right"> 4 </TD> <TD align="right"> 1967 </TD> <TD> Asia </TD> <TD> Afghanistan </TD> <TD align="right"> 34.02 </TD> </TR>
##   <TR> <TD align="right"> 5 </TD> <TD align="right"> 1972 </TD> <TD> Africa </TD> <TD> Sierra Leone </TD> <TD align="right"> 35.40 </TD> </TR>
##   <TR> <TD align="right"> 6 </TD> <TD align="right"> 1977 </TD> <TD> Asia </TD> <TD> Cambodia </TD> <TD align="right"> 31.22 </TD> </TR>
##   <TR> <TD align="right"> 7 </TD> <TD align="right"> 1982 </TD> <TD> Africa </TD> <TD> Sierra Leone </TD> <TD align="right"> 38.45 </TD> </TR>
##   <TR> <TD align="right"> 8 </TD> <TD align="right"> 1987 </TD> <TD> Africa </TD> <TD> Angola </TD> <TD align="right"> 39.91 </TD> </TR>
##   <TR> <TD align="right"> 9 </TD> <TD align="right"> 1992 </TD> <TD> Africa </TD> <TD> Rwanda </TD> <TD align="right"> 23.60 </TD> </TR>
##   <TR> <TD align="right"> 10 </TD> <TD align="right"> 1997 </TD> <TD> Africa </TD> <TD> Rwanda </TD> <TD align="right"> 36.09 </TD> </TR>
##   <TR> <TD align="right"> 11 </TD> <TD align="right"> 2002 </TD> <TD> Africa </TD> <TD> Zambia </TD> <TD align="right"> 39.19 </TD> </TR>
##   <TR> <TD align="right"> 12 </TD> <TD align="right"> 2007 </TD> <TD> Africa </TD> <TD> Swaziland </TD> <TD align="right"> 39.61 </TD> </TR>
##    </TABLE>
jFit <- lm(data = gDat, formula = lifeExp ~ I(year))
summary(jFit)
## 
## Call:
## lm(formula = lifeExp ~ I(year), data = gDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.95  -9.65   1.70  10.33  22.16 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -585.6522    32.3140   -18.1   <2e-16 ***
## I(year)        0.3259     0.0163    20.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 1702 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  399 on 1 and 1702 DF,  p-value: <2e-16

Shocking!!! life expectancy in 0 A.D. year was -585.65. That's not right. So, we perform slightly different approach:

yearMin <- min(gDat$year)
jFit <- lm(data = gDat, formula = lifeExp ~ I(year - yearMin))
summary(jFit)
## 
## Call:
## lm(formula = lifeExp ~ I(year - yearMin), data = gDat)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -39.95  -9.65   1.70  10.33  22.16 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        50.5121     0.5300    95.3   <2e-16 ***
## I(year - yearMin)   0.3259     0.0163    20.0   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.6 on 1702 degrees of freedom
## Multiple R-squared:  0.19,   Adjusted R-squared:  0.189 
## F-statistic:  399 on 1 and 1702 DF,  p-value: <2e-16

Well. The results make more sense.

See you with my next article.