STAT 545A - Homework #2

By Christian Okkels

Introduction

This homework assignment deals with a brief analysis of some of the Gapminder data (available here). Moreover, it is an introductory exercise in writing R Markdown files and compiling them into HTML reports, and in turn publishing these documents on RPubs and GitHub Gist, respectively.

Brief Exploration of Data

First, we load the Gapminder data.

gDat <- read.delim("gapminderDataFiveYear.txt")

Using the str() function, we can get some information about the data.frame gDat.

str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

This output tells us that there are 1704 observations of 6 variables. The variables are as follows:

country
year
pop
continent
lifeExp
gdpPercap

If one is not interested in the data excerpts and wants a shorter and clearer output, an alternative way of getting the number of observations and the names of the variables is to use the nrows() and names() functions, respectively.

nrow(gDat)

## [1] 1704

names(gDat)

## [1] "country"   "year"      "pop"       "continent" "lifeExp"   "gdpPercap"

For a quick statistical summary of the data.frame, one can use the function summary().

summary(gDat)

##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
##

Many basic descriptive statistics can be read from this output. For example, we see that the time frame of the data spans the period from 1952 to 2007. We can also see that the minimum life expectancy across all the different countries is as low as 23.60, and that the maximum life expectancy is 82.60. Another basic statistic is the average Gross Domestic Product per Capita, which is 7215.3.

To get a better feel for the data we can visualize different aspects of it through plots. This requires the lattice library, which is loaded as follows.

library(lattice)

Let's look at how the population of Denmark (where I'm from :-) ) has evolved from 1952 to 2007.

xyplot(pop ~ year, gDat, subset = country == "Denmark")

plot of chunk unnamed-chunk-6

Evidently, the population in Denmark has generally undergone a steady increase, although with an unmistakeable flattening out in the 80's. Moreover, the increase in the later years from about 1995 to 2007 does not appear as steep as the increase observed from 1952 to the end of the 70's.

Now, let's see how things have been going with the Gross Domestic Product per Capita. A line has also been fitted through the data points.

xyplot(gdpPercap ~ year, gDat, subset = country == "Denmark", type = c("p", 
    "r"))

plot of chunk unnamed-chunk-7

The points appear to lie rather close to the fitted line (although a more detailed statistical analysis is needed to further comment on the goodness of fit). The data thus point in the direction of the GDP per capita increasing linerly with time.

As a last visualization, let's spice things up a bit and compare the change through time of the GDP per capita for China and Japan.

xyplot(gdpPercap ~ year | country, gDat, subset = (country == "Japan" | country == 
    "China"), type = c("p", "smooth"))

plot of chunk unnamed-chunk-8