Homework Number Two (Stat545a)

In this assignment we will be taking a preliminary look at the Gapminder dataset. (Located here for those who are curious).

This will include:

We start by loading the dataset, as well as the needed libraries:

gDat <- read.delim("gapminderDataFiveYear.txt")
library(lattice)

We now want to learn a little bit about the dataset that we are dealing with:

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

From the output we can see that there are 6 variables, with 1704 observations. I can also see what the variables are, and the manner in which they are coded.

Out of curiosity I want to take a peak at some of the actual data; once again, this is to get a better sense of what we are dealing with:

tail(gDat)
##       country year      pop continent lifeExp gdpPercap
## 1699 Zimbabwe 1982  7636524    Africa   60.36     788.9
## 1700 Zimbabwe 1987  9216418    Africa   62.35     706.2
## 1701 Zimbabwe 1992 10704340    Africa   60.38     693.4
## 1702 Zimbabwe 1997 11404948    Africa   46.81     792.4
## 1703 Zimbabwe 2002 11926563    Africa   39.99     672.0
## 1704 Zimbabwe 2007 12311143    Africa   43.49     469.7

Nothing too exciting to report so I decide to make a quick and dirty summary of the data:

summary(gDat)
##         country          year           pop              continent  
##  Afghanistan:  12   Min.   :1952   Min.   :6.00e+04   Africa  :624  
##  Albania    :  12   1st Qu.:1966   1st Qu.:2.79e+06   Americas:300  
##  Algeria    :  12   Median :1980   Median :7.02e+06   Asia    :396  
##  Angola     :  12   Mean   :1980   Mean   :2.96e+07   Europe  :360  
##  Argentina  :  12   3rd Qu.:1993   3rd Qu.:1.96e+07   Oceania : 24  
##  Australia  :  12   Max.   :2007   Max.   :1.32e+09                 
##  (Other)    :1632                                                   
##     lifeExp       gdpPercap     
##  Min.   :23.6   Min.   :   241  
##  1st Qu.:48.2   1st Qu.:  1202  
##  Median :60.7   Median :  3532  
##  Mean   :59.5   Mean   :  7215  
##  3rd Qu.:70.8   3rd Qu.:  9325  
##  Max.   :82.6   Max.   :113523  
## 

Amongst other things we can see that Oceania seems to be underrepresented in the dataset. Furthermore, the life expectency ranges from 23.6 to 82.6, and the dataset has data running from 1952 up till 2007.

Finally, I decide I want to make a plot. Given that my field work takes me to Zambia, I am curious to see what this dataset has to say about the country. I decide to make a plot of life expectency over the years in the country.

xyplot(lifeExp ~ year, gDat, subset = country == "Zambia", type = c("p", "r"), 
    xlab = "Year", ylab = "Life Expectency", ylim = c(1, 60), main = "Life Expectency in Zambia", 
    col = "red")

plot of chunk unnamed-chunk-5