By Christian Okkels
This homework assignment deals with a brief analysis of some of the Gapminder data (available here). Moreover, it is an introductory exercise in writing R Markdown files and compiling them into HTML reports, and in turn publishing these documents on RPubs and GitHub Gist, respectively.
First, we load the Gapminder data.
gDat <- read.delim("gapminderDataFiveYear.txt")
Using the str() function, we can get some information about the data.frame gDat.
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
This output tells us that there are 1704 observations of 6 variables. The variables are as follows:
If one is not interested in the data excerpts and wants a shorter and clearer output, an alternative way of getting the number of observations and the names of the variables is to use the nrows() and names() functions, respectively.
nrow(gDat)
## [1] 1704
names(gDat)
## [1] "country" "year" "pop" "continent" "lifeExp" "gdpPercap"
For a quick statistical summary of the data.frame, one can use the function summary().
summary(gDat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.00e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.79e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.02e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.96e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.96e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.32e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.6 Min. : 241
## 1st Qu.:48.2 1st Qu.: 1202
## Median :60.7 Median : 3532
## Mean :59.5 Mean : 7215
## 3rd Qu.:70.8 3rd Qu.: 9325
## Max. :82.6 Max. :113523
##
Many basic descriptive statistics can be read from this output. For example, we see that the time frame of the data spans the period from 1952 to 2007. We can also see that the minimum life expectancy across all the different countries is as low as 23.60, and that the maximum life expectancy is 82.60. Another basic statistic is the average Gross Domestic Product per Capita, which is 7215.3.
To get a better feel for the data we can visualize different aspects of it through plots. This requires the lattice library, which is loaded as follows.
library(lattice)
Let's look at how the population of Denmark (where I'm from :-) ) has evolved from 1952 to 2007.
xyplot(pop ~ year, gDat, subset = country == "Denmark")
Evidently, the population in Denmark has generally undergone a steady increase, although with an unmistakeable flattening out in the 80's. Moreover, the increase in the later years from about 1995 to 2007 does not appear as steep as the increase observed from 1952 to the end of the 70's.
Now, let's see how things have been going with the Gross Domestic Product per Capita. A line has also been fitted through the data points.
xyplot(gdpPercap ~ year, gDat, subset = country == "Denmark", type = c("p",
"r"))
The points appear to lie rather close to the fitted line (although a more detailed statistical analysis is needed to further comment on the goodness of fit). The data thus point in the direction of the GDP per capita increasing linerly with time.
As a last visualization, let's spice things up a bit and compare the change through time of the GDP per capita for China and Japan.
xyplot(gdpPercap ~ year | country, gDat, subset = (country == "Japan" | country ==
"China"), type = c("p", "smooth"))