STAT 545A Homework#4

Yiming Zhang First, loading the Gapminder data and needed packages.

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(lattice)
library(plyr)
library(xtable)

Then because Oceania only have two countries, this could bring bad effect to our analysis, so we just drop this factor.

gDat <- droplevels(subset(gDat, continent != "Oceania"))

Then have a quick check of the data

str(gDat)
## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
levels(gDat$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"

We can see that the data have been imported properly and Oceania excluded from the data.

To create an HTML table and print, I use a function “htmlPrint” written by Professor Bryan Jennifer in Homework#3 Data aggregation

htmlPrint <- function(x, ..., digits = 2, include.rownames = FALSE) {
    print(xtable(x, digits = digits, ...), type = "html", include.rownames = include.rownames, 
        ...)
}

1.How is life expectancy changing over time on different continents?

Let's check it on a wide table.

LifeExpchgebyCont <- daply(gDat, ~year + continent, summarize, MeanLifeExp = mean(lifeExp))
htmlPrint(LifeExpchgebyCont, include.rownames = TRUE)
Africa Americas Asia Europe
1952 39.14 53.28 46.31 64.41
1957 41.27 55.96 49.32 66.70
1962 43.32 58.40 51.56 68.54
1967 45.33 60.41 54.66 69.74
1972 47.45 62.39 57.32 70.78
1977 49.58 64.39 59.61 71.94
1982 51.59 66.23 62.62 72.81
1987 53.34 68.09 64.85 73.64
1992 53.63 69.57 66.54 74.44
1997 53.60 71.15 68.02 75.51
2002 53.33 72.42 69.23 76.70
2007 54.81 73.61 70.73 77.65

Then show in one pic

LifeExpchgebyCont_tall <- ddply(gDat, ~year + continent, summarize, MeanLifeExp = mean(lifeExp))
xyplot(MeanLifeExp ~ year, LifeExpchgebyCont_tall, groups = continent, auto.key = TRUE, 
    type = c("p", "a"))

plot of chunk unnamed-chunk-6

We can see that life expectancy in all continents are increasing by the time, and Asia has the highest slope.

2.Depict the maximum and minimum of GDP per capital for all continents


First get the maximum and minimum of GDP per capital for all continents

GDPbyYear <- daply(gDat, ~year + continent, summarize, Max = max(gdpPercap), 
    Min = min(gdpPercap))
htmlPrint(as.data.frame(GDPbyYear), include.rownames = TRUE)
Africa.Max Americas.Max Asia.Max Europe.Max Africa.Min Americas.Min Asia.Min Europe.Min
1952 4725.30 13990.48 108382.35 14734.23 298.85 1397.72 331.00 973.53
1957 5487.10 14847.13 113523.13 17909.49 336.00 1544.40 350.00 1353.99
1962 6757.03 16173.15 95458.11 20431.09 355.20 1662.14 388.00 1709.68
1967 18772.75 19530.37 80894.88 22966.14 412.98 1452.06 349.00 2172.35
1972 21011.50 21806.04 109347.87 27195.11 464.10 1654.46 357.00 2860.17
1977 21951.21 24072.63 59265.48 26982.29 502.32 1874.30 371.00 3528.48
1982 17364.28 25009.56 33693.18 28397.72 462.21 2011.16 424.00 3630.88
1987 11864.41 29884.35 28118.43 31540.97 389.88 1823.02 385.00 3738.93
1992 13522.16 32003.93 34932.92 33965.66 410.90 1456.31 347.00 2497.44
1997 14722.84 35767.43 40300.62 41283.16 312.19 1341.73 415.00 3193.05
2002 12521.71 39097.10 36023.11 44683.98 241.17 1270.36 611.00 4604.21
2007 13206.48 42951.65 47306.99 49357.19 277.55 1201.64 944.00 5937.03

Then plot the maximum and minimum separately.

GDPbyYear_tall <- ddply(gDat, ~year + continent, summarize, Max = max(gdpPercap), 
    Min = min(gdpPercap))
print(xyplot(Max ~ year, GDPbyYear_tall, groups = continent, auto.key = TRUE, 
    type = c("p", "a")), position = c(0, 0, 0.55, 1), more = TRUE)
print(xyplot(Min ~ year, GDPbyYear_tall, groups = continent, auto.key = TRUE, 
    type = c("p", "a")), position = c(0.45, 0, 1, 1))

plot of chunk unnamed-chunk-8

We can see that the maximum in Asia fell greatly from extremely high to the normal. And the minimal in Europe increased from normal to extremely high.

3.Look at the spread of GDP per capital within the continents

Let's look at 2007 for example. First let's just use mean to plot the line between the dots in different continents

Dat_2007 <- subset(gDat, year %in% 2007)
stripplot(gdpPercap ~ continent, Dat_2007, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"))

plot of chunk unnamed-chunk-9

Then how about we change to use the median to plot the line.

Dat_2007 <- subset(gDat, year %in% 2007)
stripplot(gdpPercap ~ continent, Dat_2007, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"), fun = median)

plot of chunk unnamed-chunk-10

Look at the difference between the two figures, Africa, Americas and Europe does not change a lot, only Asia does, this could mean that the gap between poor countries and rich countries are really huge, there are so many poor countries and some extremely rich countries like Singapore, Japan.

4.Depict the proportion of countries with low life expectancy over time by continent

Define the country in a particular year whose life expectancy is below 60 as a “low life expectancy country”. Then let's find the percentage of those countries in each continents within the year.

PercentLowLifeExpCountries <- daply(gDat, ~year + continent, summarize, Percentage = length(which(lifeExp < 
    60))/length(unique(country)))
htmlPrint(PercentLowLifeExpCountries, include.rownames = TRUE)
Africa Americas Asia Europe
1952 1.00 0.76 0.88 0.23
1957 1.00 0.60 0.82 0.10
1962 0.98 0.52 0.76 0.03
1967 0.96 0.44 0.76 0.03
1972 0.96 0.40 0.58 0.03
1977 0.96 0.28 0.42 0.03
1982 0.85 0.20 0.36 0.00
1987 0.77 0.08 0.24 0.00
1992 0.75 0.08 0.21 0.00
1997 0.75 0.04 0.18 0.00
2002 0.79 0.04 0.12 0.00
2007 0.77 0.00 0.09 0.00

Then let's see in a plot.

LowLifeExpCountries_tall <- ddply(gDat, ~year + continent, summarize, Percentage = length(which(lifeExp < 
    60))/length(unique(country)), nCountries = length(which(lifeExp < 60)))
xyplot(Percentage ~ year, LowLifeExpCountries_tall, groups = continent, auto.key = TRUE, 
    type = c("p", "a"))

plot of chunk unnamed-chunk-12

Let's try to show the results in a box plot.

bwplot(lifeExp ~ as.factor(year) | continent, gDat)

plot of chunk unnamed-chunk-13

5.Find counties with extremely low or high life expectancy in 1952 or that exhibit extremely rapid or slow life expectancy gains

First, get the intercept and coefficients of all country(use function written by Professor Bryan Jennifer in Data aggregation )

yearMin <- min(gDat$year)
jFun <- function(x) {
    estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
    names(estCoefs) <- c("intercept", "slope")
    return(estCoefs)
}
Coefs <- ddply(gDat, ~country, jFun)

Then get the top 10 countries that have lowest and those countries with highest life expectancy in 1952

htmlPrint(head(arrange(Coefs, intercept), 10))
country intercept slope
Gambia 28.40 0.58
Afghanistan 29.91 0.28
Yemen, Rep. 30.13 0.61
Sierra Leone 30.88 0.21
Guinea 31.56 0.42
Guinea-Bissau 31.74 0.27
Angola 32.13 0.21
Mali 33.05 0.38
Mozambique 34.21 0.22
Equatorial Guinea 34.43 0.31
htmlPrint(tail(arrange(Coefs, intercept), 10))
country intercept slope
Belgium 67.89 0.21
United States 68.41 0.18
United Kingdom 68.81 0.19
Canada 68.88 0.22
Switzerland 69.45 0.22
Denmark 71.03 0.12
Sweden 71.61 0.17
Netherlands 71.89 0.14
Iceland 71.96 0.17
Norway 72.21 0.13

And also those countries exhibit extremely rapid or slow life expectancy gains.

htmlPrint(head(arrange(Coefs, slope), 10))
country intercept slope
Zimbabwe 55.22 -0.09
Zambia 47.66 -0.06
Rwanda 42.74 -0.05
Botswana 52.93 0.06
Congo, Dem. Rep. 41.96 0.09
Swaziland 46.39 0.10
Lesotho 47.38 0.10
Liberia 39.84 0.10
Denmark 71.03 0.12
Uganda 44.28 0.12
htmlPrint(tail(arrange(Coefs, slope), 10))
country intercept slope
Jordan 44.06 0.57
Gambia 28.40 0.58
Tunisia 44.56 0.59
West Bank and Gaza 43.80 0.60
Yemen, Rep. 30.13 0.61
Libya 42.10 0.63
Indonesia 36.88 0.63
Saudi Arabia 40.81 0.65
Vietnam 39.01 0.67
Oman 37.21 0.77

Then get those countries that have have lower life expectancy in 1952 and also gains slow with time.

poorcountries <- Coefs[(Coefs$slope < 0.3) & (Coefs$intercept < 35), ]
htmlPrint(poorcountries)
country intercept slope
Afghanistan 29.91 0.28
Angola 32.13 0.21
Guinea-Bissau 31.74 0.27
Mozambique 34.21 0.22
Sierra Leone 30.88 0.21
Somalia 34.68 0.23

And compared with thus countries that have lower life expectancy in 1952 but gains fast

increcountries <- Coefs[(Coefs$slope > 0.5) & (Coefs$intercept < 40), ]
htmlPrint(increcountries)
country intercept slope
Gambia 28.40 0.58
India 39.27 0.51
Indonesia 36.88 0.63
Nepal 34.43 0.53
Oman 37.21 0.77
Senegal 36.75 0.50
Vietnam 39.01 0.67
Yemen, Rep. 30.13 0.61

Take an example, compare Angola and Gambia.

comparecoun <- droplevels(subset(gDat, subset = country %in% c("Angola", "Gambia")))
levels(comparecoun$country)

[1] “Angola” “Gambia”

xyplot(lifeExp ~ year, comparecoun, groups = country, type = c("p", "a"), auto.key = TRUE)

plot of chunk unnamed-chunk-19

We can see that the change between the two country happened near 1972. I want to know what happened near 1972 that had made this change to Angola.

I don't know how to get those countries directly from gDat, I tried poorcountries<- gDat[(Coefs$slope<0.3)&(Coefs$intercept<40),], but it does not work.