Yiming Zhang First, loading the Gapminder data and needed packages.
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(lattice)
library(plyr)
library(xtable)
Then because Oceania only have two countries, this could bring bad effect to our analysis, so we just drop this factor.
gDat <- droplevels(subset(gDat, continent != "Oceania"))
Then have a quick check of the data
str(gDat)
## 'data.frame': 1680 obs. of 6 variables:
## $ country : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
levels(gDat$continent)
## [1] "Africa" "Americas" "Asia" "Europe"
We can see that the data have been imported properly and Oceania excluded from the data.
To create an HTML table and print, I use a function “htmlPrint” written by Professor Bryan Jennifer in Homework#3 Data aggregation
htmlPrint <- function(x, ..., digits = 2, include.rownames = FALSE) {
print(xtable(x, digits = digits, ...), type = "html", include.rownames = include.rownames,
...)
}
Let's check it on a wide table.
LifeExpchgebyCont <- daply(gDat, ~year + continent, summarize, MeanLifeExp = mean(lifeExp))
htmlPrint(LifeExpchgebyCont, include.rownames = TRUE)
| Africa | Americas | Asia | Europe | |
|---|---|---|---|---|
| 1952 | 39.14 | 53.28 | 46.31 | 64.41 |
| 1957 | 41.27 | 55.96 | 49.32 | 66.70 |
| 1962 | 43.32 | 58.40 | 51.56 | 68.54 |
| 1967 | 45.33 | 60.41 | 54.66 | 69.74 |
| 1972 | 47.45 | 62.39 | 57.32 | 70.78 |
| 1977 | 49.58 | 64.39 | 59.61 | 71.94 |
| 1982 | 51.59 | 66.23 | 62.62 | 72.81 |
| 1987 | 53.34 | 68.09 | 64.85 | 73.64 |
| 1992 | 53.63 | 69.57 | 66.54 | 74.44 |
| 1997 | 53.60 | 71.15 | 68.02 | 75.51 |
| 2002 | 53.33 | 72.42 | 69.23 | 76.70 |
| 2007 | 54.81 | 73.61 | 70.73 | 77.65 |
Then show in one pic
LifeExpchgebyCont_tall <- ddply(gDat, ~year + continent, summarize, MeanLifeExp = mean(lifeExp))
xyplot(MeanLifeExp ~ year, LifeExpchgebyCont_tall, groups = continent, auto.key = TRUE,
type = c("p", "a"))
We can see that life expectancy in all continents are increasing by the time, and Asia has the highest slope.
2.Depict the maximum and minimum of GDP per capital for all continents
First get the maximum and minimum of GDP per capital for all continents
GDPbyYear <- daply(gDat, ~year + continent, summarize, Max = max(gdpPercap),
Min = min(gdpPercap))
htmlPrint(as.data.frame(GDPbyYear), include.rownames = TRUE)
| Africa.Max | Americas.Max | Asia.Max | Europe.Max | Africa.Min | Americas.Min | Asia.Min | Europe.Min | |
|---|---|---|---|---|---|---|---|---|
| 1952 | 4725.30 | 13990.48 | 108382.35 | 14734.23 | 298.85 | 1397.72 | 331.00 | 973.53 |
| 1957 | 5487.10 | 14847.13 | 113523.13 | 17909.49 | 336.00 | 1544.40 | 350.00 | 1353.99 |
| 1962 | 6757.03 | 16173.15 | 95458.11 | 20431.09 | 355.20 | 1662.14 | 388.00 | 1709.68 |
| 1967 | 18772.75 | 19530.37 | 80894.88 | 22966.14 | 412.98 | 1452.06 | 349.00 | 2172.35 |
| 1972 | 21011.50 | 21806.04 | 109347.87 | 27195.11 | 464.10 | 1654.46 | 357.00 | 2860.17 |
| 1977 | 21951.21 | 24072.63 | 59265.48 | 26982.29 | 502.32 | 1874.30 | 371.00 | 3528.48 |
| 1982 | 17364.28 | 25009.56 | 33693.18 | 28397.72 | 462.21 | 2011.16 | 424.00 | 3630.88 |
| 1987 | 11864.41 | 29884.35 | 28118.43 | 31540.97 | 389.88 | 1823.02 | 385.00 | 3738.93 |
| 1992 | 13522.16 | 32003.93 | 34932.92 | 33965.66 | 410.90 | 1456.31 | 347.00 | 2497.44 |
| 1997 | 14722.84 | 35767.43 | 40300.62 | 41283.16 | 312.19 | 1341.73 | 415.00 | 3193.05 |
| 2002 | 12521.71 | 39097.10 | 36023.11 | 44683.98 | 241.17 | 1270.36 | 611.00 | 4604.21 |
| 2007 | 13206.48 | 42951.65 | 47306.99 | 49357.19 | 277.55 | 1201.64 | 944.00 | 5937.03 |
Then plot the maximum and minimum separately.
GDPbyYear_tall <- ddply(gDat, ~year + continent, summarize, Max = max(gdpPercap),
Min = min(gdpPercap))
print(xyplot(Max ~ year, GDPbyYear_tall, groups = continent, auto.key = TRUE,
type = c("p", "a")), position = c(0, 0, 0.55, 1), more = TRUE)
print(xyplot(Min ~ year, GDPbyYear_tall, groups = continent, auto.key = TRUE,
type = c("p", "a")), position = c(0.45, 0, 1, 1))
We can see that the maximum in Asia fell greatly from extremely high to the normal. And the minimal in Europe increased from normal to extremely high.
Let's look at 2007 for example. First let's just use mean to plot the line between the dots in different continents
Dat_2007 <- subset(gDat, year %in% 2007)
stripplot(gdpPercap ~ continent, Dat_2007, jitter.data = TRUE, grid = "h", type = c("p",
"a"))
Then how about we change to use the median to plot the line.
Dat_2007 <- subset(gDat, year %in% 2007)
stripplot(gdpPercap ~ continent, Dat_2007, jitter.data = TRUE, grid = "h", type = c("p",
"a"), fun = median)
Look at the difference between the two figures, Africa, Americas and Europe does not change a lot, only Asia does, this could mean that the gap between poor countries and rich countries are really huge, there are so many poor countries and some extremely rich countries like Singapore, Japan.
Define the country in a particular year whose life expectancy is below 60 as a “low life expectancy country”. Then let's find the percentage of those countries in each continents within the year.
PercentLowLifeExpCountries <- daply(gDat, ~year + continent, summarize, Percentage = length(which(lifeExp <
60))/length(unique(country)))
htmlPrint(PercentLowLifeExpCountries, include.rownames = TRUE)
| Africa | Americas | Asia | Europe | |
|---|---|---|---|---|
| 1952 | 1.00 | 0.76 | 0.88 | 0.23 |
| 1957 | 1.00 | 0.60 | 0.82 | 0.10 |
| 1962 | 0.98 | 0.52 | 0.76 | 0.03 |
| 1967 | 0.96 | 0.44 | 0.76 | 0.03 |
| 1972 | 0.96 | 0.40 | 0.58 | 0.03 |
| 1977 | 0.96 | 0.28 | 0.42 | 0.03 |
| 1982 | 0.85 | 0.20 | 0.36 | 0.00 |
| 1987 | 0.77 | 0.08 | 0.24 | 0.00 |
| 1992 | 0.75 | 0.08 | 0.21 | 0.00 |
| 1997 | 0.75 | 0.04 | 0.18 | 0.00 |
| 2002 | 0.79 | 0.04 | 0.12 | 0.00 |
| 2007 | 0.77 | 0.00 | 0.09 | 0.00 |
Then let's see in a plot.
LowLifeExpCountries_tall <- ddply(gDat, ~year + continent, summarize, Percentage = length(which(lifeExp <
60))/length(unique(country)), nCountries = length(which(lifeExp < 60)))
xyplot(Percentage ~ year, LowLifeExpCountries_tall, groups = continent, auto.key = TRUE,
type = c("p", "a"))
Let's try to show the results in a box plot.
bwplot(lifeExp ~ as.factor(year) | continent, gDat)
First, get the intercept and coefficients of all country(use function written by Professor Bryan Jennifer in Data aggregation )
yearMin <- min(gDat$year)
jFun <- function(x) {
estCoefs <- coef(lm(lifeExp ~ I(year - yearMin), x))
names(estCoefs) <- c("intercept", "slope")
return(estCoefs)
}
Coefs <- ddply(gDat, ~country, jFun)
Then get the top 10 countries that have lowest and those countries with highest life expectancy in 1952
htmlPrint(head(arrange(Coefs, intercept), 10))
| country | intercept | slope |
|---|---|---|
| Gambia | 28.40 | 0.58 |
| Afghanistan | 29.91 | 0.28 |
| Yemen, Rep. | 30.13 | 0.61 |
| Sierra Leone | 30.88 | 0.21 |
| Guinea | 31.56 | 0.42 |
| Guinea-Bissau | 31.74 | 0.27 |
| Angola | 32.13 | 0.21 |
| Mali | 33.05 | 0.38 |
| Mozambique | 34.21 | 0.22 |
| Equatorial Guinea | 34.43 | 0.31 |
htmlPrint(tail(arrange(Coefs, intercept), 10))
| country | intercept | slope |
|---|---|---|
| Belgium | 67.89 | 0.21 |
| United States | 68.41 | 0.18 |
| United Kingdom | 68.81 | 0.19 |
| Canada | 68.88 | 0.22 |
| Switzerland | 69.45 | 0.22 |
| Denmark | 71.03 | 0.12 |
| Sweden | 71.61 | 0.17 |
| Netherlands | 71.89 | 0.14 |
| Iceland | 71.96 | 0.17 |
| Norway | 72.21 | 0.13 |
And also those countries exhibit extremely rapid or slow life expectancy gains.
htmlPrint(head(arrange(Coefs, slope), 10))
| country | intercept | slope |
|---|---|---|
| Zimbabwe | 55.22 | -0.09 |
| Zambia | 47.66 | -0.06 |
| Rwanda | 42.74 | -0.05 |
| Botswana | 52.93 | 0.06 |
| Congo, Dem. Rep. | 41.96 | 0.09 |
| Swaziland | 46.39 | 0.10 |
| Lesotho | 47.38 | 0.10 |
| Liberia | 39.84 | 0.10 |
| Denmark | 71.03 | 0.12 |
| Uganda | 44.28 | 0.12 |
htmlPrint(tail(arrange(Coefs, slope), 10))
| country | intercept | slope |
|---|---|---|
| Jordan | 44.06 | 0.57 |
| Gambia | 28.40 | 0.58 |
| Tunisia | 44.56 | 0.59 |
| West Bank and Gaza | 43.80 | 0.60 |
| Yemen, Rep. | 30.13 | 0.61 |
| Libya | 42.10 | 0.63 |
| Indonesia | 36.88 | 0.63 |
| Saudi Arabia | 40.81 | 0.65 |
| Vietnam | 39.01 | 0.67 |
| Oman | 37.21 | 0.77 |
Then get those countries that have have lower life expectancy in 1952 and also gains slow with time.
poorcountries <- Coefs[(Coefs$slope < 0.3) & (Coefs$intercept < 35), ]
htmlPrint(poorcountries)
| country | intercept | slope |
|---|---|---|
| Afghanistan | 29.91 | 0.28 |
| Angola | 32.13 | 0.21 |
| Guinea-Bissau | 31.74 | 0.27 |
| Mozambique | 34.21 | 0.22 |
| Sierra Leone | 30.88 | 0.21 |
| Somalia | 34.68 | 0.23 |
And compared with thus countries that have lower life expectancy in 1952 but gains fast
increcountries <- Coefs[(Coefs$slope > 0.5) & (Coefs$intercept < 40), ]
htmlPrint(increcountries)
| country | intercept | slope |
|---|---|---|
| Gambia | 28.40 | 0.58 |
| India | 39.27 | 0.51 |
| Indonesia | 36.88 | 0.63 |
| Nepal | 34.43 | 0.53 |
| Oman | 37.21 | 0.77 |
| Senegal | 36.75 | 0.50 |
| Vietnam | 39.01 | 0.67 |
| Yemen, Rep. | 30.13 | 0.61 |
Take an example, compare Angola and Gambia.
comparecoun <- droplevels(subset(gDat, subset = country %in% c("Angola", "Gambia")))
levels(comparecoun$country)
[1] “Angola” “Gambia”
xyplot(lifeExp ~ year, comparecoun, groups = country, type = c("p", "a"), auto.key = TRUE)
We can see that the change between the two country happened near 1972. I want to know what happened near 1972 that had made this change to Angola.
I don't know how to get those countries directly from gDat, I tried poorcountries<- gDat[(Coefs$slope<0.3)&(Coefs$intercept<40),], but it does not work.