STAT 545A Homework#6

Yiming Zhang

First, loading the Gapminder data and needed packages.

gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file=gdURL)
library(lattice)
library(plyr)
library(xtable)
library(ggplot2)
library(xtable)
## define a function for converting and printing to HTML table
htmlPrint <- function(x, ...,
                      digits = 4, include.rownames = FALSE) {
  print(xtable(x, digits = digits, ...), type = 'html',
        include.rownames = include.rownames, ...)
  }

Then have a quick check of the data.

str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

We can see that the data we have loaded is a data.frame, and there are 1704 observations and 6 variables. Notice that there are only 5 levels in variable “continent”, just try to check how many countries that each continent have.

countcountry <- ddply(gDat,~continent,summarize,count_country=length(unique(country)))
htmlPrint(countcountry)
continent count_country
Africa 52
Americas 25
Asia 33
Europe 30
Oceania 2

Then we notice that there are only 2 countries are included in Oceania, this could bring bad effects to our following analysis, so just drop the Oceania.

gDat <- droplevels(subset(gDat,continent !="Oceania"))

Have a quick check

str(gDat)
## 'data.frame':    1680 obs. of  6 variables:
##  $ country  : Factor w/ 140 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 4 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

We can see that the Oceania has been successfully dropped from the original data. Let's check out the general picture of life expectancy in these four continents without considering the year.

stripplot(lifeExp ~ continent, gDat, jitter.data = TRUE, grid = "h", type = c("p", 
    "a"), main = paste("Life expectancy in different continents"), fun = median)

plot of chunk unnamed-chunk-8

Here notice that I use median point to connect the four continents, since median is a more robust measure than mean.

In this picture, we can have the general idea that the life expectancy in Europe are higher than the other continents. Then let's check out how the life expectancy varies by time.

stripplot(lifeExp ~ factor(year) | reorder(continent, lifeExp), gDat,
          jitter.data = TRUE,
          type = c("p", "a"), fun = median, alpha = 0.4, grid = "h", 
          ## use alpha to control the transparency 
          main = paste("Life expectancy varies in time in four continents"),
          scales = list(x = list(rot = c(45, 0))))

plot of chunk unnamed-chunk-9

To have a clear view of the trend for the four continents, plot another picture that

LifeExpchgebyCont_tall <- ddply(gDat, ~year + continent, 
                                summarize, MedianLifeExp = median(lifeExp))
xyplot(MedianLifeExp ~ year, LifeExpchgebyCont_tall, groups = continent, 
       main = paste("Life expectancy by median in four continents"),
       auto.key = TRUE, type = c("p", "a"))

plot of chunk unnamed-chunk-10

From the plot we can see that the life expectancy in all the four continents are increasing, but they have different patterns. For Europe, it starts at high level but increase slowly ; for Americas, it starts little lower than Europe but it increase to the same level with Europe in recent years; for Africa, it starts at lowest level and also increase slowly; for Asia, it is the most interesting part, it start at very low level, near the Africa, but it increase very fast and finally get the same level as Americas and Europe. Also we can notice that the plots in Europe are concentrated while the plots in Asia are sparse.

GDPbyYear_tall <- ddply(gDat, ~year + continent, summarize, Max = max(gdpPercap), 
    Min = min(gdpPercap))
print(xyplot(Max ~ year, GDPbyYear_tall, groups = continent, auto.key = TRUE, 
    type = c("p", "a")), position = c(0, 0, 0.55, 1), more = TRUE)
print(xyplot(Min ~ year, GDPbyYear_tall, groups = continent, auto.key = TRUE, 
    type = c("p", "a")), position = c(0.45, 0, 1, 1))

plot of chunk unnamed-chunk-11

To have a more clear view of the distribution of the four continents on gdp per captial and life expectancy for one year (let's choose 2007), plot the density plot.

select_year <- "2007"
gDat_select_year <- subset(gDat,subset=year==select_year)
ggplot(gDat_select_year,aes(x=gdpPercap,colour=continent))+geom_density()

plot of chunk unnamed-chunk-12

ggplot(gDat_select_year,aes(x=lifeExp,colour=continent))+geom_density()

plot of chunk unnamed-chunk-12

To compare different countries simulatanously with GDP per captia and lifeExp and also with population, draw a plot that with gdpPercap as x asis, lifeExp as y asis and square root of population as size of each plot.

here I use square root of the population rather than the orginal population to reduce the huge gap difference between each country to make the plot size easy to compare.

ggplot(subset(gDat, year == select_year), aes(x = gdpPercap, y = lifeExp, colour = continent, 
    size = sqrt(pop))) + geom_point() + scale_x_log10()

plot of chunk unnamed-chunk-13

ggplot(gDat, aes(x = lifeExp, color = continent)) + geom_density()

plot of chunk unnamed-chunk-13

reorder and arrange the gapminder data

##first reorder the continent factor based on their mean of life expectancy
gDat <- within(gDat, continent <- reorder(continent, lifeExp,mean))
##then reorder the country factor in each continent based on their mean of life expectancy
gDat <- within(gDat, country <- reorder(country, lifeExp,mean))
##arrange and reorder the data itself by continent, then country and finally year
gDat <- arrange(gDat, continent, country, year)

Then print the cleaned data to file.

write.table(gDat, "gDat_cl.tsv", quote = FALSE,
            sep = "\t", row.names = FALSE)

There I finished the main task in Rscript01, then I will do the analysis in Rscript02 without importing the data again.

Here I use a function to get the intercept, slope and standard deviation of linear regression of life expectancy on year within each country.

jFun <- function(x) {
  jFit <- lm(lifeExp ~ I(year - min(gDat$year)), x)
  est <- c(coef(jFit),sd(jFit$residuals))
  names(est) <- c("intercept", "slope","sd" )
  return(est)
}

Then use the function above get the intercept and slope within each country

ests<- ddply(gDat,~country,jFun)
htmlPrint(ests)
country intercept slope sd
Sierra Leone 30.8832 0.2140 0.7861
Afghanistan 29.9073 0.2753 1.1659
Angola 32.1267 0.2093 1.3415
Guinea-Bissau 31.7370 0.2718 0.6135
Mozambique 34.2062 0.2245 2.1851
Somalia 34.6754 0.2296 1.7764
Rwanda 42.7419 -0.0458 6.2531
Liberia 39.8364 0.0960 1.6903
Equatorial Guinea 34.4303 0.3102 0.3134
Guinea 31.5570 0.4248 1.1402
Malawi 36.9104 0.2342 1.8432
Mali 33.0512 0.3768 0.4592
Nigeria 37.8595 0.2081 1.4493
Central African Republic 38.8095 0.1839 3.3605
Gambia 28.4004 0.5818 1.0942
Ethiopia 36.0282 0.3072 0.9987
Congo, Dem. Rep. 41.9611 0.0939 2.3164
Niger 35.1507 0.3421 2.0821
Burkina Faso 34.6847 0.3640 1.9518
Burundi 40.5786 0.1541 1.5358
Zambia 47.6580 -0.0604 4.3180
Eritrea 35.6953 0.3747 1.4271
Djibouti 36.2772 0.3674 1.0742
Chad 39.8094 0.2532 1.7462
Yemen, Rep. 30.1303 0.6055 1.5120
Uganda 44.2752 0.1216 3.0393
Madagascar 36.6681 0.4037 0.5336
Cambodia 37.0154 0.3959 5.3681
Tanzania 43.1084 0.1747 1.7492
Cameroon 41.2495 0.2501 3.0923
Sudan 37.8742 0.3828 0.6141
Cote d'Ivoire 44.8459 0.1306 3.7429
Benin 39.5885 0.3342 1.1200
Nepal 34.4316 0.5293 0.8812
Swaziland 46.3879 0.0951 6.3349
Bangladesh 36.1355 0.4981 0.9312
Lesotho 47.3790 0.0956 5.6578
Haiti 39.2462 0.3971 0.8016
Senegal 36.7467 0.5047 0.8890
Gabon 38.9353 0.4467 3.8654
Togo 40.9775 0.3826 2.2242
Mauritania 40.0256 0.4464 0.3886
Ghana 43.4927 0.3217 0.7373
Comoros 39.9960 0.4504 0.4564
Congo, Rep. 47.1368 0.1951 3.3815
Bolivia 38.7564 0.4999 1.1293
Zimbabwe 55.2212 -0.0930 6.8701
Kenya 47.0020 0.2065 4.1782
India 39.2698 0.5053 1.6446
Myanmar 41.4116 0.4331 2.8917
Namibia 47.1343 0.2312 4.7299
South Africa 49.3413 0.1692 4.5236
Indonesia 36.8831 0.6346 0.6155
Botswana 52.9291 0.0607 5.8277
Pakistan 43.7230 0.4058 0.3842
Mongolia 43.8264 0.4387 0.8965
Egypt 40.9680 0.5555 0.9908
Iraq 50.1135 0.2352 3.8683
Guatemala 42.1194 0.5313 0.5541
Vietnam 39.0101 0.6716 1.2525
Morocco 42.6908 0.5425 0.7217
Sao Tome and Principe 48.5276 0.3407 1.3292
Honduras 42.9924 0.5429 1.4915
Nicaragua 43.0451 0.5565 0.5706
Oman 37.2077 0.7722 2.2385
Iran 44.9790 0.4966 0.6337
Saudi Arabia 40.8141 0.6496 1.9846
Peru 44.3476 0.5277 1.0273
Algeria 43.3750 0.5693 1.2614
Libya 42.1019 0.6255 1.4682
El Salvador 46.5119 0.4771 1.8526
Turkey 46.0223 0.4972 1.0937
Jordan 44.0639 0.5717 1.8204
West Bank and Gaza 43.7984 0.6011 1.8899
Tunisia 44.5553 0.5878 1.4865
Philippines 49.4043 0.4205 0.7051
Syria 46.1013 0.5544 1.2677
Dominican Republic 48.5978 0.4712 1.4781
China 47.1905 0.5307 3.6775
Thailand 52.6564 0.3470 1.1488
Brazil 51.5120 0.3901 0.3111
Ecuador 49.0654 0.5001 0.6663
Korea, Dem. Rep. 54.9056 0.3164 3.7072
Colombia 53.4271 0.3808 1.2506
Malaysia 51.5052 0.4645 1.9908
Mauritius 55.3708 0.3485 1.6592
Korea, Rep. 49.7275 0.5554 1.1196
Mexico 53.0054 0.4510 0.9964
Bahrain 52.7492 0.4675 1.5633
Lebanon 58.6874 0.2610 1.1706
Sri Lanka 59.7915 0.2449 1.0370
Venezuela 57.5133 0.3297 1.4128
Reunion 53.9975 0.4599 1.5537
Paraguay 62.4818 0.1574 0.3732
Trinidad and Tobago 62.0523 0.1737 1.5751
Chile 54.3177 0.4768 1.1373
Bosnia and Herzegovina 58.0896 0.3498 2.1516
Panama 58.0610 0.3542 1.4464
Romania 63.9621 0.1574 1.3941
Albania 59.2291 0.3347 1.8908
Serbia 61.5343 0.2552 1.7082
Jamaica 62.6610 0.2214 1.9603
Kuwait 57.4593 0.4168 1.6808
Argentina 62.6884 0.2317 0.2787
Hungary 65.9928 0.1236 1.1319
Bulgaria 65.7373 0.1457 2.3923
Croatia 63.8558 0.2255 1.0942
Poland 64.7809 0.1962 1.5458
Costa Rica 59.1047 0.4028 1.4482
Montenegro 62.2416 0.2930 2.6257
Taiwan 61.3374 0.3272 1.2494
Portugal 61.1468 0.3372 1.0867
Slovak Republic 67.0099 0.1340 1.2393
Uruguay 65.7416 0.1833 0.5088
Cuba 62.2134 0.3212 1.6596
Singapore 61.8459 0.3409 0.6788
Czech Republic 67.5281 0.1448 0.7871
Slovenia 66.0863 0.2005 0.6778
Puerto Rico 66.9485 0.2106 1.2097
Finland 66.4490 0.2379 0.3377
Ireland 67.5415 0.1991 0.4556
Austria 66.4485 0.2420 0.3884
Germany 67.5681 0.2137 0.3967
United States 68.4138 0.1842 0.3968
Hong Kong, China 63.4286 0.3660 1.1136
Belgium 67.8919 0.2091 0.2793
Israel 66.3004 0.2671 0.3487
Greece 67.0672 0.2424 0.7422
United Kingdom 68.8085 0.1860 0.4215
Italy 66.5968 0.2697 0.3973
Spain 66.4778 0.2809 0.9660
France 67.7901 0.2385 0.2098
Denmark 71.0336 0.1213 0.3802
Japan 65.1221 0.3529 1.3055
Canada 68.8838 0.2189 0.2376
Switzerland 69.4537 0.2222 0.2049
Netherlands 71.8896 0.1367 0.3316
Norway 72.2146 0.1319 0.4669
Sweden 71.6050 0.1663 0.2019
Iceland 71.9636 0.1654 0.5214

Print it to file

write.table(ests, "ests.tsv", quote = FALSE,
            sep = "\t", row.names = FALSE)

Merge the estimated data with the original data

jDat <- merge(gDat,ests)

Use function choose the best 3 countries as the maximum slope(The Most Improved Awards), the maximum intercept(Best country in the past), , and the minimum standard deviation(The Most Stable Improvement Awards)

BestFun <- function(x){
  max_slope <- which.max(x$slope)
  max_intercept <- which.max(x$intercept)
  min_sd <- which.min(x$sd)
x[c(max_slope,max_intercept,min_sd),]
}

Similariary,use function choose the worest 3 countries as the minimum slope, the minimum intercept, and the maximum standard deviation.

WorestFun <- function(x){
  min_slope <- which.min(x$slope)
  min_intercept <- which.min(x$intercept)
  max_sd <- which.max(x$sd)
  x[c(min_slope,min_intercept,max_sd),]
}

Get the best and worest countries in Afica and merge into one data frame

Best_Africa <- BestFun(subset(jDat,subset=continent=="Africa"))
htmlPrint(Best_Africa)
country year pop continent lifeExp gdpPercap intercept slope sd
Libya 1952 1019729.0000 Africa 42.7230 2387.5481 42.1019 0.6255 1.4682
Mauritius 1952 516556.0000 Africa 50.9860 1967.9557 55.3708 0.3485 1.6592
Equatorial Guinea 1952 216964.0000 Africa 34.4820 375.6431 34.4303 0.3102 0.3134
Worest_Africa <- WorestFun(subset(jDat,subset=continent=="Africa"))
htmlPrint(Worest_Africa)
country year pop continent lifeExp gdpPercap intercept slope sd
Zimbabwe 1952 3080907.0000 Africa 48.4510 406.8841 55.2212 -0.0930 6.8701
Gambia 1952 284320.0000 Africa 30.0000 485.2307 28.4004 0.5818 1.0942
Zimbabwe 1952 3080907.0000 Africa 48.4510 406.8841 55.2212 -0.0930 6.8701
merge_Africa<- merge(Best_Africa,Worest_Africa,all=TRUE)
BestandWorest_Afria<-subset(gDat,subset=country %in% merge_Africa$country)

Get the best and worest countries in Asia and merge into one data frame

Best_Asia <- BestFun(subset(jDat,subset=continent=="Asia"))
htmlPrint(Best_Asia)
country year pop continent lifeExp gdpPercap intercept slope sd
Oman 1952 507833.0000 Asia 37.5780 1828.2303 37.2077 0.7722 2.2385
Israel 1952 1620914.0000 Asia 65.3900 4086.5221 66.3004 0.2671 0.3487
Israel 1952 1620914.0000 Asia 65.3900 4086.5221 66.3004 0.2671 0.3487
Worest_Asia <- WorestFun(subset(jDat,subset=continent=="Asia"))
htmlPrint(Worest_Asia)
country year pop continent lifeExp gdpPercap intercept slope sd
Iraq 1952 5441766.0000 Asia 45.3200 4129.7661 50.1135 0.2352 3.8683
Afghanistan 1952 8425333.0000 Asia 28.8010 779.4453 29.9073 0.2753 1.1659
Cambodia 1952 4693836.0000 Asia 39.4170 368.4693 37.0154 0.3959 5.3681
 merge_Asia<- merge(Best_Asia,Worest_Asia,all=TRUE)
BestandWorest_Asia<-subset(gDat,subset=country %in% merge_Asia$country)

Get the best and worest countries in Americas and merge into one data frame

Best_Americas <- BestFun(subset(jDat,subset=continent=="Americas"))
htmlPrint(Best_Americas)
country year pop continent lifeExp gdpPercap intercept slope sd
Nicaragua 1952 1165790.0000 Americas 42.3140 3112.3639 43.0451 0.5565 0.5706
Canada 1952 14785584.0000 Americas 68.7500 11367.1611 68.8838 0.2189 0.2376
Canada 1952 14785584.0000 Americas 68.7500 11367.1611 68.8838 0.2189 0.2376
Worest_Americas <- WorestFun(subset(jDat,subset=continent=="Americas"))
htmlPrint(Worest_Americas)
country year pop continent lifeExp gdpPercap intercept slope sd
Paraguay 1952 1555876.0000 Americas 62.6490 1952.3087 62.4818 0.1574 0.3732
Bolivia 1952 2883315.0000 Americas 40.4140 2677.3263 38.7564 0.4999 1.1293
Jamaica 1952 1426095.0000 Americas 58.5300 2898.5309 62.6610 0.2214 1.9603
merge_Americas<- merge(Best_Americas,Worest_Americas,all=TRUE)
BestandWorest_Americas<-subset(gDat,subset=country %in% merge_Americas$country)

Get the best and worest countries in Europe and merge into one data frame

Best_Europe <- BestFun(subset(jDat,subset=continent=="Europe"))
htmlPrint(Best_Europe)
country year pop continent lifeExp gdpPercap intercept slope sd
Turkey 1952 22235677.0000 Europe 43.5850 1969.1010 46.0223 0.4972 1.0937
Norway 1952 3327728.0000 Europe 72.6700 10095.4217 72.2146 0.1319 0.4669
Sweden 1952 7124673.0000 Europe 71.8600 8527.8447 71.6050 0.1663 0.2019
Worest_Europe <- WorestFun(subset(jDat,subset=continent=="Europe"))
htmlPrint(Best_Europe)
country year pop continent lifeExp gdpPercap intercept slope sd
Turkey 1952 22235677.0000 Europe 43.5850 1969.1010 46.0223 0.4972 1.0937
Norway 1952 3327728.0000 Europe 72.6700 10095.4217 72.2146 0.1319 0.4669
Sweden 1952 7124673.0000 Europe 71.8600 8527.8447 71.6050 0.1663 0.2019
merge_Europe<- merge(Best_Europe,Worest_Europe,all=TRUE)
BestandWorest_Europe<-subset(gDat,subset=country %in% merge_Europe$country)

Put it all togather

all_Best_Worest <- rbind(BestandWorest_Afria,BestandWorest_Asia,
                         BestandWorest_Americas,BestandWorest_Europe)
ests_Best_Worest<- ddply(all_Best_Worest,~country,jFun)

Print it to file

write.table(ests_Best_Worest, "Best_Worest_countries_each_continents_estimation.tsv", 
            quote = FALSE,sep = "\t", row.names = FALSE)

For Africa

ggplot(BestandWorest_Afria,aes(x=year,y=lifeExp,colour=country))+
  geom_point()+geom_line()+stat_smooth(method = "lm")+
  facet_wrap(~country)+ggtitle("The Best And Worest countries in Africa ")

plot of chunk unnamed-chunk-28

dev.print(pdf,"Best_Worest_Countries_Africa.pdf")
## pdf 
##   2

For Asia

ggplot(BestandWorest_Asia,aes(x=year,y=lifeExp,colour=country))+
  geom_point()+geom_line()+stat_smooth(method = "lm")+
  facet_wrap(~country)+ggtitle("The Best And Worest countries in Asia ")

plot of chunk unnamed-chunk-29

dev.print(pdf,"Best_Worest_Countries_Asia.pdf")
## pdf 
##   2

For Americas

ggplot(BestandWorest_Americas,aes(x=year,y=lifeExp,colour=country))+
  geom_point()+geom_line()+stat_smooth(method = "lm")+
  facet_wrap(~country)+ggtitle("The Best And Worest countries in Americas ")

plot of chunk unnamed-chunk-30

dev.print(pdf,"Best_Worest_Countries_Americas.pdf")
## pdf 
##   2

For Europe

ggplot(BestandWorest_Europe,aes(x=year,y=lifeExp,colour=country))+
  geom_point()+geom_line()+stat_smooth(method = "lm")+
  facet_wrap(~country)+ggtitle("The Best And Worest countries in Europe ")

plot of chunk unnamed-chunk-31

dev.print(pdf,"Best_Worest_Countries_Europe.pdf")
## pdf 
##   2