Mahdiar Khosravi

STAT-545A hw#4
Sept.30.2013

Reference Source

I was trying to see how someone who is more experienced than me in R and RStudio writes their code. But, as I do not know many people in class I tried to pick the source code from someone whom I've heard about from Jenny in the class and I remembered Dean Attali.
Reference code and report:

Comments on Dean's code

The code and report are very well written and narrated. This makes it comprehendible for a new R user like me. These are more wonderings than comments!
1. Is there any difference between how you defined firstLastYears and this:

firstLastYears2 = subset(gDat, year == c(min(year),max(year)))

(When I check them with identical, it says they are.)
2. Are these exactly the same? (Again identical says they are “identical”.):

Data1 <- ddply(gDat, .(continent,year), summarize, mean0 = mean(lifeExp))
Data2 <- ddply(gDat, ~ continent + year, summarize, mean0 = mean(lifeExp))

.3. I can't understand what the argument year == x$year[1] in the line 98 of this code tells.

Selecting Tasks

After I started to work on the basis of Dean's previous work, I realized that, maybe as a result of Jenny's suggestion about flexibility, Dean has carried out tasks that are somewhat different comparing to what was said in corresponding webpage. But anyway I decided to continue since the purpose of this assignment was introduced to be generating “Companion Graphics”.
I choose these tasks to produce these graphics, based on Dean's code:

Data initialization

# load required libraries
library(plyr)
library(xtable)
library(lattice)
# import the data
gDat <- read.delim("gapminderDataFiveYear.txt")
# sanity check that import was successful
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

First of all we omit Oceania from the source data:

iDat <- droplevels(subset(gDat, continent != "Oceania"))
levels(iDat$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"

Life expectancy change over time on different continents

We use trimmed mean of the life expectancy as the definition of the “typical” life expectancy. In this table, Dean had compared the typical life expectancy in each continent per year. I have omitted the other statistics that have been calculated here. (I tried to keep mean and trimmed mean and use multi-panel conditioning by group(), but I did not succeed to use two functions at the same time for stripplot())

# compute the means and arrange the data by the highest percent difference
trimFactor = 0.15
lifeExpMeans <- ddply(iDat, .(continent,year), summarize,
                Dumm1 = mean(lifeExp, trim = 0.15))
lifeExpMeans <- arrange(lifeExpMeans,Dumm1)
names(lifeExpMeans)[3] <- paste0("Tmean",trimFactor*100, "%")
lifeExpMeans <- xtable(lifeExpMeans)
print(lifeExpMeans, type = "html", include.rownames = FALSE)
continent year Tmean15%
Africa 1952 38.95
Africa 1957 40.95
Africa 1962 43.04
Africa 1967 45.11
Asia 1952 45.79
Africa 1972 47.22
Asia 1957 49.00
Africa 1977 49.30
Africa 1982 51.26
Asia 1962 51.29
Africa 2002 52.04
Africa 1997 52.85
Africa 1987 52.92
Americas 1952 53.10
Africa 1992 53.66
Africa 2007 53.75
Asia 1967 54.71
Americas 1957 56.02
Asia 1972 57.79
Americas 1962 58.68
Asia 1977 60.55
Americas 1967 60.79
Americas 1972 62.89
Asia 1982 63.15
Americas 1977 64.88
Europe 1952 65.11
Asia 1987 65.45
Americas 1982 66.82
Asia 1992 67.09
Europe 1957 67.34
Asia 1997 68.54
Americas 1987 68.63
Europe 1962 69.10
Asia 2002 69.88
Americas 1992 70.05
Europe 1967 70.20
Europe 1972 71.14
Asia 2007 71.33
Americas 1997 71.63
Europe 1977 72.18
Americas 2002 72.85
Europe 1982 73.09
Europe 1987 73.97
Americas 2007 73.99
Europe 1992 74.77
Europe 1997 75.85
Europe 2002 76.94
Europe 2007 77.89

To present companion figures, first step can be examining typical life expectancy for different years, regardless of the continents and the countries. This is shown in Figure.1. The same plot with data associated with each continent separately, is shown in Figure.2.

Fig1 <- stripplot(lifeExp ~ factor(year) , iDat, jitter.data=TRUE, fun = function(x){
  mean(x, trim= trimFactor)
  }, type= c("p","a"), sub= "Figure.1", main= "Global")

Fig2 <- stripplot(lifeExp ~ factor(year) , iDat, group= continent, jitter.data=TRUE, fun = function(x){
  mean(x, trim= trimFactor)
  }, type= c("p","a"), sub= "Figure.2", auto.key = list(columns = nlevels(iDat$continent)))
print(Fig1,position= c(0,0,0.5,1), more=TRUE)
print(Fig2,position= c(0.5,0,1,1))

plot of chunk unnamed-chunk-7

Here we trimmed 15% of the data to evaluate the trimmed mean.
As we can see from Figure.1, typical life expectancy has increased almost constantly with time. However, when we look at the trend for different continents, we realize that there has been a consistent order among different continents throughout this time span. These results show that Americas and Asia have had a more consistent behavior and decreased their difference with Europe. Africa, on the other hand, shows a relative minimum at 2002 and is getting farther from the other continents.

World population in each of the continents at different years

Here we look at the total population of each continent in every year, and compare that to the world's total population. The data is arranged by year, where in each year group the continents are arranged from most populous to least.

worldRelativePop <- ddply(iDat, .(continent, year), function(x) {
  x <- as.list(x)
  x["continentPop"] <- sum(x$pop)
  x["worldPop"] <- sum(subset(iDat, year == x$year[1])[["pop"]])
# I can't understand what the argument:  year == x$year[1]) is.   
  x["percent"] <- round(as.numeric(x["continentPop"])/as.numeric(x["worldPop"]) * 
                              100, 2)
  as.data.frame(x[c("continentPop", "worldPop", "percent")])
}
)
worldRelativePop <- arrange(worldRelativePop, year, desc(percent))
worldRelativePop <- xtable(worldRelativePop)
print(worldRelativePop, type = "html", include.rownames = FALSE)
continent year continentPop worldPop percent
Asia 1952 1395357352.00 2396271145.00 58.23
Europe 1952 418120846.00 2396271145.00 17.45
Americas 1952 345152446.00 2396271145.00 14.40
Africa 1952 237640501.00 2396271145.00 9.92
Asia 1957 1562780599.00 2652462604.00 58.92
Europe 1957 437890351.00 2652462604.00 16.51
Americas 1957 386953916.00 2652462604.00 14.59
Africa 1957 264837738.00 2652462604.00 9.98
Asia 1962 1696357182.00 2886499456.00 58.77
Europe 1962 460355155.00 2886499456.00 15.95
Americas 1962 433270254.00 2886499456.00 15.01
Africa 1962 296516865.00 2886499456.00 10.27
Asia 1967 1905662900.00 3202877970.00 59.50
Europe 1967 481178958.00 3202877970.00 15.02
Americas 1967 480746623.00 3202877970.00 15.01
Africa 1967 335289489.00 3202877970.00 10.47
Asia 1972 2150972248.00 3560871058.00 60.41
Americas 1972 529384210.00 3560871058.00 14.87
Europe 1972 500635059.00 3560871058.00 14.06
Africa 1972 379879541.00 3560871058.00 10.67
Asia 1977 2384513556.00 3912806807.00 60.94
Americas 1977 578067699.00 3912806807.00 14.77
Europe 1977 517164531.00 3912806807.00 13.22
Africa 1977 433061021.00 3912806807.00 11.07
Asia 1982 2610135582.00 4271041990.00 61.11
Americas 1982 630290920.00 4271041990.00 14.76
Europe 1982 531266901.00 4271041990.00 12.44
Africa 1982 499348587.00 4271041990.00 11.69
Asia 1987 2871220762.00 4671903003.00 61.46
Americas 1987 682753971.00 4671903003.00 14.61
Africa 1987 574834110.00 4671903003.00 12.30
Europe 1987 543094160.00 4671903003.00 11.62
Asia 1992 3133292191.00 5089790609.00 61.56
Americas 1992 739274104.00 5089790609.00 14.52
Africa 1992 659081517.00 5089790609.00 12.95
Europe 1992 558142797.00 5089790609.00 10.97
Asia 1997 3383285500.00 5492963042.00 61.59
Americas 1997 796900410.00 5492963042.00 14.51
Africa 1997 743832984.00 5492963042.00 13.54
Europe 1997 568944148.00 5492963042.00 10.36
Asia 2002 3601802203.00 5863522750.00 61.43
Americas 2002 849772762.00 5863522750.00 14.49
Africa 2002 833723916.00 5863522750.00 14.22
Europe 2002 578223869.00 5863522750.00 9.86
Asia 2007 3811953827.00 6226463232.00 61.22
Africa 2007 929539692.00 6226463232.00 14.93
Americas 2007 898871184.00 6226463232.00 14.44
Europe 2007 586098529.00 6226463232.00 9.41

From this table it was interpreted that Asia is consistently by far the most populated continent, always making up ~60% of the world population. We depict the same phenomenon using bar charts in Figure.3:

barchart(continentPop ~ reorder(continent,continentPop)|factor(year),worldRelativePop, ylab= "Population", sub="Figure.3")

plot of chunk unnamed-chunk-9

Now we can see that it is much easier to show the same result this way.

Also, as an “interesting observation”, it was pointed out how Europe, America, and Africa changed spots over time. In the 1950's, Europe was the most populated, followed by America and Africa. As the years go by, America's relative population remains fairly constant, Europe's relative population decreases, and Africa's increases. This trend consistently continues throughout the years without exception, until at the last data point in 2007 the rankings of the three continents is completely flipped from the beginning - Africa followed by America followed by Europe. An excellent “companion graphic” for this observation can be the following figure. In Figure.4 we have suppressed the data associated with Asia so as to make the comparison among others more tangible.

xyplot(continentPop ~ reorder(continent,continentPop,fun=min), subset(worldRelativePop,
       continent !="Asia"), group = factor(year), type= c("p","a"), grid="h", ylab= "Population", 
       xlab= "Continents", auto.key=list(space= "right", title= "Years"),sub="Figure.4")

plot of chunk unnamed-chunk-10

Countries with extremely low or high life expectancy in 1952

Using the vector of the interesting countries due to having highest slopes of increasing life expectancy in 1952, we first present a table illustrating slopes and intercepts of GDP per capita and life expectancy for this countries in the year 1952:

intrestingCountries <- c("Gambia","Tunisia","Libya","Saudi Arabia",
                         "Vietnam","Oman","Guatemala","Honduras","Nicaragua",
                         "Portugal","Bosnia and Herzegovina","Turkey")

inDat <- subset(iDat, country %in% intrestingCountries)
inDaty <- subset(inDat, year == c("1952","1957"))

yearMin <- min(iDat$year)
jFun <- function(x) {
  jFitLifeExp <- lm(lifeExp ~ I(year - yearMin), x)
  jCoefLifeExp <- coef(jFitLifeExp)
  jFitGDP <- lm(gdpPercap ~ I(year - yearMin), x)
  jCoefGDP <- coef(jFitGDP)
  names(jCoefLifeExp) <- NULL
  names(jCoefGDP) <- NULL
  return(c(interceptLifeExp = jCoefLifeExp[1],
           slopeLifeExp = jCoefLifeExp[2],
           interceptGDP = jCoefGDP[1],
           slopeGDP = jCoefGDP[2]))
  }
intretCoef <- ddply(inDaty, ~ country + continent, jFun)
intretCoef <- arrange(intretCoef,slopeLifeExp)
intretCoef <- xtable(intretCoef)
print(intretCoef, type="html", include.rownames= FALSE)
country continent interceptLifeExp slopeLifeExp interceptGDP slopeGDP
Portugal Europe 59.82 0.34 3068.32 141.25
Gambia Africa 30.00 0.41 485.23 7.14
Guatemala Americas 42.02 0.42 2428.24 37.78
Vietnam Asia 40.41 0.50 605.07 14.24
Tunisia Africa 44.60 0.50 1468.48 -14.65
Oman Asia 37.58 0.50 1828.23 82.90
Libya Africa 42.72 0.51 2387.55 212.15
Honduras Americas 41.91 0.55 2194.93 5.11
Saudi Arabia Asia 39.88 0.60 6459.55 339.61
Nicaragua Americas 42.31 0.62 3112.36 69.01
Turkey Europe 43.58 0.90 1969.10 49.93
Bosnia and Herzegovina Europe 53.82 0.93 973.53 76.09

As we can see from this table, slopes and intercepts related to GDP per capita variable are very different. Plotting the data for these two variables reveals some other interesting facts:

# I needed to do something with inDat's levels for the xyplot legends,  
# This one partially works but messes inDat: 
# levels(inDat$country) <- inDat$country
xyplot(lifeExp ~ factor(year)|continent, inDat, group= country, type= c("p","a"), sub ="Figure.5")

plot of chunk unnamed-chunk-12

xyplot(gdpPercap ~ factor(year)|continent, inDat, group= country, type= c("p","a"), sub ="Figure.6")

plot of chunk unnamed-chunk-12

I had a problem with levels of data and I could not show the legends, data has kept the legend of iDat (gDat - Oceania)
The behavior of GDP per capita for Saudi Arabia and Tunisia is interesting. The results show a relative maximum for them.