STAT-545A hw#4
Sept.30.2013
I was trying to see how someone who is more experienced than me in R and RStudio writes their code. But, as I do not know many people in class I tried to pick the source code from someone whom I've heard about from Jenny in the class and I remembered Dean Attali.
Reference code and report:
The code and report are very well written and narrated. This makes it comprehendible for a new R user like me.
These are more wonderings than comments!
1. Is there any difference between how you defined firstLastYears and this:
firstLastYears2 = subset(gDat, year == c(min(year),max(year)))
(When I check them with identical, it says they are.)
2. Are these exactly the same? (Again identical says they are “identical”.):
Data1 <- ddply(gDat, .(continent,year), summarize, mean0 = mean(lifeExp))
Data2 <- ddply(gDat, ~ continent + year, summarize, mean0 = mean(lifeExp))
.3. I can't understand what the argument year == x$year[1] in the line 98 of this code tells.
After I started to work on the basis of Dean's previous work, I realized that, maybe as a result of Jenny's suggestion about flexibility, Dean has carried out tasks that are somewhat different comparing to what was said in corresponding webpage. But anyway I decided to continue since the purpose of this assignment was introduced to be generating “Companion Graphics”.
I choose these tasks to produce these graphics, based on Dean's code:
# load required libraries
library(plyr)
library(xtable)
library(lattice)
# import the data
gDat <- read.delim("gapminderDataFiveYear.txt")
# sanity check that import was successful
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
First of all we omit Oceania from the source data:
iDat <- droplevels(subset(gDat, continent != "Oceania"))
levels(iDat$continent)
## [1] "Africa" "Americas" "Asia" "Europe"
We use trimmed mean of the life expectancy as the definition of the “typical” life expectancy.
In this table, Dean had compared the typical life expectancy in each continent per year. I have omitted the other statistics that have been calculated here.
(I tried to keep mean and trimmed mean and use multi-panel conditioning by group(), but I did not succeed to use two functions at the same time for stripplot())
# compute the means and arrange the data by the highest percent difference
trimFactor = 0.15
lifeExpMeans <- ddply(iDat, .(continent,year), summarize,
Dumm1 = mean(lifeExp, trim = 0.15))
lifeExpMeans <- arrange(lifeExpMeans,Dumm1)
names(lifeExpMeans)[3] <- paste0("Tmean",trimFactor*100, "%")
lifeExpMeans <- xtable(lifeExpMeans)
print(lifeExpMeans, type = "html", include.rownames = FALSE)
| continent | year | Tmean15% |
|---|---|---|
| Africa | 1952 | 38.95 |
| Africa | 1957 | 40.95 |
| Africa | 1962 | 43.04 |
| Africa | 1967 | 45.11 |
| Asia | 1952 | 45.79 |
| Africa | 1972 | 47.22 |
| Asia | 1957 | 49.00 |
| Africa | 1977 | 49.30 |
| Africa | 1982 | 51.26 |
| Asia | 1962 | 51.29 |
| Africa | 2002 | 52.04 |
| Africa | 1997 | 52.85 |
| Africa | 1987 | 52.92 |
| Americas | 1952 | 53.10 |
| Africa | 1992 | 53.66 |
| Africa | 2007 | 53.75 |
| Asia | 1967 | 54.71 |
| Americas | 1957 | 56.02 |
| Asia | 1972 | 57.79 |
| Americas | 1962 | 58.68 |
| Asia | 1977 | 60.55 |
| Americas | 1967 | 60.79 |
| Americas | 1972 | 62.89 |
| Asia | 1982 | 63.15 |
| Americas | 1977 | 64.88 |
| Europe | 1952 | 65.11 |
| Asia | 1987 | 65.45 |
| Americas | 1982 | 66.82 |
| Asia | 1992 | 67.09 |
| Europe | 1957 | 67.34 |
| Asia | 1997 | 68.54 |
| Americas | 1987 | 68.63 |
| Europe | 1962 | 69.10 |
| Asia | 2002 | 69.88 |
| Americas | 1992 | 70.05 |
| Europe | 1967 | 70.20 |
| Europe | 1972 | 71.14 |
| Asia | 2007 | 71.33 |
| Americas | 1997 | 71.63 |
| Europe | 1977 | 72.18 |
| Americas | 2002 | 72.85 |
| Europe | 1982 | 73.09 |
| Europe | 1987 | 73.97 |
| Americas | 2007 | 73.99 |
| Europe | 1992 | 74.77 |
| Europe | 1997 | 75.85 |
| Europe | 2002 | 76.94 |
| Europe | 2007 | 77.89 |
To present companion figures, first step can be examining typical life expectancy for different years, regardless of the continents and the countries. This is shown in Figure.1. The same plot with data associated with each continent separately, is shown in Figure.2.
Fig1 <- stripplot(lifeExp ~ factor(year) , iDat, jitter.data=TRUE, fun = function(x){
mean(x, trim= trimFactor)
}, type= c("p","a"), sub= "Figure.1", main= "Global")
Fig2 <- stripplot(lifeExp ~ factor(year) , iDat, group= continent, jitter.data=TRUE, fun = function(x){
mean(x, trim= trimFactor)
}, type= c("p","a"), sub= "Figure.2", auto.key = list(columns = nlevels(iDat$continent)))
print(Fig1,position= c(0,0,0.5,1), more=TRUE)
print(Fig2,position= c(0.5,0,1,1))
Here we trimmed 15% of the data to evaluate the trimmed mean.
As we can see from Figure.1, typical life expectancy has increased almost constantly with time. However, when we look at the trend for different continents, we realize that there has been a consistent order among different continents throughout this time span. These results show that Americas and Asia have had a more consistent behavior and decreased their difference with Europe. Africa, on the other hand, shows a relative minimum at 2002 and is getting farther from the other continents.
Here we look at the total population of each continent in every year, and compare that to the world's total population. The data is arranged by year, where in each year group the continents are arranged from most populous to least.
worldRelativePop <- ddply(iDat, .(continent, year), function(x) {
x <- as.list(x)
x["continentPop"] <- sum(x$pop)
x["worldPop"] <- sum(subset(iDat, year == x$year[1])[["pop"]])
# I can't understand what the argument: year == x$year[1]) is.
x["percent"] <- round(as.numeric(x["continentPop"])/as.numeric(x["worldPop"]) *
100, 2)
as.data.frame(x[c("continentPop", "worldPop", "percent")])
}
)
worldRelativePop <- arrange(worldRelativePop, year, desc(percent))
worldRelativePop <- xtable(worldRelativePop)
print(worldRelativePop, type = "html", include.rownames = FALSE)
| continent | year | continentPop | worldPop | percent |
|---|---|---|---|---|
| Asia | 1952 | 1395357352.00 | 2396271145.00 | 58.23 |
| Europe | 1952 | 418120846.00 | 2396271145.00 | 17.45 |
| Americas | 1952 | 345152446.00 | 2396271145.00 | 14.40 |
| Africa | 1952 | 237640501.00 | 2396271145.00 | 9.92 |
| Asia | 1957 | 1562780599.00 | 2652462604.00 | 58.92 |
| Europe | 1957 | 437890351.00 | 2652462604.00 | 16.51 |
| Americas | 1957 | 386953916.00 | 2652462604.00 | 14.59 |
| Africa | 1957 | 264837738.00 | 2652462604.00 | 9.98 |
| Asia | 1962 | 1696357182.00 | 2886499456.00 | 58.77 |
| Europe | 1962 | 460355155.00 | 2886499456.00 | 15.95 |
| Americas | 1962 | 433270254.00 | 2886499456.00 | 15.01 |
| Africa | 1962 | 296516865.00 | 2886499456.00 | 10.27 |
| Asia | 1967 | 1905662900.00 | 3202877970.00 | 59.50 |
| Europe | 1967 | 481178958.00 | 3202877970.00 | 15.02 |
| Americas | 1967 | 480746623.00 | 3202877970.00 | 15.01 |
| Africa | 1967 | 335289489.00 | 3202877970.00 | 10.47 |
| Asia | 1972 | 2150972248.00 | 3560871058.00 | 60.41 |
| Americas | 1972 | 529384210.00 | 3560871058.00 | 14.87 |
| Europe | 1972 | 500635059.00 | 3560871058.00 | 14.06 |
| Africa | 1972 | 379879541.00 | 3560871058.00 | 10.67 |
| Asia | 1977 | 2384513556.00 | 3912806807.00 | 60.94 |
| Americas | 1977 | 578067699.00 | 3912806807.00 | 14.77 |
| Europe | 1977 | 517164531.00 | 3912806807.00 | 13.22 |
| Africa | 1977 | 433061021.00 | 3912806807.00 | 11.07 |
| Asia | 1982 | 2610135582.00 | 4271041990.00 | 61.11 |
| Americas | 1982 | 630290920.00 | 4271041990.00 | 14.76 |
| Europe | 1982 | 531266901.00 | 4271041990.00 | 12.44 |
| Africa | 1982 | 499348587.00 | 4271041990.00 | 11.69 |
| Asia | 1987 | 2871220762.00 | 4671903003.00 | 61.46 |
| Americas | 1987 | 682753971.00 | 4671903003.00 | 14.61 |
| Africa | 1987 | 574834110.00 | 4671903003.00 | 12.30 |
| Europe | 1987 | 543094160.00 | 4671903003.00 | 11.62 |
| Asia | 1992 | 3133292191.00 | 5089790609.00 | 61.56 |
| Americas | 1992 | 739274104.00 | 5089790609.00 | 14.52 |
| Africa | 1992 | 659081517.00 | 5089790609.00 | 12.95 |
| Europe | 1992 | 558142797.00 | 5089790609.00 | 10.97 |
| Asia | 1997 | 3383285500.00 | 5492963042.00 | 61.59 |
| Americas | 1997 | 796900410.00 | 5492963042.00 | 14.51 |
| Africa | 1997 | 743832984.00 | 5492963042.00 | 13.54 |
| Europe | 1997 | 568944148.00 | 5492963042.00 | 10.36 |
| Asia | 2002 | 3601802203.00 | 5863522750.00 | 61.43 |
| Americas | 2002 | 849772762.00 | 5863522750.00 | 14.49 |
| Africa | 2002 | 833723916.00 | 5863522750.00 | 14.22 |
| Europe | 2002 | 578223869.00 | 5863522750.00 | 9.86 |
| Asia | 2007 | 3811953827.00 | 6226463232.00 | 61.22 |
| Africa | 2007 | 929539692.00 | 6226463232.00 | 14.93 |
| Americas | 2007 | 898871184.00 | 6226463232.00 | 14.44 |
| Europe | 2007 | 586098529.00 | 6226463232.00 | 9.41 |
From this table it was interpreted that Asia is consistently by far the most populated continent, always making up ~60% of the world population. We depict the same phenomenon using bar charts in Figure.3:
barchart(continentPop ~ reorder(continent,continentPop)|factor(year),worldRelativePop, ylab= "Population", sub="Figure.3")
Now we can see that it is much easier to show the same result this way.
Also, as an “interesting observation”, it was pointed out how Europe, America, and Africa changed spots over time. In the 1950's, Europe was the most populated, followed by America and Africa. As the years go by, America's relative population remains fairly constant, Europe's relative population decreases, and Africa's increases. This trend consistently continues throughout the years without exception, until at the last data point in 2007 the rankings of the three continents is completely flipped from the beginning - Africa followed by America followed by Europe. An excellent “companion graphic” for this observation can be the following figure. In Figure.4 we have suppressed the data associated with Asia so as to make the comparison among others more tangible.
xyplot(continentPop ~ reorder(continent,continentPop,fun=min), subset(worldRelativePop,
continent !="Asia"), group = factor(year), type= c("p","a"), grid="h", ylab= "Population",
xlab= "Continents", auto.key=list(space= "right", title= "Years"),sub="Figure.4")
Using the vector of the interesting countries due to having highest slopes of increasing life expectancy in 1952, we first present a table illustrating slopes and intercepts of GDP per capita and life expectancy for this countries in the year 1952:
intrestingCountries <- c("Gambia","Tunisia","Libya","Saudi Arabia",
"Vietnam","Oman","Guatemala","Honduras","Nicaragua",
"Portugal","Bosnia and Herzegovina","Turkey")
inDat <- subset(iDat, country %in% intrestingCountries)
inDaty <- subset(inDat, year == c("1952","1957"))
yearMin <- min(iDat$year)
jFun <- function(x) {
jFitLifeExp <- lm(lifeExp ~ I(year - yearMin), x)
jCoefLifeExp <- coef(jFitLifeExp)
jFitGDP <- lm(gdpPercap ~ I(year - yearMin), x)
jCoefGDP <- coef(jFitGDP)
names(jCoefLifeExp) <- NULL
names(jCoefGDP) <- NULL
return(c(interceptLifeExp = jCoefLifeExp[1],
slopeLifeExp = jCoefLifeExp[2],
interceptGDP = jCoefGDP[1],
slopeGDP = jCoefGDP[2]))
}
intretCoef <- ddply(inDaty, ~ country + continent, jFun)
intretCoef <- arrange(intretCoef,slopeLifeExp)
intretCoef <- xtable(intretCoef)
print(intretCoef, type="html", include.rownames= FALSE)
| country | continent | interceptLifeExp | slopeLifeExp | interceptGDP | slopeGDP |
|---|---|---|---|---|---|
| Portugal | Europe | 59.82 | 0.34 | 3068.32 | 141.25 |
| Gambia | Africa | 30.00 | 0.41 | 485.23 | 7.14 |
| Guatemala | Americas | 42.02 | 0.42 | 2428.24 | 37.78 |
| Vietnam | Asia | 40.41 | 0.50 | 605.07 | 14.24 |
| Tunisia | Africa | 44.60 | 0.50 | 1468.48 | -14.65 |
| Oman | Asia | 37.58 | 0.50 | 1828.23 | 82.90 |
| Libya | Africa | 42.72 | 0.51 | 2387.55 | 212.15 |
| Honduras | Americas | 41.91 | 0.55 | 2194.93 | 5.11 |
| Saudi Arabia | Asia | 39.88 | 0.60 | 6459.55 | 339.61 |
| Nicaragua | Americas | 42.31 | 0.62 | 3112.36 | 69.01 |
| Turkey | Europe | 43.58 | 0.90 | 1969.10 | 49.93 |
| Bosnia and Herzegovina | Europe | 53.82 | 0.93 | 973.53 | 76.09 |
As we can see from this table, slopes and intercepts related to GDP per capita variable are very different. Plotting the data for these two variables reveals some other interesting facts:
# I needed to do something with inDat's levels for the xyplot legends,
# This one partially works but messes inDat:
# levels(inDat$country) <- inDat$country
xyplot(lifeExp ~ factor(year)|continent, inDat, group= country, type= c("p","a"), sub ="Figure.5")
xyplot(gdpPercap ~ factor(year)|continent, inDat, group= country, type= c("p","a"), sub ="Figure.6")
I had a problem with levels of data and I could not show the legends, data has kept the legend of iDat (gDat - Oceania)
The behavior of GDP per capita for Saudi Arabia and Tunisia is interesting. The results show a relative maximum for them.