Dean Attali

STAT 545A hw 3
Sept 22 2013

Exercises done:

Data initialization

# load required libraries
library(plyr)
library(xtable)
# import the data
gDat <- read.delim("gapminderDataFiveYear.txt")
# sanity check that import was successful
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Average GDP/cap in each continent when the data was first and last collected (easy)

In my previous assignment I worked out the GDP/cap in every continent per year using the 'aggregate' function. The goal here was just to show myself how awesome and easy plyr is to get the same data. We just look at the first and last year data.

# get the data that only has the first and last years
firstLastYears = subset(gDat, year == min(year) | year == max(year))
# use plyr to pull out the wanted information
avgGdpContinent <- ddply(firstLastYears, ~year + continent, summarize, gdp = mean(gdpPercap))
avgGdpContinent <- xtable(avgGdpContinent)
print(avgGdpContinent, type = "html", include.rownames = FALSE)
year continent gdp
1952 Africa 1252.57
1952 Americas 4079.06
1952 Asia 5195.48
1952 Europe 5661.06
1952 Oceania 10298.09
2007 Africa 3089.03
2007 Americas 11003.03
2007 Asia 12473.03
2007 Europe 25054.48
2007 Oceania 29810.19

We can see from the table above that Oceania and Asia are the big winners of the 50 years, while Africa is the loser. Of course, visualizing it would be much nicer, but it is forbidden.

Trimmed mean statistics for life expectancy in each continent for every year (fun)

Here, we comapre the mean life expectancy in eahc continent per year with the trimmed mean after removing 15% of lowest/highest values. We compute the difference between the trimmed mean and the regular mean, and calculate the percent difference between them.

# compute the means and arrange the data by the highest percent difference
lifeExpMeans <- arrange(ddply(gDat, .(continent, year), summarize, mean0 = mean(lifeExp), 
    mean15 = mean(lifeExp, trim = 0.15), meanDiff = abs(mean0 - mean15), percentDiff = round(meanDiff/mean0 * 
        100, 2)), desc(percentDiff))
lifeExpMeans <- xtable(lifeExpMeans)
print(lifeExpMeans, type = "html", include.rownames = FALSE)
continent year mean0 mean15 meanDiff percentDiff
Africa 2002 53.33 52.04 1.29 2.42
Africa 2007 54.81 53.75 1.06 1.93
Asia 1977 59.61 60.55 0.94 1.57
Africa 1997 53.60 52.85 0.75 1.40
Asia 1952 46.31 45.79 0.52 1.13
Europe 1952 64.41 65.11 0.70 1.09
Europe 1957 66.70 67.34 0.64 0.95
Asia 1987 64.85 65.45 0.60 0.93
Asia 2002 69.23 69.88 0.64 0.93
Americas 1982 66.23 66.82 0.59 0.90
Asia 2007 70.73 71.33 0.61 0.86
Asia 1982 62.62 63.15 0.53 0.84
Asia 1992 66.54 67.09 0.55 0.83
Asia 1972 57.32 57.79 0.47 0.82
Europe 1962 68.54 69.10 0.56 0.82
Africa 1987 53.34 52.92 0.42 0.80
Americas 1972 62.39 62.89 0.50 0.80
Americas 1987 68.09 68.63 0.54 0.79
Africa 1957 41.27 40.95 0.31 0.76
Americas 1977 64.39 64.88 0.49 0.76
Asia 1997 68.02 68.54 0.52 0.76
Americas 1992 69.57 70.05 0.49 0.70
Americas 1997 71.15 71.63 0.48 0.68
Europe 1967 69.74 70.20 0.46 0.66
Africa 1962 43.32 43.04 0.28 0.65
Africa 1982 51.59 51.26 0.33 0.65
Asia 1957 49.32 49.00 0.32 0.65
Americas 1967 60.41 60.79 0.38 0.63
Americas 2002 72.42 72.85 0.43 0.60
Africa 1977 49.58 49.30 0.28 0.56
Asia 1962 51.56 51.29 0.27 0.53
Europe 1972 70.78 71.14 0.37 0.52
Americas 2007 73.61 73.99 0.38 0.51
Africa 1967 45.33 45.11 0.22 0.49
Africa 1972 47.45 47.22 0.23 0.49
Americas 1962 58.40 58.68 0.28 0.48
Africa 1952 39.14 38.95 0.19 0.47
Europe 1997 75.51 75.85 0.35 0.46
Europe 1992 74.44 74.77 0.33 0.45
Europe 1987 73.64 73.97 0.33 0.44
Europe 1982 72.81 73.09 0.29 0.39
Americas 1952 53.28 53.10 0.18 0.34
Europe 1977 71.94 72.18 0.25 0.34
Europe 2002 76.70 76.94 0.24 0.31
Europe 2007 77.65 77.89 0.24 0.31
Americas 1957 55.96 56.02 0.05 0.10
Asia 1967 54.66 54.71 0.05 0.09
Africa 1992 53.63 53.66 0.03 0.06
Oceania 1952 69.25 69.25 0.00 0.00
Oceania 1957 70.30 70.30 0.00 0.00
Oceania 1962 71.09 71.09 0.00 0.00
Oceania 1967 71.31 71.31 0.00 0.00
Oceania 1972 71.91 71.91 0.00 0.00
Oceania 1977 72.85 72.85 0.00 0.00
Oceania 1982 74.29 74.29 0.00 0.00
Oceania 1987 75.32 75.32 0.00 0.00
Oceania 1992 76.94 76.94 0.00 0.00
Oceania 1997 78.19 78.19 0.00 0.00
Oceania 2002 79.74 79.74 0.00 0.00
Oceania 2007 80.72 80.72 0.00 0.00

We can see that even after trimming 15% from both ends of the life expectancies in each continent, the most difference between the trimmed mean and the real mean is less than 2.5%. This means (pun non-intended) that there isn't a huge variability in lif expectancies between the different countries within each continent in a given year. It's visible that Africa has the largest such variability, as 3 of the top 5 rows belong to Africa. It's also nice to see how Oceania has 0% difference because there are not enough countries in it to trim, so the trimmed mean uses the same data as the real mean.

Absolute and relative world population in each of the continents (very fun)

Here we look at the total population of each continent in every year, and compare that to the world's total population. The data is arranged by year, where in each year group the continents are arranged from most populous to least.

worldRelativePop <- ddply(gDat, .(continent, year), function(.data) {
    .data <- as.list(.data)
    .data["continentPop"] <- sum(.data$pop)
    .data["worldPop"] <- sum(subset(gDat, year == .data$year[1])[["pop"]])
    .data["percent"] <- round(as.numeric(.data["continentPop"])/as.numeric(.data["worldPop"]) * 
        100, 2)
    quickdf(.data[c("continentPop", "worldPop", "percent")])
})
worldRelativePop <- arrange(worldRelativePop, year, desc(percent))
worldRelativePop <- xtable(worldRelativePop)
print(worldRelativePop, type = "html", include.rownames = FALSE)
continent year continentPop worldPop percent
Asia 1952 1395357352.00 2406957151.00 57.97
Europe 1952 418120846.00 2406957151.00 17.37
Americas 1952 345152446.00 2406957151.00 14.34
Africa 1952 237640501.00 2406957151.00 9.87
Oceania 1952 10686006.00 2406957151.00 0.44
Asia 1957 1562780599.00 2664404580.00 58.65
Europe 1957 437890351.00 2664404580.00 16.43
Americas 1957 386953916.00 2664404580.00 14.52
Africa 1957 264837738.00 2664404580.00 9.94
Oceania 1957 11941976.00 2664404580.00 0.45
Asia 1962 1696357182.00 2899782974.00 58.50
Europe 1962 460355155.00 2899782974.00 15.88
Americas 1962 433270254.00 2899782974.00 14.94
Africa 1962 296516865.00 2899782974.00 10.23
Oceania 1962 13283518.00 2899782974.00 0.46
Asia 1967 1905662900.00 3217478384.00 59.23
Europe 1967 481178958.00 3217478384.00 14.96
Americas 1967 480746623.00 3217478384.00 14.94
Africa 1967 335289489.00 3217478384.00 10.42
Oceania 1967 14600414.00 3217478384.00 0.45
Asia 1972 2150972248.00 3576977158.00 60.13
Americas 1972 529384210.00 3576977158.00 14.80
Europe 1972 500635059.00 3576977158.00 14.00
Africa 1972 379879541.00 3576977158.00 10.62
Oceania 1972 16106100.00 3576977158.00 0.45
Asia 1977 2384513556.00 3930045807.00 60.67
Americas 1977 578067699.00 3930045807.00 14.71
Europe 1977 517164531.00 3930045807.00 13.16
Africa 1977 433061021.00 3930045807.00 11.02
Oceania 1977 17239000.00 3930045807.00 0.44
Asia 1982 2610135582.00 4289436840.00 60.85
Americas 1982 630290920.00 4289436840.00 14.69
Europe 1982 531266901.00 4289436840.00 12.39
Africa 1982 499348587.00 4289436840.00 11.64
Oceania 1982 18394850.00 4289436840.00 0.43
Asia 1987 2871220762.00 4691477418.00 61.20
Americas 1987 682753971.00 4691477418.00 14.55
Africa 1987 574834110.00 4691477418.00 12.25
Europe 1987 543094160.00 4691477418.00 11.58
Oceania 1987 19574415.00 4691477418.00 0.42
Asia 1992 3133292191.00 5110710260.00 61.31
Americas 1992 739274104.00 5110710260.00 14.47
Africa 1992 659081517.00 5110710260.00 12.90
Europe 1992 558142797.00 5110710260.00 10.92
Oceania 1992 20919651.00 5110710260.00 0.41
Asia 1997 3383285500.00 5515204472.00 61.34
Americas 1997 796900410.00 5515204472.00 14.45
Africa 1997 743832984.00 5515204472.00 13.49
Europe 1997 568944148.00 5515204472.00 10.32
Oceania 1997 22241430.00 5515204472.00 0.40
Asia 2002 3601802203.00 5886977579.00 61.18
Americas 2002 849772762.00 5886977579.00 14.43
Africa 2002 833723916.00 5886977579.00 14.16
Europe 2002 578223869.00 5886977579.00 9.82
Oceania 2002 23454829.00 5886977579.00 0.40
Asia 2007 3811953827.00 6251013179.00 60.98
Africa 2007 929539692.00 6251013179.00 14.87
Americas 2007 898871184.00 6251013179.00 14.38
Europe 2007 586098529.00 6251013179.00 9.38
Oceania 2007 24549947.00 6251013179.00 0.39

There might be a nicer, easier way to achieve this, but I didn't know how. I was trying to use plain old 'summarize', but summarize did not let me aggregate the total population of all continents in each of the years. I'm not sure if this kinf of splitting is available with plyr. Since I couldn't get what I wanted with pylr, I looked at the source code of the summarize function and was able to alter it a little bit to get what I needed.
We can see that Asia is consistently by far the most populated continent, always making up ~60% of the world population.
One very interesting observation is how Europe, America, and Africa changed spots over time. In the 1950's, Europe was the most populated, followed by America and Africa. As the years go by, America's relative population remains fairly constant at around 14.5%, Europe's relative population decreases, and Africa's increases. This trend consistently continues throughout the years without exception, until at the last data point in 2007 the rankings of the three continents is completely flipped from the beginning - Africa followed by America followed by Europe

A list of all countries that at some point had their population size decrease (very fun)

The world can be a very cruel place. Many countries have gone through genocides, massive natural disasters, or other events that have caused them to lose a significant portion of their population. For example, the Khmer Rouge in Cambodia killed off a large fraction of the Cambodian population in the 1970's. As a result, the country's population actually shrank from 1972 to 1977. It is interesting to see what other countries went through a population decrease at some point.

# get all the years that we have data for
years <- unique(gDat$year)
# get a list of all countries
allCountries = levels(gDat$country)
# initialize a vector for the poor countries that experienced a population
# decrease
resultCountries = vector(mode = "character")

# go through every country, and see if its population in the previous data
# year is larger than the current population. If that is true for any given
# year, add the country to our results list
for (iCountry in allCountries) {
    for (idxYear in seq(years)[-1]) {
        prevYear = years[idxYear - 1]
        curYear = years[idxYear]
        prevYearData = gDat[intersect(which(gDat$year == prevYear), which(gDat$country == 
            iCountry)), ]
        curYearData = gDat[intersect(which(gDat$year == curYear), which(gDat$country == 
            iCountry)), ]
        prevPop = prevYearData[["pop"]]
        curPop = curYearData[["pop"]]
        if (prevPop >= curPop) {
            resultCountries = append(resultCountries, iCountry)
            break
        }
    }
}
print(resultCountries)
##  [1] "Afghanistan"            "Bosnia and Herzegovina"
##  [3] "Bulgaria"               "Cambodia"              
##  [5] "Croatia"                "Czech Republic"        
##  [7] "Equatorial Guinea"      "Germany"               
##  [9] "Guinea-Bissau"          "Hungary"               
## [11] "Ireland"                "Kuwait"                
## [13] "Lebanon"                "Lesotho"               
## [15] "Liberia"                "Montenegro"            
## [17] "Poland"                 "Portugal"              
## [19] "Romania"                "Rwanda"                
## [21] "Serbia"                 "Slovenia"              
## [23] "Somalia"                "South Africa"          
## [25] "Switzerland"            "Trinidad and Tobago"   
## [27] "West Bank and Gaza"

There might be a non-forloop way to do this, but I couldn't figure it out.
As we can see, there are 27 countries that as some point had their population decrease, and Cambodia is indeed one of them.
I wanted to show this data in a dataframe, but I'm on a plane without WiFi at midnight before this is due, and I can't find out how to build a data frame from scratch, so I'll just leave it as a list :)