Data Aggregation

Huiting Ma

This homework will focus on the following parts:

Data Import

First, the Gapminder data can be imported from here. Then, make sure that you are working in the correct directory on your computer.

setwd("C:/Users/user/Desktop/545A")
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(gdURL)
str(gDat)
## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Data Aggregation

Now, let us get started. In order to do the following steps, make sure you have already installed “plyr” and “xytable” packages.

library(xtable)
library(lattice)
library(plyr)

Get the Maximum and Minimum of GDP per Capita for all Continents-Wide

The first example that I want to discuss is how to get the maximum and minimum of GDP per capica for all continents.I will introduce two ways to achieve our results.

Way # 1

The following function helps to determine the maximum and minimum GDP per capita for each continent (for example “Africa”). Also, this function can give a good names for the resutls.

hcontinent <- "Africa"
hDat <- subset(gDat, continent == hcontinent)
hFun <- function(x) {
    maxgdpPercap <- max(x$gdpPercap)
    mingdpPercap <- min(x$gdpPercap)
    names(maxgdpPercap) <- "maxgdpPercap"
    names(mingdpPercap) <- "mingdpPercap"
    return(c(maxgdpPercap, mingdpPercap))
}
hFun(hDat)
## maxgdpPercap mingdpPercap 
##      21951.2        241.2

Now, Let us place this function inside a ddply().

GetmaxminPercapway1 <- ddply(gDat, ~continent, hFun)

Yes! We got our results. However, we want to sort the table on minimum GDP per capital. Then, we need apply arrange().

GetmaxminPercapway1 <- arrange(GetmaxminPercapway1, mingdpPercap)

Last step, we can display the table through apply xtable().

GetmaxminPercapway1 <- xtable(GetmaxminPercapway1)
print(GetmaxminPercapway1, type = "html", include.rownames = FALSE)
continent maxgdpPercap mingdpPercap
Africa 21951.21 241.17
Asia 113523.13 331.00
Europe 49357.19 973.53
Americas 42951.65 1201.64
Oceania 34435.37 10039.60

Based on the table above, we noticed that the minimum GDP per capita in African is much less than Oceania.

Way # 2

In order to simplify the code, we can try this:

GetmaxminPercapway2 <- ddply(gDat, ~continent, summarize, maxgdpPercap = max(gdpPercap), 
    mingdpPercap = min(gdpPercap))

GetmaxminPercapway2 <- arrange(GetmaxminPercapway2, mingdpPercap)

Now, we can print out our results through applying xytable.

GetmaxminPercapway2 <- xtable(GetmaxminPercapway2)
print(GetmaxminPercapway2, type = "html", include.rownames = FALSE)
continent maxgdpPercap mingdpPercap
Africa 21951.21 241.17
Asia 113523.13 331.00
Europe 49357.19 973.53
Americas 42951.65 1201.64
Oceania 34435.37 10039.60

Here we go, the results are the same as way # 1.

Get the Maximum and Minimum of GDP per Capita for all Continents-Long

Next table we are going to talk about is how to change our “wide” format to “long” format. We could do that through reshape(), but today we will focus on using ddply().

GetmaxminPercapLong <- ddply(gDat, ~continent, summarize, gdpPercapSummary = c(max(gdpPercap), 
    min(gdpPercap)), SummaryDescribe = factor(c("maximum", "minimum")))

Now, we can print out our results through applying xytable.

GetmaxminPercapLong <- xtable(GetmaxminPercapLong)
print(GetmaxminPercapLong, type = "html", include.rownames = FALSE)
continent gdpPercapSummary SummaryDescribe
Africa 21951.21 maximum
Africa 241.17 minimum
Americas 42951.65 maximum
Americas 1201.64 minimum
Asia 113523.13 maximum
Asia 331.00 minimum
Europe 49357.19 maximum
Europe 973.53 minimum
Oceania 34435.37 maximum
Oceania 10039.60 minimum

The above table gives us the same information as “wide” format.

Get the Spread of GDP per Capita for all Continents

Now, we can look at the spread of GDP per capita within the continents.

GetSpread <- ddply(gDat, ~continent, summarize, sdGdpPercap = sd(gdpPercap), 
    varGdpPercap = var(gdpPercap), medGdpPercap = median(gdpPercap), IQRgdpPercap = IQR(gdpPercap))
GetSpread <- arrange(GetSpread, sdGdpPercap)

Then, we can print out our answers.

GetSpread <- xtable(GetSpread)
print(GetSpread, type = "html", include.rownames = FALSE)
continent sdGdpPercap varGdpPercap medGdpPercap IQRgdpPercap
Africa 2827.93 7997187.31 1192.14 1616.17
Oceania 6358.98 40436668.87 17983.30 8072.26
Americas 6396.76 40918591.10 5465.51 4402.43
Europe 9355.21 87520019.60 12081.75 13248.30
Asia 14045.37 197272505.85 2646.79 7492.26

Based on the table above, Asia is experiencing the hugest varability amoung all contingents.

Compute a Trimmed Mean of Life Expectancy for Different Years

Trimmed mean is also known as truncated mean. It is a statistical measure of central tendency. The following steps are introducing a way to compute a trimmed mean of life expectancy for diffferent years.

trimmedMeanByYear <- ddply(gDat, ~year, summarize, trimmedmean = mean(lifeExp, 
    trim = 1))

Now, we can print out our results through applying xytable.

trimmedMeanByYear <- xtable(trimmedMeanByYear)
print(trimmedMeanByYear, type = "html", include.rownames = FALSE)
year trimmedmean
1952 45.14
1957 48.36
1962 50.88
1967 53.83
1972 56.53
1977 59.67
1982 62.44
1987 65.83
1992 67.70
1997 69.39
2002 70.83
2007 71.94

According to the above table, the trimmed mean is increasing from 1952 to 2007.

Compare Life Expectancy over time on Different Continents

Another table is to compare life expectancy over time on different continents.

LifeExpByYearContinent <- ddply(gDat, ~continent + year, summarize, LifeExpMean = mean(lifeExp))

Now, we can print out our results through applying xytable.

LifeExpByYearContinent <- xtable(LifeExpByYearContinent)
print(LifeExpByYearContinent, type = "html", include.rownames = FALSE)
continent year LifeExpMean
Africa 1952 39.14
Africa 1957 41.27
Africa 1962 43.32
Africa 1967 45.33
Africa 1972 47.45
Africa 1977 49.58
Africa 1982 51.59
Africa 1987 53.34
Africa 1992 53.63
Africa 1997 53.60
Africa 2002 53.33
Africa 2007 54.81
Americas 1952 53.28
Americas 1957 55.96
Americas 1962 58.40
Americas 1967 60.41
Americas 1972 62.39
Americas 1977 64.39
Americas 1982 66.23
Americas 1987 68.09
Americas 1992 69.57
Americas 1997 71.15
Americas 2002 72.42
Americas 2007 73.61
Asia 1952 46.31
Asia 1957 49.32
Asia 1962 51.56
Asia 1967 54.66
Asia 1972 57.32
Asia 1977 59.61
Asia 1982 62.62
Asia 1987 64.85
Asia 1992 66.54
Asia 1997 68.02
Asia 2002 69.23
Asia 2007 70.73
Europe 1952 64.41
Europe 1957 66.70
Europe 1962 68.54
Europe 1967 69.74
Europe 1972 70.78
Europe 1977 71.94
Europe 1982 72.81
Europe 1987 73.64
Europe 1992 74.44
Europe 1997 75.51
Europe 2002 76.70
Europe 2007 77.65
Oceania 1952 69.25
Oceania 1957 70.30
Oceania 1962 71.09
Oceania 1967 71.31
Oceania 1972 71.91
Oceania 1977 72.85
Oceania 1982 74.29
Oceania 1987 75.32
Oceania 1992 76.94
Oceania 1997 78.19
Oceania 2002 79.74
Oceania 2007 80.72

The results show that Oceania has longer life expectancy than Africa.

Count the number of countries with low life expectancy over time by continent

If the life expectancy for a country is less than 60, I defined this country to be low life expectancy.

benchmark <- 60
NumlowLifeEXbyContinent <- ddply(gDat, ~continent + year, summarize, CountLowLifeEX = sum(lifeExp < 
    benchmark))

Now, we can print out our results.

NumlowLifeEXbyContinent <- xtable(NumlowLifeEXbyContinent)
print(NumlowLifeEXbyContinent, type = "html", include.rownames = FALSE)
continent year CountLowLifeEX
Africa 1952 52
Africa 1957 52
Africa 1962 51
Africa 1967 50
Africa 1972 50
Africa 1977 50
Africa 1982 44
Africa 1987 40
Africa 1992 39
Africa 1997 39
Africa 2002 41
Africa 2007 40
Americas 1952 19
Americas 1957 15
Americas 1962 13
Americas 1967 11
Americas 1972 10
Americas 1977 7
Americas 1982 5
Americas 1987 2
Americas 1992 2
Americas 1997 1
Americas 2002 1
Americas 2007 0
Asia 1952 29
Asia 1957 27
Asia 1962 25
Asia 1967 25
Asia 1972 19
Asia 1977 14
Asia 1982 12
Asia 1987 8
Asia 1992 7
Asia 1997 6
Asia 2002 4
Asia 2007 3
Europe 1952 7
Europe 1957 3
Europe 1962 1
Europe 1967 1
Europe 1972 1
Europe 1977 1
Europe 1982 0
Europe 1987 0
Europe 1992 0
Europe 1997 0
Europe 2002 0
Europe 2007 0
Oceania 1952 0
Oceania 1957 0
Oceania 1962 0
Oceania 1967 0
Oceania 1972 0
Oceania 1977 0
Oceania 1982 0
Oceania 1987 0
Oceania 1992 0
Oceania 1997 0
Oceania 2002 0
Oceania 2007 0

By looking at the table above, we can find that there are a lot of countries in Africa which have low life expectancy. However, the overall life expectancy in Oceania is much higher!! (No country has low life expectancy).

Compute the proportion of countries with low life expectancy over time by continent

The last table we will look at is a extension of above question. We will try to get the proportion of countries with low life expectancy instead the number of countries.

benchmark <- 60
ProlowLifeEXbyContinent <- ddply(gDat, ~continent + year, summarize, PropLowLifeEXbyContinent = sum(lifeExp < 
    benchmark)/length(lifeExp))

Now, let us see the results!

ProlowLifeEXbyContinent <- xtable(ProlowLifeEXbyContinent)
print(ProlowLifeEXbyContinent, type = "html", include.rownames = FALSE)
continent year PropLowLifeEXbyContinent
Africa 1952 1.00
Africa 1957 1.00
Africa 1962 0.98
Africa 1967 0.96
Africa 1972 0.96
Africa 1977 0.96
Africa 1982 0.85
Africa 1987 0.77
Africa 1992 0.75
Africa 1997 0.75
Africa 2002 0.79
Africa 2007 0.77
Americas 1952 0.76
Americas 1957 0.60
Americas 1962 0.52
Americas 1967 0.44
Americas 1972 0.40
Americas 1977 0.28
Americas 1982 0.20
Americas 1987 0.08
Americas 1992 0.08
Americas 1997 0.04
Americas 2002 0.04
Americas 2007 0.00
Asia 1952 0.88
Asia 1957 0.82
Asia 1962 0.76
Asia 1967 0.76
Asia 1972 0.58
Asia 1977 0.42
Asia 1982 0.36
Asia 1987 0.24
Asia 1992 0.21
Asia 1997 0.18
Asia 2002 0.12
Asia 2007 0.09
Europe 1952 0.23
Europe 1957 0.10
Europe 1962 0.03
Europe 1967 0.03
Europe 1972 0.03
Europe 1977 0.03
Europe 1982 0.00
Europe 1987 0.00
Europe 1992 0.00
Europe 1997 0.00
Europe 2002 0.00
Europe 2007 0.00
Oceania 1952 0.00
Oceania 1957 0.00
Oceania 1962 0.00
Oceania 1967 0.00
Oceania 1972 0.00
Oceania 1977 0.00
Oceania 1982 0.00
Oceania 1987 0.00
Oceania 1992 0.00
Oceania 1997 0.00
Oceania 2002 0.00
Oceania 2007 0.00

We can find that more than 50% countries in Africa have low life expectancy.