In this homework, we want to execute a series of data aggregation tasks on a set of data.
Let's import the Gapminder data set again:
SourceDat <- read.delim(file = "gapminderDataFiveYear.txt")
We can check the structure and tail of the imported source data to see if data is properly imported:
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
## country year pop continent lifeExp gdpPercap
## 1701 Zimbabwe 1992 10704340 Africa 60.38 693.4
## 1702 Zimbabwe 1997 11404948 Africa 46.81 792.4
## 1703 Zimbabwe 2002 11926563 Africa 39.99 672.0
## 1704 Zimbabwe 2007 12311143 Africa 43.49 469.7
Now we want to take a look at the minimum and maximum of GDP per capita for each continent. In order to that
we can use the ddply() syntax from plyr package.
library(plyr)
GDPExterWide <- ddply(SourceDat, ~continent, summarize, MinGDP = round(min(gdpPercap),
2), MaxGDP = round(max(gdpPercap), 2))
For showing the result in a more pleasant way, we can use xtable package and draw a table.
library(xtable)
print(arrange(xtable(GDPExterWide), MinGDP), include.rownames = FALSE, type = "html")
| continent | MinGDP | MaxGDP |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Asia | 331.00 | 113523.13 |
| Europe | 973.53 | 49357.19 |
| Americas | 1201.64 | 42951.65 |
| Oceania | 10039.60 | 34435.37 |
As we can see from the acquired results, different continents won't be in the same order if we sort them on MaxGDP. This means the continent which embraces the country with the lowest minimum GDP around the world, does not necessarily have also the country with the lowest maximum GDP per capita around the world.
On the other hand, we should pay attention not to misinterpret the difference between minimum and maximum of GDP per capita for each continent. An important parameter to be considered in that regard is the number of countries in each continent:
(The chunk I wrote for this table is hideous and I avoided bringing it here)
| Continent | Africa | Americas | Asia | Europe | Oceania |
| Countries | 52 | 25 | 33 | 30 | 2 |
Now we can see that for example Oceania, who is the only continent with minimum and maximum GDP of the same order of magnitude, comprises only two countries!
With a very similar method we can review the spread of GDP per capita within the different continents. Here, the table is sorted by median absolute deviation.
| continent | SD | MAD | IQR |
|---|---|---|---|
| Africa | 2827.93 | 775.32 | 1616.17 |
| Asia | 14045.37 | 2820.83 | 7492.26 |
| Americas | 6396.76 | 3269.33 | 4402.43 |
| Oceania | 6358.98 | 6459.10 | 8072.26 |
| Europe | 9355.21 | 8846.05 | 13248.30 |
As we can see again, continents do not follow the same order for standard deviation and interquartile range. However, at this stage, it is not quite clear to me what these stats tell.
In this part, we want to check the average life expectancy around the world in different years of our study. Two variables of 12.5% and 25% (interquartile) trimmed mean are also illustrated in the following table
# You have two choices for trim fraction This chunk might be nicer, and more
# practical, if we define a function I think.
TP1 <- 0.125
TP2 <- 0.25
AveLExPYear <- ddply(SourceDat, ~year, summarize, Average = mean(lifeExp), Dumm1 = mean(lifeExp,
trim = TP1), Dumm2 = mean(lifeExp, trim = TP2))
names(AveLExPYear)[3:4] <- paste0("Tmean", 100 * c(TP1, TP2))
AveLExPYear <- arrange(AveLExPYear, Average)
# It was already sorted, but we can do it just to make sure
print(xtable(AveLExPYear), type = "html", include.rownames = FALSE)
| year | Average | Tmean12.5 | Tmean25 |
|---|---|---|---|
| 1952 | 49.06 | 48.42 | 47.34 |
| 1957 | 51.51 | 51.16 | 50.28 |
| 1962 | 53.61 | 53.51 | 52.79 |
| 1967 | 55.68 | 55.85 | 55.43 |
| 1972 | 57.65 | 58.05 | 58.08 |
| 1977 | 59.57 | 60.17 | 60.47 |
| 1982 | 61.53 | 62.22 | 62.70 |
| 1987 | 63.21 | 64.07 | 64.77 |
| 1992 | 64.16 | 65.34 | 66.19 |
| 1997 | 65.01 | 66.21 | 67.25 |
| 2002 | 65.69 | 66.95 | 68.31 |
| 2007 | 67.01 | 68.34 | 69.69 |
Interestingly, the average/trimmed mean life expectancy monotonically has increased by time around the world. However, this does not eliminate the necessity of investigating correlations of positive factors (e.g. health science and general comfort) and negative factors (e.g. war and disease) with life expectancy.
Also, as we can see here, trimmed mean values demonstrate more pronounced increase in the average life expectancy by time. More end data is truncated in evaluation of trimmed mean, more noticeable increases would be.
An easy way to acquire this data is using ddply() function with two simultaneous filters, namely continent and year.
AvelExPYearCont <- ddply(SourceDat, ~continent + year, summarize, Average = mean(lifeExp))
But, let's first take a look at the structure of the output data.frame using str() :
## 'data.frame': 60 obs. of 3 variables:
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ Average : num 39.1 41.3 43.3 45.3 47.5 ...
As we can see here, we have a data.frame with 60 rows. Since the default result of ddply() call is a tall format file, this result can be terrible to be presented as a regular table sometimes. Here we can see a table showing 10 random rows of this data.frame:
Dumm1 <- AvelExPYearCont[sample(nrow(AvelExPYearCont), 10), , ]
print(xtable(Dumm1), type = "html", include.rownames = FALSE)
| continent | year | Average |
|---|---|---|
| Asia | 1997 | 68.02 |
| Asia | 2002 | 69.23 |
| Asia | 1962 | 51.56 |
| Americas | 1977 | 64.39 |
| Americas | 1962 | 58.40 |
| Europe | 1957 | 66.70 |
| Oceania | 1987 | 75.32 |
| Africa | 1972 | 47.45 |
| Europe | 1972 | 70.78 |
| Asia | 1977 | 59.61 |
However, apparently this is common challenge and people have found the solution for this problem. Using the instruction over a discussion in stackoverflow, we can transform our data.frame to the wide format. In this table you can see the average of life expectancy for each continent in different years.
AvelExPYearContW <- reshape(AvelExPYearCont, idvar = "continent", timevar = "year",
direction = "wide")
names(AvelExPYearContW)[-1] <- unique(SourceDat$year)
print(xtable(AvelExPYearContW), type = "html", include.rownames = FALSE)
| continent | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 39.14 | 41.27 | 43.32 | 45.33 | 47.45 | 49.58 | 51.59 | 53.34 | 53.63 | 53.60 | 53.33 | 54.81 |
| Americas | 53.28 | 55.96 | 58.40 | 60.41 | 62.39 | 64.39 | 66.23 | 68.09 | 69.57 | 71.15 | 72.42 | 73.61 |
| Asia | 46.31 | 49.32 | 51.56 | 54.66 | 57.32 | 59.61 | 62.62 | 64.85 | 66.54 | 68.02 | 69.23 | 70.73 |
| Europe | 64.41 | 66.70 | 68.54 | 69.74 | 70.78 | 71.94 | 72.81 | 73.64 | 74.44 | 75.51 | 76.70 | 77.65 |
| Oceania | 69.25 | 70.30 | 71.09 | 71.31 | 71.91 | 72.85 | 74.29 | 75.32 | 76.94 | 78.19 | 79.74 | 80.72 |
Interpretation of the results is much easier this way. We can see that the trend is upward for all the five continents. Except for Africa in the years 1997 and 2002, where we can see a slight decrease in life expectancy, the average life expectancy in different continents have been gradually increased by time.
Now we want to get some sense about the number of countries in each continent that have a life expectancy lower than a certain “global”“ life expectancy index. In order to do this, obviously, we need to define the global life expectancy first.
Defining a constant global life expectancy would not be a good idea according to the increasing trend over time, as we saw earlier. Therefore, we should have a global life expectancy for each year.
One way to do this is to get another average on life expectancy for each year in the AvelExPYearContW data.frame, which illustrates the average life expectancy in each continent and each year of our study. But, this approach neglects the effect of number of countries in each continent and gives relatively higher global life expectancy and it is better to take average of life expectancy for all the countries together, as in AveLExPYear data.frame.
compare the life expectancies 71.5021 and 67.0074, to see the higher predicted global life expectancy for the year 2007 acquired through another averaging on average life expectancies in different continents. The latter value is from the AveLExPYear data.frame.
AveLExGlob <- round(AveLExPYear$Average, 2)
names(AveLExGlob) <- AveLExPYear$year
AveLExGlob
## 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
## 49.06 51.51 53.61 55.68 57.65 59.57 61.53 63.21 64.16 65.01 65.69 67.01
Now based on this global life expectancy vector we count the number of countries that have life expectancies of lower than corresponding vector element. Since the initial tall results is not pleasant to show, we use the reshape() function once again.
LowLExCount <- ddply(SourceDat, ~continent + year, summarize, countries = length(lifeExp[lifeExp <
AveLExGlob[as.character(year)]]))
LowLExCount <- reshape(LowLExCount, direction = "wide", idvar = "continent",
timevar = "year")
names(LowLExCount)[-1] <- unique(SourceDat$year)
print(xtable(LowLExCount), type = "html", include.rownames = FALSE)
| continent | 1952 | 1957 | 1962 | 1967 | 1972 | 1977 | 1982 | 1987 | 1992 | 1997 | 2002 | 2007 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Africa | 50 | 50 | 50 | 50 | 50 | 49 | 48 | 46 | 46 | 45 | 45 | 45 |
| Americas | 9 | 9 | 8 | 6 | 6 | 7 | 7 | 5 | 3 | 2 | 2 | 2 |
| Asia | 22 | 20 | 19 | 18 | 18 | 14 | 12 | 12 | 11 | 10 | 10 | 10 |
| Europe | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Oceania | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Looking through this table, we realize that Americas and Asia have been improving over years and the number of countries in these continents that have a low life expectancy, based on our definition, is decreasing. Africa, persistently, holds a large number of countries with low life expectancy, although it has started getting better after 1977.
Apparently, life expectancy in Oceania and Europe has been well above the global value every year.