Numerous tasks were attempted from Homework#3 on data aggregation. Some were successfully completed, whilst others were not. The following is an outline of what went right and wrong for the tasks attempted.
Before undertaking the R work, some packages and data had to be imported.
library(plyr)
library(xtable)
gDat <- read.delim("gapminderDataFiveYear.txt")
Aside from importing the relevant data, this code loads both the 'plyr' library for data aggregation and the 'xtable' library for pretty tables.
As can be seen below, in this task I used the 'ddply' function within the 'plyr' package to create a 5x3 data frame called contGdp. Within this data frame were 3 variables, namely continent, maxGdpPercap and minGdpPerCap (maximum/minimum gross domestic product per capita). From this data frame it was possible to create a new data frame called contOrdGdp by using the 'arrange' function on the maxGdpPercap column. This sorted the continents in order of max gdp per capita, largest first.
contGdp <- ddply(gDat, ~continent, summarise, maxGdpPercap = max(gdpPercap),
minGdpPercap = min(gdpPercap))
contOrdGdp <- arrange(contGdp, desc(maxGdpPercap))
print(xtable(contOrdGdp), type = "html", include.rownames = FALSE)
| continent | maxGdpPercap | minGdpPercap |
|---|---|---|
| Asia | 113523.13 | 331.00 |
| Europe | 49357.19 | 973.53 |
| Americas | 42951.65 | 1201.64 |
| Oceania | 34435.37 | 10039.60 |
| Africa | 21951.21 | 241.17 |
With the continents now in order of maxGdpPercap as can be seen above, there appears to be little relation shown between this variable and the minGdpPercap. For example in row 1 column 2 we can see that although Asia had the largest maximum gdp per capita value, the minimum was actually very small. This suggests large disparity between high and low gdp in Asia. On the whole there seems little relationship between extreme values of the two variables of min/maxGdpPercap.
In this case I create a 60x3 data frame with the 3 variables continent, year and meanLifeExp (mean life expectancy).
contLifeExp <- ddply(gDat, continent ~ year, summarise, meanLifeExp = mean(lifeExp))
print(xtable(contLifeExp), type = "html", include.rownames = FALSE)
| continent | year | meanLifeExp |
|---|---|---|
| Africa | 1952 | 39.14 |
| Africa | 1957 | 41.27 |
| Africa | 1962 | 43.32 |
| Africa | 1967 | 45.33 |
| Africa | 1972 | 47.45 |
| Africa | 1977 | 49.58 |
| Africa | 1982 | 51.59 |
| Africa | 1987 | 53.34 |
| Africa | 1992 | 53.63 |
| Africa | 1997 | 53.60 |
| Africa | 2002 | 53.33 |
| Africa | 2007 | 54.81 |
| Americas | 1952 | 53.28 |
| Americas | 1957 | 55.96 |
| Americas | 1962 | 58.40 |
| Americas | 1967 | 60.41 |
| Americas | 1972 | 62.39 |
| Americas | 1977 | 64.39 |
| Americas | 1982 | 66.23 |
| Americas | 1987 | 68.09 |
| Americas | 1992 | 69.57 |
| Americas | 1997 | 71.15 |
| Americas | 2002 | 72.42 |
| Americas | 2007 | 73.61 |
| Asia | 1952 | 46.31 |
| Asia | 1957 | 49.32 |
| Asia | 1962 | 51.56 |
| Asia | 1967 | 54.66 |
| Asia | 1972 | 57.32 |
| Asia | 1977 | 59.61 |
| Asia | 1982 | 62.62 |
| Asia | 1987 | 64.85 |
| Asia | 1992 | 66.54 |
| Asia | 1997 | 68.02 |
| Asia | 2002 | 69.23 |
| Asia | 2007 | 70.73 |
| Europe | 1952 | 64.41 |
| Europe | 1957 | 66.70 |
| Europe | 1962 | 68.54 |
| Europe | 1967 | 69.74 |
| Europe | 1972 | 70.78 |
| Europe | 1977 | 71.94 |
| Europe | 1982 | 72.81 |
| Europe | 1987 | 73.64 |
| Europe | 1992 | 74.44 |
| Europe | 1997 | 75.51 |
| Europe | 2002 | 76.70 |
| Europe | 2007 | 77.65 |
| Oceania | 1952 | 69.25 |
| Oceania | 1957 | 70.30 |
| Oceania | 1962 | 71.09 |
| Oceania | 1967 | 71.31 |
| Oceania | 1972 | 71.91 |
| Oceania | 1977 | 72.85 |
| Oceania | 1982 | 74.29 |
| Oceania | 1987 | 75.32 |
| Oceania | 1992 | 76.94 |
| Oceania | 1997 | 78.19 |
| Oceania | 2002 | 79.74 |
| Oceania | 2007 | 80.72 |
It is evident that for all continents mean life expectancy has been increasing since 1952, however the rates at which this has been the case have not been consistent. For example, over the 55 years covered Africa mean life expectancy has increased by around 15 years, whilst Asia has seen an increase of about 24 years. Asia has the largest increase, though this is partially due to a low starting mean life expectancy. All continents bar Africa had a mean life expectancy in 2007 between 70 and 80 years, while Africa had a value of around 55 years as the average value. This is the most noticeable statistic drawn from the table as it shows a real issue with life expectancy in the continent of Africa, even in the present day.
Here I took 3 variables which were year, MeanLifeExp (mean yearly global life expectancy) and TrimLifeExp (mean yearly global life expectancy with trim 0.1). This created a 12x3 data frame named yearLifeExp.
yearLifeExp <- ddply(gDat, ~year, summarise, MeanLifeExp = mean(lifeExp), TrimLifeExp = mean(lifeExp,
trim = 0.1))
print(xtable(yearLifeExp), type = "html", include.rownames = FALSE)
| year | MeanLifeExp | TrimLifeExp |
|---|---|---|
| 1952 | 49.06 | 48.58 |
| 1957 | 51.51 | 51.27 |
| 1962 | 53.61 | 53.58 |
| 1967 | 55.68 | 55.87 |
| 1972 | 57.65 | 58.01 |
| 1977 | 59.57 | 60.10 |
| 1982 | 61.53 | 62.12 |
| 1987 | 63.21 | 63.92 |
| 1992 | 64.16 | 65.19 |
| 1997 | 65.01 | 66.02 |
| 2002 | 65.69 | 66.72 |
| 2007 | 67.01 | 68.11 |
By observation it appears this trim reduces the mean slight for the first 3 years of 1952, 1957 and 1962. This changes from then on however, as the trim actually gives and increased mean value when compared to the usual mean. Again this is very slight and never gives much more than an increase of 1 year. Having said this, such a change might not be considered small in the context of some data inference.
For this question I presentated maximum and minimum values for each continent within one column.
GdpMaxMin <- ddply(gDat, ~continent, summarise, factor = c("Max Gdp", "Min Gdp"),
GdpPercap = c(max(gdpPercap), min(gdpPercap)))
print(xtable(GdpMaxMin), type = "html", include.rownames = FALSE)
| continent | factor | GdpPercap |
|---|---|---|
| Africa | Max Gdp | 21951.21 |
| Africa | Min Gdp | 241.17 |
| Americas | Max Gdp | 42951.65 |
| Americas | Min Gdp | 1201.64 |
| Asia | Max Gdp | 113523.13 |
| Asia | Min Gdp | 331.00 |
| Europe | Max Gdp | 49357.19 |
| Europe | Min Gdp | 973.53 |
| Oceania | Max Gdp | 34435.37 |
| Oceania | Min Gdp | 10039.60 |
By observing the table above we can see the same results as in Q1) but in a slightly different format. To avoid repeating what I have already said I will leave this question here!
This task was unsuccessful after many hours of work! This time I created a function lessThanCount to count the number of countries with a life expectancy under a specific value (here I tested 66) for specific years. This function worked for specific bounds but I couldn't get it to take the upper life expectancy and a varying bound or to do this within a data arrive to give the appropriate count of countries with low life expectancy over time by continent. With more work I believe this would work but I didn't have the time by this point to go any further. I aimed on adding this column to the lifeExpCount data frame below.
lessThanCount <- function(x, y) {
r <- 0
for (i in 1:nrow(x)) {
if (y != x$year[[i]]) {
next
}
if (x$lifeExp[[i]] < 66) {
r <- r + 1
} else {
r <- r
}
}
return(r)
}
lifeExpCount <- ddply(gDat, continent ~ year, summarise, MeanGloLife = mean(lifeExp))
print(xtable(lifeExpCount), type = "html", include.rownames = FALSE)
| continent | year | MeanGloLife |
|---|---|---|
| Africa | 1952 | 39.14 |
| Africa | 1957 | 41.27 |
| Africa | 1962 | 43.32 |
| Africa | 1967 | 45.33 |
| Africa | 1972 | 47.45 |
| Africa | 1977 | 49.58 |
| Africa | 1982 | 51.59 |
| Africa | 1987 | 53.34 |
| Africa | 1992 | 53.63 |
| Africa | 1997 | 53.60 |
| Africa | 2002 | 53.33 |
| Africa | 2007 | 54.81 |
| Americas | 1952 | 53.28 |
| Americas | 1957 | 55.96 |
| Americas | 1962 | 58.40 |
| Americas | 1967 | 60.41 |
| Americas | 1972 | 62.39 |
| Americas | 1977 | 64.39 |
| Americas | 1982 | 66.23 |
| Americas | 1987 | 68.09 |
| Americas | 1992 | 69.57 |
| Americas | 1997 | 71.15 |
| Americas | 2002 | 72.42 |
| Americas | 2007 | 73.61 |
| Asia | 1952 | 46.31 |
| Asia | 1957 | 49.32 |
| Asia | 1962 | 51.56 |
| Asia | 1967 | 54.66 |
| Asia | 1972 | 57.32 |
| Asia | 1977 | 59.61 |
| Asia | 1982 | 62.62 |
| Asia | 1987 | 64.85 |
| Asia | 1992 | 66.54 |
| Asia | 1997 | 68.02 |
| Asia | 2002 | 69.23 |
| Asia | 2007 | 70.73 |
| Europe | 1952 | 64.41 |
| Europe | 1957 | 66.70 |
| Europe | 1962 | 68.54 |
| Europe | 1967 | 69.74 |
| Europe | 1972 | 70.78 |
| Europe | 1977 | 71.94 |
| Europe | 1982 | 72.81 |
| Europe | 1987 | 73.64 |
| Europe | 1992 | 74.44 |
| Europe | 1997 | 75.51 |
| Europe | 2002 | 76.70 |
| Europe | 2007 | 77.65 |
| Oceania | 1952 | 69.25 |
| Oceania | 1957 | 70.30 |
| Oceania | 1962 | 71.09 |
| Oceania | 1967 | 71.31 |
| Oceania | 1972 | 71.91 |
| Oceania | 1977 | 72.85 |
| Oceania | 1982 | 74.29 |
| Oceania | 1987 | 75.32 |
| Oceania | 1992 | 76.94 |
| Oceania | 1997 | 78.19 |
| Oceania | 2002 | 79.74 |
| Oceania | 2007 | 80.72 |