Presets:
Sys.setenv(lang = "EN")
library("lattice")
The following work is based on the work of Christian Okkels: code | report
All that I write here will appear within the quote box as this sentence itself. My work will appear at the end of each section.
This homework assignment considers various data aggregation tasks in R using the plyr library. The data are those of the Gapminder project (available here).
First, we load the necessary libraries:
## library(lattice) # not needed here, because we aren't doing plots.
library(plyr)
library(xtable)
Then we import the Gapminder data into the data.frame gDat:
commented the first line since I don't copy it to my directory.
# gDat <- read.delim('gapminderDataFiveYear.txt')
gDat <- read.delim("http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")
A quick check that the data import went well is done by use of the str() function:
str(gDat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
From the results we see that we have to do with a data.frame with 1704 observations of 6 variables. The variables are: country, year, pop (population), continent, lifeExp (life expectancy), and gdpPercap (GDP per capita). The types of the variables and their contents seem to make sense.
Now we begin the data aggregation, in which we consider the following tasks:
In this task, we construct a data.frame with 3 variables: the continent, the minimum GDP per capita for that continent (among all the countries in it), and the maximum GDP per capita. This is achieved in plyr by use of the ddply() function:
maxminGDPperCapByCont <- ddply(gDat, ~continent, summarize, minGDPperCap = min(gdpPercap),
maxGDPperCap = max(gdpPercap))
str(maxminGDPperCapByCont)
## 'data.frame': 5 obs. of 3 variables:
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## $ minGDPperCap: num 241 1202 331 974 10040
## $ maxGDPperCap: num 21951 42952 113523 49357 34435
From the last command we see that the saved quantity maxminGDPperCapByCont is a data.frame itself. However, it only consists of 5 observations (corresponding to the 5 continents) of 3 variables (continent, minimum GDP per capita, and maximum GDP per capita). As it is such a short data.frama, we can as well show the full size. This can be done in a nice table as follows:
tableData <- xtable(maxminGDPperCapByCont)
print(tableData, type = "html", include.rownames = FALSE)
| continent | minGDPperCap | maxGDPperCap |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Americas | 1201.64 | 42951.65 |
| Asia | 331.00 | 113523.13 |
| Europe | 973.53 | 49357.19 |
| Oceania | 10039.60 | 34435.37 |
Right now, these results are sorted alphabetically according the continent. We can sort on e.g. the maximum GDP per capita using the arrange() function:
tableData <- arrange(tableData, maxGDPperCap)
tableData <- xtable(tableData)
print(tableData, type = "html", include.rownames = FALSE)
| continent | minGDPperCap | maxGDPperCap |
|---|---|---|
| Africa | 241.17 | 21951.21 |
| Oceania | 10039.60 | 34435.37 |
| Americas | 1201.64 | 42951.65 |
| Europe | 973.53 | 49357.19 |
| Asia | 331.00 | 113523.13 |
First, we notice that Asia is far ahead in the lead when it comes to the maximum GDP per capita, with a value more than twice as large as the second place. After Asia comes Europe and the Americas, which are very close. Then we have Oceania, and finally Asia. It is interesting to see that the continents rank very differently with respect to the maximum GDP per capita than they would have for the minimum GDP per capita. For instance, Asia–although ranking in first place for the maximum–still has the second lowest minimum GDP per capita. This indicates a very large spread across the continent's countries. The situation is less extreme for Europe and the Americas; these ranked 2nd and 3rd, respectively, with respect to the maximum, and they would rank 3rd and 2nd, respectively, if sorted by the minimum GDP per capita. Another rather extreme case is Oceania, which has the second lowest maximum GDP per capita. However, it springs to the eye that this is the continent with the largest minimum GDP per capita, with a value 8-10 times greater than the 2nd and 3rd place. Finally, there is Asia, which preserves its ranking; it has both the lowest maximum and the lowest minimum GDP per capita.
Here I provide a strip plot where continents are ordered by the difference between the minimum and the maximum of GDP per capita:
maxminGDPperCapByCont.Reorder = maxminGDPperCapByCont[order(maxminGDPperCapByCont$maxGDPperCap -
maxminGDPperCapByCont$minGDPperCap), ]
maxminGDPperCapByCont.Reorder = within(maxminGDPperCapByCont.Reorder, continent <- reorder(continent,
maxGDPperCap - minGDPperCap))
stripplot(maxGDPperCap + minGDPperCap ~ continent, maxminGDPperCapByCont.Reorder,
type = "l", auto.key = TRUE)
Now it is clear that the difference between maximum and minimum is more affected by maximum than minimum.
In this task we consider the spread of the GDP per capita within the continents. The different measures of spread are:
sd() and var()mad()IQR()We now create a data.frame containing these measures of the spread of the GDP per capita for each continent. This is again done by means of ddply(), where we just compute even more things than above.
spreadGDPperCapByCont <- ddply(gDat, ~continent, summarize, sdGDPperCap = sd(gdpPercap),
varGDPperCap = var(gdpPercap), madGDPperCap = mad(gdpPercap), iqrGDPperCap = IQR(gdpPercap))
str(spreadGDPperCapByCont)
## 'data.frame': 5 obs. of 5 variables:
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
## $ sdGDPperCap : num 2828 6397 14045 9355 6359
## $ varGDPperCap: num 8.00e+06 4.09e+07 1.97e+08 8.75e+07 4.04e+07
## $ madGDPperCap: num 775 3269 2821 8846 6459
## $ iqrGDPperCap: num 1616 4402 7492 13248 8072
# spreadGDPperCapByCont
Here, we also see that the saved variable spreadGDPperCapByCont retains the data.frame structure. It contains 5 variables; the continent and the four measures of spread of the GDP per capita.
Let's show the results in a table:
tableData <- xtable(spreadGDPperCapByCont)
print(tableData, type = "html", include.rownames = FALSE)
| continent | sdGDPperCap | varGDPperCap | madGDPperCap | iqrGDPperCap |
|---|---|---|---|---|
| Africa | 2827.93 | 7997187.31 | 775.32 | 1616.17 |
| Americas | 6396.76 | 40918591.10 | 3269.33 | 4402.43 |
| Asia | 14045.37 | 197272505.85 | 2820.83 | 7492.26 |
| Europe | 9355.21 | 87520019.60 | 8846.05 | 13248.30 |
| Oceania | 6358.98 | 40436668.87 | 6459.10 | 8072.26 |
Again, this is sorted in the default alphabetical order. Sorting on the standard deviation gives us:
tableData <- arrange(tableData, sdGDPperCap)
tableData <- xtable(tableData)
print(tableData, type = "html", include.rownames = FALSE)
| continent | sdGDPperCap | varGDPperCap | madGDPperCap | iqrGDPperCap |
|---|---|---|---|---|
| Africa | 2827.93 | 7997187.31 | 775.32 | 1616.17 |
| Oceania | 6358.98 | 40436668.87 | 6459.10 | 8072.26 |
| Americas | 6396.76 | 40918591.10 | 3269.33 | 4402.43 |
| Europe | 9355.21 | 87520019.60 | 8846.05 | 13248.30 |
| Asia | 14045.37 | 197272505.85 | 2820.83 | 7492.26 |
From highest to lowest standard deviation of the GDP per capita, the continents are thus: Asia, Europe, Americas, Oceania, and Africa. In fact, the task from above (with the minimum and maximum GDP per capita) gave some indication of this ranking. We saw there that Asia–although having the largest maximum by far–had the second lowest minimum. This indicates a very large spread, and that is exactly what wee see in the table above. In the task above, Asia was the continent with both the lowest minimum and maximum, indicating a low spread. And this is also precisely what the table above tells us. Now, the continents obviously rank the same with respect to the standard deviation and the variance, seeing that the latter is merely the square of the former. However, it is interesting to notice that this ranking order is, in most cases, not retained for the two other measures of spread. For instance, Europe has the highest spread in GDP per capita when looking at the mean absolute deviation and the interquartile range. And Asia, with the highest value for the standard deviation, is pushed back to the second last place when considering the MAD measure. It is also interesting to see that Oceania and the Americas are ever so close with respect to the standard deviation, but differ wildly–up to a factor of 2–for the MAD and IQR measures. Africa, on the other hand, is the only continent that retains its position, having the lowest spread of GDP per capita across all of the different meaures considered. What we can gather qualitatively from this is, of course, that all of the countries in Africa are relatively poor, whereas the other continents generally have both poor and much richer countries.
Here I provide a bar chart for standard deviation and IQR where countries are ordered by standard deviation:
spreadGDPperCapByCont.Reorder = within(spreadGDPperCapByCont, continent <- reorder(continent,
sdGDPperCap))
barchart(sdGDPperCap + iqrGDPperCap ~ continent, spreadGDPperCapByCont.Reorder,
auto.key = TRUE)
Comparing SD and IQR, we can guess that the large SD of Asia is due to the presence of outliers and the large SD of Europe is not due to outliers. To verify this, a violin plot is provided:
bwplot(gdpPercap ~ continent, within(gDat, continent <- reorder(continent, gdpPercap,
FUN = sd)), panel = panel.violin)
The plot verifies our guess that Europe has large variation and yet has few outliers.
This task concerns the average life expectancy for different years. We will also consider an alternative measure, namely the so-called trimmed, or truncated, mean.
First, we find the average life expectancy (averaged over all countries) for each year:
avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp))
str(avgLifeExpByYear)
## 'data.frame': 12 obs. of 2 variables:
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ avgLifeExp: num 49.1 51.5 53.6 55.7 57.6 ...
The last command gives us a feel of the data in avgLifeExpByYear. Evidently, we have a data.frame with 12 observations (corresponding to the 12 years, which are separated by 5 year gaps) and two variables. The variables are, of course, the year and the average life expectancy.
The results are shown in the table below.
tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)
| year | avgLifeExp |
|---|---|
| 1952 | 49.06 |
| 1957 | 51.51 |
| 1962 | 53.61 |
| 1967 | 55.68 |
| 1972 | 57.65 |
| 1977 | 59.57 |
| 1982 | 61.53 |
| 1987 | 63.21 |
| 1992 | 64.16 |
| 1997 | 65.01 |
| 2002 | 65.69 |
| 2007 | 67.01 |
We observe a clear trend in these data; the average life expectancy (across all countries in the world) has increased throughout the years. But not only that; looking more closely at the values, we see that the increases in the 5-year time periods are roughly 2 for the first many elements. This pattern of constant increases is broken, however, when we get to about 1990; here, the increases in the average life expectancy start to sort of flatten out, with differences in value between two consecutive periods of below 1. The increase seems to pick up again, though, between the last two data points, i.e. between the years 2002 and 2007.
Next, we will apply the trimmed, or truncated mean. This is another statistical measure of so-called central tendency, much like the regular mean and the median. As the name suggests, the trimmed mean calculates the mean after “trimming” the data sample by discarding a certain amount of entries at the high and low end.
The trimmed mean is included in R through the trim() argument in the function mean(). The trim factor, which determines the amount of data entries to be discarded, is a fraction between 0 and 0.5. A trim factor of zero thus reduces the trimmed mean to the regular mean, whereas a trim factor of e.g. 0.4 implies that 40% of observations are discarded from each end of the data before the mean is computed.
Usually, [for most statistical applications, 5 to 25 percent of the ends are discarded][wikipedia:trimmed-mean], corresponding to a trim factor between 0.05 and 0.25.
Using a trim factor of 0.1 to discard 10% of the data from each end gives:
avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp,
trim = 0.1))
str(avgLifeExpByYear)
## 'data.frame': 12 obs. of 2 variables:
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ avgLifeExp: num 48.6 51.3 53.6 55.9 58 ...
Let's show these data in a table:
tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)
| year | avgLifeExp |
|---|---|
| 1952 | 48.58 |
| 1957 | 51.27 |
| 1962 | 53.58 |
| 1967 | 55.87 |
| 1972 | 58.01 |
| 1977 | 60.10 |
| 1982 | 62.12 |
| 1987 | 63.92 |
| 1992 | 65.19 |
| 1997 | 66.02 |
| 2002 | 66.72 |
| 2007 | 68.11 |
Evidently, the trend remains the same; the average life expectancy increases with time. The numbers, however, are not the same as above. Let's make it easier to compare the regular mean of the life expectancy with the trimmed mean by creating a data.frame containing both measures as variables.
avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp),
avgTrimLifeExp = mean(lifeExp, trim = 0.1))
The data are shown in the table below.
tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)
| year | avgLifeExp | avgTrimLifeExp |
|---|---|---|
| 1952 | 49.06 | 48.58 |
| 1957 | 51.51 | 51.27 |
| 1962 | 53.61 | 53.58 |
| 1967 | 55.68 | 55.87 |
| 1972 | 57.65 | 58.01 |
| 1977 | 59.57 | 60.10 |
| 1982 | 61.53 | 62.12 |
| 1987 | 63.21 | 63.92 |
| 1992 | 64.16 | 65.19 |
| 1997 | 65.01 | 66.02 |
| 2002 | 65.69 | 66.72 |
| 2007 | 67.01 | 68.11 |
Although they are naturally not equal, the mean values for each year are rather close. The general trend of life expectancy increasing with time also remains the same. Using the trimmed mean also results in increases of magnitude roughly 2. Like it was for the regular mean, however, this trend flattens out a bit around 1992, after which the increases are smaller. It does seem to pick up again for the newest data, though we would need more observations to properly comment on whether the trend increases or flattens out.
The most appropriate plot for this section is of course xyplot:
xyplot(avgLifeExp + avgTrimLifeExp ~ year, avgLifeExpByYear, type = "l")
I also try xyplot for life expectancy of some countries, which is sometimes a good way to see both the trend and the variation over the years.
set.seed(100)
sampleCont = sample(unique(gDat$country), 40)
xyplot(lifeExp ~ year, subset(gDat, country %in% sampleCont), group = country,
type = "l")
According to the plot, there are some countries whose life expectancy is greatly dropped after 1990, which makes the mean life expectancy grow slower after 1990. Also, because of some countries who have low life expentancy throughout the years, the trimmed mean does not catch this drop well. In other words, the trimmed mean fails to throw away the countries with great drop, which just leads to roughly the same value as untrimmed mean, therefore providing no meaningful information.
This task considers the question of how life expectancy changes over time on different continents.
Thus, compared to the above task, we do not just want the average life expectancy (across all countries) for each year; rather, we want the average life expectancy (across the countries for each respective continent) for each year.
In plyr, this can be achieved by adding ~ continent + year (rather than just ~ continent) in the ddply() function call.
avgLifeExpByContAndYear <- ddply(gDat, ~continent + year, summarize, avgLifeExp = mean(lifeExp))
Showing these results requires a slightly longer table:
tableData <- xtable(avgLifeExpByContAndYear)
print(tableData, type = "html", include.rownames = FALSE)
| continent | year | avgLifeExp |
|---|---|---|
| Africa | 1952 | 39.14 |
| Africa | 1957 | 41.27 |
| Africa | 1962 | 43.32 |
| Africa | 1967 | 45.33 |
| Africa | 1972 | 47.45 |
| Africa | 1977 | 49.58 |
| Africa | 1982 | 51.59 |
| Africa | 1987 | 53.34 |
| Africa | 1992 | 53.63 |
| Africa | 1997 | 53.60 |
| Africa | 2002 | 53.33 |
| Africa | 2007 | 54.81 |
| Americas | 1952 | 53.28 |
| Americas | 1957 | 55.96 |
| Americas | 1962 | 58.40 |
| Americas | 1967 | 60.41 |
| Americas | 1972 | 62.39 |
| Americas | 1977 | 64.39 |
| Americas | 1982 | 66.23 |
| Americas | 1987 | 68.09 |
| Americas | 1992 | 69.57 |
| Americas | 1997 | 71.15 |
| Americas | 2002 | 72.42 |
| Americas | 2007 | 73.61 |
| Asia | 1952 | 46.31 |
| Asia | 1957 | 49.32 |
| Asia | 1962 | 51.56 |
| Asia | 1967 | 54.66 |
| Asia | 1972 | 57.32 |
| Asia | 1977 | 59.61 |
| Asia | 1982 | 62.62 |
| Asia | 1987 | 64.85 |
| Asia | 1992 | 66.54 |
| Asia | 1997 | 68.02 |
| Asia | 2002 | 69.23 |
| Asia | 2007 | 70.73 |
| Europe | 1952 | 64.41 |
| Europe | 1957 | 66.70 |
| Europe | 1962 | 68.54 |
| Europe | 1967 | 69.74 |
| Europe | 1972 | 70.78 |
| Europe | 1977 | 71.94 |
| Europe | 1982 | 72.81 |
| Europe | 1987 | 73.64 |
| Europe | 1992 | 74.44 |
| Europe | 1997 | 75.51 |
| Europe | 2002 | 76.70 |
| Europe | 2007 | 77.65 |
| Oceania | 1952 | 69.25 |
| Oceania | 1957 | 70.30 |
| Oceania | 1962 | 71.09 |
| Oceania | 1967 | 71.31 |
| Oceania | 1972 | 71.91 |
| Oceania | 1977 | 72.85 |
| Oceania | 1982 | 74.29 |
| Oceania | 1987 | 75.32 |
| Oceania | 1992 | 76.94 |
| Oceania | 1997 | 78.19 |
| Oceania | 2002 | 79.74 |
| Oceania | 2007 | 80.72 |
For Africa, the general trend for the first many time periods is that the average life expectancy increases. However, something sad–but interesting from a statistical point of view–happens after 1987. The average life expectancy flattens out. In fact, from 1992-1997 and 1997-2002, the average life expectancy decreases! However, it increases again in the period 2002-2007. The average life expectancy over time for the Americas shows the same general trend. Here, however, there is no real flattening, nor any decreases; just a steady increase throughout the years. It is also interesting to notice the magnitude of the average life expectancy for e.g. Africa and the Americas; by 2007, the average life expectancy for Africa has only reached about the same level as the Americas had in 1952! Asia also undergoes a gradual increase in life expectancy over time. It starts lower than the Americas, but the increases are larger, and by 2007 Asia has almost reached the same level as the Americas. The average life expectancy in Europe has also increased with time. But the increases are small; Europe starts out with a very high life expectancy (much higher than what Africa had in 2007) and ends in 2007 at an even higher level. But the increases are still small; whereas e.g. Asia goes from roughly 46 to 71 (an increase of 25) from 1952-2007, Europe goes from 64 to 78 (an increase of only 14). Lastly, there is Oceania. Here, we also observe an increasing trend. The average life expectancy starts out in 1952 at an even higher level than Europe did in the same year (and actually at about the same level that Asia had reached in 2007), and it also ends even higher than Europe by 2007. Oceania is thus the continent with the largest initial (in the year 1952) and final (in 2007) average life expectancy. But it is also the continent with the smallest total increase; the increase from 1952 to 2007 is merely about 11—even lower than Europe.
On a final note, it is both a timely and a tiring task to assess these trends from the data.frame and the table of data. It would be much better, easier, and informative to visualize the average life expectancy over time for each continent as five different plots in the same figure.
Again, xyplot is the most appropriate plot for this section.
xyplot(avgLifeExp ~ year, avgLifeExpByContAndYear, group = continent, type = "l",
auto.key = TRUE)
Combining this plot with the previous section, some African countries seem to drive the slow growth in the life expectancy after 1990. If we exclude Africa, the mean life expectancy over the years becomes:
avgLifeExpByYear.NAfrica <- ddply(subset(gDat, continent != "Africa"), ~year,
summarize, avgLifeExp = mean(lifeExp), avgTrimLifeExp = mean(lifeExp, trim = 0.1))
xyplot(avgLifeExp + avgTrimLifeExp ~ year, avgLifeExpByYear.NAfrica, type = "l")
Now the plot seems to exhibit a linear trend throughout the observation period, which means that we should look at African countries if we are concerned with low life expectancy growth after 1990.
In this task we will count the number of countries with a life expectancy lower than some threshold, for each year. We will do this in a few short steps.
First, we consider the subset of gDat for which ´lifeExp´ is less than a certain threshold (this number can be e.g. the mean life expectancy, or any other other value).
lifeExpThreshold = 35
jDat <- subset(gDat, lifeExp < lifeExpThreshold)
Then we can use this data.frame jDat as input to ddply(). In the call to ddply() we also specify that we want to consider each continent and each year. Moreover, we define a variable that holds the number of countries in that continent.
The result gives us the number of countries with a life expectancy lower than the specified threshold over time and by continent.
nCountriesWithLowLifeExpOverTime <- ddply(jDat, ~continent + year, summarize,
nCountriesWithLowLifeExp = length(unique(country)))
str(nCountriesWithLowLifeExpOverTime)
## 'data.frame': 10 obs. of 3 variables:
## $ continent : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 3 3 3 3 3
## $ year : int 1952 1957 1962 1967 1992 1952 1957 1962 1967 1977
## $ nCountriesWithLowLifeExp: int 12 8 4 1 1 2 2 1 1 1
Of course, the new data.frame jDat didn't have to specified. The computations can be put into a single command:
nCountriesWithLowLifeExpOverTime <- ddply(subset(gDat, lifeExp < 35), ~continent +
year, summarize, nCountriesWithLowLifeExp = length(unique(country)))
The results (using 35 as the threshold/benchmark value) are shown in the table below.
tableData <- xtable(nCountriesWithLowLifeExpOverTime)
print(tableData, type = "html", include.rownames = FALSE)
| continent | year | nCountriesWithLowLifeExp |
|---|---|---|
| Africa | 1952 | 12 |
| Africa | 1957 | 8 |
| Africa | 1962 | 4 |
| Africa | 1967 | 1 |
| Africa | 1992 | 1 |
| Asia | 1952 | 2 |
| Asia | 1957 | 2 |
| Asia | 1962 | 1 |
| Asia | 1967 | 1 |
| Asia | 1977 | 1 |
First, we notice that only Africa and Asia are represented. As we saw in the above task, these were also the continents with the lowest average life expectancy over time. Now, from the last column—the number of countries in the continent with a life expectancy lower than the benchmark, for a given year—we observe a decreasin trend. Fortunately, the number of countries with a life expectancy lower than the threshold decreases with time. Of course, as there are so few countries in Asia represented here, this is mostly relevant for Africa. Let's try to increase the threshold value to 50 to get some more data:
nCountriesWithLowLifeExpOverTime <- ddply(subset(gDat, lifeExp < 50), ~continent +
year, summarize, nCountriesWithLowLifeExp = length(unique(country)))
tableData <- xtable(nCountriesWithLowLifeExpOverTime)
print(tableData, type = "html", include.rownames = FALSE)
| continent | year | nCountriesWithLowLifeExp |
|---|---|---|
| Africa | 1952 | 50 |
| Africa | 1957 | 49 |
| Africa | 1962 | 47 |
| Africa | 1967 | 39 |
| Africa | 1972 | 36 |
| Africa | 1977 | 28 |
| Africa | 1982 | 24 |
| Africa | 1987 | 20 |
| Africa | 1992 | 20 |
| Africa | 1997 | 20 |
| Africa | 2002 | 22 |
| Africa | 2007 | 18 |
| Americas | 1952 | 9 |
| Americas | 1957 | 8 |
| Americas | 1962 | 6 |
| Americas | 1967 | 2 |
| Americas | 1972 | 2 |
| Americas | 1977 | 1 |
| Asia | 1952 | 22 |
| Asia | 1957 | 18 |
| Asia | 1962 | 17 |
| Asia | 1967 | 12 |
| Asia | 1972 | 6 |
| Asia | 1977 | 5 |
| Asia | 1982 | 3 |
| Asia | 1987 | 1 |
| Asia | 1992 | 1 |
| Asia | 1997 | 1 |
| Asia | 2002 | 1 |
| Asia | 2007 | 1 |
| Europe | 1952 | 1 |
| Europe | 1957 | 1 |
Now we also have a few data entries for the Americas and Europe. And the numbers of countries in the last column are also larger. There was only a single country in Europe with a life expectancy lower than 50 in 1952 and 1957. For the later years, there are none. For the Americas, the trend is a decrease. And after 1977, there are not any countries here with a life expectancy lower than 50 either. The trends for Asia and Africa are the same; the number of countries with a life expectancy lower than 50 decreases with time. Asia only has one such country from 1987 and all the way up to the most recent time. Africa, on the other, still has as many as 18 such countries in 2007. Looking back in time, though, this is still better than the initial number of 50 such countries in 1952.
Finally, it is not much fun to assess these trends from the long table of data above. It would be much better to visualize the data for the different continents by plotting the number of countries against time. This would make trends more visible, allow for easier comparisons between continents, and also make the data more understandable for others.
xyplot for the data with life expectancy threshold 50 is as follows:
xyplot(nCountriesWithLowLifeExp ~ year, nCountriesWithLowLifeExpOverTime, group = continent,
type = "l", auto.key = TRUE, )
Removing the data points with no country whose life expectancy is below 50 was really good with respect to data visualization because we can clearly see in the plot when there is no country whose life expectancy is below 50 for each continent. Also, because of treating zero as missing, we can see without paying much attention that there are still some countries in Asia with low life expectancy although the number of such countries is close to 0.