stat545a-2013-hw04_lee-woo

Presets:

Sys.setenv(lang = "EN")
library("lattice")

The following work is based on the work of Christian Okkels: code | report

All that I write here will appear within the quote box as this sentence itself. My work will appear at the end of each section.

Introduction & loading of libraries and data

This homework assignment considers various data aggregation tasks in R using the plyr library. The data are those of the Gapminder project (available here).

First, we load the necessary libraries:

## library(lattice) # not needed here, because we aren't doing plots.
library(plyr)
library(xtable)

Then we import the Gapminder data into the data.frame gDat:

commented the first line since I don't copy it to my directory.

# gDat <- read.delim('gapminderDataFiveYear.txt')
gDat <- read.delim("http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")

A quick check that the data import went well is done by use of the str() function:

str(gDat)

## 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ pop      : num  8425333 9240934 10267083 11537966 13079460 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

From the results we see that we have to do with a data.frame with 1704 observations of 6 variables. The variables are: country, year, pop (population), continent, lifeExp (life expectancy), and gdpPercap (GDP per capita). The types of the variables and their contents seem to make sense.

Now we begin the data aggregation, in which we consider the following tasks:

Maximum and minimum GDP per capita for all continents (“wide format”)
The spread of GDP per capita within the continents
A trimmed mean of life expectancy by year
The change of life expectancy over time for different continents
Counting the number of countries with low life expectancy over time by continent

Maximum and minimum GDP per capita for all continents (“wide format”)

In this task, we construct a data.frame with 3 variables: the continent, the minimum GDP per capita for that continent (among all the countries in it), and the maximum GDP per capita. This is achieved in plyr by use of the ddply() function:

maxminGDPperCapByCont <- ddply(gDat, ~continent, summarize, minGDPperCap = min(gdpPercap), 
    maxGDPperCap = max(gdpPercap))
str(maxminGDPperCapByCont)

## 'data.frame':    5 obs. of  3 variables:
##  $ continent   : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##  $ minGDPperCap: num  241 1202 331 974 10040
##  $ maxGDPperCap: num  21951 42952 113523 49357 34435

From the last command we see that the saved quantity maxminGDPperCapByCont is a data.frame itself. However, it only consists of 5 observations (corresponding to the 5 continents) of 3 variables (continent, minimum GDP per capita, and maximum GDP per capita). As it is such a short data.frama, we can as well show the full size. This can be done in a nice table as follows:

tableData <- xtable(maxminGDPperCapByCont)
print(tableData, type = "html", include.rownames = FALSE)

continent	minGDPperCap	maxGDPperCap
Africa	241.17	21951.21
Americas	1201.64	42951.65
Asia	331.00	113523.13
Europe	973.53	49357.19
Oceania	10039.60	34435.37

Right now, these results are sorted alphabetically according the continent. We can sort on e.g. the maximum GDP per capita using the arrange() function:

tableData <- arrange(tableData, maxGDPperCap)
tableData <- xtable(tableData)
print(tableData, type = "html", include.rownames = FALSE)

continent	minGDPperCap	maxGDPperCap
Africa	241.17	21951.21
Oceania	10039.60	34435.37
Americas	1201.64	42951.65
Europe	973.53	49357.19
Asia	331.00	113523.13

First, we notice that Asia is far ahead in the lead when it comes to the maximum GDP per capita, with a value more than twice as large as the second place. After Asia comes Europe and the Americas, which are very close. Then we have Oceania, and finally Asia. It is interesting to see that the continents rank very differently with respect to the maximum GDP per capita than they would have for the minimum GDP per capita. For instance, Asia–although ranking in first place for the maximum–still has the second lowest minimum GDP per capita. This indicates a very large spread across the continent's countries. The situation is less extreme for Europe and the Americas; these ranked 2nd and 3rd, respectively, with respect to the maximum, and they would rank 3rd and 2nd, respectively, if sorted by the minimum GDP per capita. Another rather extreme case is Oceania, which has the second lowest maximum GDP per capita. However, it springs to the eye that this is the continent with the largest minimum GDP per capita, with a value 8-10 times greater than the 2nd and 3rd place. Finally, there is Asia, which preserves its ranking; it has both the lowest maximum and the lowest minimum GDP per capita.

Here I provide a strip plot where continents are ordered by the difference between the minimum and the maximum of GDP per capita:

maxminGDPperCapByCont.Reorder = maxminGDPperCapByCont[order(maxminGDPperCapByCont$maxGDPperCap - 
    maxminGDPperCapByCont$minGDPperCap), ]
maxminGDPperCapByCont.Reorder = within(maxminGDPperCapByCont.Reorder, continent <- reorder(continent, 
    maxGDPperCap - minGDPperCap))

stripplot(maxGDPperCap + minGDPperCap ~ continent, maxminGDPperCapByCont.Reorder, 
    type = "l", auto.key = TRUE)

plot of chunk unnamed-chunk-7

Now it is clear that the difference between maximum and minimum is more affected by maximum than minimum.

The spread of GDP per capita within the continents

In this task we consider the spread of the GDP per capita within the continents. The different measures of spread are:

standard deviation or variance: sd() and var()
median absolute deviation: mad()
interquartile range: IQR()

We now create a data.frame containing these measures of the spread of the GDP per capita for each continent. This is again done by means of ddply(), where we just compute even more things than above.

spreadGDPperCapByCont <- ddply(gDat, ~continent, summarize, sdGDPperCap = sd(gdpPercap), 
    varGDPperCap = var(gdpPercap), madGDPperCap = mad(gdpPercap), iqrGDPperCap = IQR(gdpPercap))
str(spreadGDPperCapByCont)

## 'data.frame':    5 obs. of  5 variables:
##  $ continent   : Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
##  $ sdGDPperCap : num  2828 6397 14045 9355 6359
##  $ varGDPperCap: num  8.00e+06 4.09e+07 1.97e+08 8.75e+07 4.04e+07
##  $ madGDPperCap: num  775 3269 2821 8846 6459
##  $ iqrGDPperCap: num  1616 4402 7492 13248 8072

# spreadGDPperCapByCont

Here, we also see that the saved variable spreadGDPperCapByCont retains the data.frame structure. It contains 5 variables; the continent and the four measures of spread of the GDP per capita. Let's show the results in a table:

tableData <- xtable(spreadGDPperCapByCont)
print(tableData, type = "html", include.rownames = FALSE)

continent	sdGDPperCap	varGDPperCap	madGDPperCap	iqrGDPperCap
Africa	2827.93	7997187.31	775.32	1616.17
Americas	6396.76	40918591.10	3269.33	4402.43
Asia	14045.37	197272505.85	2820.83	7492.26
Europe	9355.21	87520019.60	8846.05	13248.30
Oceania	6358.98	40436668.87	6459.10	8072.26

Again, this is sorted in the default alphabetical order. Sorting on the standard deviation gives us:

tableData <- arrange(tableData, sdGDPperCap)
tableData <- xtable(tableData)
print(tableData, type = "html", include.rownames = FALSE)

continent	sdGDPperCap	varGDPperCap	madGDPperCap	iqrGDPperCap
Africa	2827.93	7997187.31	775.32	1616.17
Oceania	6358.98	40436668.87	6459.10	8072.26
Americas	6396.76	40918591.10	3269.33	4402.43
Europe	9355.21	87520019.60	8846.05	13248.30
Asia	14045.37	197272505.85	2820.83	7492.26

From highest to lowest standard deviation of the GDP per capita, the continents are thus: Asia, Europe, Americas, Oceania, and Africa. In fact, the task from above (with the minimum and maximum GDP per capita) gave some indication of this ranking. We saw there that Asia–although having the largest maximum by far–had the second lowest minimum. This indicates a very large spread, and that is exactly what wee see in the table above. In the task above, Asia was the continent with both the lowest minimum and maximum, indicating a low spread. And this is also precisely what the table above tells us. Now, the continents obviously rank the same with respect to the standard deviation and the variance, seeing that the latter is merely the square of the former. However, it is interesting to notice that this ranking order is, in most cases, not retained for the two other measures of spread. For instance, Europe has the highest spread in GDP per capita when looking at the mean absolute deviation and the interquartile range. And Asia, with the highest value for the standard deviation, is pushed back to the second last place when considering the MAD measure. It is also interesting to see that Oceania and the Americas are ever so close with respect to the standard deviation, but differ wildly–up to a factor of 2–for the MAD and IQR measures. Africa, on the other hand, is the only continent that retains its position, having the lowest spread of GDP per capita across all of the different meaures considered. What we can gather qualitatively from this is, of course, that all of the countries in Africa are relatively poor, whereas the other continents generally have both poor and much richer countries.

Here I provide a bar chart for standard deviation and IQR where countries are ordered by standard deviation:

spreadGDPperCapByCont.Reorder = within(spreadGDPperCapByCont, continent <- reorder(continent, 
    sdGDPperCap))
barchart(sdGDPperCap + iqrGDPperCap ~ continent, spreadGDPperCapByCont.Reorder, 
    auto.key = TRUE)

plot of chunk unnamed-chunk-11

Comparing SD and IQR, we can guess that the large SD of Asia is due to the presence of outliers and the large SD of Europe is not due to outliers. To verify this, a violin plot is provided:

bwplot(gdpPercap ~ continent, within(gDat, continent <- reorder(continent, gdpPercap, 
    FUN = sd)), panel = panel.violin)

plot of chunk unnamed-chunk-12

The plot verifies our guess that Europe has large variation and yet has few outliers.

A trimmed mean of life expectancy by year

This task concerns the average life expectancy for different years. We will also consider an alternative measure, namely the so-called trimmed, or truncated, mean.

First, we find the average life expectancy (averaged over all countries) for each year:

avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp))
str(avgLifeExpByYear)

## 'data.frame':    12 obs. of  2 variables:
##  $ year      : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ avgLifeExp: num  49.1 51.5 53.6 55.7 57.6 ...

The last command gives us a feel of the data in avgLifeExpByYear. Evidently, we have a data.frame with 12 observations (corresponding to the 12 years, which are separated by 5 year gaps) and two variables. The variables are, of course, the year and the average life expectancy. The results are shown in the table below.

tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)

year	avgLifeExp
1952	49.06
1957	51.51
1962	53.61
1967	55.68
1972	57.65
1977	59.57
1982	61.53
1987	63.21
1992	64.16
1997	65.01
2002	65.69
2007	67.01

We observe a clear trend in these data; the average life expectancy (across all countries in the world) has increased throughout the years. But not only that; looking more closely at the values, we see that the increases in the 5-year time periods are roughly 2 for the first many elements. This pattern of constant increases is broken, however, when we get to about 1990; here, the increases in the average life expectancy start to sort of flatten out, with differences in value between two consecutive periods of below 1. The increase seems to pick up again, though, between the last two data points, i.e. between the years 2002 and 2007.

Next, we will apply the trimmed, or truncated mean. This is another statistical measure of so-called central tendency, much like the regular mean and the median. As the name suggests, the trimmed mean calculates the mean after “trimming” the data sample by discarding a certain amount of entries at the high and low end. The trimmed mean is included in R through the trim() argument in the function mean(). The trim factor, which determines the amount of data entries to be discarded, is a fraction between 0 and 0.5. A trim factor of zero thus reduces the trimmed mean to the regular mean, whereas a trim factor of e.g. 0.4 implies that 40% of observations are discarded from each end of the data before the mean is computed. Usually, [for most statistical applications, 5 to 25 percent of the ends are discarded][wikipedia:trimmed-mean], corresponding to a trim factor between 0.05 and 0.25.

Using a trim factor of 0.1 to discard 10% of the data from each end gives:

avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp, 
    trim = 0.1))
str(avgLifeExpByYear)

## 'data.frame':    12 obs. of  2 variables:
##  $ year      : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ avgLifeExp: num  48.6 51.3 53.6 55.9 58 ...

Let's show these data in a table:

tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)

year	avgLifeExp
1952	48.58
1957	51.27
1962	53.58
1967	55.87
1972	58.01
1977	60.10
1982	62.12
1987	63.92
1992	65.19
1997	66.02
2002	66.72
2007	68.11

Evidently, the trend remains the same; the average life expectancy increases with time. The numbers, however, are not the same as above. Let's make it easier to compare the regular mean of the life expectancy with the trimmed mean by creating a data.frame containing both measures as variables.

avgLifeExpByYear <- ddply(gDat, ~year, summarize, avgLifeExp = mean(lifeExp), 
    avgTrimLifeExp = mean(lifeExp, trim = 0.1))

The data are shown in the table below.

tableData <- xtable(avgLifeExpByYear)
print(tableData, type = "html", include.rownames = FALSE)

year	avgLifeExp	avgTrimLifeExp
1952	49.06	48.58
1957	51.51	51.27
1962	53.61	53.58
1967	55.68	55.87
1972	57.65	58.01
1977	59.57	60.10
1982	61.53	62.12
1987	63.21	63.92
1992	64.16	65.19
1997	65.01	66.02
2002	65.69	66.72
2007	67.01	68.11

Although they are naturally not equal, the mean values for each year are rather close. The general trend of life expectancy increasing with time also remains the same. Using the trimmed mean also results in increases of magnitude roughly 2. Like it was for the regular mean, however, this trend flattens out a bit around 1992, after which the increases are smaller. It does seem to pick up again for the newest data, though we would need more observations to properly comment on whether the trend increases or flattens out.

The most appropriate plot for this section is of course xyplot:

xyplot(avgLifeExp + avgTrimLifeExp ~ year, avgLifeExpByYear, type = "l")

plot of chunk unnamed-chunk-19

I also try xyplot for life expectancy of some countries, which is sometimes a good way to see both the trend and the variation over the years.

set.seed(100)
sampleCont = sample(unique(gDat$country), 40)
xyplot(lifeExp ~ year, subset(gDat, country %in% sampleCont), group = country, 
    type = "l")

plot of chunk unnamed-chunk-20

According to the plot, there are some countries whose life expectancy is greatly dropped after 1990, which makes the mean life expectancy grow slower after 1990. Also, because of some countries who have low life expentancy throughout the years, the trimmed mean does not catch this drop well. In other words, the trimmed mean fails to throw away the countries with great drop, which just leads to roughly the same value as untrimmed mean, therefore providing no meaningful information.

The change of life expectancy over time for different continents

This task considers the question of how life expectancy changes over time on different continents. Thus, compared to the above task, we do not just want the average life expectancy (across all countries) for each year; rather, we want the average life expectancy (across the countries for each respective continent) for each year. In plyr, this can be achieved by adding ~ continent + year (rather than just ~ continent) in the ddply() function call.

avgLifeExpByContAndYear <- ddply(gDat, ~continent + year, summarize, avgLifeExp = mean(lifeExp))

Showing these results requires a slightly longer table:

tableData <- xtable(avgLifeExpByContAndYear)
print(tableData, type = "html", include.rownames = FALSE)

continent	year	avgLifeExp
Africa	1952	39.14
Africa	1957	41.27
Africa	1962	43.32
Africa	1967	45.33
Africa	1972	47.45
Africa	1977	49.58
Africa	1982	51.59
Africa	1987	53.34
Africa	1992	53.63
Africa	1997	53.60
Africa	2002	53.33
Africa	2007	54.81
Americas	1952	53.28
Americas	1957	55.96
Americas	1962	58.40
Americas	1967	60.41
Americas	1972	62.39
Americas	1977	64.39
Americas	1982	66.23
Americas	1987	68.09
Americas	1992	69.57
Americas	1997	71.15
Americas	2002	72.42
Americas	2007	73.61
Asia	1952	46.31
Asia	1957	49.32
Asia	1962	51.56
Asia	1967	54.66
Asia	1972	57.32
Asia	1977	59.61
Asia	1982	62.62
Asia	1987	64.85
Asia	1992	66.54
Asia	1997	68.02
Asia	2002	69.23
Asia	2007	70.73
Europe	1952	64.41
Europe	1957	66.70
Europe	1962	68.54
Europe	1967	69.74
Europe	1972	70.78
Europe	1977	71.94
Europe	1982	72.81
Europe	1987	73.64
Europe	1992	74.44
Europe	1997	75.51
Europe	2002	76.70
Europe	2007	77.65
Oceania	1952	69.25
Oceania	1957	70.30
Oceania	1962	71.09
Oceania	1967	71.31
Oceania	1972	71.91
Oceania	1977	72.85
Oceania	1982	74.29
Oceania	1987	75.32
Oceania	1992	76.94
Oceania	1997	78.19
Oceania	2002	79.74
Oceania	2007	80.72

For Africa, the general trend for the first many time periods is that the average life expectancy increases. However, something sad–but interesting from a statistical point of view–happens after 1987. The average life expectancy flattens out. In fact, from 1992-1997 and 1997-2002, the average life expectancy decreases! However, it increases again in the period 2002-2007. The average life expectancy over time for the Americas shows the same general trend. Here, however, there is no real flattening, nor any decreases; just a steady increase throughout the years. It is also interesting to notice the magnitude of the average life expectancy for e.g. Africa and the Americas; by 2007, the average life expectancy for Africa has only reached about the same level as the Americas had in 1952! Asia also undergoes a gradual increase in life expectancy over time. It starts lower than the Americas, but the increases are larger, and by 2007 Asia has almost reached the same level as the Americas. The average life expectancy in Europe has also increased with time. But the increases are small; Europe starts out with a very high life expectancy (much higher than what Africa had in 2007) and ends in 2007 at an even higher level. But the increases are still small; whereas e.g. Asia goes from roughly 46 to 71 (an increase of 25) from 1952-2007, Europe goes from 64 to 78 (an increase of only 14). Lastly, there is Oceania. Here, we also observe an increasing trend. The average life expectancy starts out in 1952 at an even higher level than Europe did in the same year (and actually at about the same level that Asia had reached in 2007), and it also ends even higher than Europe by 2007. Oceania is thus the continent with the largest initial (in the year 1952) and final (in 2007) average life expectancy. But it is also the continent with the smallest total increase; the increase from 1952 to 2007 is merely about 11—even lower than Europe.

On a final note, it is both a timely and a tiring task to assess these trends from the data.frame and the table of data. It would be much better, easier, and informative to visualize the average life expectancy over time for each continent as five different plots in the same figure.

Again, xyplot is the most appropriate plot for this section.

xyplot(avgLifeExp ~ year, avgLifeExpByContAndYear, group = continent, type = "l", 
    auto.key = TRUE)

plot of chunk unnamed-chunk-23

Combining this plot with the previous section, some African countries seem to drive the slow growth in the life expectancy after 1990. If we exclude Africa, the mean life expectancy over the years becomes:

avgLifeExpByYear.NAfrica <- ddply(subset(gDat, continent != "Africa"), ~year, 
    summarize, avgLifeExp = mean(lifeExp), avgTrimLifeExp = mean(lifeExp, trim = 0.1))

xyplot(avgLifeExp + avgTrimLifeExp ~ year, avgLifeExpByYear.NAfrica, type = "l")

plot of chunk unnamed-chunk-24

Now the plot seems to exhibit a linear trend throughout the observation period, which means that we should look at African countries if we are concerned with low life expectancy growth after 1990.

Counting the number of countries with low life expectancy over time by continent

In this task we will count the number of countries with a life expectancy lower than some threshold, for each year. We will do this in a few short steps.

First, we consider the subset of gDat for which ´lifeExp´ is less than a certain threshold (this number can be e.g. the mean life expectancy, or any other other value).

lifeExpThreshold = 35
jDat <- subset(gDat, lifeExp < lifeExpThreshold)

Then we can use this data.frame jDat as input to ddply(). In the call to ddply() we also specify that we want to consider each continent and each year. Moreover, we define a variable that holds the number of countries in that continent. The result gives us the number of countries with a life expectancy lower than the specified threshold over time and by continent.

nCountriesWithLowLifeExpOverTime <- ddply(jDat, ~continent + year, summarize, 
    nCountriesWithLowLifeExp = length(unique(country)))
str(nCountriesWithLowLifeExpOverTime)

## 'data.frame':    10 obs. of  3 variables:
##  $ continent               : Factor w/ 5 levels "Africa","Americas",..: 1 1 1 1 1 3 3 3 3 3
##  $ year                    : int  1952 1957 1962 1967 1992 1952 1957 1962 1967 1977
##  $ nCountriesWithLowLifeExp: int  12 8 4 1 1 2 2 1 1 1

Of course, the new data.frame jDat didn't have to specified. The computations can be put into a single command:

nCountriesWithLowLifeExpOverTime <- ddply(subset(gDat, lifeExp < 35), ~continent + 
    year, summarize, nCountriesWithLowLifeExp = length(unique(country)))

The results (using 35 as the threshold/benchmark value) are shown in the table below.

tableData <- xtable(nCountriesWithLowLifeExpOverTime)
print(tableData, type = "html", include.rownames = FALSE)

continent	year	nCountriesWithLowLifeExp
Africa	1952	12
Africa	1957	8
Africa	1962	4
Africa	1967	1
Africa	1992	1
Asia	1952	2
Asia	1957	2
Asia	1962	1
Asia	1967	1
Asia	1977	1

First, we notice that only Africa and Asia are represented. As we saw in the above task, these were also the continents with the lowest average life expectancy over time. Now, from the last column—the number of countries in the continent with a life expectancy lower than the benchmark, for a given year—we observe a decreasin trend. Fortunately, the number of countries with a life expectancy lower than the threshold decreases with time. Of course, as there are so few countries in Asia represented here, this is mostly relevant for Africa. Let's try to increase the threshold value to 50 to get some more data:

nCountriesWithLowLifeExpOverTime <- ddply(subset(gDat, lifeExp < 50), ~continent + 
    year, summarize, nCountriesWithLowLifeExp = length(unique(country)))

tableData <- xtable(nCountriesWithLowLifeExpOverTime)
print(tableData, type = "html", include.rownames = FALSE)

continent	year	nCountriesWithLowLifeExp
Africa	1952	50
Africa	1957	49
Africa	1962	47
Africa	1967	39
Africa	1972	36
Africa	1977	28
Africa	1982	24
Africa	1987	20
Africa	1992	20
Africa	1997	20
Africa	2002	22
Africa	2007	18
Americas	1952	9
Americas	1957	8
Americas	1962	6
Americas	1967	2
Americas	1972	2
Americas	1977	1
Asia	1952	22
Asia	1957	18
Asia	1962	17
Asia	1967	12
Asia	1972	6
Asia	1977	5
Asia	1982	3
Asia	1987	1
Asia	1992	1
Asia	1997	1
Asia	2002	1
Asia	2007	1
Europe	1952	1
Europe	1957	1

Now we also have a few data entries for the Americas and Europe. And the numbers of countries in the last column are also larger. There was only a single country in Europe with a life expectancy lower than 50 in 1952 and 1957. For the later years, there are none. For the Americas, the trend is a decrease. And after 1977, there are not any countries here with a life expectancy lower than 50 either. The trends for Asia and Africa are the same; the number of countries with a life expectancy lower than 50 decreases with time. Asia only has one such country from 1987 and all the way up to the most recent time. Africa, on the other, still has as many as 18 such countries in 2007. Looking back in time, though, this is still better than the initial number of 50 such countries in 1952.

Finally, it is not much fun to assess these trends from the long table of data above. It would be much better to visualize the data for the different continents by plotting the number of countries against time. This would make trends more visible, allow for easier comparisons between continents, and also make the data more understandable for others.

xyplot for the data with life expectancy threshold 50 is as follows:

xyplot(nCountriesWithLowLifeExp ~ year, nCountriesWithLowLifeExpOverTime, group = continent, 
    type = "l", auto.key = TRUE, )

plot of chunk unnamed-chunk-31

Removing the data points with no country whose life expectancy is below 50 was really good with respect to data visualization because we can clearly see in the plot when there is no country whose life expectancy is below 50 for each continent. Also, because of treating zero as missing, we can see without paying much attention that there are still some countries in Asia with low life expectancy although the number of such countries is close to 0.