Mina Park
In this exercise, we are working with the plyr package which allows us to split, apply, and combine data in R. We are using data from the Gapminder project.
1.) Load libraries, data, and perform a superficial check of data import
library(plyr)
Dat <- read.delim("gapminderDataFiveYear.txt")
str(Dat)
## 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ pop : num 8425333 9240934 10267083 11537966 13079460 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ gdpPercap: num 779 821 853 836 740 ...
summary(Dat)
## country year pop continent
## Afghanistan: 12 Min. :1952 Min. :6.00e+04 Africa :624
## Albania : 12 1st Qu.:1966 1st Qu.:2.79e+06 Americas:300
## Algeria : 12 Median :1980 Median :7.02e+06 Asia :396
## Angola : 12 Mean :1980 Mean :2.96e+07 Europe :360
## Argentina : 12 3rd Qu.:1993 3rd Qu.:1.96e+07 Oceania : 24
## Australia : 12 Max. :2007 Max. :1.32e+09
## (Other) :1632
## lifeExp gdpPercap
## Min. :23.6 Min. : 241
## 1st Qu.:48.2 1st Qu.: 1202
## Median :60.7 Median : 3532
## Mean :59.5 Mean : 7215
## 3rd Qu.:70.8 3rd Qu.: 9325
## Max. :82.6 Max. :113523
##
head(Dat, 5)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.80 779.4
## 2 Afghanistan 1957 9240934 Asia 30.33 820.9
## 3 Afghanistan 1962 10267083 Asia 32.00 853.1
## 4 Afghanistan 1967 11537966 Asia 34.02 836.2
## 5 Afghanistan 1972 13079460 Asia 36.09 740.0
We notice that the variables in the data are: country, year, pop, continent, lifeExp, gdpPercap. We also notice that the data is in a data frame.
2.) We want to investigate life expectancy and GDP over time, by continent
AvgLifeExpAndGdpByYearAndCont <- ddply(Dat, ~year + continent, summarize, meanLifeExp = mean(lifeExp),
meanGdp = mean(gdpPercap))
AvgLifeExpAndGdpByYearAndCont
## year continent meanLifeExp meanGdp
## 1 1952 Africa 39.14 1253
## 2 1952 Americas 53.28 4079
## 3 1952 Asia 46.31 5195
## 4 1952 Europe 64.41 5661
## 5 1952 Oceania 69.25 10298
## 6 1957 Africa 41.27 1385
## 7 1957 Americas 55.96 4616
## 8 1957 Asia 49.32 5788
## 9 1957 Europe 66.70 6963
## 10 1957 Oceania 70.30 11599
## 11 1962 Africa 43.32 1598
## 12 1962 Americas 58.40 4902
## 13 1962 Asia 51.56 5729
## 14 1962 Europe 68.54 8365
## 15 1962 Oceania 71.09 12696
## 16 1967 Africa 45.33 2050
## 17 1967 Americas 60.41 5668
## 18 1967 Asia 54.66 5971
## 19 1967 Europe 69.74 10144
## 20 1967 Oceania 71.31 14495
## 21 1972 Africa 47.45 2340
## 22 1972 Americas 62.39 6491
## 23 1972 Asia 57.32 8187
## 24 1972 Europe 70.78 12480
## 25 1972 Oceania 71.91 16417
## 26 1977 Africa 49.58 2586
## 27 1977 Americas 64.39 7352
## 28 1977 Asia 59.61 7791
## 29 1977 Europe 71.94 14284
## 30 1977 Oceania 72.85 17284
## 31 1982 Africa 51.59 2482
## 32 1982 Americas 66.23 7507
## 33 1982 Asia 62.62 7434
## 34 1982 Europe 72.81 15618
## 35 1982 Oceania 74.29 18555
## 36 1987 Africa 53.34 2283
## 37 1987 Americas 68.09 7793
## 38 1987 Asia 64.85 7608
## 39 1987 Europe 73.64 17214
## 40 1987 Oceania 75.32 20448
## 41 1992 Africa 53.63 2282
## 42 1992 Americas 69.57 8045
## 43 1992 Asia 66.54 8640
## 44 1992 Europe 74.44 17062
## 45 1992 Oceania 76.94 20894
## 46 1997 Africa 53.60 2379
## 47 1997 Americas 71.15 8889
## 48 1997 Asia 68.02 9834
## 49 1997 Europe 75.51 19077
## 50 1997 Oceania 78.19 24024
## 51 2002 Africa 53.33 2599
## 52 2002 Americas 72.42 9288
## 53 2002 Asia 69.23 10174
## 54 2002 Europe 76.70 21712
## 55 2002 Oceania 79.74 26939
## 56 2007 Africa 54.81 3089
## 57 2007 Americas 73.61 11003
## 58 2007 Asia 70.73 12473
## 59 2007 Europe 77.65 25054
## 60 2007 Oceania 80.72 29810
This gives us the data we are looking for, namely life expectancy and GDP over time per continent. But to see trends within continents, we want to see the data presented by continent.
arrange(AvgLifeExpAndGdpByYearAndCont, continent)
## year continent meanLifeExp meanGdp
## 1 1952 Africa 39.14 1253
## 2 1957 Africa 41.27 1385
## 3 1962 Africa 43.32 1598
## 4 1967 Africa 45.33 2050
## 5 1972 Africa 47.45 2340
## 6 1977 Africa 49.58 2586
## 7 1982 Africa 51.59 2482
## 8 1987 Africa 53.34 2283
## 9 1992 Africa 53.63 2282
## 10 1997 Africa 53.60 2379
## 11 2002 Africa 53.33 2599
## 12 2007 Africa 54.81 3089
## 13 1952 Americas 53.28 4079
## 14 1957 Americas 55.96 4616
## 15 1962 Americas 58.40 4902
## 16 1967 Americas 60.41 5668
## 17 1972 Americas 62.39 6491
## 18 1977 Americas 64.39 7352
## 19 1982 Americas 66.23 7507
## 20 1987 Americas 68.09 7793
## 21 1992 Americas 69.57 8045
## 22 1997 Americas 71.15 8889
## 23 2002 Americas 72.42 9288
## 24 2007 Americas 73.61 11003
## 25 1952 Asia 46.31 5195
## 26 1957 Asia 49.32 5788
## 27 1962 Asia 51.56 5729
## 28 1967 Asia 54.66 5971
## 29 1972 Asia 57.32 8187
## 30 1977 Asia 59.61 7791
## 31 1982 Asia 62.62 7434
## 32 1987 Asia 64.85 7608
## 33 1992 Asia 66.54 8640
## 34 1997 Asia 68.02 9834
## 35 2002 Asia 69.23 10174
## 36 2007 Asia 70.73 12473
## 37 1952 Europe 64.41 5661
## 38 1957 Europe 66.70 6963
## 39 1962 Europe 68.54 8365
## 40 1967 Europe 69.74 10144
## 41 1972 Europe 70.78 12480
## 42 1977 Europe 71.94 14284
## 43 1982 Europe 72.81 15618
## 44 1987 Europe 73.64 17214
## 45 1992 Europe 74.44 17062
## 46 1997 Europe 75.51 19077
## 47 2002 Europe 76.70 21712
## 48 2007 Europe 77.65 25054
## 49 1952 Oceania 69.25 10298
## 50 1957 Oceania 70.30 11599
## 51 1962 Oceania 71.09 12696
## 52 1967 Oceania 71.31 14495
## 53 1972 Oceania 71.91 16417
## 54 1977 Oceania 72.85 17284
## 55 1982 Oceania 74.29 18555
## 56 1987 Oceania 75.32 20448
## 57 1992 Oceania 76.94 20894
## 58 1997 Oceania 78.19 24024
## 59 2002 Oceania 79.74 26939
## 60 2007 Oceania 80.72 29810
Using “arrange” gives us data arranged by continent. Note: we can also get data arranged in this order by putting the variables in the order of “~continent+year” when we use ddply().
In general, it looks like life expectancy is increasing with GDP over time for all continents. A figure would be a great way of visualizing this trend.
3.) We want to look at life expectancy and GDP over time, by continent, in a “wide” format
LifeExpByYearAndCont.Wide <- daply(Dat, ~year + continent, summarize, avgLifeExp = mean(lifeExp))
LifeExpByYearAndCont.Wide
## continent
## year Africa Americas Asia Europe Oceania
## 1952 39.14 53.28 46.31 64.41 69.25
## 1957 41.27 55.96 49.32 66.7 70.3
## 1962 43.32 58.4 51.56 68.54 71.09
## 1967 45.33 60.41 54.66 69.74 71.31
## 1972 47.45 62.39 57.32 70.78 71.91
## 1977 49.58 64.39 59.61 71.94 72.85
## 1982 51.59 66.23 62.62 72.81 74.29
## 1987 53.34 68.09 64.85 73.64 75.32
## 1992 53.63 69.57 66.54 74.44 76.94
## 1997 53.6 71.15 68.02 75.51 78.19
## 2002 53.33 72.42 69.23 76.7 79.74
## 2007 54.81 73.61 70.73 77.65 80.72
GdpByYearAndCont.Wide <- daply(Dat, ~year + continent, summarize, avgGdp = mean(gdpPercap))
GdpByYearAndCont.Wide
## continent
## year Africa Americas Asia Europe Oceania
## 1952 1253 4079 5195 5661 10298
## 1957 1385 4616 5788 6963 11599
## 1962 1598 4902 5729 8365 12696
## 1967 2050 5668 5971 10144 14495
## 1972 2340 6491 8187 12480 16417
## 1977 2586 7352 7791 14284 17284
## 1982 2482 7507 7434 15618 18555
## 1987 2283 7793 7608 17214 20448
## 1992 2282 8045 8640 17062 20894
## 1997 2379 8889 9834 19077 24024
## 2002 2599 9288 10174 21712 26939
## 2007 3089 11003 12473 25054 29810
Presenting data in a wide format allows for a more compact table and it is easier to look at trends across continents. As the name implies, it is “wide”, whereas the previous data was presented in a “tall” format. Note: This is an example of another output of the plyr package, namely in an array format.
4.) We want to look at the number and proportion of countries with low life expectancy over time, by continent
4.1) Before doing this for the entire dataset, I want to initially try doing some exercises just for the year 2007.
Dat2007 <- subset(Dat, year == 2007)
avgLifeExp2007 <- mean(Dat2007$lifeExp)
avgLifeExp2007 #67
## [1] 67.01
We will use life expectancy below the mean life expectancy of 67 as the threshold for low life expectancy.
Dat2007LowLifeExp <- ddply(Dat2007, ~continent + country, summarize, lowLifeExp = lifeExp <
67)
Dat2007LowLifeExp
## continent country lowLifeExp
## 1 Africa Algeria FALSE
## 2 Africa Angola TRUE
## 3 Africa Benin TRUE
## 4 Africa Botswana TRUE
## 5 Africa Burkina Faso TRUE
## 6 Africa Burundi TRUE
## 7 Africa Cameroon TRUE
## 8 Africa Central African Republic TRUE
## 9 Africa Chad TRUE
## 10 Africa Comoros TRUE
## 11 Africa Congo, Dem. Rep. TRUE
## 12 Africa Congo, Rep. TRUE
## 13 Africa Cote d'Ivoire TRUE
## 14 Africa Djibouti TRUE
## 15 Africa Egypt FALSE
## 16 Africa Equatorial Guinea TRUE
## 17 Africa Eritrea TRUE
## 18 Africa Ethiopia TRUE
## 19 Africa Gabon TRUE
## 20 Africa Gambia TRUE
## 21 Africa Ghana TRUE
## 22 Africa Guinea TRUE
## 23 Africa Guinea-Bissau TRUE
## 24 Africa Kenya TRUE
## 25 Africa Lesotho TRUE
## 26 Africa Liberia TRUE
## 27 Africa Libya FALSE
## 28 Africa Madagascar TRUE
## 29 Africa Malawi TRUE
## 30 Africa Mali TRUE
## 31 Africa Mauritania TRUE
## 32 Africa Mauritius FALSE
## 33 Africa Morocco FALSE
## 34 Africa Mozambique TRUE
## 35 Africa Namibia TRUE
## 36 Africa Niger TRUE
## 37 Africa Nigeria TRUE
## 38 Africa Reunion FALSE
## 39 Africa Rwanda TRUE
## 40 Africa Sao Tome and Principe TRUE
## 41 Africa Senegal TRUE
## 42 Africa Sierra Leone TRUE
## 43 Africa Somalia TRUE
## 44 Africa South Africa TRUE
## 45 Africa Sudan TRUE
## 46 Africa Swaziland TRUE
## 47 Africa Tanzania TRUE
## 48 Africa Togo TRUE
## 49 Africa Tunisia FALSE
## 50 Africa Uganda TRUE
## 51 Africa Zambia TRUE
## 52 Africa Zimbabwe TRUE
## 53 Americas Argentina FALSE
## 54 Americas Bolivia TRUE
## 55 Americas Brazil FALSE
## 56 Americas Canada FALSE
## 57 Americas Chile FALSE
## 58 Americas Colombia FALSE
## 59 Americas Costa Rica FALSE
## 60 Americas Cuba FALSE
## 61 Americas Dominican Republic FALSE
## 62 Americas Ecuador FALSE
## 63 Americas El Salvador FALSE
## 64 Americas Guatemala FALSE
## 65 Americas Haiti TRUE
## 66 Americas Honduras FALSE
## 67 Americas Jamaica FALSE
## 68 Americas Mexico FALSE
## 69 Americas Nicaragua FALSE
## 70 Americas Panama FALSE
## 71 Americas Paraguay FALSE
## 72 Americas Peru FALSE
## 73 Americas Puerto Rico FALSE
## 74 Americas Trinidad and Tobago FALSE
## 75 Americas United States FALSE
## 76 Americas Uruguay FALSE
## 77 Americas Venezuela FALSE
## 78 Asia Afghanistan TRUE
## 79 Asia Bahrain FALSE
## 80 Asia Bangladesh TRUE
## 81 Asia Cambodia TRUE
## 82 Asia China FALSE
## 83 Asia Hong Kong, China FALSE
## 84 Asia India TRUE
## 85 Asia Indonesia FALSE
## 86 Asia Iran FALSE
## 87 Asia Iraq TRUE
## 88 Asia Israel FALSE
## 89 Asia Japan FALSE
## 90 Asia Jordan FALSE
## 91 Asia Korea, Dem. Rep. FALSE
## 92 Asia Korea, Rep. FALSE
## 93 Asia Kuwait FALSE
## 94 Asia Lebanon FALSE
## 95 Asia Malaysia FALSE
## 96 Asia Mongolia TRUE
## 97 Asia Myanmar TRUE
## 98 Asia Nepal TRUE
## 99 Asia Oman FALSE
## 100 Asia Pakistan TRUE
## 101 Asia Philippines FALSE
## 102 Asia Saudi Arabia FALSE
## 103 Asia Singapore FALSE
## 104 Asia Sri Lanka FALSE
## 105 Asia Syria FALSE
## 106 Asia Taiwan FALSE
## 107 Asia Thailand FALSE
## 108 Asia Vietnam FALSE
## 109 Asia West Bank and Gaza FALSE
## 110 Asia Yemen, Rep. TRUE
## 111 Europe Albania FALSE
## 112 Europe Austria FALSE
## 113 Europe Belgium FALSE
## 114 Europe Bosnia and Herzegovina FALSE
## 115 Europe Bulgaria FALSE
## 116 Europe Croatia FALSE
## 117 Europe Czech Republic FALSE
## 118 Europe Denmark FALSE
## 119 Europe Finland FALSE
## 120 Europe France FALSE
## 121 Europe Germany FALSE
## 122 Europe Greece FALSE
## 123 Europe Hungary FALSE
## 124 Europe Iceland FALSE
## 125 Europe Ireland FALSE
## 126 Europe Italy FALSE
## 127 Europe Montenegro FALSE
## 128 Europe Netherlands FALSE
## 129 Europe Norway FALSE
## 130 Europe Poland FALSE
## 131 Europe Portugal FALSE
## 132 Europe Romania FALSE
## 133 Europe Serbia FALSE
## 134 Europe Slovak Republic FALSE
## 135 Europe Slovenia FALSE
## 136 Europe Spain FALSE
## 137 Europe Sweden FALSE
## 138 Europe Switzerland FALSE
## 139 Europe Turkey FALSE
## 140 Europe United Kingdom FALSE
## 141 Oceania Australia FALSE
## 142 Oceania New Zealand FALSE
Here, what I did was ask R to find which countries have life expectancy below 67. As you can see, the output is a logical vector of TRUEs and FALSEs.
nCountriesLowLifeExpByCont <- ddply(Dat2007LowLifeExp, ~continent, summarize,
nCountriesLowLifeExp = length(which(lowLifeExp == TRUE)))
nCountriesLowLifeExpByCont
## continent nCountriesLowLifeExp
## 1 Africa 45
## 2 Americas 2
## 3 Asia 10
## 4 Europe 0
## 5 Oceania 0
To find the number of countries per continent with low life expectancy, I asked R to count the number of times “TRUE” showed up under “lowLifeExp”. In R language, this means using length() for which() are “TRUE”“ in the "lowLifeExp” vector. Personal story: Figuring this out turned out to be a source of great pain and frustration, and took an embarassing amount of time.
4.2) Now that I have the hang of it, I am moving onto the entire dataset. I will use an arbitrary life expectancy of 50 as the threshold for low life expectancy.
DatLowLifeExp <- ddply(Dat, ~continent + country + year, summarize, lowLifeExp = lifeExp <
50)
First, we are looking at the number of countries with low life expectancy per continent.
nCountriesLowLifeExpByContandYear.Wide <- daply(DatLowLifeExp, ~year + continent,
summarize, nCountriesLowLifeExp = length(which(lowLifeExp == TRUE)))
nCountriesLowLifeExpByContandYear.Wide
## continent
## year Africa Americas Asia Europe Oceania
## 1952 50 9 22 1 0
## 1957 49 8 18 1 0
## 1962 47 6 17 0 0
## 1967 39 2 12 0 0
## 1972 36 2 6 0 0
## 1977 28 1 5 0 0
## 1982 24 0 3 0 0
## 1987 20 0 1 0 0
## 1992 20 0 1 0 0
## 1997 20 0 1 0 0
## 2002 22 0 1 0 0
## 2007 18 0 1 0 0
Now, we are looking at the proportion of countries with low life expectancy per continent.
propCountriesLowLifeExpByContAndYear.Wide <- daply(DatLowLifeExp, ~year + continent,
summarize, propLowLifeExp = (length(which(lowLifeExp == TRUE))/length(unique(country))))
propCountriesLowLifeExpByContAndYear.Wide
## continent
## year Africa Americas Asia Europe Oceania
## 1952 0.9615 0.36 0.6667 0.03333 0
## 1957 0.9423 0.32 0.5455 0.03333 0
## 1962 0.9038 0.24 0.5152 0 0
## 1967 0.75 0.08 0.3636 0 0
## 1972 0.6923 0.08 0.1818 0 0
## 1977 0.5385 0.04 0.1515 0 0
## 1982 0.4615 0 0.09091 0 0
## 1987 0.3846 0 0.0303 0 0
## 1992 0.3846 0 0.0303 0 0
## 1997 0.3846 0 0.0303 0 0
## 2002 0.4231 0 0.0303 0 0
## 2007 0.3462 0 0.0303 0 0
To generate a data frame with both the number and proportion of countries with low life expectancy per continent, we include them all in our output.
propAndNCountriesLowLifeExpByContAndYear <- ddply(DatLowLifeExp, ~continent +
year, summarize, nCountriesLowLifeExp = length(which(lowLifeExp == TRUE)),
nCountries = length(unique(country)), propLowLifeExp = (length(which(lowLifeExp ==
TRUE))/length(unique(country))))
propAndNCountriesLowLifeExpByContAndYear
## continent year nCountriesLowLifeExp nCountries propLowLifeExp
## 1 Africa 1952 50 52 0.96154
## 2 Africa 1957 49 52 0.94231
## 3 Africa 1962 47 52 0.90385
## 4 Africa 1967 39 52 0.75000
## 5 Africa 1972 36 52 0.69231
## 6 Africa 1977 28 52 0.53846
## 7 Africa 1982 24 52 0.46154
## 8 Africa 1987 20 52 0.38462
## 9 Africa 1992 20 52 0.38462
## 10 Africa 1997 20 52 0.38462
## 11 Africa 2002 22 52 0.42308
## 12 Africa 2007 18 52 0.34615
## 13 Americas 1952 9 25 0.36000
## 14 Americas 1957 8 25 0.32000
## 15 Americas 1962 6 25 0.24000
## 16 Americas 1967 2 25 0.08000
## 17 Americas 1972 2 25 0.08000
## 18 Americas 1977 1 25 0.04000
## 19 Americas 1982 0 25 0.00000
## 20 Americas 1987 0 25 0.00000
## 21 Americas 1992 0 25 0.00000
## 22 Americas 1997 0 25 0.00000
## 23 Americas 2002 0 25 0.00000
## 24 Americas 2007 0 25 0.00000
## 25 Asia 1952 22 33 0.66667
## 26 Asia 1957 18 33 0.54545
## 27 Asia 1962 17 33 0.51515
## 28 Asia 1967 12 33 0.36364
## 29 Asia 1972 6 33 0.18182
## 30 Asia 1977 5 33 0.15152
## 31 Asia 1982 3 33 0.09091
## 32 Asia 1987 1 33 0.03030
## 33 Asia 1992 1 33 0.03030
## 34 Asia 1997 1 33 0.03030
## 35 Asia 2002 1 33 0.03030
## 36 Asia 2007 1 33 0.03030
## 37 Europe 1952 1 30 0.03333
## 38 Europe 1957 1 30 0.03333
## 39 Europe 1962 0 30 0.00000
## 40 Europe 1967 0 30 0.00000
## 41 Europe 1972 0 30 0.00000
## 42 Europe 1977 0 30 0.00000
## 43 Europe 1982 0 30 0.00000
## 44 Europe 1987 0 30 0.00000
## 45 Europe 1992 0 30 0.00000
## 46 Europe 1997 0 30 0.00000
## 47 Europe 2002 0 30 0.00000
## 48 Europe 2007 0 30 0.00000
## 49 Oceania 1952 0 2 0.00000
## 50 Oceania 1957 0 2 0.00000
## 51 Oceania 1962 0 2 0.00000
## 52 Oceania 1967 0 2 0.00000
## 53 Oceania 1972 0 2 0.00000
## 54 Oceania 1977 0 2 0.00000
## 55 Oceania 1982 0 2 0.00000
## 56 Oceania 1987 0 2 0.00000
## 57 Oceania 1992 0 2 0.00000
## 58 Oceania 1997 0 2 0.00000
## 59 Oceania 2002 0 2 0.00000
## 60 Oceania 2007 0 2 0.00000
Overall, the number/proportion of countries with low life expectancy is decreasing over time for all continents. However, there are stark gradients across continents - Africa still has a rather large proportion of countries with low life expectancy compared to all other continents.
In future exercises, I would like to investigate the relationship between GDP and life expectancy over time by continent. I would also like to identify countries that have “interesting stories” in this regard.
To sum: From this exercise, I have gathered that “plyr” is a very useful package and can imagine various handy uses of it. Personal reflection: I've also realized that at my current level of R programming, using R is like trying to write a book in a foreign language without understanding grammar and having a limited/non-existent vocabulary. Here's to becoming R-literate!