In this assignment, we prepare ourselves to take advantage of graphical figures in data aggregation tasks. Similar to last assignment, we have used Gapminder data.I want to emphasize that I have used the assignment #3 codes of two students: Rebecca Johnston and Jinyuan Zhang
We start with loading data and checking the structure of the input:
## Loading Data
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
library(lattice)
Now let's get rid of the Oceania continent which has only two countires and it will not give us informative figures and tables.
## Omit Oceania From the data, as it contains only 2 countries
iDat <- droplevels(subset(gDat, continent != "Oceania"))
Now we report the average life expectancy for different years.
iDat <- within(iDat, continent <- reorder(continent, lifeExp)) ## Change the order of continents based on their life expectancy
MeanLife = ddply(gDat, ~year, summarize, MeanLifeExp = mean(lifeExp))
MeanLifeTable = xtable(MeanLife)
## Plotting the Average of Life Expectancy over time
xyplot(MeanLifeExp ~ year, data = MeanLife, grid = "h", type = c("p", "a"))
As we see the average life expectancy increases with time. Obviously, the overal trend is easier to be seen using a figure rather than a table.
In this part we want to see how life expectancy changes over time on different continents.
## Plotting Life Expectancies over time across different continents
stripplot(lifeExp ~ as.factor(year), iDat, groups = continent, auto.key = list(reverse.rows = TRUE),
jitter.data = TRUE, grid = "h", type = c("p", "a"), fun = median)
All the continents face an increase in the overal life expectancy. Among them, Europe has the highest median of life expectancy. As we see Asia has the fastest growth in the life expectancy over time.
In this part we want to depict the number of countries with low life expectancy over time across continents. We have chosen a benchmark of 50 for defining the low-life-expectancy. Naturally, a country with life expectancy lower than this benchmark, is counted as a low-life expectancy country.
## Counting the Low-Life-Expectancy across continents with benchmark = 50
## Counter Function
Benchmark = 50
Count = c(rep(0, nrow(iDat)))
for (i in 1:nrow(iDat)) {
if (iDat$lifeExp[i] < Benchmark) {
Count[i] = 1
} else {
Count[i] = 0
}
}
newiDat = cbind(iDat, Count)
countlifeExp = ddply(newiDat, .(continent, year), summarize, CountryCount = sum(Count))
countlifeExp = xtable(countlifeExp)
print(countlifeExp, type = "html", include.rownames = FALSE)
| continent | year | CountryCount |
|---|---|---|
| Africa | 1952 | 50.00 |
| Africa | 1957 | 49.00 |
| Africa | 1962 | 47.00 |
| Africa | 1967 | 39.00 |
| Africa | 1972 | 36.00 |
| Africa | 1977 | 28.00 |
| Africa | 1982 | 24.00 |
| Africa | 1987 | 20.00 |
| Africa | 1992 | 20.00 |
| Africa | 1997 | 20.00 |
| Africa | 2002 | 22.00 |
| Africa | 2007 | 18.00 |
| Asia | 1952 | 22.00 |
| Asia | 1957 | 18.00 |
| Asia | 1962 | 17.00 |
| Asia | 1967 | 12.00 |
| Asia | 1972 | 6.00 |
| Asia | 1977 | 5.00 |
| Asia | 1982 | 3.00 |
| Asia | 1987 | 1.00 |
| Asia | 1992 | 1.00 |
| Asia | 1997 | 1.00 |
| Asia | 2002 | 1.00 |
| Asia | 2007 | 1.00 |
| Americas | 1952 | 9.00 |
| Americas | 1957 | 8.00 |
| Americas | 1962 | 6.00 |
| Americas | 1967 | 2.00 |
| Americas | 1972 | 2.00 |
| Americas | 1977 | 1.00 |
| Americas | 1982 | 0.00 |
| Americas | 1987 | 0.00 |
| Americas | 1992 | 0.00 |
| Americas | 1997 | 0.00 |
| Americas | 2002 | 0.00 |
| Americas | 2007 | 0.00 |
| Europe | 1952 | 1.00 |
| Europe | 1957 | 1.00 |
| Europe | 1962 | 0.00 |
| Europe | 1967 | 0.00 |
| Europe | 1972 | 0.00 |
| Europe | 1977 | 0.00 |
| Europe | 1982 | 0.00 |
| Europe | 1987 | 0.00 |
| Europe | 1992 | 0.00 |
| Europe | 1997 | 0.00 |
| Europe | 2002 | 0.00 |
| Europe | 2007 | 0.00 |
This long table does not seem to be a nice object to show to audience (to be used in a presentation or paper or etc.) However, this is a good one to be used in plotting the data in a figure; a figure which is a nice object to be shown to others. Way more compact and informative than the table above.
## Plotting the Low-Life-Expectancy counts over time for continents
xyplot(CountryCount ~ year, data = countlifeExp, group = continent, auto.key = TRUE)
An overal decrease in the low-life-expectancy countries is observable from these figures.
Next, we are going to depict the maximum and minimum of GDP per capita for all continents.
## Identifying the Minimum and Maximum GDP Per Capita for different
## continents reorder the data with respect to gdpPercap
iDat <- within(iDat, continent <- reorder(continent, gdpPercap))
## Rebecca Johnston gDat-> iDat Write own function to produce a data frame in
## tall format
minmax <- function(x) {
## Make character vector to specify min and max
factor = c("Min", "Max")
## Specify function to compute min and max (same order as line above)
gdpPerCapita = c(min(x$gdpPercap), max(x$gdpPercap))
## Make factor and value two columns in a data frame
data.frame(factor, gdpPerCapita)
}
contMinMaxGdpTall <- ddply(iDat, ~continent, minmax)
contMinMaxGdpTable <- xtable(contMinMaxGdpTall)
print(contMinMaxGdpTable, type = "html", include.rownames = FALSE)
| continent | factor | gdpPerCapita |
|---|---|---|
| Africa | Min | 241.17 |
| Africa | Max | 21951.21 |
| Americas | Min | 1201.64 |
| Americas | Max | 42951.65 |
| Asia | Min | 331.00 |
| Asia | Max | 113523.13 |
| Europe | Min | 973.53 |
| Europe | Max | 49357.19 |
## Bar Chart for the Minimum and Maximum of GDP per Capita over the entire
## time period for all the continents
barchart(gdpPerCapita ~ factor | continent, contMinMaxGdpTall)
Since, it is awkward to ignore the year here, as there are strong temporal trends in GDP per capita, we will try to bring the year variable into our visual display. We have shown the results separately for each year.
## Use a similar way to generate the continent-wise minmax; this time
## separately for each year
contMinMaxGdpTall <- ddply(iDat, ~continent + year, minmax)
## Comparing different continents over time based on their minimum and
## maximum GDP per Capita
xyplot(gdpPerCapita ~ continent | as.factor(year), data = contMinMaxGdpTall,
group = factor, auto.key = TRUE, grid = "h", type = c("p", "a"))
Now, we want to look at the spread of GDP per capita within the continents.
## Rebecca's code - Mean and Median Omitted
contSpreadGdp <- ddply(iDat, ~continent + year, summarize, sdGdpPercap = sd(gdpPercap),
madGdpPercap = mad(gdpPercap), iqrGdpPercap = IQR(gdpPercap))
contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap) #order table by standard deviation
contSpreadGdpXT <- xtable(contSpreadGdp)
print(contSpreadGdpXT, type = "html", include.rownames = FALSE)
| continent | year | sdGdpPercap | madGdpPercap | iqrGdpPercap |
|---|---|---|---|---|
| Africa | 1952 | 982.95 | 696.94 | 919.90 |
| Africa | 1957 | 1134.51 | 712.10 | 966.09 |
| Africa | 1962 | 1461.84 | 786.26 | 1028.89 |
| Africa | 1987 | 2566.53 | 814.02 | 2021.71 |
| Africa | 1992 | 2644.08 | 728.29 | 1963.99 |
| Africa | 1997 | 2820.73 | 771.40 | 2064.48 |
| Africa | 1967 | 2847.72 | 742.99 | 1110.30 |
| Africa | 2002 | 2972.65 | 819.40 | 2534.31 |
| Americas | 1952 | 3001.73 | 1265.22 | 1511.74 |
| Europe | 1952 | 3114.06 | 2983.64 | 3995.66 |
| Africa | 1982 | 3242.63 | 899.40 | 1958.93 |
| Africa | 1972 | 3286.85 | 952.11 | 1476.34 |
| Americas | 1957 | 3312.38 | 1917.27 | 2269.16 |
| Americas | 1962 | 3421.74 | 1719.81 | 2430.39 |
| Africa | 2007 | 3618.16 | 1032.21 | 3130.55 |
| Europe | 1957 | 3677.95 | 3692.10 | 5202.35 |
| Africa | 1977 | 4142.40 | 1015.21 | 2035.67 |
| Americas | 1967 | 4160.89 | 2076.92 | 2545.56 |
| Europe | 1962 | 4199.19 | 4225.83 | 5557.55 |
| Europe | 1967 | 4724.98 | 5134.80 | 6619.24 |
| Americas | 1972 | 4754.40 | 2229.77 | 2778.00 |
| Americas | 1977 | 5355.60 | 2260.26 | 2918.17 |
| Europe | 1972 | 5509.69 | 6001.40 | 7465.31 |
| Americas | 1982 | 5530.49 | 3463.59 | 4739.39 |
| Europe | 1977 | 5874.46 | 6864.29 | 8692.38 |
| Europe | 1982 | 6453.23 | 7646.98 | 9451.86 |
| Americas | 1987 | 6665.04 | 3292.12 | 3666.65 |
| Americas | 1992 | 7047.09 | 3231.02 | 3697.55 |
| Europe | 1987 | 7482.96 | 9014.97 | 11047.02 |
| Americas | 1997 | 7874.23 | 3934.24 | 5082.98 |
| Asia | 1987 | 8090.26 | 4824.52 | 9938.89 |
| Asia | 1982 | 8701.18 | 4921.98 | 11511.36 |
| Americas | 2002 | 8895.82 | 3167.47 | 3939.29 |
| Europe | 1992 | 9109.80 | 12065.97 | 16367.13 |
| Americas | 2007 | 9713.21 | 4773.60 | 6249.22 |
| Asia | 1992 | 9727.43 | 4282.12 | 13430.26 |
| Europe | 1997 | 10065.46 | 12347.82 | 17242.93 |
| Asia | 1997 | 11094.18 | 4315.99 | 17799.80 |
| Asia | 2002 | 11150.72 | 4497.79 | 17141.28 |
| Europe | 2002 | 11197.36 | 13027.36 | 18651.51 |
| Europe | 2007 | 11800.34 | 12506.17 | 19006.06 |
| Asia | 1977 | 11815.78 | 3638.67 | 10034.17 |
| Asia | 1967 | 14062.59 | 1969.57 | 5070.53 |
| Asia | 2007 | 14154.94 | 4566.12 | 19863.98 |
| Asia | 1962 | 16415.86 | 1428.06 | 3361.71 |
| Asia | 1952 | 18634.89 | 921.11 | 2285.64 |
| Asia | 1972 | 19087.50 | 2775.31 | 7547.82 |
| Asia | 1957 | 19506.52 | 1292.32 | 2496.68 |
But how to find an insight into these measures using figures. The most basic way might be to draw the empirical density of GDP per Capita for different continents.
## Empirical Density Function of GDP per Capita for all the continent
densityplot(~gdpPercap, iDat, plot.points = FALSE, ref = TRUE, group = continent,
auto.key = list(columns = nlevels(iDat$continent)), n = 400)
We see that Europe has less symmetry in its density compared to other continents. This partly shows why Europe has the most IQR and MAD measures. On the other hand, Asia has a nearly symmetric part in its density in addition to a non-symmetric fat tail on the right. This shows why it has the biggest Standard deviation but a medium MAD. There is a lot to say about this density which the table above could not say.
We can also use the boxplots in order to compare the spread of GDP per capita data among different continents. We have chosen a part of data for this part.
## Taking a look on box plots for a couple of years hDat contains a part of
## data due to the years 1982, 1987, 1992,1997, 2002, 2007
hDat <- subset(iDat, year %in% c(1982, 1987, 1992, 1997, 2002, 2007)) #
bwplot(gdpPercap ~ as.factor(year) | continent, hDat)
I think these sorts of simple figures have a lot more to say about the spread of the data rather than that information we have had in the tables. To see how tables might be boring we have selected three years 1957, 1982 and 2007 and we draw the “numbers"from the table above. I do not think that these numbers could be interpreted as easy as what we have in a figure such as density plot.
## Comparing different spread measures for three chosen years:1957, 1982,
## 2007 hDat contains a part of data due to the years 1957, 1982, 2007
hDat <- subset(iDat, year %in% c(1957, 1982, 2007)) #
contSpreadGdp <- ddply(hDat, ~continent + year, summarize, sdGdpPercap = sd(gdpPercap),
madGdpPercap = mad(gdpPercap), iqrGdpPercap = IQR(gdpPercap))
contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap) #order table by standard deviation
## Evaluatation of different spread measure for each continent over time
iQRplot <- xyplot(iqrGdpPercap ~ continent, data = contSpreadGdp, group = year,
auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
MADplot <- xyplot(madGdpPercap ~ continent, data = contSpreadGdp, group = year,
auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
SDplot <- xyplot(sdGdpPercap ~ continent, data = contSpreadGdp, group = year,
auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
print(iQRplot, position = c(0, 0, 0.33, 1), more = TRUE)
print(MADplot, position = c(0.33, 0, 0.66, 1), more = TRUE)
print(SDplot, position = c(0.66, 0, 1, 1))