STAT 545A Homework #4 Visualize a Quantitative Variable

In this assignment, we prepare ourselves to take advantage of graphical figures in data aggregation tasks. Similar to last assignment, we have used Gapminder data.I want to emphasize that I have used the assignment #3 codes of two students: Rebecca Johnston and Jinyuan Zhang

Loading the Data

We start with loading data and checking the structure of the input:

## Loading Data
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
library(lattice)

Now let's get rid of the Oceania continent which has only two countires and it will not give us informative figures and tables.

## Omit Oceania From the data, as it contains only 2 countries
iDat <- droplevels(subset(gDat, continent != "Oceania"))

Mean Life Expectancy over Time

Now we report the average life expectancy for different years.

iDat <- within(iDat, continent <- reorder(continent, lifeExp))  ## Change the order of continents based on their life expectancy 
MeanLife = ddply(gDat, ~year, summarize, MeanLifeExp = mean(lifeExp))
MeanLifeTable = xtable(MeanLife)
## Plotting the Average of Life Expectancy over time
xyplot(MeanLifeExp ~ year, data = MeanLife, grid = "h", type = c("p", "a"))

plot of chunk unnamed-chunk-2

As we see the average life expectancy increases with time. Obviously, the overal trend is easier to be seen using a figure rather than a table.

Life Expectancy over time for different continents

In this part we want to see how life expectancy changes over time on different continents.

## Plotting Life Expectancies over time across different continents
stripplot(lifeExp ~ as.factor(year), iDat, groups = continent, auto.key = list(reverse.rows = TRUE), 
    jitter.data = TRUE, grid = "h", type = c("p", "a"), fun = median)

plot of chunk unnamed-chunk-3

All the continents face an increase in the overal life expectancy. Among them, Europe has the highest median of life expectancy. As we see Asia has the fastest growth in the life expectancy over time.

Low-Life-Expectancy Countries

In this part we want to depict the number of countries with low life expectancy over time across continents. We have chosen a benchmark of 50 for defining the low-life-expectancy. Naturally, a country with life expectancy lower than this benchmark, is counted as a low-life expectancy country.

## Counting the Low-Life-Expectancy across continents with benchmark = 50
## Counter Function
Benchmark = 50
Count = c(rep(0, nrow(iDat)))
for (i in 1:nrow(iDat)) {
    if (iDat$lifeExp[i] < Benchmark) {
        Count[i] = 1
    } else {
        Count[i] = 0
    }
}
newiDat = cbind(iDat, Count)
countlifeExp = ddply(newiDat, .(continent, year), summarize, CountryCount = sum(Count))
countlifeExp = xtable(countlifeExp)
print(countlifeExp, type = "html", include.rownames = FALSE)
continent year CountryCount
Africa 1952 50.00
Africa 1957 49.00
Africa 1962 47.00
Africa 1967 39.00
Africa 1972 36.00
Africa 1977 28.00
Africa 1982 24.00
Africa 1987 20.00
Africa 1992 20.00
Africa 1997 20.00
Africa 2002 22.00
Africa 2007 18.00
Asia 1952 22.00
Asia 1957 18.00
Asia 1962 17.00
Asia 1967 12.00
Asia 1972 6.00
Asia 1977 5.00
Asia 1982 3.00
Asia 1987 1.00
Asia 1992 1.00
Asia 1997 1.00
Asia 2002 1.00
Asia 2007 1.00
Americas 1952 9.00
Americas 1957 8.00
Americas 1962 6.00
Americas 1967 2.00
Americas 1972 2.00
Americas 1977 1.00
Americas 1982 0.00
Americas 1987 0.00
Americas 1992 0.00
Americas 1997 0.00
Americas 2002 0.00
Americas 2007 0.00
Europe 1952 1.00
Europe 1957 1.00
Europe 1962 0.00
Europe 1967 0.00
Europe 1972 0.00
Europe 1977 0.00
Europe 1982 0.00
Europe 1987 0.00
Europe 1992 0.00
Europe 1997 0.00
Europe 2002 0.00
Europe 2007 0.00

This long table does not seem to be a nice object to show to audience (to be used in a presentation or paper or etc.) However, this is a good one to be used in plotting the data in a figure; a figure which is a nice object to be shown to others. Way more compact and informative than the table above.

## Plotting the Low-Life-Expectancy counts over time for continents
xyplot(CountryCount ~ year, data = countlifeExp, group = continent, auto.key = TRUE)

plot of chunk unnamed-chunk-5

An overal decrease in the low-life-expectancy countries is observable from these figures.

Maximum and Minimum of GDP per Capita

Next, we are going to depict the maximum and minimum of GDP per capita for all continents.

## Identifying the Minimum and Maximum GDP Per Capita for different
## continents reorder the data with respect to gdpPercap
iDat <- within(iDat, continent <- reorder(continent, gdpPercap))
## Rebecca Johnston gDat-> iDat Write own function to produce a data frame in
## tall format
minmax <- function(x) {
    ## Make character vector to specify min and max
    factor = c("Min", "Max")
    ## Specify function to compute min and max (same order as line above)
    gdpPerCapita = c(min(x$gdpPercap), max(x$gdpPercap))
    ## Make factor and value two columns in a data frame
    data.frame(factor, gdpPerCapita)
}
contMinMaxGdpTall <- ddply(iDat, ~continent, minmax)
contMinMaxGdpTable <- xtable(contMinMaxGdpTall)
print(contMinMaxGdpTable, type = "html", include.rownames = FALSE)
continent factor gdpPerCapita
Africa Min 241.17
Africa Max 21951.21
Americas Min 1201.64
Americas Max 42951.65
Asia Min 331.00
Asia Max 113523.13
Europe Min 973.53
Europe Max 49357.19
## Bar Chart for the Minimum and Maximum of GDP per Capita over the entire
## time period for all the continents
barchart(gdpPerCapita ~ factor | continent, contMinMaxGdpTall)

plot of chunk unnamed-chunk-6

Since, it is awkward to ignore the year here, as there are strong temporal trends in GDP per capita, we will try to bring the year variable into our visual display. We have shown the results separately for each year.

## Use a similar way to generate the continent-wise minmax; this time
## separately for each year
contMinMaxGdpTall <- ddply(iDat, ~continent + year, minmax)
## Comparing different continents over time based on their minimum and
## maximum GDP per Capita
xyplot(gdpPerCapita ~ continent | as.factor(year), data = contMinMaxGdpTall, 
    group = factor, auto.key = TRUE, grid = "h", type = c("p", "a"))

plot of chunk unnamed-chunk-7

Spread of GDP per Capita within the continents

Now, we want to look at the spread of GDP per capita within the continents.

## Rebecca's code - Mean and Median Omitted
contSpreadGdp <- ddply(iDat, ~continent + year, summarize, sdGdpPercap = sd(gdpPercap), 
    madGdpPercap = mad(gdpPercap), iqrGdpPercap = IQR(gdpPercap))
contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap)  #order table by standard deviation
contSpreadGdpXT <- xtable(contSpreadGdp)
print(contSpreadGdpXT, type = "html", include.rownames = FALSE)
continent year sdGdpPercap madGdpPercap iqrGdpPercap
Africa 1952 982.95 696.94 919.90
Africa 1957 1134.51 712.10 966.09
Africa 1962 1461.84 786.26 1028.89
Africa 1987 2566.53 814.02 2021.71
Africa 1992 2644.08 728.29 1963.99
Africa 1997 2820.73 771.40 2064.48
Africa 1967 2847.72 742.99 1110.30
Africa 2002 2972.65 819.40 2534.31
Americas 1952 3001.73 1265.22 1511.74
Europe 1952 3114.06 2983.64 3995.66
Africa 1982 3242.63 899.40 1958.93
Africa 1972 3286.85 952.11 1476.34
Americas 1957 3312.38 1917.27 2269.16
Americas 1962 3421.74 1719.81 2430.39
Africa 2007 3618.16 1032.21 3130.55
Europe 1957 3677.95 3692.10 5202.35
Africa 1977 4142.40 1015.21 2035.67
Americas 1967 4160.89 2076.92 2545.56
Europe 1962 4199.19 4225.83 5557.55
Europe 1967 4724.98 5134.80 6619.24
Americas 1972 4754.40 2229.77 2778.00
Americas 1977 5355.60 2260.26 2918.17
Europe 1972 5509.69 6001.40 7465.31
Americas 1982 5530.49 3463.59 4739.39
Europe 1977 5874.46 6864.29 8692.38
Europe 1982 6453.23 7646.98 9451.86
Americas 1987 6665.04 3292.12 3666.65
Americas 1992 7047.09 3231.02 3697.55
Europe 1987 7482.96 9014.97 11047.02
Americas 1997 7874.23 3934.24 5082.98
Asia 1987 8090.26 4824.52 9938.89
Asia 1982 8701.18 4921.98 11511.36
Americas 2002 8895.82 3167.47 3939.29
Europe 1992 9109.80 12065.97 16367.13
Americas 2007 9713.21 4773.60 6249.22
Asia 1992 9727.43 4282.12 13430.26
Europe 1997 10065.46 12347.82 17242.93
Asia 1997 11094.18 4315.99 17799.80
Asia 2002 11150.72 4497.79 17141.28
Europe 2002 11197.36 13027.36 18651.51
Europe 2007 11800.34 12506.17 19006.06
Asia 1977 11815.78 3638.67 10034.17
Asia 1967 14062.59 1969.57 5070.53
Asia 2007 14154.94 4566.12 19863.98
Asia 1962 16415.86 1428.06 3361.71
Asia 1952 18634.89 921.11 2285.64
Asia 1972 19087.50 2775.31 7547.82
Asia 1957 19506.52 1292.32 2496.68

But how to find an insight into these measures using figures. The most basic way might be to draw the empirical density of GDP per Capita for different continents.

## Empirical Density Function of GDP per Capita for all the continent
densityplot(~gdpPercap, iDat, plot.points = FALSE, ref = TRUE, group = continent, 
    auto.key = list(columns = nlevels(iDat$continent)), n = 400)

plot of chunk unnamed-chunk-9

We see that Europe has less symmetry in its density compared to other continents. This partly shows why Europe has the most IQR and MAD measures. On the other hand, Asia has a nearly symmetric part in its density in addition to a non-symmetric fat tail on the right. This shows why it has the biggest Standard deviation but a medium MAD. There is a lot to say about this density which the table above could not say.

We can also use the boxplots in order to compare the spread of GDP per capita data among different continents. We have chosen a part of data for this part.

## Taking a look on box plots for a couple of years hDat contains a part of
## data due to the years 1982, 1987, 1992,1997, 2002, 2007
hDat <- subset(iDat, year %in% c(1982, 1987, 1992, 1997, 2002, 2007))  # 
bwplot(gdpPercap ~ as.factor(year) | continent, hDat)

plot of chunk unnamed-chunk-10

I think these sorts of simple figures have a lot more to say about the spread of the data rather than that information we have had in the tables. To see how tables might be boring we have selected three years 1957, 1982 and 2007 and we draw the “numbers"from the table above. I do not think that these numbers could be interpreted as easy as what we have in a figure such as density plot.

## Comparing different spread measures for three chosen years:1957, 1982,
## 2007 hDat contains a part of data due to the years 1957, 1982, 2007
hDat <- subset(iDat, year %in% c(1957, 1982, 2007))  # 
contSpreadGdp <- ddply(hDat, ~continent + year, summarize, sdGdpPercap = sd(gdpPercap), 
    madGdpPercap = mad(gdpPercap), iqrGdpPercap = IQR(gdpPercap))
contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap)  #order table by standard deviation
## Evaluatation of different spread measure for each continent over time
iQRplot <- xyplot(iqrGdpPercap ~ continent, data = contSpreadGdp, group = year, 
    auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
MADplot <- xyplot(madGdpPercap ~ continent, data = contSpreadGdp, group = year, 
    auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
SDplot <- xyplot(sdGdpPercap ~ continent, data = contSpreadGdp, group = year, 
    auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
print(iQRplot, position = c(0, 0, 0.33, 1), more = TRUE)
print(MADplot, position = c(0.33, 0, 0.66, 1), more = TRUE)
print(SDplot, position = c(0.66, 0, 1, 1))

plot of chunk unnamed-chunk-11