STAT 545A Homework #4 Visualize a Quantitative Variable

In this assignment, we prepare ourselves to take advantage of graphical figures in data aggregation tasks. Similar to last assignment, we have used Gapminder data.I want to emphasize that I have used the assignment #3 codes of two students: Rebecca Johnston and Jinyuan Zhang

Loading the Data

We start with loading data and checking the structure of the input:

## Loading Data
gdURL <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
gDat <- read.delim(file = gdURL)
library(plyr)
library(xtable)
library(lattice)

Now let's get rid of the Oceania continent which has only two countires and it will not give us informative figures and tables.

## Omit Oceania From the data, as it contains only 2 countries
iDat <- droplevels(subset(gDat, continent != "Oceania"))

Mean Life Expectancy over Time

Now we report the average life expectancy for different years.

iDat <- within(iDat, continent <- reorder(continent, lifeExp))  ## Change the order of continents based on their life expectancy 
MeanLife = ddply(gDat, ~year, summarize, MeanLifeExp = mean(lifeExp))
MeanLifeTable = xtable(MeanLife)
## Plotting the Average of Life Expectancy over time
xyplot(MeanLifeExp ~ year, data = MeanLife, grid = "h", type = c("p", "a"))

plot of chunk unnamed-chunk-2

As we see the average life expectancy increases with time. Obviously, the overal trend is easier to be seen using a figure rather than a table.

Life Expectancy over time for different continents

In this part we want to see how life expectancy changes over time on different continents.

## Plotting Life Expectancies over time across different continents
stripplot(lifeExp ~ as.factor(year), iDat, groups = continent, auto.key = list(reverse.rows = TRUE), 
    jitter.data = TRUE, grid = "h", type = c("p", "a"), fun = median)

plot of chunk unnamed-chunk-3

All the continents face an increase in the overal life expectancy. Among them, Europe has the highest median of life expectancy. As we see Asia has the fastest growth in the life expectancy over time.

Low-Life-Expectancy Countries

In this part we want to depict the number of countries with low life expectancy over time across continents. We have chosen a benchmark of 50 for defining the low-life-expectancy. Naturally, a country with life expectancy lower than this benchmark, is counted as a low-life expectancy country.

## Counting the Low-Life-Expectancy across continents with benchmark = 50
## Counter Function
Benchmark = 50
Count = c(rep(0, nrow(iDat)))
for (i in 1:nrow(iDat)) {
    if (iDat$lifeExp[i] < Benchmark) {
        Count[i] = 1
    } else {
        Count[i] = 0
    }
}
newiDat = cbind(iDat, Count)
countlifeExp = ddply(newiDat, .(continent, year), summarize, CountryCount = sum(Count))
countlifeExp = xtable(countlifeExp)
print(countlifeExp, type = "html", include.rownames = FALSE)

continent	year	CountryCount
Africa	1952	50.00
Africa	1957	49.00
Africa	1962	47.00
Africa	1967	39.00
Africa	1972	36.00
Africa	1977	28.00
Africa	1982	24.00
Africa	1987	20.00
Africa	1992	20.00
Africa	1997	20.00
Africa	2002	22.00
Africa	2007	18.00
Asia	1952	22.00
Asia	1957	18.00
Asia	1962	17.00
Asia	1967	12.00
Asia	1972	6.00
Asia	1977	5.00
Asia	1982	3.00
Asia	1987	1.00
Asia	1992	1.00
Asia	1997	1.00
Asia	2002	1.00
Asia	2007	1.00
Americas	1952	9.00
Americas	1957	8.00
Americas	1962	6.00
Americas	1967	2.00
Americas	1972	2.00
Americas	1977	1.00
Americas	1982	0.00
Americas	1987	0.00
Americas	1992	0.00
Americas	1997	0.00
Americas	2002	0.00
Americas	2007	0.00
Europe	1952	1.00
Europe	1957	1.00
Europe	1962	0.00
Europe	1967	0.00
Europe	1972	0.00
Europe	1977	0.00
Europe	1982	0.00
Europe	1987	0.00
Europe	1992	0.00
Europe	1997	0.00
Europe	2002	0.00
Europe	2007	0.00

This long table does not seem to be a nice object to show to audience (to be used in a presentation or paper or etc.) However, this is a good one to be used in plotting the data in a figure; a figure which is a nice object to be shown to others. Way more compact and informative than the table above.

## Plotting the Low-Life-Expectancy counts over time for continents
xyplot(CountryCount ~ year, data = countlifeExp, group = continent, auto.key = TRUE)

plot of chunk unnamed-chunk-5

An overal decrease in the low-life-expectancy countries is observable from these figures.

Maximum and Minimum of GDP per Capita

Next, we are going to depict the maximum and minimum of GDP per capita for all continents.

## Identifying the Minimum and Maximum GDP Per Capita for different
## continents reorder the data with respect to gdpPercap
iDat <- within(iDat, continent <- reorder(continent, gdpPercap))
## Rebecca Johnston gDat-> iDat Write own function to produce a data frame in
## tall format
minmax <- function(x) {
    ## Make character vector to specify min and max
    factor = c("Min", "Max")
    ## Specify function to compute min and max (same order as line above)
    gdpPerCapita = c(min(x$gdpPercap), max(x$gdpPercap))
    ## Make factor and value two columns in a data frame
    data.frame(factor, gdpPerCapita)
}
contMinMaxGdpTall <- ddply(iDat, ~continent, minmax)
contMinMaxGdpTable <- xtable(contMinMaxGdpTall)
print(contMinMaxGdpTable, type = "html", include.rownames = FALSE)

continent	factor	gdpPerCapita
Africa	Min	241.17
Africa	Max	21951.21
Americas	Min	1201.64
Americas	Max	42951.65
Asia	Min	331.00
Asia	Max	113523.13
Europe	Min	973.53
Europe	Max	49357.19

## Bar Chart for the Minimum and Maximum of GDP per Capita over the entire
## time period for all the continents
barchart(gdpPerCapita ~ factor | continent, contMinMaxGdpTall)

plot of chunk unnamed-chunk-6

Since, it is awkward to ignore the year here, as there are strong temporal trends in GDP per capita, we will try to bring the year variable into our visual display. We have shown the results separately for each year.

## Use a similar way to generate the continent-wise minmax; this time
## separately for each year
contMinMaxGdpTall <- ddply(iDat, ~continent + year, minmax)
## Comparing different continents over time based on their minimum and
## maximum GDP per Capita
xyplot(gdpPerCapita ~ continent | as.factor(year), data = contMinMaxGdpTall, 
    group = factor, auto.key = TRUE, grid = "h", type = c("p", "a"))

plot of chunk unnamed-chunk-7

Spread of GDP per Capita within the continents

Now, we want to look at the spread of GDP per capita within the continents.

## Rebecca's code - Mean and Median Omitted
contSpreadGdp <- ddply(iDat, ~continent + year, summarize, sdGdpPercap = sd(gdpPercap), 
    madGdpPercap = mad(gdpPercap), iqrGdpPercap = IQR(gdpPercap))
contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap)  #order table by standard deviation
contSpreadGdpXT <- xtable(contSpreadGdp)
print(contSpreadGdpXT, type = "html", include.rownames = FALSE)

continent	year	sdGdpPercap	madGdpPercap	iqrGdpPercap
Africa	1952	982.95	696.94	919.90
Africa	1957	1134.51	712.10	966.09
Africa	1962	1461.84	786.26	1028.89
Africa	1987	2566.53	814.02	2021.71
Africa	1992	2644.08	728.29	1963.99
Africa	1997	2820.73	771.40	2064.48
Africa	1967	2847.72	742.99	1110.30
Africa	2002	2972.65	819.40	2534.31
Americas	1952	3001.73	1265.22	1511.74
Europe	1952	3114.06	2983.64	3995.66
Africa	1982	3242.63	899.40	1958.93
Africa	1972	3286.85	952.11	1476.34
Americas	1957	3312.38	1917.27	2269.16
Americas	1962	3421.74	1719.81	2430.39
Africa	2007	3618.16	1032.21	3130.55
Europe	1957	3677.95	3692.10	5202.35
Africa	1977	4142.40	1015.21	2035.67
Americas	1967	4160.89	2076.92	2545.56
Europe	1962	4199.19	4225.83	5557.55
Europe	1967	4724.98	5134.80	6619.24
Americas	1972	4754.40	2229.77	2778.00
Americas	1977	5355.60	2260.26	2918.17
Europe	1972	5509.69	6001.40	7465.31
Americas	1982	5530.49	3463.59	4739.39
Europe	1977	5874.46	6864.29	8692.38
Europe	1982	6453.23	7646.98	9451.86
Americas	1987	6665.04	3292.12	3666.65
Americas	1992	7047.09	3231.02	3697.55
Europe	1987	7482.96	9014.97	11047.02
Americas	1997	7874.23	3934.24	5082.98
Asia	1987	8090.26	4824.52	9938.89
Asia	1982	8701.18	4921.98	11511.36
Americas	2002	8895.82	3167.47	3939.29
Europe	1992	9109.80	12065.97	16367.13
Americas	2007	9713.21	4773.60	6249.22
Asia	1992	9727.43	4282.12	13430.26
Europe	1997	10065.46	12347.82	17242.93
Asia	1997	11094.18	4315.99	17799.80
Asia	2002	11150.72	4497.79	17141.28
Europe	2002	11197.36	13027.36	18651.51
Europe	2007	11800.34	12506.17	19006.06
Asia	1977	11815.78	3638.67	10034.17
Asia	1967	14062.59	1969.57	5070.53
Asia	2007	14154.94	4566.12	19863.98
Asia	1962	16415.86	1428.06	3361.71
Asia	1952	18634.89	921.11	2285.64
Asia	1972	19087.50	2775.31	7547.82
Asia	1957	19506.52	1292.32	2496.68

But how to find an insight into these measures using figures. The most basic way might be to draw the empirical density of GDP per Capita for different continents.

## Empirical Density Function of GDP per Capita for all the continent
densityplot(~gdpPercap, iDat, plot.points = FALSE, ref = TRUE, group = continent, 
    auto.key = list(columns = nlevels(iDat$continent)), n = 400)

plot of chunk unnamed-chunk-9

We see that Europe has less symmetry in its density compared to other continents. This partly shows why Europe has the most IQR and MAD measures. On the other hand, Asia has a nearly symmetric part in its density in addition to a non-symmetric fat tail on the right. This shows why it has the biggest Standard deviation but a medium MAD. There is a lot to say about this density which the table above could not say.

We can also use the boxplots in order to compare the spread of GDP per capita data among different continents. We have chosen a part of data for this part.

## Taking a look on box plots for a couple of years hDat contains a part of
## data due to the years 1982, 1987, 1992,1997, 2002, 2007
hDat <- subset(iDat, year %in% c(1982, 1987, 1992, 1997, 2002, 2007))  # 
bwplot(gdpPercap ~ as.factor(year) | continent, hDat)

plot of chunk unnamed-chunk-10

I think these sorts of simple figures have a lot more to say about the spread of the data rather than that information we have had in the tables. To see how tables might be boring we have selected three years 1957, 1982 and 2007 and we draw the “numbers"from the table above. I do not think that these numbers could be interpreted as easy as what we have in a figure such as density plot.

## Comparing different spread measures for three chosen years:1957, 1982,
## 2007 hDat contains a part of data due to the years 1957, 1982, 2007
hDat <- subset(iDat, year %in% c(1957, 1982, 2007))  # 
contSpreadGdp <- ddply(hDat, ~continent + year, summarize, sdGdpPercap = sd(gdpPercap), 
    madGdpPercap = mad(gdpPercap), iqrGdpPercap = IQR(gdpPercap))
contSpreadGdp <- arrange(contSpreadGdp, sdGdpPercap)  #order table by standard deviation
## Evaluatation of different spread measure for each continent over time
iQRplot <- xyplot(iqrGdpPercap ~ continent, data = contSpreadGdp, group = year, 
    auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
MADplot <- xyplot(madGdpPercap ~ continent, data = contSpreadGdp, group = year, 
    auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
SDplot <- xyplot(sdGdpPercap ~ continent, data = contSpreadGdp, group = year, 
    auto.key = list(reverse.rows = TRUE), grid = "h", type = c("p", "a"))
print(iQRplot, position = c(0, 0, 0.33, 1), more = TRUE)
print(MADplot, position = c(0.33, 0, 0.66, 1), more = TRUE)
print(SDplot, position = c(0.66, 0, 1, 1))

plot of chunk unnamed-chunk-11