Final Project STAT 545A - Daniel Dinsdale

Introduction

This document provides a means of outlining the final project which can be accessed here. This link includes instructions on how to run the pipeline through R.

Life Expectancy Analysis

Initial plots were primarily to show the relationship between mean life expectancy in years for each of the 5 continents since 1952. The first plot shows the mean values, whilst the second plot fits a linear model to the data. This will be done on a country by country basis by myself later in the document.

p1 <- ggplot(lifeExpCont, aes(x = year, y = meanLifeExp, colour = continent))
p1 + geom_line() + ylab("Mean Life Expectancy (Years)") + xlab("Year")

plot of chunk unnamed-chunk-2

p2 <- ggplot(gDat, aes(x = year, y = lifeExp, colour = continent))
p2 + geom_smooth(method = "lm") + ylab(" Life Expectancy (Years)") + xlab("Year")

plot of chunk unnamed-chunk-2

Following this a layered histogram was also observed. Note that colours used here are the unedited R palate, whilst later on I will experiment with RColorBrewer.

p3 <- ggplot(gDat, aes(x = lifeExp, fill = continent))
p3 + geom_bar(binwidth = 1.5) + ylab("Count") + xlab("Life Expectancy")

plot of chunk unnamed-chunk-3

This plot shows the proportion of each life expectancy that each continent provides. So for example a life expectancy of around 40 was far more likely to be from Africa than any other continent. In fact Oceania has no life expectancy this low. This histogram has to be taken with a pinch of salt however, as these are counts and not all continents have the same amount of data. Oceania for instance has a very small number of data points.

From here I ordered continents within the file based on mean life expectancy (as opposed to alphabetic order). This swapped Asia with the Americas and the new file was named “gapminderOrderedContinents.txt”. Note that at this point I removed Oceania to prevent it from cluttering up my plots. With this new file I also wrote a function providing linear models of life expectancy for each country. Thanks to Jenny for providing examples of this in lectures. Using the newly ordered file I added three columns-intercept, slope and residual. I then wrote this to file under the title “gapminderWithInterceptsOrdered.txt”.

Linear Models

With the new file I used a second R script to do the following analysis.

First of all I identified the best and worst three countries for mean life expectancy in all continents except Oceania (which I will ignore from now on). Writing two new files with only these extreme countries, “gapminderbestworstcoef.txt” for intercept, slope and residual and “gapminderbestworstlife.txt” for the yearly mean life expectancy I could then produce some interesting plots. Note that within continents these were all arranged in order of life expectancy.

p4 <- ggplot(subset(bestWorstCountry, continent == "Asia"), aes(x = year - 1952, 
    y = lifeExp, colour = country))
p4 + geom_point() + geom_abline(aes(intercept = intercept, slope = slope), data = subset(bestWorstCoefs, 
    continent == "Asia")) + ylab("Life Expectancy") + xlab("Year-1952") + facet_grid(~country)

plot of chunk unnamed-chunk-6

p5 <- ggplot(subset(bestWorstCountry, continent == "Africa"), aes(x = year - 
    1952, y = lifeExp, colour = country))
p5 + geom_point() + geom_abline(aes(intercept = intercept, slope = slope), data = subset(bestWorstCoefs, 
    continent == "Africa")) + ylab("Life Expectancy") + xlab("Year-1952") + 
    facet_grid(~country)

plot of chunk unnamed-chunk-6

p6 <- ggplot(subset(bestWorstCountry, continent == "Americas"), aes(x = year - 
    1952, y = lifeExp, colour = country))
p6 + geom_point() + geom_abline(aes(intercept = intercept, slope = slope), data = subset(bestWorstCoefs, 
    continent == "Americas")) + ylab("Life Expectancy") + xlab("Year-1952") + 
    facet_grid(~country)

plot of chunk unnamed-chunk-6

p7 <- ggplot(subset(bestWorstCountry, continent == "Europe"), aes(x = year - 
    1952, y = lifeExp, colour = country))
p7 + geom_point() + geom_abline(aes(intercept = intercept, slope = slope), data = subset(bestWorstCoefs, 
    continent == "Europe")) + ylab("Life Expectancy") + xlab("Year-1952") + 
    facet_grid(~country)

plot of chunk unnamed-chunk-6

These plots show scatterplots with an overlayed linear model of my creating for the best and worst 3 countries per continent for life expectancy. Zimbabwe in Africa has an interesting plot, since the life expectancy drops drastically in the late 1980s.

Colours

Finally there are some plots produced in the name of experimentation! Here I tried RColorBrewer to improve on colours within the plots and also attempted at altering the size of points on the graphs based on population. Note here I used Jenny's file on gapminder country colours and in the final plot an error message appears as I altered the y limits to ignore an outlier!

p8 <- ggplot(jDat, aes(x = lifeExp, y = gdpPercap))
p8 + geom_point() + facet_wrap(~continent) + aes(colour = country) + scale_color_manual(values = jColors) + 
    theme(legend.position = "none")

plot of chunk unnamed-chunk-8

p9 <- ggplot(subset(jDat, year == 2007), aes(x = lifeExp, y = gdpPercap))
p9 + geom_point(aes(size = sqrt(pop/pi)), pch = 21, show_guide = FALSE) + scale_size_continuous(range = c(1, 
    40)) + facet_wrap(~continent) + aes(fill = country) + scale_fill_manual(values = jColors)

plot of chunk unnamed-chunk-8

p10 <- ggplot(subset(jDat, year == 1952), aes(x = lifeExp, y = gdpPercap))
p10 + scale_y_continuous(limits = c(0, 20000)) + geom_point(aes(size = sqrt(pop/pi)), 
    pch = 21, show_guide = FALSE) + scale_size_continuous(range = c(1, 40)) + 
    facet_wrap(~continent) + aes(fill = country) + scale_fill_manual(values = jColors)
## Warning: Removed 1 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-8

Danny