Preamble

Let load some r packages.

library(ggplot2)
library(reshape2)

Introduction

In the begining it is important to download and source your data. * If you are writing it down in Excel, you should document how you are doing it throughtly * Downloaded links should be sourced.

setwd("~/kai_r_markdown/")
gd_url <- "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt"
#download.file(url = gd_url, destfile = "gapminder_data.txt", method="curl")

Load the data [1].

gapminder_df <- read.table("gapminder_data.txt", sep="\t", header=T)
head(gapminder_df)
##       country year      pop continent lifeExp gdpPercap
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134

Lets find out what the worst country in the world is.

##      country year     pop continent lifeExp gdpPercap
## 1287  Rwanda 1992 7290203    Africa  23.599  737.0686

And the best

subset(gapminder_df, lifeExp == max(lifeExp))
##     country year       pop continent lifeExp gdpPercap
## 798   Japan 2007 127467972      Asia  82.603  31656.07

Clearly…

Okay, lets just plot Life Expectancy

p1 <- ggplot(gapminder_df, aes(x=year, y=lifeExp)) + stat_boxplot(aes(color=continent))
p1

p2 <- ggplot(gapminder_df, aes(x=lifeExp, y=gdpPercap)) + geom_point() + scale_y_log10() + stat_smooth(method="lm")
p2

p3 <- ggplot(gapminder_df, aes(x=year, y=gdpPercap)) + geom_point() + scale_y_log10() + stat_smooth(method="lm") + facet_wrap(~continent)
p3

Does population relate to GDP?

p4 <- ggplot(gapminder_df, aes(x=pop, y=lifeExp)) + geom_point(aes(color=continent)) + geom_text(aes(label=country))
p4

Lets take a closer look at china and india.

china_india <- subset(gapminder_df, country %in% c("China", "India"))
p5 <- ggplot(china_india, aes(x=pop, y=lifeExp)) + geom_point(aes(color=country, size=sqrt(pop))) + facet_wrap(~country) + stat_smooth(formula=y~poly(x,3), method="lm")
p5

Speculation: Why are china and india the only two countries where life expectancy correlates with population growth?

Work Cited

[1] D. H. Huson, A. F. Auch, J. Qi, and S. C. Schuster, “MEGAN analysis of metagenomic data.” Genome Res, vol. 17, no. 3, pp. 377–386, Mar. 2007.