Synopsis

The purpose of this page is to perform exploratory data analysis on the Gapminder data, which includes the life expectancy, GDP per capita, and population by country. We will be specifically inspecting the gapminder_unfiltered data frame in this assignment.

In general, the data shows that GDP per Capita and life expectancy has been increasing over time amongst all continents and most countries. Population has remained mostly the same with no drastic changes over time. The countries with the largest population are China, India, and the United States. The countries with the longest life expectancies are Japan, Hong Kong, and Iceland.

Packages Required

library(gapminder) # data from gapminder.org
library(dplyr) # used to perform data transformation and manipulation
library(ggplot2) # used for data visualization
library(prettydoc) # document themes for R Markdown

Source Code

The gapminder_unfiltered data frame was not filtered on year or for complete data. Per inspection of the data and based on the documentation for gapminder, observed through the code below, noted that there are six variables in the data set.

?gapminder
  • Country - which country the row is associated with
  • Continent - which continent is the country from
  • Year - which year the measures are related to (ranges from 1952 - 2007 in increments of 5 years)
  • Life Expectancy - life expecatancy at birth, in years
  • Population - population of the country
  • GDP per Capita - GDP per capita for the country

Data Description

The number of rows (observations) and columns (variables) in the gapminder_unfiltered data frame are shown below:

nrow(gapminder_unfiltered) # total number of observations
## [1] 3313
ncol(gapminder_unfiltered) # total number of variables
## [1] 6

Below are basic summary statistics (including min, max, median, and mean) for each of the six variables noted above:

summary(gapminder_unfiltered)
##            country        continent         year         lifeExp     
##  Czech Republic:  58   Africa  : 637   Min.   :1950   Min.   :23.60  
##  Denmark       :  58   Americas: 470   1st Qu.:1967   1st Qu.:58.33  
##  Finland       :  58   Asia    : 578   Median :1982   Median :69.61  
##  Iceland       :  58   Europe  :1302   Mean   :1980   Mean   :65.24  
##  Japan         :  58   FSU     : 139   3rd Qu.:1996   3rd Qu.:73.66  
##  Netherlands   :  58   Oceania : 187   Max.   :2007   Max.   :82.67  
##  (Other)       :2965                                                 
##       pop              gdpPercap       
##  Min.   :5.941e+04   Min.   :   241.2  
##  1st Qu.:2.680e+06   1st Qu.:  2505.3  
##  Median :7.560e+06   Median :  7825.8  
##  Mean   :3.177e+07   Mean   : 11313.8  
##  3rd Qu.:1.961e+07   3rd Qu.: 17355.8  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

There appeared to be no missing values in the data set:

sum(is.na(gapminder_unfiltered))
## [1] 0

The following continents are in the data set:

unique(gapminder_unfiltered$continent)
## [1] Asia     Europe   Africa   Americas FSU      Oceania 
## Levels: Africa Americas Asia Europe FSU Oceania

The total number of countries in the data set is shown below:

countries <- unique(gapminder_unfiltered$country)
length(countries)
## [1] 187

Exploratory Data Analysis

1. Distribution of GDP per capital in 2007

# filter to obtain the data for 2007 
gapminder2007 <- filter(gapminder_unfiltered, year == 2007)
# plot the distribution of GDP per capital in 2007
ggplot(gapminder2007, aes(gdpPercap)) +
  geom_histogram(bins = 80, fill="darkgreen") + 
  ggtitle("2007 GDP Distribution across Countries") +
  xlab("GDP per Capita") + ylab("Count") +
  theme_classic()

2. Difference in GDP distribution for continents in 2007

# for the gapminder data for 2007, group by continent
# to show the average GDP per continent
GDPbyCont <- gapminder2007 %>% 
  group_by(continent) %>% 
  summarize(avgGDP = mean(gdpPercap))
# plot the avgGDP for each continent, to show the 
# GDP by continent in 2007
ggplot(GDPbyCont, 
  aes(x = continent, y = avgGDP, color = continent)) +
  geom_point(size=4) +
  ggtitle("2007 GDP by Continent") +
  xlab("Continent") + ylab("Avg. GDP per Capita") +
  theme_classic()

3. Top 10 countries with largest GDP per capita in 2007

# arrange gapminder data for 2007 in order from largest
# to smallest GDP per capita, then select the top 10 countries
top_n(arrange(gapminder2007, -gdpPercap), 10)
## # A tibble: 10 x 6
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
## 1             Qatar      Asia  2007  75.588    907229  82010.98
## 2      Macao, China      Asia  2007  80.718    456989  54589.82
## 3            Norway    Europe  2007  80.196   4627926  49357.19
## 4            Brunei      Asia  2007  77.118    386511  48014.59
## 5            Kuwait      Asia  2007  77.588   2505559  47306.99
## 6         Singapore      Asia  2007  79.972   4553009  47143.18
## 7     United States  Americas  2007  78.242 301139947  42951.65
## 8           Ireland    Europe  2007  78.885   4109086  40676.00
## 9  Hong Kong, China      Asia  2007  82.208   6980412  39724.98
## 10      Switzerland    Europe  2007  81.701   7554661  37506.42

4. US GDP per Capita over Time

# filter to obtain only US data
USgapminder <- filter(gapminder_unfiltered, country == "United States")
# plot data of GDP per capita over time for the US
ggplot(USgapminder,aes(year,gdpPercap)) + 
  geom_smooth() + 
  ggtitle("US GDP per Capita over Time") + 
  ylab("GDP Per Capita") +
  xlab("Year") + theme_classic()

5. Percent growth/decline in GDP per capita in 2007 for the US?

# Find US GDP per capita in 2007
GDP2007 <- select(filter(USgapminder, year == 2007),gdpPercap)
# Find US GDP per capita in 2005 (the last year with GDP
# data before 2007)
GDP2005 <- select(filter(USgapminder, year == 2005),gdpPercap)
# Calculate the percent change in GDP per capita
((GDP2007 - GDP2005)/GDP2005)*100
##   gdpPercap
## 1  3.065828

US GDP per capita increased by approximately 3% from 2005 to 2007.

6. Historical growth/decline in GDP per capita for the US

Based on the graph in part 4, the history of GDP per capita shows us that the GDP per capita has been increasing steadily over time in the US.

In addition, the data shows that GDP per capita overall has been increasing over time across all continents. Morever, population over time has remained somewhat steady across all continents and life expectancy has increased overall.

# gdp by continent
ggplot(gapminder_unfiltered, 
       aes(year,gdpPercap, colour = continent)) + 
  geom_smooth() + ggtitle("GDP over Time by Continent") +
  xlab("Year") + ylab("Avg. GDP per Capita") +
  theme_classic()

# population by continent
ggplot(gapminder_unfiltered, 
       aes(year,pop, colour = continent)) + 
  geom_smooth() + ggtitle("Population over Time by Continent") +
  xlab("Year") + ylab("Avg. Population") +
  theme_classic()

# life expectation by continent
ggplot(gapminder_unfiltered, 
       aes(year, lifeExp, colour = continent)) + 
  geom_smooth() + ggtitle("Life Expectation over Time by Continent") +
  xlab("Year") + ylab("Life Expectancy") +
  theme_classic()