Synopsis

This R Markdown Notebook is my report for the Data Wrangling with R class assignment for Week 4.
The following report uses dplyr and ggplot2 to transform and visualize data for exploratory data analysis.

Data Source

For this report we are looking at the gapminder_unfiltered data set which is included as a part of the gapminder package.

The dataset contains “an excerpt of the data available at Gapminder.org. For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
More info

Packages required

library(tidyverse)
library(gapminder)
library(dplyr)
library(ggplot2)
library(scales)

About the data

The data is not filtered on year or compelete cases and has 3313 rows.

A brief summary of the variables:


Data description

# number of rows
nrow(gapminder_unfiltered)
## [1] 3313
# number of variables
ncol(gapminder_unfiltered)
## [1] 6
# structure of data
str(gapminder_unfiltered, give.attr = F)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3313 obs. of  6 variables:
##  $ country  : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...
# sample rows
head(gapminder_unfiltered)
# summary stats
summary(gapminder_unfiltered)
##            country        continent         year         lifeExp     
##  Czech Republic:  58   Africa  : 637   Min.   :1950   Min.   :23.60  
##  Denmark       :  58   Americas: 470   1st Qu.:1967   1st Qu.:58.33  
##  Finland       :  58   Asia    : 578   Median :1982   Median :69.61  
##  Iceland       :  58   Europe  :1302   Mean   :1980   Mean   :65.24  
##  Japan         :  58   FSU     : 139   3rd Qu.:1996   3rd Qu.:73.66  
##  Netherlands   :  58   Oceania : 187   Max.   :2007   Max.   :82.67  
##  (Other)       :2965                                                 
##       pop              gdpPercap       
##  Min.   :5.941e+04   Min.   :   241.2  
##  1st Qu.:2.680e+06   1st Qu.:  2505.3  
##  Median :7.560e+06   Median :  7825.8  
##  Mean   :3.177e+07   Mean   : 11313.8  
##  3rd Qu.:1.961e+07   3rd Qu.: 17355.8  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

We can see that we have data for 6 continents and a total of 187 countries are represented in our data. We can also see that 58 years are measured from 1950 to 2007
We can see that no missing values are present. But let us explore further before concluding this.


ggplot(data = gapminder_unfiltered) +
  geom_bar(mapping = aes(x = year)) +
  labs(x = "Year", y = "Number of observations") +
  ggtitle("Number of observations per year")

table(gapminder_unfiltered$year)
## 
## 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 
##   39   24  144   24   24   24   24  144   25   25   26   26  151   26   26 
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 
##   27   27  156   27   27   27   27  168   32   27   27   27  171   27   27 
## 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 
##   27   27  171   27   27   27   27  171   27   27   32   33  183   33   33 
## 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 
##   33   33  184   33   33   33   33  187   33   32   30   18  183

We can see that the number of entries for each year varies, this means that we do have missing data i.e. not all countries have data for every year.

But with the plot and the table we can see that we have relatively more complete data for every 5th year starting from 1952 to 2007.

  1. For the year 2007, what is the distribution of GDP per capita across all countries?
gapminder_unfiltered %>%
  filter(year == 2007) %>%
  ggplot() +
    geom_histogram(mapping = aes(gdpPercap),
                   binwidth = 5000,
                   boundary = 0,
                   fill = "#9ad3de",
                   color = "#006BA6") +
  scale_x_continuous(breaks = pretty_breaks(n = 10)) +
  labs(x = "GDP Per Capita", y = "Number of Countries") +
  ggtitle("Distribution of GDP per capita across Countries")

As we can see from the histogram most countries have a GDP per capita of less than $10000.


  1. For the year 2007, how do the distributions differ across the different continents?
gapminder_unfiltered %>%
  filter(year == 2007) %>%
  ggplot() +
    geom_boxplot(mapping = aes(x = factor(continent, levels =),
                               y = gdpPercap),
                 fill = "#00A896") + 
    scale_y_continuous(breaks = pretty_breaks(n = 10)) +
    labs(y = "GDP Per Capita", x = "Continent") +
    ggtitle("Distribution of GDP per capita across continents") +
    coord_flip()

The box plots shows the distribution of GDP per capita across different continents. Let us sort the continents by median and remove the outliers i.e. gdpPercap > 55000 to get a slightly better picture.

lvls <- gapminder_unfiltered %>%
  filter(year == 2007) %>%
  group_by(continent) %>%
  summarize(avgGdpPercap = median(gdpPercap)) %>%
  arrange(desc(avgGdpPercap)) %>%
  select(unique(continent))

gapminder_unfiltered %>%
  filter(year == 2007, gdpPercap < 55000) %>%
  ggplot() +
    geom_boxplot(mapping = aes(x = factor(continent, levels = lvls$continent),
                               y = gdpPercap),
                 fill = "#00A896") + 
    scale_y_continuous(breaks = pretty_breaks(n = 10)) +
    labs(y = "GDP Per Capita", x = "Continent") +
    ggtitle("Distribution of GDP per capita across continents") +
    coord_flip()

With this we can see the distributions more clearly. Africa has the lowest median GDP per cap while Europe has the highest.


  1. For the year 2007, what are the top 10 countries with the largest GDP per capita?
gapminder_unfiltered %>%
  filter(year == 2007) %>%
  filter(rank(desc(gdpPercap)) <= 10) %>%
  arrange(desc(gdpPercap)) %>%
  ggplot() +
  geom_bar(mapping = aes(x = factor(country, levels = rev(unique(country))),
                         y = gdpPercap,
                         fill = continent),
           stat = "identity") +
  labs(x = "Country", y = "GDP Per Capita") +
  ggtitle("Top 10 countries with largest GDP per capita") +
  coord_flip()


  1. Plot the GDP per capita for your country of origin for all years available.
gapminder_unfiltered %>%
  filter(country == 'India') %>%
    ggplot() +
      geom_line(mapping = aes(x = year, y = gdpPercap)) +
  labs(x = "Year", y = "GDP Per Capita") +
  ggtitle("India's GDP per capita over the years")


  1. What was the percent growth (or decline) in GDP per capita in 2007?
percent_growth_2007 <- gapminder_unfiltered %>%
  filter(country == 'India') %>%
  arrange(year) %>%
  mutate(change = (gdpPercap - lag(gdpPercap))/lag(gdpPercap) * 100) %>%
  filter(year == 2007) %>%
  select(year, change)

The percent growth for India in 2007 was 40.39%.


  1. What has been the historical growth (or decline) in GDP per capita for your country?
India_gdpPercap <- gapminder_unfiltered %>%
  filter(country == 'India') %>%
  arrange(year) %>%
  select(year, gdpPercap)
India_gdpPercap %>%
  summarize(change = last(gdpPercap) - first(gdpPercap))

India’s GDP per capita in 1952 was $546.57 India’s GDP per capita in 2007 was $2452.21

The growth in India’s GDP over these years was $1905.64