Synopsis

The purpose of this assignment is to perform some basic exploratory analysis in r. The dataset used for this assignment focuses on the life expectancy, population, and GDP per capita for countries around the world from 1952 to 2007. From this assignment, one can immediately see the differences among the continents in terms of GDP per capita, and specifically for the United States, one can see steady growth year over year for the GDP per capita.

Packages Required

For this assignment, 3 different packages are required: gapminder, dplyr, and ggplot2. Those packages are loaded and described below.

## Load the necessary libraries ##
library(gapminder) ## Used to collect the data.
library(dplyr) ## Used for data manipulation
library(ggplot2) ## Used for data visualization

Source Code

For this assignment, I will be using data from the gapminder library which can be sourced from https://www.gapminder.org/data/. Contained in this data are 6 different variables: country, continent, year, life expectancy, population, and GDP per capita. Country, continent, and year are categorical variables for when the obersvation was recorded. Life expectancy is the number of years at birth a person is expected to live. Population is measure of the total number of people living in each country, and GDP per capita is the total GDP for the country divided by the total population for that country.

## Load the required data
data_gap <- gapminder_unfiltered

Data Description

To investigate the structure of the data, I will use the str() function as demonstrated below.

## Structure of the data ##
str(data_gap)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3313 obs. of  6 variables:
##  $ country  : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

As one can see, there are 6 variables contained in this dataset with 3,313 observations of those variables. Currently, there are two categorical variables and 4 numerical variables. However, since “year” really isn’t a numerical variable in this case, I will change it to a categorical variable.

## Change year to categorical variable and show the structure of the data
data_gap$year <- factor(data_gap$year)
str(data_gap)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3313 obs. of  6 variables:
##  $ country  : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : Factor w/ 58 levels "1950","1951",..: 3 8 13 18 23 28 33 38 43 48 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Now, we can see that “year” is now a categorical variable, and we are confident in the structure of the data. Next, we will use the below code to check for any missing values.

## Check for missing values ##
sum(is.na(data_gap))
## [1] 0

Fortunately, there are no missing issues to deal with in this dataset.

Next, we will provide some basic summary statistics for the data using the following code.

## Summary statistics ##
summary(data_gap)
##            country        continent         year         lifeExp     
##  Czech Republic:  58   Africa  : 637   2002   : 187   Min.   :23.60  
##  Denmark       :  58   Americas: 470   1997   : 184   1st Qu.:58.33  
##  Finland       :  58   Asia    : 578   1992   : 183   Median :69.61  
##  Iceland       :  58   Europe  :1302   2007   : 183   Mean   :65.24  
##  Japan         :  58   FSU     : 139   1977   : 171   3rd Qu.:73.66  
##  Netherlands   :  58   Oceania : 187   1982   : 171   Max.   :82.67  
##  (Other)       :2965                   (Other):2234                  
##       pop              gdpPercap       
##  Min.   :5.941e+04   Min.   :   241.2  
##  1st Qu.:2.680e+06   1st Qu.:  2505.3  
##  Median :7.560e+06   Median :  7825.8  
##  Mean   :3.177e+07   Mean   : 11313.8  
##  3rd Qu.:1.961e+07   3rd Qu.: 17355.8  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

While this does provde some interesting initial information on the life expectancy, population and GDP per capita, this function does not provide much other information for the categorical variables. However, one interesting point to note is that “FSU” is listed as a continent in the data. Let’s investigate what this variable represents.

## Investigate FSU ## 
tmp <- data_gap %>%
  filter(continent == "FSU") %>%
  droplevels()
tmp$country %>% levels()
## [1] "Armenia"    "Belarus"    "Georgia"    "Kazakhstan" "Latvia"    
## [6] "Lithuania"  "Russia"     "Ukraine"    "Uzbekistan"

Based on the above output, one can conclude that “FSU” represents Former Soviet Union. While this does appear to be a mis-categorization within a certain time period, I will leave the data as it stands and will revisit it should it affect the end analysis. Other than “FSU”, the other continents in the data include Africa, the Americas, Asia, Europe, and Oceania. From the source code, we can also conclude that the years included in the study are from 1952 to 2007 in increments of 5 years for most countries while some have data for each individual year. Finally, we will compute a few basic summary statistics such as # of countries per continent, mean life expectancy per continent, mean population per continent, and mean GDP per capita per country.

## Other general summary statistics ##
data_gap %>% 
  group_by(continent) %>% 
  summarize(n_obs = n(), n_countries = n_distinct(country), avglifeexp = mean(lifeExp), avgpop = mean(pop), meanGDP = mean(gdpPercap))
## # A tibble: 6 × 6
##   continent n_obs n_countries avglifeexp   avgpop   meanGDP
##      <fctr> <int>       <int>      <dbl>    <dbl>     <dbl>
## 1    Africa   637          53   49.03680  9728850  2175.859
## 2  Americas   470          36   67.09195 39416728 10802.574
## 3      Asia   578          43   62.41587 95444180 10073.938
## 4    Europe  1302          35   72.72164 15315944 16551.178
## 5       FSU   139           9   68.84430 31793002  7326.686
## 6   Oceania   187          11   69.74691  5424172 14057.097

As one can surmise from the output above, Africa has the largest number of unique countries and the lowest life expectancy by far. Europe has the highest average life expectancy over this time, and Asia has the largest average population.

Exploratory Data Analysis

Problem 1: For the year 2007, what is the distribution of GDP per capita across all countries?

## Problem 1: Distribution of GDP per Capita across all countries ##
data_gap %>% 
  filter(year == "2007") %>%
  summarize(min_GDP = min(gdpPercap), max_GDP = max(gdpPercap), avg_GDP = mean(gdpPercap), med_GDP = median(gdpPercap), sd_GDP = sd(gdpPercap)) 
## # A tibble: 1 × 5
##    min_GDP  max_GDP  avg_GDP  med_GDP   sd_GDP
##      <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1 277.5519 82010.98 12403.13 6873.262 13829.02
  ggplot(data = data_gap) + 
    geom_histogram(mapping = aes(x = gdpPercap), binwidth = 5000)

As one can see from the statistical ouputs and the histogram, the distribution of GDP per capita appears to be heaviliy concentrated near 0 with a dwindling tail to the right as the GDP per capita increases. In this case, a few outliers appear to be influencing the mean, so the best measure to use for the overall population would be a median GDP per capita of $6,873.26.

Problem 2: For the year 2007, how do the distributions differ across the different continents?

## Problem 2: Distribution of GDP per Capita across by Continent ##
data_gap %>% 
  filter(year == "2007") %>%
  group_by(continent) %>% 
  summarize(min_GDP = min(gdpPercap), max_GDP = max(gdpPercap), avg_GDP = mean(gdpPercap), med_GDP = median(gdpPercap), sd_GDP = sd(gdpPercap))
## # A tibble: 6 × 6
##   continent   min_GDP  max_GDP   avg_GDP   med_GDP    sd_GDP
##      <fctr>     <dbl>    <dbl>     <dbl>     <dbl>     <dbl>
## 1    Africa  277.5519 13206.48  3091.230  1463.249  3583.240
## 2  Americas 1201.6372 42951.65 11940.902  9065.801  9542.837
## 3      Asia  944.0000 82010.98 15338.057  4889.250 18864.574
## 4    Europe 2604.7505 49357.19 24174.153 25885.565 11747.064
## 5       FSU 2211.1589 16666.51  9522.539 10273.774  5357.088
## 6   Oceania 1827.0966 34435.37 13156.979  5143.615 13150.012
data_gap %>% 
  filter(year == "2007") %>%
  group_by(continent) %>%   
ggplot(mapping = aes(x = gdpPercap, y = ..density..)) + ## Multiple histogram plot
    geom_freqpoly(mapping = aes(colour = continent), binwidth = 5000)

data_gap %>% 
  filter(year == "2007") %>%
  group_by(continent) %>% 
ggplot(mapping = aes(x = continent, y = gdpPercap)) +  ## Boxplot
    geom_boxplot()

To answer this question, I first calculated some simple summary statistics for each continent’s GDP per capita. Africa contains the lowest single GDP per capita and has the lowest overall average. The Americas enjoy a significantly higher average GDP per capita, but with a minimum of $1,201, there are still countries well below that average. Asia has the widest spread between its countries with a the highest overall GDP per capita but also a minimum on par with Africa. Europe and Oceania both have similar distributions with the main difference being that Europe has a much higher maximum GDP per capita.

Now, I will create two charts to validate these conclusions. The first is a frequency plot that shows the density of each continent’s countries for each GDP per capita level, and the second is a boxplot with a visual representation of several of the statistics listed above. As one can see, the graphs confirm what the individual statistics stated. Africa’s distribution is shifted close to 0 with a narrow spread. Asia is also clustered near Africa but has a bigger tail spreading toward the right. The Americas are shifted farther to the right than Asia with a smaller tail. Europe and Oceania are shifted even further to the right, but Europe is more narrowly clustered while Oceania has a larger spread.

Problem 3: For the year 2007, what are the top 10 countries with the largest GDP per capita?

## Problem 3: Top 10 Countries with the largest GDP per capita in 2007 ##
data_gap %>%
  filter(year == "2007") %>%
  mutate(GDPRank = min_rank(desc(gdpPercap))) %>% 
    filter(GDPRank <= 10) %>%
    arrange(desc(gdpPercap)) %>% 
  ggplot(mapping = aes(x = reorder(country, -gdpPercap), y = gdpPercap)) + 
    geom_bar(stat = "identity")

Problem 4: Plot the GDP per capita for your country of origin for all years available.

## Problem 4: Plot the GDP per capita for your country of origin for all years available.
data_gap %>% 
  filter(country == "United States") %>% 
  ggplot(mapping = aes(x = year, y = gdpPercap, group = 1)) +
  geom_line()

Problem 5: What was the percent growth (or decline) in GDP per capita in 2007?

## Problem 5: Plot the GDP per capita for your country of origin for all years available.
data_gap %>% 
  filter(country == "United States") %>% 
  mutate(gdpGrowth = ((gdpPercap - lag(gdpPercap)) / lag(gdpPercap))) %>%
  filter(year == "2007")
## # A tibble: 1 × 7
##         country continent   year lifeExp       pop gdpPercap  gdpGrowth
##          <fctr>    <fctr> <fctr>   <dbl>     <int>     <dbl>      <dbl>
## 1 United States  Americas   2007  78.242 301139947  42951.65 0.03065828

In 2007, the GDP per capita in the United States increased 3.1%.

Problem 6: What has been the historical growth (or decline) in GDP per capita for your country?

## Problem 6: What has been the historical growth (or decline) in GDP per capita for your country?
data_gap %>% 
  filter(country == "United States") %>% 
  mutate(gdpGrowth = ((gdpPercap - lag(gdpPercap)) / lag(gdpPercap))) %>%
  ggplot(mapping = aes(x = year, y = gdpGrowth, group = 1)) +
  geom_line()

From the graph above, one can see that since 1952, the GDP per capita in the United States has only decreased 8 times while the rest of the years have grown at a rate between 0% and 5%.