Synopsis

We will use Gapminder data to analyze GDP growth per capita across countries, continents and then hone in on the USA, my home country. We’ll use dplyr for dataset manipulation and ggplot2 and base R for graphing.

Packages Required

library(gapminder)
library(dplyr)
library(ggplot2)

Source Code

The variables contained in this dataset are described below:

The units for GDP were international dollars in 2005.

Data Description

The gapminder dataset comes from gapminder_unfiltered, which is available in the gapminder package specified in the Required Packages section. The Gapminder organization seeks to improve understanding of global development through many aspects. More information about the organization is available at <www.gapminder.org/ignorance>

The dataset, gapminder_unfiltered, includes 3313 observations of 6 variables. The datatypes for each variable are below:

df <- gapminder_unfiltered
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3313 obs. of  6 variables:
##  $ country  : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

There are 187 countries and 6 continents included. The countries and continents are listed below.

Additionally, as we’ll see, the table includes data from 1952 to 2006.

unique(df$country)
##   [1] Afghanistan              Albania                 
##   [3] Algeria                  Angola                  
##   [5] Argentina                Armenia                 
##   [7] Aruba                    Australia               
##   [9] Austria                  Azerbaijan              
##  [11] Bahamas                  Bahrain                 
##  [13] Bangladesh               Barbados                
##  [15] Belarus                  Belgium                 
##  [17] Belize                   Benin                   
##  [19] Bhutan                   Bolivia                 
##  [21] Bosnia and Herzegovina   Botswana                
##  [23] Brazil                   Brunei                  
##  [25] Bulgaria                 Burkina Faso            
##  [27] Burundi                  Cambodia                
##  [29] Cameroon                 Canada                  
##  [31] Cape Verde               Central African Republic
##  [33] Chad                     Chile                   
##  [35] China                    Colombia                
##  [37] Comoros                  Congo, Dem. Rep.        
##  [39] Congo, Rep.              Costa Rica              
##  [41] Cote d'Ivoire            Croatia                 
##  [43] Cuba                     Cyprus                  
##  [45] Czech Republic           Denmark                 
##  [47] Djibouti                 Dominican Republic      
##  [49] Ecuador                  Egypt                   
##  [51] El Salvador              Equatorial Guinea       
##  [53] Eritrea                  Estonia                 
##  [55] Ethiopia                 Fiji                    
##  [57] Finland                  France                  
##  [59] French Guiana            French Polynesia        
##  [61] Gabon                    Gambia                  
##  [63] Georgia                  Germany                 
##  [65] Ghana                    Greece                  
##  [67] Grenada                  Guadeloupe              
##  [69] Guatemala                Guinea                  
##  [71] Guinea-Bissau            Guyana                  
##  [73] Haiti                    Honduras                
##  [75] Hong Kong, China         Hungary                 
##  [77] Iceland                  India                   
##  [79] Indonesia                Iran                    
##  [81] Iraq                     Ireland                 
##  [83] Israel                   Italy                   
##  [85] Jamaica                  Japan                   
##  [87] Jordan                   Kazakhstan              
##  [89] Kenya                    Korea, Dem. Rep.        
##  [91] Korea, Rep.              Kuwait                  
##  [93] Latvia                   Lebanon                 
##  [95] Lesotho                  Liberia                 
##  [97] Libya                    Lithuania               
##  [99] Luxembourg               Macao, China            
## [101] Madagascar               Malawi                  
## [103] Malaysia                 Maldives                
## [105] Mali                     Malta                   
## [107] Martinique               Mauritania              
## [109] Mauritius                Mexico                  
## [111] Micronesia, Fed. Sts.    Moldova                 
## [113] Mongolia                 Montenegro              
## [115] Morocco                  Mozambique              
## [117] Myanmar                  Namibia                 
## [119] Nepal                    Netherlands             
## [121] Netherlands Antilles     New Caledonia           
## [123] New Zealand              Nicaragua               
## [125] Niger                    Nigeria                 
## [127] Norway                   Oman                    
## [129] Pakistan                 Panama                  
## [131] Papua New Guinea         Paraguay                
## [133] Peru                     Philippines             
## [135] Poland                   Portugal                
## [137] Puerto Rico              Qatar                   
## [139] Reunion                  Romania                 
## [141] Russia                   Rwanda                  
## [143] Samoa                    Sao Tome and Principe   
## [145] Saudi Arabia             Senegal                 
## [147] Serbia                   Sierra Leone            
## [149] Singapore                Slovak Republic         
## [151] Slovenia                 Solomon Islands         
## [153] Somalia                  South Africa            
## [155] Spain                    Sri Lanka               
## [157] Sudan                    Suriname                
## [159] Swaziland                Sweden                  
## [161] Switzerland              Syria                   
## [163] Taiwan                   Tajikistan              
## [165] Tanzania                 Thailand                
## [167] Timor-Leste              Togo                    
## [169] Tonga                    Trinidad and Tobago     
## [171] Tunisia                  Turkey                  
## [173] Turkmenistan             Uganda                  
## [175] Ukraine                  United Arab Emirates    
## [177] United Kingdom           United States           
## [179] Uruguay                  Uzbekistan              
## [181] Vanuatu                  Venezuela               
## [183] Vietnam                  West Bank and Gaza      
## [185] Yemen, Rep.              Zambia                  
## [187] Zimbabwe                
## 187 Levels: Afghanistan Albania Algeria Angola Argentina Armenia ... Zimbabwe
unique(df$continent)
## [1] Asia     Europe   Africa   Americas FSU      Oceania 
## Levels: Africa Americas Asia Europe FSU Oceania
unique(df$year)
##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 1950 1951
## [15] 1953 1954 1955 1956 1958 1959 1960 1961 1963 1964 1965 1966 1968 1969
## [29] 1970 1971 1973 1974 1975 1976 1978 1979 1980 1981 1983 1984 1985 1986
## [43] 1988 1989 1990 1991 1993 1994 1995 1996 1998 1999 2000 2001 2003 2004
## [57] 2005 2006

Missing Data

There is no missing data in this dataset

sum(is.na(df))
## [1] 0

Exploratory Data Analysis

Question 1 - check the distribution of GDP per Capita for

We’ll create a dataframe specifically for 2007 and run a summary on the dataframe and run a histogram to check the distribution. Qatar and it’s oil money and Macao and it’s gambling money! The distribution is obviously quite skewed with the majority of countries having less than 5000 GDP per capita.

#let's create a dataframe for 2007 since we'll be using this year a bit
df07 <- df %>% filter(year == 2007)

summary(df07$gdpPercap)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   277.6  2147.0  6873.0 12400.0 19000.0 82010.0
hist(df07$gdpPercap, breaks = 20)

#We can use arrange to see the richest countries on a per capita basis
head(df07 %>% arrange(desc(gdpPercap)))
## # A tibble: 6 x 6
##        country continent  year lifeExp     pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>   <int>     <dbl>
## 1        Qatar      Asia  2007  75.588  907229  82010.98
## 2 Macao, China      Asia  2007  80.718  456989  54589.82
## 3       Norway    Europe  2007  80.196 4627926  49357.19
## 4       Brunei      Asia  2007  77.118  386511  48014.59
## 5       Kuwait      Asia  2007  77.588 2505559  47306.99
## 6    Singapore      Asia  2007  79.972 4553009  47143.18

Question 2 - gdp/capita by continent

We will create a dataframe containing a grouping by continents so we can see how gdp/capita is distributed over each continent. Europe isn’t doing so poorly. The box plot per continent illustrates the distributions over continents.

df07 %>% group_by(continent) %>% summarise(mean = mean(gdpPercap),
 median = median(gdpPercap), min = min(gdpPercap), max = max(gdpPercap))
## # A tibble: 6 x 5
##   continent      mean    median       min      max
##      <fctr>     <dbl>     <dbl>     <dbl>    <dbl>
## 1    Africa  3091.230  1463.249  277.5519 13206.48
## 2  Americas 11940.902  9065.801 1201.6372 42951.65
## 3      Asia 15338.057  4889.250  944.0000 82010.98
## 4    Europe 24174.153 25885.565 2604.7505 49357.19
## 5       FSU  9522.539 10273.774 2211.1589 16666.51
## 6   Oceania 13156.979  5143.615 1827.0966 34435.37
df07cont <- df07 %>% group_by(continent)

ggplot(data = df07cont, mapping = aes(x = df07cont$continent, 
  y = df07cont$gdpPercap)) + geom_boxplot() + 
  ggtitle("Distribution compare for GDP/capita")

Some interesting observations: Europe is substantially higher than all continents. Asia is also VERY spread out with a few countries at the top of our distribution.

Question 3 - let’s look at the top 10 gdp/capita

We’ll run a simple arrange on the 2007 dataframe. We already commented on Qatar.

df07 %>% arrange(desc(gdpPercap)) %>% top_n(n = 10, wt = gdpPercap)
## # A tibble: 10 x 6
##             country continent  year lifeExp       pop gdpPercap
##              <fctr>    <fctr> <int>   <dbl>     <int>     <dbl>
## 1             Qatar      Asia  2007  75.588    907229  82010.98
## 2      Macao, China      Asia  2007  80.718    456989  54589.82
## 3            Norway    Europe  2007  80.196   4627926  49357.19
## 4            Brunei      Asia  2007  77.118    386511  48014.59
## 5            Kuwait      Asia  2007  77.588   2505559  47306.99
## 6         Singapore      Asia  2007  79.972   4553009  47143.18
## 7     United States  Americas  2007  78.242 301139947  42951.65
## 8           Ireland    Europe  2007  78.885   4109086  40676.00
## 9  Hong Kong, China      Asia  2007  82.208   6980412  39724.98
## 10      Switzerland    Europe  2007  81.701   7554661  37506.42

Another note is that the lifeExp is quite high in these rich countries. Correlation or causation?

Question 4 - USA gdp per capita

We’ll plot the GDP per Capita over the years for the USA to see growth and declines. I wonder what happened in 1981…

dfUSA <- df %>% filter(country == 'United States') %>% select(gdpPercap, year)

ggplot(data = dfUSA, aes(x = dfUSA$year, y = dfUSA$gdpPercap)) + geom_point() +
  ggtitle("GDP per Capita vs. Year for United States") + xlab("Year") +
  ylab("GDP per Capita")

There’s been a steady increase with a few crashes here and there mostly due to irresponsible bankers.

Question 5 - let’s look at GDP % change for USA

First we’ll define a function, pctchange, to simplify things.

#let's define a function for percent change
pctchange <-function(x) {
 pct <- 100*((x - lag(x))/(lag(x)))
 return (pct)
}

We can use dplyr to calculate the gdp % change from 2007. There’s no data for 2006 so we’ll take an average over the period prior. This will happen for any years missing, not just 2006, but it looks like there’s data for all years for USA aside 2006.

df %>% filter(country == "United States") %>% 
  mutate(gdp_per_capita_change_pct = pctchange(gdpPercap)/
           as.numeric(year-lag(as.numeric(year)))) %>% 
  filter(year == "2007") %>% 
  select(country, year, gdp_per_capita_change_pct)
## # A tibble: 1 x 3
##         country  year gdp_per_capita_change_pct
##          <fctr> <int>                     <dbl>
## 1 United States  2007                  1.532914

We had a 1.53% growth pear year using an average from 2005 to 2007. Not good.

Question 6 - let’s look at USA GDP growth over the years

Here we’ll create a dataframe with the % change over the years so we can plot the data. We’ll omit 2007 to avoid the missing 2006 inflating the results and then we’ll take the mean and median of the growth. We’ll show a boxplot and then a linechart of the growth.

#Let's define a dataframe that we can plot and analyze for GDP % change
#we won't include 2007 because 2006 is missing. We could have chosen to take
#an average over the period for missing data like in Q5 if we wanted to.
USA_GDP_Change <- df %>% filter(country == "United States", year < "2007") %>% 
  mutate(gdp_per_capita_change_pct = pctchange(gdpPercap))

#We'll use summarise to find the mean and median through all of the years
USA_GDP_Change %>% filter(year > "1950") %>%
  summarise(mean = mean(gdp_per_capita_change_pct), median = median(gdp_per_capita_change_pct))
## # A tibble: 1 x 2
##       mean   median
##      <dbl>    <dbl>
## 1 2.176289 2.530872
ggplot(data = USA_GDP_Change, mapping = aes(x = USA_GDP_Change$country, 
y = USA_GDP_Change$gdp_per_capita_change_pct)) + geom_boxplot() + 
  ggtitle("Distribution for GDP % changes for USA")

ggplot(data = USA_GDP_Change, aes(x = USA_GDP_Change$year, 
y = USA_GDP_Change$gdp_per_capita_change_pct)) + geom_line() +
  ggtitle("GDP % Change Across Years") + xlab("Year") +
  ylab("GDP % Change")

Typically we’re at about 2.176 or 2.530% growth in the USA depending on whether we’re using the mean or median. The IQR is 1-3.7% from the boxplot.

We can see the crashes and recoveries in the x-y line plot. We haven’t had a >3.75% growth year since the early 80s, despite the push in IT.