We will use Gapminder data to analyze GDP growth per capita across countries, continents and then hone in on the USA, my home country. We’ll use dplyr for dataset manipulation and ggplot2 and base R for graphing.
library(gapminder)
library(dplyr)
library(ggplot2)
The variables contained in this dataset are described below:
The units for GDP were international dollars in 2005.
The gapminder dataset comes from gapminder_unfiltered, which is available in the gapminder package specified in the Required Packages section. The Gapminder organization seeks to improve understanding of global development through many aspects. More information about the organization is available at <www.gapminder.org/ignorance>
The dataset, gapminder_unfiltered, includes 3313 observations of 6 variables. The datatypes for each variable are below:
df <- gapminder_unfiltered
str(df)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
There are 187 countries and 6 continents included. The countries and continents are listed below.
Additionally, as we’ll see, the table includes data from 1952 to 2006.
unique(df$country)
## [1] Afghanistan Albania
## [3] Algeria Angola
## [5] Argentina Armenia
## [7] Aruba Australia
## [9] Austria Azerbaijan
## [11] Bahamas Bahrain
## [13] Bangladesh Barbados
## [15] Belarus Belgium
## [17] Belize Benin
## [19] Bhutan Bolivia
## [21] Bosnia and Herzegovina Botswana
## [23] Brazil Brunei
## [25] Bulgaria Burkina Faso
## [27] Burundi Cambodia
## [29] Cameroon Canada
## [31] Cape Verde Central African Republic
## [33] Chad Chile
## [35] China Colombia
## [37] Comoros Congo, Dem. Rep.
## [39] Congo, Rep. Costa Rica
## [41] Cote d'Ivoire Croatia
## [43] Cuba Cyprus
## [45] Czech Republic Denmark
## [47] Djibouti Dominican Republic
## [49] Ecuador Egypt
## [51] El Salvador Equatorial Guinea
## [53] Eritrea Estonia
## [55] Ethiopia Fiji
## [57] Finland France
## [59] French Guiana French Polynesia
## [61] Gabon Gambia
## [63] Georgia Germany
## [65] Ghana Greece
## [67] Grenada Guadeloupe
## [69] Guatemala Guinea
## [71] Guinea-Bissau Guyana
## [73] Haiti Honduras
## [75] Hong Kong, China Hungary
## [77] Iceland India
## [79] Indonesia Iran
## [81] Iraq Ireland
## [83] Israel Italy
## [85] Jamaica Japan
## [87] Jordan Kazakhstan
## [89] Kenya Korea, Dem. Rep.
## [91] Korea, Rep. Kuwait
## [93] Latvia Lebanon
## [95] Lesotho Liberia
## [97] Libya Lithuania
## [99] Luxembourg Macao, China
## [101] Madagascar Malawi
## [103] Malaysia Maldives
## [105] Mali Malta
## [107] Martinique Mauritania
## [109] Mauritius Mexico
## [111] Micronesia, Fed. Sts. Moldova
## [113] Mongolia Montenegro
## [115] Morocco Mozambique
## [117] Myanmar Namibia
## [119] Nepal Netherlands
## [121] Netherlands Antilles New Caledonia
## [123] New Zealand Nicaragua
## [125] Niger Nigeria
## [127] Norway Oman
## [129] Pakistan Panama
## [131] Papua New Guinea Paraguay
## [133] Peru Philippines
## [135] Poland Portugal
## [137] Puerto Rico Qatar
## [139] Reunion Romania
## [141] Russia Rwanda
## [143] Samoa Sao Tome and Principe
## [145] Saudi Arabia Senegal
## [147] Serbia Sierra Leone
## [149] Singapore Slovak Republic
## [151] Slovenia Solomon Islands
## [153] Somalia South Africa
## [155] Spain Sri Lanka
## [157] Sudan Suriname
## [159] Swaziland Sweden
## [161] Switzerland Syria
## [163] Taiwan Tajikistan
## [165] Tanzania Thailand
## [167] Timor-Leste Togo
## [169] Tonga Trinidad and Tobago
## [171] Tunisia Turkey
## [173] Turkmenistan Uganda
## [175] Ukraine United Arab Emirates
## [177] United Kingdom United States
## [179] Uruguay Uzbekistan
## [181] Vanuatu Venezuela
## [183] Vietnam West Bank and Gaza
## [185] Yemen, Rep. Zambia
## [187] Zimbabwe
## 187 Levels: Afghanistan Albania Algeria Angola Argentina Armenia ... Zimbabwe
unique(df$continent)
## [1] Asia Europe Africa Americas FSU Oceania
## Levels: Africa Americas Asia Europe FSU Oceania
unique(df$year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007 1950 1951
## [15] 1953 1954 1955 1956 1958 1959 1960 1961 1963 1964 1965 1966 1968 1969
## [29] 1970 1971 1973 1974 1975 1976 1978 1979 1980 1981 1983 1984 1985 1986
## [43] 1988 1989 1990 1991 1993 1994 1995 1996 1998 1999 2000 2001 2003 2004
## [57] 2005 2006
There is no missing data in this dataset
sum(is.na(df))
## [1] 0
We’ll create a dataframe specifically for 2007 and run a summary on the dataframe and run a histogram to check the distribution. Qatar and it’s oil money and Macao and it’s gambling money! The distribution is obviously quite skewed with the majority of countries having less than 5000 GDP per capita.
#let's create a dataframe for 2007 since we'll be using this year a bit
df07 <- df %>% filter(year == 2007)
summary(df07$gdpPercap)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 277.6 2147.0 6873.0 12400.0 19000.0 82010.0
hist(df07$gdpPercap, breaks = 20)
#We can use arrange to see the richest countries on a per capita basis
head(df07 %>% arrange(desc(gdpPercap)))
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Qatar Asia 2007 75.588 907229 82010.98
## 2 Macao, China Asia 2007 80.718 456989 54589.82
## 3 Norway Europe 2007 80.196 4627926 49357.19
## 4 Brunei Asia 2007 77.118 386511 48014.59
## 5 Kuwait Asia 2007 77.588 2505559 47306.99
## 6 Singapore Asia 2007 79.972 4553009 47143.18
We will create a dataframe containing a grouping by continents so we can see how gdp/capita is distributed over each continent. Europe isn’t doing so poorly. The box plot per continent illustrates the distributions over continents.
df07 %>% group_by(continent) %>% summarise(mean = mean(gdpPercap),
median = median(gdpPercap), min = min(gdpPercap), max = max(gdpPercap))
## # A tibble: 6 x 5
## continent mean median min max
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 3091.230 1463.249 277.5519 13206.48
## 2 Americas 11940.902 9065.801 1201.6372 42951.65
## 3 Asia 15338.057 4889.250 944.0000 82010.98
## 4 Europe 24174.153 25885.565 2604.7505 49357.19
## 5 FSU 9522.539 10273.774 2211.1589 16666.51
## 6 Oceania 13156.979 5143.615 1827.0966 34435.37
df07cont <- df07 %>% group_by(continent)
ggplot(data = df07cont, mapping = aes(x = df07cont$continent,
y = df07cont$gdpPercap)) + geom_boxplot() +
ggtitle("Distribution compare for GDP/capita")
Some interesting observations: Europe is substantially higher than all continents. Asia is also VERY spread out with a few countries at the top of our distribution.
We’ll run a simple arrange on the 2007 dataframe. We already commented on Qatar.
df07 %>% arrange(desc(gdpPercap)) %>% top_n(n = 10, wt = gdpPercap)
## # A tibble: 10 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Qatar Asia 2007 75.588 907229 82010.98
## 2 Macao, China Asia 2007 80.718 456989 54589.82
## 3 Norway Europe 2007 80.196 4627926 49357.19
## 4 Brunei Asia 2007 77.118 386511 48014.59
## 5 Kuwait Asia 2007 77.588 2505559 47306.99
## 6 Singapore Asia 2007 79.972 4553009 47143.18
## 7 United States Americas 2007 78.242 301139947 42951.65
## 8 Ireland Europe 2007 78.885 4109086 40676.00
## 9 Hong Kong, China Asia 2007 82.208 6980412 39724.98
## 10 Switzerland Europe 2007 81.701 7554661 37506.42
Another note is that the lifeExp is quite high in these rich countries. Correlation or causation?
We’ll plot the GDP per Capita over the years for the USA to see growth and declines. I wonder what happened in 1981…
dfUSA <- df %>% filter(country == 'United States') %>% select(gdpPercap, year)
ggplot(data = dfUSA, aes(x = dfUSA$year, y = dfUSA$gdpPercap)) + geom_point() +
ggtitle("GDP per Capita vs. Year for United States") + xlab("Year") +
ylab("GDP per Capita")
There’s been a steady increase with a few crashes here and there mostly due to irresponsible bankers.
First we’ll define a function, pctchange, to simplify things.
#let's define a function for percent change
pctchange <-function(x) {
pct <- 100*((x - lag(x))/(lag(x)))
return (pct)
}
We can use dplyr to calculate the gdp % change from 2007. There’s no data for 2006 so we’ll take an average over the period prior. This will happen for any years missing, not just 2006, but it looks like there’s data for all years for USA aside 2006.
df %>% filter(country == "United States") %>%
mutate(gdp_per_capita_change_pct = pctchange(gdpPercap)/
as.numeric(year-lag(as.numeric(year)))) %>%
filter(year == "2007") %>%
select(country, year, gdp_per_capita_change_pct)
## # A tibble: 1 x 3
## country year gdp_per_capita_change_pct
## <fctr> <int> <dbl>
## 1 United States 2007 1.532914
We had a 1.53% growth pear year using an average from 2005 to 2007. Not good.
Here we’ll create a dataframe with the % change over the years so we can plot the data. We’ll omit 2007 to avoid the missing 2006 inflating the results and then we’ll take the mean and median of the growth. We’ll show a boxplot and then a linechart of the growth.
#Let's define a dataframe that we can plot and analyze for GDP % change
#we won't include 2007 because 2006 is missing. We could have chosen to take
#an average over the period for missing data like in Q5 if we wanted to.
USA_GDP_Change <- df %>% filter(country == "United States", year < "2007") %>%
mutate(gdp_per_capita_change_pct = pctchange(gdpPercap))
#We'll use summarise to find the mean and median through all of the years
USA_GDP_Change %>% filter(year > "1950") %>%
summarise(mean = mean(gdp_per_capita_change_pct), median = median(gdp_per_capita_change_pct))
## # A tibble: 1 x 2
## mean median
## <dbl> <dbl>
## 1 2.176289 2.530872
ggplot(data = USA_GDP_Change, mapping = aes(x = USA_GDP_Change$country,
y = USA_GDP_Change$gdp_per_capita_change_pct)) + geom_boxplot() +
ggtitle("Distribution for GDP % changes for USA")
ggplot(data = USA_GDP_Change, aes(x = USA_GDP_Change$year,
y = USA_GDP_Change$gdp_per_capita_change_pct)) + geom_line() +
ggtitle("GDP % Change Across Years") + xlab("Year") +
ylab("GDP % Change")
Typically we’re at about 2.176 or 2.530% growth in the USA depending on whether we’re using the mean or median. The IQR is 1-3.7% from the boxplot.
We can see the crashes and recoveries in the x-y line plot. We haven’t had a >3.75% growth year since the early 80s, despite the push in IT.