The purpose of this assignment is to perform some basic exploratory analysis in r. The dataset used for this assignment focuses on the life expectancy, population, and GDP per capita for countries around the world from 1952 to 2007. From this assignment, one can immediately see the differences among the continents in terms of GDP per capita, and specifically for the United States, one can see steady growth year over year for the GDP per capita.
For this assignment, 3 different packages are required: gapminder, dplyr, and ggplot2. Those packages are loaded and described below.
## Load the necessary libraries ##
library(gapminder) ## Used to collect the data.
library(dplyr) ## Used for data manipulation
library(ggplot2) ## Used for data visualization
For this assignment, I will be using data from the gapminder library which can be sourced from https://www.gapminder.org/data/. Contained in this data are 6 different variables: country, continent, year, life expectancy, population, and GDP per capita. Country, continent, and year are categorical variables for when the obersvation was recorded. Life expectancy is the number of years at birth a person is expected to live. Population is measure of the total number of people living in each country, and GDP per capita is the total GDP for the country divided by the total population for that country.
## Load the required data
data_gap <- gapminder_unfiltered
To investigate the structure of the data, I will use the str() function as demonstrated below.
## Structure of the data ##
str(data_gap)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
As one can see, there are 6 variables contained in this dataset with 3,313 observations of those variables. Currently, there are two categorical variables and 4 numerical variables. However, since “year” really isn’t a numerical variable in this case, I will change it to a categorical variable.
## Change year to categorical variable and show the structure of the data
data_gap$year <- factor(data_gap$year)
str(data_gap)
## Classes 'tbl_df', 'tbl' and 'data.frame': 3313 obs. of 6 variables:
## $ country : Factor w/ 187 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 6 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : Factor w/ 58 levels "1950","1951",..: 3 8 13 18 23 28 33 38 43 48 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
Now, we can see that “year” is now a categorical variable, and we are confident in the structure of the data. Next, we will use the below code to check for any missing values.
## Check for missing values ##
sum(is.na(data_gap))
## [1] 0
Fortunately, there are no missing issues to deal with in this dataset.
Next, we will provide some basic summary statistics for the data using the following code.
## Summary statistics ##
summary(data_gap)
## country continent year lifeExp
## Czech Republic: 58 Africa : 637 2002 : 187 Min. :23.60
## Denmark : 58 Americas: 470 1997 : 184 1st Qu.:58.33
## Finland : 58 Asia : 578 1992 : 183 Median :69.61
## Iceland : 58 Europe :1302 2007 : 183 Mean :65.24
## Japan : 58 FSU : 139 1977 : 171 3rd Qu.:73.66
## Netherlands : 58 Oceania : 187 1982 : 171 Max. :82.67
## (Other) :2965 (Other):2234
## pop gdpPercap
## Min. :5.941e+04 Min. : 241.2
## 1st Qu.:2.680e+06 1st Qu.: 2505.3
## Median :7.560e+06 Median : 7825.8
## Mean :3.177e+07 Mean : 11313.8
## 3rd Qu.:1.961e+07 3rd Qu.: 17355.8
## Max. :1.319e+09 Max. :113523.1
##
While this does provde some interesting initial information on the life expectancy, population and GDP per capita, this function does not provide much other information for the categorical variables. However, one interesting point to note is that “FSU” is listed as a continent in the data. Let’s investigate what this variable represents.
## Investigate FSU ##
tmp <- data_gap %>%
filter(continent == "FSU") %>%
droplevels()
tmp$country %>% levels()
## [1] "Armenia" "Belarus" "Georgia" "Kazakhstan" "Latvia"
## [6] "Lithuania" "Russia" "Ukraine" "Uzbekistan"
Based on the above output, one can conclude that “FSU” represents Former Soviet Union. While this does appear to be a mis-categorization within a certain time period, I will leave the data as it stands and will revisit it should it affect the end analysis. Other than “FSU”, the other continents in the data include Africa, the Americas, Asia, Europe, and Oceania. From the source code, we can also conclude that the years included in the study are from 1952 to 2007 in increments of 5 years for most countries while some have data for each individual year. Finally, we will compute a few basic summary statistics such as # of countries per continent, mean life expectancy per continent, mean population per continent, and mean GDP per capita per country.
## Other general summary statistics ##
data_gap %>%
group_by(continent) %>%
summarize(n_obs = n(), n_countries = n_distinct(country), avglifeexp = mean(lifeExp), avgpop = mean(pop), meanGDP = mean(gdpPercap))
## # A tibble: 6 × 6
## continent n_obs n_countries avglifeexp avgpop meanGDP
## <fctr> <int> <int> <dbl> <dbl> <dbl>
## 1 Africa 637 53 49.03680 9728850 2175.859
## 2 Americas 470 36 67.09195 39416728 10802.574
## 3 Asia 578 43 62.41587 95444180 10073.938
## 4 Europe 1302 35 72.72164 15315944 16551.178
## 5 FSU 139 9 68.84430 31793002 7326.686
## 6 Oceania 187 11 69.74691 5424172 14057.097
As one can surmise from the output above, Africa has the largest number of unique countries and the lowest life expectancy by far. Europe has the highest average life expectancy over this time, and Asia has the largest average population.
## Problem 1: Distribution of GDP per Capita across all countries ##
data_gap %>%
filter(year == "2007") %>%
summarize(min_GDP = min(gdpPercap), max_GDP = max(gdpPercap), avg_GDP = mean(gdpPercap), med_GDP = median(gdpPercap), sd_GDP = sd(gdpPercap))
## # A tibble: 1 × 5
## min_GDP max_GDP avg_GDP med_GDP sd_GDP
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 277.5519 82010.98 12403.13 6873.262 13829.02
ggplot(data = data_gap) +
geom_histogram(mapping = aes(x = gdpPercap), binwidth = 5000)
As one can see from the statistical ouputs and the histogram, the distribution of GDP per capita appears to be heaviliy concentrated near 0 with a dwindling tail to the right as the GDP per capita increases. In this case, a few outliers appear to be influencing the mean, so the best measure to use for the overall population would be a median GDP per capita of $6,873.26.
## Problem 2: Distribution of GDP per Capita across by Continent ##
data_gap %>%
filter(year == "2007") %>%
group_by(continent) %>%
summarize(min_GDP = min(gdpPercap), max_GDP = max(gdpPercap), avg_GDP = mean(gdpPercap), med_GDP = median(gdpPercap), sd_GDP = sd(gdpPercap))
## # A tibble: 6 × 6
## continent min_GDP max_GDP avg_GDP med_GDP sd_GDP
## <fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 277.5519 13206.48 3091.230 1463.249 3583.240
## 2 Americas 1201.6372 42951.65 11940.902 9065.801 9542.837
## 3 Asia 944.0000 82010.98 15338.057 4889.250 18864.574
## 4 Europe 2604.7505 49357.19 24174.153 25885.565 11747.064
## 5 FSU 2211.1589 16666.51 9522.539 10273.774 5357.088
## 6 Oceania 1827.0966 34435.37 13156.979 5143.615 13150.012
data_gap %>%
filter(year == "2007") %>%
group_by(continent) %>%
ggplot(mapping = aes(x = gdpPercap, y = ..density..)) + ## Multiple histogram plot
geom_freqpoly(mapping = aes(colour = continent), binwidth = 5000)
data_gap %>%
filter(year == "2007") %>%
group_by(continent) %>%
ggplot(mapping = aes(x = continent, y = gdpPercap)) + ## Boxplot
geom_boxplot()
To answer this question, I first calculated some simple summary statistics for each continent’s GDP per capita. Africa contains the lowest single GDP per capita and has the lowest overall average. The Americas enjoy a significantly higher average GDP per capita, but with a minimum of $1,201, there are still countries well below that average. Asia has the widest spread between its countries with a the highest overall GDP per capita but also a minimum on par with Africa. Europe and Oceania both have similar distributions with the main difference being that Europe has a much higher maximum GDP per capita.
Now, I will create two charts to validate these conclusions. The first is a frequency plot that shows the density of each continent’s countries for each GDP per capita level, and the second is a boxplot with a visual representation of several of the statistics listed above. As one can see, the graphs confirm what the individual statistics stated. Africa’s distribution is shifted close to 0 with a narrow spread. Asia is also clustered near Africa but has a bigger tail spreading toward the right. The Americas are shifted farther to the right than Asia with a smaller tail. Europe and Oceania are shifted even further to the right, but Europe is more narrowly clustered while Oceania has a larger spread.
## Problem 3: Top 10 Countries with the largest GDP per capita in 2007 ##
data_gap %>%
filter(year == "2007") %>%
mutate(GDPRank = min_rank(desc(gdpPercap))) %>%
filter(GDPRank <= 10) %>%
arrange(desc(gdpPercap)) %>%
ggplot(mapping = aes(x = reorder(country, -gdpPercap), y = gdpPercap)) +
geom_bar(stat = "identity")
## Problem 4: Plot the GDP per capita for your country of origin for all years available.
data_gap %>%
filter(country == "United States") %>%
ggplot(mapping = aes(x = year, y = gdpPercap, group = 1)) +
geom_line()
## Problem 5: Plot the GDP per capita for your country of origin for all years available.
data_gap %>%
filter(country == "United States") %>%
mutate(gdpGrowth = ((gdpPercap - lag(gdpPercap)) / lag(gdpPercap))) %>%
filter(year == "2007")
## # A tibble: 1 × 7
## country continent year lifeExp pop gdpPercap gdpGrowth
## <fctr> <fctr> <fctr> <dbl> <int> <dbl> <dbl>
## 1 United States Americas 2007 78.242 301139947 42951.65 0.03065828
In 2007, the GDP per capita in the United States increased 3.1%.
## Problem 6: What has been the historical growth (or decline) in GDP per capita for your country?
data_gap %>%
filter(country == "United States") %>%
mutate(gdpGrowth = ((gdpPercap - lag(gdpPercap)) / lag(gdpPercap))) %>%
ggplot(mapping = aes(x = year, y = gdpGrowth, group = 1)) +
geom_line()
From the graph above, one can see that since 1952, the GDP per capita in the United States has only decreased 8 times while the rest of the years have grown at a rate between 0% and 5%.