The source of the gapminder data is from gapminder site itself.
In the gapminder dataset, there are 6 variables and 1704 observations.
Below are the 6 variables used in the dataset including the purpose of each.
In this report, I will be analzying the gapminder dataset. The overall summary of the data is that it contains data on life expectancy, GDP per capita and population by country. The first part of my analysis will look at the various identifiers and the number of unique values. The second part of my analysis will explore the overall summary statistics of the dataset by looking at the mean, median and IQR. The last part of my analysis will look at the different relationships between variables and determining if there is any correlation between them. To conclude, I will phrase three questions that I have discovered throughout my analysis.
library(tidyverse)
library(gapminder)
library(knitr)
library(ggplot2)
nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 6
summary(gapminder)
## country continent year lifeExp
## Afghanistan: 12 Africa :624 Min. :1952 Min. :23.60
## Albania : 12 Americas:300 1st Qu.:1966 1st Qu.:48.20
## Algeria : 12 Asia :396 Median :1980 Median :60.71
## Angola : 12 Europe :360 Mean :1980 Mean :59.47
## Argentina : 12 Oceania : 24 3rd Qu.:1993 3rd Qu.:70.85
## Australia : 12 Max. :2007 Max. :82.60
## (Other) :1632
## pop gdpPercap
## Min. :6.001e+04 Min. : 241.2
## 1st Qu.:2.794e+06 1st Qu.: 1202.1
## Median :7.024e+06 Median : 3531.8
## Mean :2.960e+07 Mean : 7215.3
## 3rd Qu.:1.959e+07 3rd Qu.: 9325.5
## Max. :1.319e+09 Max. :113523.1
##
Identifiers : These are variables that can be observed and require no measurement as they are used to identify a specific metric. Examples from the gapminder dataset include country, continent and year.
Metrics : This includes measured values that are quantitative in nature. Examples from the gapminder dataset include life expectancy, population and GDP per capital.
unique_identifiers <- gapminder %>%
summarise(unique_country = n_distinct(country),
unique_countient = n_distinct(continent),
unique_year = n_distinct(year))
unique_identifiers %>%
kable()
| unique_country | unique_countient | unique_year |
|---|---|---|
| 142 | 5 | 12 |
unique_country <- gapminder %>%
select(country) %>%
unique()
head(unique_country) %>%
kable()
| country |
|---|
| Afghanistan |
| Albania |
| Algeria |
| Angola |
| Argentina |
| Australia |
unique_continent <- gapminder %>%
select(continent) %>%
unique()
unique_continent %>%
kable()
| continent |
|---|
| Asia |
| Europe |
| Africa |
| Americas |
| Oceania |
unique_year <- gapminder %>%
select(year) %>%
unique()
unique_year %>%
kable()
| year |
|---|
| 1952 |
| 1957 |
| 1962 |
| 1967 |
| 1972 |
| 1977 |
| 1982 |
| 1987 |
| 1992 |
| 1997 |
| 2002 |
| 2007 |
unique(unlist(lapply(gapminder, function (x) which(is.na(x)))))
## integer(0)
I believe that the country and continent identifiers should be analyzed together because the are connected to each other as the continent is dependent on the location of the country. The year identifier should be analyzed individually because it is unrelated to the other identifiers. Also, there are no missing values in the dataset.
stats_lifeExp <- gapminder %>%
summarise(MIN_lifeExp = min(lifeExp),
Q1_lifeExp = quantile(lifeExp,0.25),
MED_lifeExp = median(lifeExp),
MEAN_lifeExp = mean(lifeExp),
Q3_lifeExp = quantile(lifeExp,0.75),
MAX_lifeExp = max(lifeExp),
SD_lifeExp = sd(lifeExp),
IQR_lifeExp = IQR(lifeExp))
prettyNum(stats_lifeExp, big.mark = ",", scientific = FALSE) %>%
kable()
| x | |
|---|---|
| MIN_lifeExp | 23.599 |
| Q1_lifeExp | 48.198 |
| MED_lifeExp | 60.7125 |
| MEAN_lifeExp | 59.47444 |
| Q3_lifeExp | 70.8455 |
| MAX_lifeExp | 82.603 |
| SD_lifeExp | 12.91711 |
| IQR_lifeExp | 22.6475 |
stats_pop <- gapminder %>%
summarise(MIN_pop = min(pop),
Q1_pop = quantile(pop,0.25),
MED_pop = median(pop),
MEAN_pop = mean(pop),
Q3_pop = quantile(pop,0.75),
MAX_pop = max(pop),
SD_pop = sd(pop),
IQR_pop = IQR(pop))
prettyNum(stats_pop, big.mark = ",", scientific = FALSE) %>%
kable()
| x | |
|---|---|
| MIN_pop | 60,011 |
| Q1_pop | 2,793,664 |
| MED_pop | 7,023,596 |
| MEAN_pop | 29,601,212 |
| Q3_pop | 19,585,222 |
| MAX_pop | 1,318,683,096 |
| SD_pop | 106,157,897 |
| IQR_pop | 16,791,558 |
stats_gdpPercap <- gapminder %>%
summarise(MIN_gdpPercap = min(gdpPercap),
Q1_gdpPercap = quantile(gdpPercap,0.25),
MED_gdpPercap = median(gdpPercap),
MEAN_gdpPercap = mean(gdpPercap),
Q3_gdpPercap = quantile(gdpPercap,0.75),
MAX_gdpPercap = max(gdpPercap),
SD_gdpPercap = sd(gdpPercap),
IQR_gdpPercap = IQR(gdpPercap))
prettyNum(stats_gdpPercap, big.mark = ",", scientific = FALSE) %>%
kable()
| x | |
|---|---|
| MIN_gdpPercap | 241.1659 |
| Q1_gdpPercap | 1,202.06 |
| MED_gdpPercap | 3,531.847 |
| MEAN_gdpPercap | 7,215.327 |
| Q3_gdpPercap | 9,325.462 |
| MAX_gdpPercap | 113,523.1 |
| SD_gdpPercap | 9,857.455 |
| IQR_gdpPercap | 8,123.402 |
stats_lifeExp2 <- gapminder %>%
group_by(continent) %>%
summarise(MIN_lifeExp = min(lifeExp),
Q1_lifeExp = quantile(lifeExp,0.25),
MED_lifeExp = median(lifeExp),
MEAN_lifeExp = mean(lifeExp),
Q3_lifeExp = quantile(lifeExp,0.75),
MAX_lifeExp = max(lifeExp),
SD_lifeExp = sd(lifeExp),
IQR_lifeExp = IQR(lifeExp))
stats_lifeExp2 %>%
kable()
| continent | MIN_lifeExp | Q1_lifeExp | MED_lifeExp | MEAN_lifeExp | Q3_lifeExp | MAX_lifeExp | SD_lifeExp | IQR_lifeExp |
|---|---|---|---|---|---|---|---|---|
| Africa | 23.599 | 42.37250 | 47.7920 | 48.86533 | 54.41150 | 76.442 | 9.150210 | 12.0390 |
| Americas | 37.579 | 58.41000 | 67.0480 | 64.65874 | 71.69950 | 80.653 | 9.345088 | 13.2895 |
| Asia | 28.801 | 51.42625 | 61.7915 | 60.06490 | 69.50525 | 82.603 | 11.864532 | 18.0790 |
| Europe | 43.585 | 69.57000 | 72.2410 | 71.90369 | 75.45050 | 81.757 | 5.433178 | 5.8805 |
| Oceania | 69.120 | 71.20500 | 73.6650 | 74.32621 | 77.55250 | 81.235 | 3.795611 | 6.3475 |
stats_pop2 <- gapminder %>%
group_by(continent) %>%
summarise(MIN_pop = min(pop),
Q1_pop = quantile(pop,0.25),
MED_pop = median(pop),
MEAN_pop = mean(pop),
Q3_pop = quantile(pop,0.75),
MAX_pop = max(pop),
SD_pop = sd(pop),
IQR_pop = IQR(pop))
stats_pop2 %>%
kable()
| continent | MIN_pop | Q1_pop | MED_pop | MEAN_pop | Q3_pop | MAX_pop | SD_pop | IQR_pop |
|---|---|---|---|---|---|---|---|---|
| Africa | 60011 | 1342075 | 4579311 | 9916003 | 10801490 | 135031164 | 15490923 | 9459415 |
| Americas | 662850 | 2962359 | 6227510 | 24504795 | 18340309 | 301139947 | 50979430 | 15377950 |
| Asia | 120447 | 3844393 | 14530831 | 77038722 | 46300348 | 1318683096 | 206885205 | 42455955 |
| Europe | 147962 | 4331500 | 8551125 | 17169765 | 21802867 | 82400996 | 20519438 | 17471367 |
| Oceania | 1994794 | 3199213 | 6403492 | 8874672 | 14351625 | 20434176 | 6506342 | 11152413 |
stats_gdpPercap2 <- gapminder %>%
group_by(continent) %>%
summarise(MIN_gdpPercap = min(gdpPercap),
Q1_gdpPercap = quantile(gdpPercap,0.25),
MED_gdpPercap = median(gdpPercap),
MEAN_gdpPercap = mean(gdpPercap),
Q3_gdpPercap = quantile(gdpPercap,0.75),
MAX_gdpPercap = max(gdpPercap),
SD_gdpPercap = sd(gdpPercap),
IQR_gdpPercap = IQR(gdpPercap))
stats_gdpPercap2 %>%
kable()
| continent | MIN_gdpPercap | Q1_gdpPercap | MED_gdpPercap | MEAN_gdpPercap | Q3_gdpPercap | MAX_gdpPercap | SD_gdpPercap | IQR_gdpPercap |
|---|---|---|---|---|---|---|---|---|
| Africa | 241.1659 | 761.247 | 1192.138 | 2193.755 | 2377.417 | 21951.21 | 2827.930 | 1616.170 |
| Americas | 1201.6372 | 3427.779 | 5465.510 | 7136.110 | 7830.210 | 42951.65 | 6396.764 | 4402.431 |
| Asia | 331.0000 | 1056.993 | 2646.787 | 7902.150 | 8549.256 | 113523.13 | 14045.373 | 7492.262 |
| Europe | 973.5332 | 7213.085 | 12081.749 | 14469.476 | 20461.386 | 49357.19 | 9355.213 | 13248.301 |
| Oceania | 10039.5956 | 14141.859 | 17983.304 | 18621.609 | 22214.117 | 34435.37 | 6358.983 | 8072.258 |
gapminder %>%
ggplot(aes(x = lifeExp)) +
geom_histogram(binwidth = 0.3) +
labs(title = "Distribution of Life Expectancy",
x = 'Life Expectancy',
y = 'Frequency')
gapminder %>%
ggplot(aes(x = pop)) +
geom_histogram(bins = 100) +
labs(title = "Distribution of Population",
x = 'Population',
y = 'Frequency')
gapminder %>%
ggplot(aes(x = gdpPercap)) +
geom_histogram(bins = 100) +
labs(title = "Distribution of GDP Per Capitia",
x = 'GDP per Capitia',
y = 'Frequency')
gapminder %>%
ggplot(aes(x = as_factor(year), y = lifeExp)) +
geom_boxplot() +
facet_wrap(vars(continent), nrow = 1) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Distribution of Life Expectancy",
x = 'Year',
y = 'Life Expectancy')
gapminder %>%
ggplot(aes(x = as_factor(year), y = pop)) +
geom_boxplot() +
facet_wrap(vars(continent), nrow = 1) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Distribution of Population",
x = 'Year',
y = 'Population')
gapminder %>%
ggplot(aes(x = as_factor(year), y = gdpPercap)) +
geom_boxplot() +
facet_wrap(vars(continent), nrow = 1) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Distribution of GDP Per Capitia",
x = 'Year',
y = 'GDP Per Capitia')
Life Expectancy: Taking a look at the boxplots for the distribution of life expectancy grouped by continents, we can see that the distribution of life expectancy in Africa is fairly consistent except for a few points. The minimum recorded life expectancy in the Africa continent was 23.599 which is significantly lower than the average life expectancy of 48.865 in the overall continent. In addition, we can see from the distribution of the life expectancy in Asia and Europe that there are a few outliers as that drop below the average in the mid to late 1900’s.
Population: The overall population metrics across each continent is fairly constant except for Asia and America. As seen in the boxplot distributions, there are signifcant outliers that can much greater than the average population in Asia across each year recorded. The outliers for the America continent are less significant but over quite a few years, America recorded a much higher population than on average.
GDP Per Capitia: In the distribution of GDP Per Capitia, there are a few outliers in the Asia boxplot as from the 1950’s to the 1970’s, there was signifcantly higher GDP per capitia on average. In addition, America also experienced this trend but to a lesser extent.
Overall, we can see trends in each of the metric variables as life expectancy, population and GDP per Capitia are all generally increasing across all continents. For example, looking the distribution of life expectancy, we can see exponential growth in Africa, America, Asia and Europe. The one surprising distribution is the limited growth of population numbers in Europe and the Oceania. From the 1950’s to the early 2000’s, the population numbers have been fairly constant and there has not been significant growth like the other continents. One thing I would like to look further into is maybe how each of these metric variables are correlated to one another and if there is any ties between them.
gapminder %>%
select(gdpPercap, lifeExp) %>%
cor()
## gdpPercap lifeExp
## gdpPercap 1.0000000 0.5837062
## lifeExp 0.5837062 1.0000000
gapminder %>%
ggplot(aes(x = gdpPercap,y = lifeExp, color = continent))+
geom_point()+
facet_wrap(vars(year), nrow = 3) +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "GDP per Capitia vs life expectancy",
x = 'GDP per Capitia',
y = 'Life expectancy')
One thing that is certainly striking about this chart is how low the life expectancy is in the continent of Africa and is trending even lower as we get into the 2010’s. In addition, the continents with greater GDP per Capitia also have a higher life expectancy as evident with Europe, Asia and America. There seems to be a large variation between a majority of the first world countries and the third world countries. This is a problem in our society now adays as not everyone in the world has equal resources based on location and this inevitably affects how much people earn in their lifetime and long they are expected to live.