Introduction

  • The source of the gapminder data is from gapminder site itself.

  • In the gapminder dataset, there are 6 variables and 1704 observations.

  • Below are the 6 variables used in the dataset including the purpose of each.

    • country : This variable is used to identify the specific country used in the dataset.
    • continent : This variable determines which countinent each respective country is located in.
    • year : This variable identifies which year the statistics fall under for each country.
    • lifeExp : This variable determines the average life span of an individual in the respective country.
    • pop : This variable represents the total population for a country.
    • gdpPercap : This variable represents the economic output of a country per person.

In this report, I will be analzying the gapminder dataset. The overall summary of the data is that it contains data on life expectancy, GDP per capita and population by country. The first part of my analysis will look at the various identifiers and the number of unique values. The second part of my analysis will explore the overall summary statistics of the dataset by looking at the mean, median and IQR. The last part of my analysis will look at the different relationships between variables and determining if there is any correlation between them. To conclude, I will phrase three questions that I have discovered throughout my analysis.

library(tidyverse)
library(gapminder)
library(knitr)
library(ggplot2)
nrow(gapminder)
## [1] 1704
ncol(gapminder)
## [1] 6
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

Description of the Variables

  • Identifiers : These are variables that can be observed and require no measurement as they are used to identify a specific metric. Examples from the gapminder dataset include country, continent and year.

  • Metrics : This includes measured values that are quantitative in nature. Examples from the gapminder dataset include life expectancy, population and GDP per capital.

Explore the Identifiers

  • Calculate the number of unique values for each identifier and list each of them.
unique_identifiers <- gapminder %>% 
  summarise(unique_country = n_distinct(country),
            unique_countient = n_distinct(continent),
            unique_year = n_distinct(year))

unique_identifiers %>% 
  kable()
unique_country unique_countient unique_year
142 5 12
unique_country <- gapminder %>% 
  select(country) %>% 
  unique()

head(unique_country) %>% 
  kable()
country
Afghanistan
Albania
Algeria
Angola
Argentina
Australia
unique_continent <- gapminder %>% 
  select(continent) %>% 
  unique()

unique_continent %>% 
  kable()
continent
Asia
Europe
Africa
Americas
Oceania
unique_year <- gapminder %>% 
  select(year) %>% 
  unique()

unique_year %>% 
  kable()
year
1952
1957
1962
1967
1972
1977
1982
1987
1992
1997
2002
2007
unique(unlist(lapply(gapminder, function (x) which(is.na(x)))))
## integer(0)

I believe that the country and continent identifiers should be analyzed together because the are connected to each other as the continent is dependent on the location of the country. The year identifier should be analyzed individually because it is unrelated to the other identifiers. Also, there are no missing values in the dataset.

Explore the Metrics

  • Summary Statistics for each metric variable
stats_lifeExp <- gapminder %>% 
  summarise(MIN_lifeExp = min(lifeExp),
            Q1_lifeExp = quantile(lifeExp,0.25),
            MED_lifeExp = median(lifeExp),
            MEAN_lifeExp = mean(lifeExp),
            Q3_lifeExp = quantile(lifeExp,0.75),
            MAX_lifeExp = max(lifeExp),
            SD_lifeExp = sd(lifeExp),
            IQR_lifeExp = IQR(lifeExp))

prettyNum(stats_lifeExp, big.mark = ",", scientific = FALSE) %>% 
  kable()
x
MIN_lifeExp 23.599
Q1_lifeExp 48.198
MED_lifeExp 60.7125
MEAN_lifeExp 59.47444
Q3_lifeExp 70.8455
MAX_lifeExp 82.603
SD_lifeExp 12.91711
IQR_lifeExp 22.6475
stats_pop <- gapminder %>% 
  summarise(MIN_pop = min(pop),
            Q1_pop = quantile(pop,0.25),
            MED_pop = median(pop),
            MEAN_pop = mean(pop),
            Q3_pop = quantile(pop,0.75),
            MAX_pop = max(pop),
            SD_pop = sd(pop),
            IQR_pop = IQR(pop))

prettyNum(stats_pop, big.mark = ",", scientific = FALSE) %>% 
  kable()
x
MIN_pop 60,011
Q1_pop 2,793,664
MED_pop 7,023,596
MEAN_pop 29,601,212
Q3_pop 19,585,222
MAX_pop 1,318,683,096
SD_pop 106,157,897
IQR_pop 16,791,558
stats_gdpPercap <- gapminder %>% 
  summarise(MIN_gdpPercap = min(gdpPercap),
            Q1_gdpPercap = quantile(gdpPercap,0.25),
            MED_gdpPercap = median(gdpPercap),
            MEAN_gdpPercap = mean(gdpPercap),
            Q3_gdpPercap = quantile(gdpPercap,0.75),
            MAX_gdpPercap = max(gdpPercap),
            SD_gdpPercap = sd(gdpPercap),
            IQR_gdpPercap = IQR(gdpPercap))

prettyNum(stats_gdpPercap, big.mark = ",", scientific = FALSE) %>% 
  kable()
x
MIN_gdpPercap 241.1659
Q1_gdpPercap 1,202.06
MED_gdpPercap 3,531.847
MEAN_gdpPercap 7,215.327
Q3_gdpPercap 9,325.462
MAX_gdpPercap 113,523.1
SD_gdpPercap 9,857.455
IQR_gdpPercap 8,123.402
  • Summary statistics grouped by continent
stats_lifeExp2 <- gapminder %>%
  group_by(continent) %>% 
  summarise(MIN_lifeExp = min(lifeExp),
            Q1_lifeExp = quantile(lifeExp,0.25),
            MED_lifeExp = median(lifeExp),
            MEAN_lifeExp = mean(lifeExp),
            Q3_lifeExp = quantile(lifeExp,0.75),
            MAX_lifeExp = max(lifeExp),
            SD_lifeExp = sd(lifeExp),
            IQR_lifeExp = IQR(lifeExp))

stats_lifeExp2 %>% 
  kable()
continent MIN_lifeExp Q1_lifeExp MED_lifeExp MEAN_lifeExp Q3_lifeExp MAX_lifeExp SD_lifeExp IQR_lifeExp
Africa 23.599 42.37250 47.7920 48.86533 54.41150 76.442 9.150210 12.0390
Americas 37.579 58.41000 67.0480 64.65874 71.69950 80.653 9.345088 13.2895
Asia 28.801 51.42625 61.7915 60.06490 69.50525 82.603 11.864532 18.0790
Europe 43.585 69.57000 72.2410 71.90369 75.45050 81.757 5.433178 5.8805
Oceania 69.120 71.20500 73.6650 74.32621 77.55250 81.235 3.795611 6.3475
stats_pop2 <- gapminder %>% 
  group_by(continent) %>% 
  summarise(MIN_pop = min(pop),
            Q1_pop = quantile(pop,0.25),
            MED_pop = median(pop),
            MEAN_pop = mean(pop),
            Q3_pop = quantile(pop,0.75),
            MAX_pop = max(pop),
            SD_pop = sd(pop),
            IQR_pop = IQR(pop))

stats_pop2 %>% 
  kable()
continent MIN_pop Q1_pop MED_pop MEAN_pop Q3_pop MAX_pop SD_pop IQR_pop
Africa 60011 1342075 4579311 9916003 10801490 135031164 15490923 9459415
Americas 662850 2962359 6227510 24504795 18340309 301139947 50979430 15377950
Asia 120447 3844393 14530831 77038722 46300348 1318683096 206885205 42455955
Europe 147962 4331500 8551125 17169765 21802867 82400996 20519438 17471367
Oceania 1994794 3199213 6403492 8874672 14351625 20434176 6506342 11152413
stats_gdpPercap2 <- gapminder %>% 
  group_by(continent) %>% 
  summarise(MIN_gdpPercap = min(gdpPercap),
            Q1_gdpPercap = quantile(gdpPercap,0.25),
            MED_gdpPercap = median(gdpPercap),
            MEAN_gdpPercap = mean(gdpPercap),
            Q3_gdpPercap = quantile(gdpPercap,0.75),
            MAX_gdpPercap = max(gdpPercap),
            SD_gdpPercap = sd(gdpPercap),
            IQR_gdpPercap = IQR(gdpPercap))

stats_gdpPercap2 %>% 
  kable()
continent MIN_gdpPercap Q1_gdpPercap MED_gdpPercap MEAN_gdpPercap Q3_gdpPercap MAX_gdpPercap SD_gdpPercap IQR_gdpPercap
Africa 241.1659 761.247 1192.138 2193.755 2377.417 21951.21 2827.930 1616.170
Americas 1201.6372 3427.779 5465.510 7136.110 7830.210 42951.65 6396.764 4402.431
Asia 331.0000 1056.993 2646.787 7902.150 8549.256 113523.13 14045.373 7492.262
Europe 973.5332 7213.085 12081.749 14469.476 20461.386 49357.19 9355.213 13248.301
Oceania 10039.5956 14141.859 17983.304 18621.609 22214.117 34435.37 6358.983 8072.258
  • Histograms of each metric variable
gapminder %>% 
  ggplot(aes(x = lifeExp)) +
  geom_histogram(binwidth = 0.3) +
  labs(title = "Distribution of Life Expectancy",
       x = 'Life Expectancy',
       y = 'Frequency')

gapminder %>% 
  ggplot(aes(x = pop)) +
  geom_histogram(bins = 100) +
  labs(title = "Distribution of Population",
       x = 'Population',
       y = 'Frequency')

gapminder %>% 
  ggplot(aes(x = gdpPercap)) +
  geom_histogram(bins = 100) +
  labs(title = "Distribution of GDP Per Capitia",
       x = 'GDP per Capitia',
       y = 'Frequency')

  • Boxplots of each metric variable faceted by continent
gapminder %>% 
  ggplot(aes(x = as_factor(year), y = lifeExp)) +
  geom_boxplot() +
  facet_wrap(vars(continent), nrow = 1) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Distribution of Life Expectancy",
       x = 'Year',
       y = 'Life Expectancy')

gapminder %>% 
  ggplot(aes(x = as_factor(year), y = pop)) +
  geom_boxplot() +
  facet_wrap(vars(continent), nrow = 1) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Distribution of Population",
       x = 'Year',
       y = 'Population')

gapminder %>% 
  ggplot(aes(x = as_factor(year), y = gdpPercap)) +
  geom_boxplot() +
  facet_wrap(vars(continent), nrow = 1) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "Distribution of GDP Per Capitia",
       x = 'Year',
       y = 'GDP Per Capitia')

  • For each metric variable, is there any evidence of outliers in the data overall?

Life Expectancy: Taking a look at the boxplots for the distribution of life expectancy grouped by continents, we can see that the distribution of life expectancy in Africa is fairly consistent except for a few points. The minimum recorded life expectancy in the Africa continent was 23.599 which is significantly lower than the average life expectancy of 48.865 in the overall continent. In addition, we can see from the distribution of the life expectancy in Asia and Europe that there are a few outliers as that drop below the average in the mid to late 1900’s.

Population: The overall population metrics across each continent is fairly constant except for Asia and America. As seen in the boxplot distributions, there are signifcant outliers that can much greater than the average population in Asia across each year recorded. The outliers for the America continent are less significant but over quite a few years, America recorded a much higher population than on average.

GDP Per Capitia: In the distribution of GDP Per Capitia, there are a few outliers in the Asia boxplot as from the 1950’s to the 1970’s, there was signifcantly higher GDP per capitia on average. In addition, America also experienced this trend but to a lesser extent.

  • What are the most striking features of this variable? What trends can you comment on? Are there particular observations that seen worth exploring further?

Overall, we can see trends in each of the metric variables as life expectancy, population and GDP per Capitia are all generally increasing across all continents. For example, looking the distribution of life expectancy, we can see exponential growth in Africa, America, Asia and Europe. The one surprising distribution is the limited growth of population numbers in Europe and the Oceania. From the 1950’s to the early 2000’s, the population numbers have been fairly constant and there has not been significant growth like the other continents. One thing I would like to look further into is maybe how each of these metric variables are correlated to one another and if there is any ties between them.

Explore the Relationship between Variables

  • Linear Correlation between GDP per Capitia and life expectancy
gapminder %>% 
  select(gdpPercap, lifeExp) %>% 
  cor()
##           gdpPercap   lifeExp
## gdpPercap 1.0000000 0.5837062
## lifeExp   0.5837062 1.0000000
  • Scatter Plot between GDP per Capitia and life expectancy
gapminder %>% 
  ggplot(aes(x = gdpPercap,y = lifeExp, color = continent))+
  geom_point()+
  facet_wrap(vars(year), nrow = 3) +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title = "GDP per Capitia vs life expectancy",
       x = 'GDP per Capitia',
       y = 'Life expectancy')

  • What is striking about this chart? Describe your conclusions

One thing that is certainly striking about this chart is how low the life expectancy is in the continent of Africa and is trending even lower as we get into the 2010’s. In addition, the continents with greater GDP per Capitia also have a higher life expectancy as evident with Europe, Asia and America. There seems to be a large variation between a majority of the first world countries and the third world countries. This is a problem in our society now adays as not everyone in the world has equal resources based on location and this inevitably affects how much people earn in their lifetime and long they are expected to live.

Conclusion

  1. What can we do to improve the standard of living in third world countries located in Africa, Asia, etc..?
  2. By looking at the current data visualizations, can we project where these metrics will be by 2050 and how will the landscape of these metrics change?
  3. After completing this analysis, the population numbers in Asia are growing exponentially. Should we be concerned about the standard of living in Asia if the population gets to large, and do you think this will eventually affect the life expectanct of people in Asia?