1 Introduction

Human endeavors, more than any other time in history have made awesome progress in technology and science. Twenty-first century’s rapid development and deployment of a COVID-19 vaccination to virtual reality commerce in the metaverse, where avatars of our likeness meet in real-time with business associates across the world. Despite the advancements in technology, much of humanity has been left behind. The most basic of human needs are not being met. Clean water, sanitary living conditions, nutritious food,health care,and even the COVID-19 vaccinations are not accessible to most of the world.1

Most of us, so unaware of what occurs within our own neighborhoods, that lack of knowledge of the worlds beyond our own zip codes is common place. Facts appear as murky and vague.

For this project, I therefore draw from the Gapminder database of global macroeconomic and public health indicators to explore and analyse the progression of 142 countries on 5 continents,from 1952 to 2007, in 5 year increments. Specifically, I will provide an overview of the data set and then focus on two key indicators of a country’s wealth: life expectancy and GDP-per capita(Gross domestic product per capita).


2 Data

2.0.1 About the data

Data for this project originates from the Gapminder Foundation, a Swedish based non-profit whose mission, “is to fight devastating ignorance with a fact-based worldview everyone can understand.”2 This dataset is a subset of the Gapminder database and is imported from the Gapminder library in R. The data was collected over a 55 year period from 1952 to 2007 in 5 year increments. It is comprised of a subset of key indicators of 142 countries on 5 continents macroeconomic and public health,specifically, GDP per capita(Gross domestic product-per Capita),life expectancy from birth and population.


2.0.2 Import and Tidy the data

  • The Gapminder data set is imported/read into R for cleaning, wrangling,exploration and analysis.
#importing the Gapminder data for analysis
library(gapminder)
data(gapminder)
view(gapminder)

To learn about the dataset we take a snapshot of it, such as column names, data dimensions, and a statistical summary comprised of min,max,median,mean and interquartile range.

#**summary** , **colnames** and **dim** functions used for descriptives  of dataset 
summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 
colnames(gapminder)
## [1] "country"   "continent" "year"      "lifeExp"   "pop"       "gdpPercap"
dim(gapminder)
## [1] 1704    6

The dataset contains 1704 observations and 6 variables.


Since this is a real-world dataset it may suffer from the problem of missing data. I use the skim and focus functions to ascertain if this is the case.

# **skim** function provides a snapshot the dataset such as means, missing variables,dimensions and frequency of column type.
gapminder %>%
  skim() %>%
  focus(n_missing, numeric.mean)
Data summary
Name Piped data
Number of rows 1704
Number of columns 6
_______________________
Column type frequency:
factor 2
numeric 4
________________________
Group variables None

Variable type: factor

skim_variable n_missing
country 0
continent 0

Variable type: numeric

skim_variable n_missing mean
year 0 1979.50
lifeExp 0 59.47
pop 0 29601212.32
gdpPercap 0 7215.33

There are no missing factors or numeric variables.


For easier readability and more insight into the data, it is converted to tabular format.

#DT function used to convert data to table
library(DT)
datatable(data = gapminder,
          rownames = FALSE,
          filter ="top",
          options = list(autoWidth = TRUE))

To gain further insight I look at the first five rows, and last five rows of the dataset.

#head displays the first five rows of the data set 
head(gapminder,5)
## # A tibble: 5 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
# tail displays the last five rows of data set

tail(gapminder,5)
## # A tibble: 5 x 6
##   country  continent  year lifeExp      pop gdpPercap
##   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Zimbabwe Africa     1987    62.4  9216418      706.
## 2 Zimbabwe Africa     1992    60.4 10704340      693.
## 3 Zimbabwe Africa     1997    46.8 11404948      792.
## 4 Zimbabwe Africa     2002    40.0 11926563      672.
## 5 Zimbabwe Africa     2007    43.5 12311143      470.

We see that each observation of Gapminder represents a unique pair of a country and a year in 5 year increments. For instance, the first observation represents country statistics for Afghanistan in the year 1952, the second for Afghanistan in 1957 and so on.

For each combination of a country and a year the dataset also contains specific variables describing a country’s demographics over time also in 5 year increments. We see that in the continent Asia,the life expectancy in years, the population and the GDP per capita. The GDP per capita is measured as a country’s total economic output per person. Its calculated by dividing the GDP of a country by its total population. Its a common measure of a country’s wealth.10

Each variable is seen as one consistent data type, thus some are numeric such as life expectancy, population,year and GDP per capita. Where as some are factors or categorical such as continent and country.


Tidying the data

Here,I rename the variables year,pop, and gdpPercapfor better clarity and insight.

#renaming data set for more clarity and insight
gapminder_rename<-gapminder %>%
  rename(five_year_increment=year)%>%
  rename(total_country_population=pop)%>%
  rename(gdpPerCapita=gdpPercap)

gapminder_rename<-gapminder_rename %>%
  rename(gdpPerCap_US_Dollars=gdpPerCapita)

I check again for any missing data within renamed dataset.

#checking for missing data
gapminder_rename %>%
  is.na() %>% 
  sum()
## [1] 0

3 Analysis

Now that Gapminder data has been imported,cleaned and tidied it can be analysed. I begin with general questions of the data. Such as median life expectancy of each continent,and median gross national income per capita from 1952-2007.

The variables I focus on are:

  • lifeExp Life Expectancy from Birth
  • gdpPerCap Gross National Income per capita (US $)
  • total_country_population Total population of each country

The aggregate function tells us the median life expectancy of each continent from 1952-2007.

# lifeExp disaggregated with continent for median lifeExp of continents
aggregate(lifeExp ~ continent, gapminder_rename, median)
##   continent lifeExp
## 1    Africa 47.7920
## 2  Americas 67.0480
## 3      Asia 61.7915
## 4    Europe 72.2410
## 5   Oceania 73.6650

We see that the African continent has the shortest median life expectancy at 47 whereas Oceania has the longest at 73.


I next want to determine the top 50 countries in the 80th percent_rank for life_expectancy in 1972.

#using percent_rank, top_n  and mutate functions to determine top 20 countries in the 80th percentile in 2002.
gapminder %>% 
  filter(year==2002) %>% 
  mutate(percent_rank=ntile(lifeExp,100)) %>% 
  filter(percent_rank>80) %>% 
  arrange(desc(percent_rank)) %>% 
  top_n(20,wt=percent_rank) %>% 
  select(continent,country,lifeExp,percent_rank)
## # A tibble: 20 x 4
##    continent country          lifeExp percent_rank
##    <fct>     <fct>              <dbl>        <int>
##  1 Asia      Japan               82            100
##  2 Asia      Hong Kong, China    81.5           99
##  3 Europe    Switzerland         80.6           98
##  4 Europe    Iceland             80.5           97
##  5 Oceania   Australia           80.4           96
##  6 Europe    Italy               80.2           95
##  7 Europe    Sweden              80.0           94
##  8 Europe    Spain               79.8           93
##  9 Americas  Canada              79.8           92
## 10 Asia      Israel              79.7           91
## 11 Europe    France              79.6           90
## 12 Oceania   New Zealand         79.1           89
## 13 Europe    Norway              79.0           88
## 14 Europe    Austria             79.0           87
## 15 Asia      Singapore           78.8           86
## 16 Europe    Germany             78.7           85
## 17 Europe    Netherlands         78.5           84
## 18 Europe    United Kingdom      78.5           83
## 19 Europe    Finland             78.4           82
## 20 Europe    Belgium             78.3           81

This analysis reveals that the countries in the top 80th percent_rank were primarily located in Europe, Oceania and Asia. Only one country,Canada, appeared from the Americas, from 2002.


For comparison I calculate the countries in the lower 20th percentile for life expectancy, during the same year.

#using percent_rank, top_n  and mutate functions to determine bottom 20 countries in the 20th percentile in 2002.
gapminder %>% 
  filter(year==2002) %>% 
  mutate(percent_rank=ntile(lifeExp,100)) %>% 
  filter(percent_rank<20) %>% 
  arrange(percent_rank) %>% 
  top_n(-20,wt=percent_rank) %>% 
  select(continent,country,lifeExp,percent_rank)
## # A tibble: 20 x 4
##    continent country                  lifeExp percent_rank
##    <fct>     <fct>                      <dbl>        <int>
##  1 Africa    Zambia                      39.2            1
##  2 Africa    Zimbabwe                    40.0            1
##  3 Africa    Angola                      41.0            2
##  4 Africa    Sierra Leone                41.0            2
##  5 Asia      Afghanistan                 42.1            3
##  6 Africa    Central African Republic    43.3            3
##  7 Africa    Liberia                     43.8            4
##  8 Africa    Rwanda                      43.4            4
##  9 Africa    Mozambique                  44.0            5
## 10 Africa    Swaziland                   43.9            5
## 11 Africa    Congo, Dem. Rep.            45.0            6
## 12 Africa    Lesotho                     44.6            6
## 13 Africa    Guinea-Bissau               45.5            7
## 14 Africa    Malawi                      45.0            7
## 15 Africa    Nigeria                     46.6            8
## 16 Africa    Somalia                     45.9            8
## 17 Africa    Botswana                    46.6            9
## 18 Africa    Cote d'Ivoire               46.8            9
## 19 Africa    Burundi                     47.4           10
## 20 Africa    Uganda                      47.8           10

This analysis reveals that countries in the lower 20 percent_rank in 2002 were located primarily in Africa. Afghanistan located in Asia was also in the lower 20 percent rank.The lower life_expectancy in these countries was primarily attributed to wars.12


The aggregate function shows us the Gross Income per capita (US $) of each continent from 1952-2007.

# gdpPercap disaggregated with continent for median gdpPercap of continents
aggregate(gdpPerCap_US_Dollars~ continent, gapminder_rename,median)
##   continent gdpPerCap_US_Dollars
## 1    Africa             1192.138
## 2  Americas             5465.510
## 3      Asia             2646.787
## 4    Europe            12081.749
## 5   Oceania            17983.304

We see that the African continent has the lowest median gdpPercap at $1192.14 whereas Oceania has the highest $17,983. The continent of Africa has consistently trended lower for median GDP per capita. This brings up another question of the data. Since we know the analysis suggests that gdpPercapita leads to longer life expectancy, does total GDP contribute to longer life expectancy? To tease this insight from the data we can derive a new column totalGDP using the mutate function.

#filter and mutate functions used to derive a new column **total GDP**
gapminder %>% 
  filter(year>1950) %>% 
  mutate(totalGDP=pop*gdpPercap)
## # A tibble: 1,704 x 7
##    country     continent  year lifeExp      pop gdpPercap     totalGDP
##    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>        <dbl>
##  1 Afghanistan Asia       1952    28.8  8425333      779.  6567086330.
##  2 Afghanistan Asia       1957    30.3  9240934      821.  7585448670.
##  3 Afghanistan Asia       1962    32.0 10267083      853.  8758855797.
##  4 Afghanistan Asia       1967    34.0 11537966      836.  9648014150.
##  5 Afghanistan Asia       1972    36.1 13079460      740.  9678553274.
##  6 Afghanistan Asia       1977    38.4 14880372      786. 11697659231.
##  7 Afghanistan Asia       1982    39.9 12881816      978. 12598563401.
##  8 Afghanistan Asia       1987    40.8 13867957      852. 11820990309.
##  9 Afghanistan Asia       1992    41.7 16317921      649. 10595901589.
## 10 Afghanistan Asia       1997    41.8 22227415      635. 14121995875.
## # ... with 1,694 more rows

To further test my assumption that life expectancy is tied to GDP, I will calculate the percent_rank using the lifeExp variable. This will reveal the life expectancy percent_rank, and will be sorted in descending order.

#Testing assumption that lifeExp is related to GDP_PerCapita.
gapminder %>% 
  filter(year>1950) %>% 
  mutate(percent_rank=ntile(lifeExp,100)) %>% 
  arrange(desc(gdpPercap))
## # A tibble: 1,704 x 7
##    country   continent  year lifeExp     pop gdpPercap percent_rank
##    <fct>     <fct>     <int>   <dbl>   <int>     <dbl>        <int>
##  1 Kuwait    Asia       1957    58.0  212846   113523.           45
##  2 Kuwait    Asia       1972    67.7  841934   109348.           64
##  3 Kuwait    Asia       1952    55.6  160000   108382.           40
##  4 Kuwait    Asia       1962    60.5  358266    95458.           50
##  5 Kuwait    Asia       1967    64.6  575003    80895.           57
##  6 Kuwait    Asia       1977    69.3 1140357    59265.           68
##  7 Norway    Europe     2007    80.2 4627926    49357.           99
##  8 Kuwait    Asia       2007    77.6 2505559    47307.           96
##  9 Singapore Asia       2007    80.0 4553009    47143.           99
## 10 Norway    Europe     2002    79.0 4535591    44684.           98
## # ... with 1,694 more rows

In just the first 10 observations of this dataframe we see the top countries sorted by GDP per capita and their respective life expectancy percent_rank on the right of the data frame have increased as GDP grew. This is especially evident with the country Kuwait. Kuwait went from life expectancy in the 40th percentile in 1952 to the 96th percentile in 2007. I conducted external research on Kuwait for that time frame and learned they experienced an increase to GDP per capita, when it went through nearly forty years of modernization, from 1946 to 1982 due to the discovery of commercial oil reserves.5


Before I can verify any claims that GDP per capita is correlated to life expectancy, I also must look at the percent_rank of countries with the lowest GDP per_capita.

#ascending  percent_rank of countries with low GDP_per capita from 1952
gapminder %>% 
  filter(year>1950) %>% 
  mutate(percent_rank=ntile(lifeExp,100)) %>% 
  arrange(gdpPercap)
## # A tibble: 1,704 x 7
##    country          continent  year lifeExp      pop gdpPercap percent_rank
##    <fct>            <fct>     <int>   <dbl>    <int>     <dbl>        <int>
##  1 Congo, Dem. Rep. Africa     2002    45.0 55379852      241.           18
##  2 Congo, Dem. Rep. Africa     2007    46.5 64606759      278.           22
##  3 Lesotho          Africa     1952    42.1   748747      299.           12
##  4 Guinea-Bissau    Africa     1952    32.5   580653      300.            1
##  5 Congo, Dem. Rep. Africa     1997    42.6 47798986      312.           13
##  6 Eritrea          Africa     1952    35.9  1438760      329.            3
##  7 Myanmar          Asia       1952    36.3 20092996      331             3
##  8 Lesotho          Africa     1957    45.0   813338      336.           18
##  9 Burundi          Africa     1952    39.0  2445618      339.            6
## 10 Eritrea          Africa     1957    38.0  1542611      344.            5
## # ... with 1,694 more rows
#ascending  percent_rank of countries with low GDP_per capita in 2007
gapminder %>% 
  filter(year==2007) %>% 
  mutate(percent_rank=ntile(lifeExp,100)) %>% 
  arrange(gdpPercap)
## # A tibble: 142 x 7
##    country                continent  year lifeExp     pop gdpPercap percent_rank
##    <fct>                  <fct>     <int>   <dbl>   <int>     <dbl>        <int>
##  1 Congo, Dem. Rep.       Africa     2007    46.5  6.46e7      278.            7
##  2 Liberia                Africa     2007    45.7  3.19e6      415.            5
##  3 Burundi                Africa     2007    49.6  8.39e6      430.           10
##  4 Zimbabwe               Africa     2007    43.5  1.23e7      470.            4
##  5 Guinea-Bissau          Africa     2007    46.4  1.47e6      579.            6
##  6 Niger                  Africa     2007    56.9  1.29e7      620.           18
##  7 Eritrea                Africa     2007    58.0  4.91e6      641.           19
##  8 Ethiopia               Africa     2007    52.9  7.65e7      691.           14
##  9 Central African Repub~ Africa     2007    44.7  4.37e6      706.            5
## 10 Gambia                 Africa     2007    59.4  1.69e6      753.           21
## # ... with 132 more rows

Looking at the lower percent_rank and GDP_per capita of countries does reveal that these countries indeed have lower life expectancy compared to countries with higher GDP.


Here, a more granular analysis is conducted comparing two countries,South Africa and Ireland focusing on life expectancy of both countries. I choose South Africa because it has recently been in the headlines as the country where the Omicron variant of COVID-19 originated,and has one of the lowest vaccination rates in the world.3 By contrast Ireland was choosen because it has one of the highest vaccination rates in the world.4


For comparison between South Africa and Ireland,I analyze which country has the longest life expectancy from birth.

#Here I use the select, filter and group_by functions to compare the lifeExp of 2 countrys.

gapminder_rename%>%
  select(country,lifeExp)%>%
  filter(country=="Ireland"|
           country=="South Africa")%>%
  group_by(country)%>%
  summarise(Average_life=mean(lifeExp))
## # A tibble: 2 x 2
##   country      Average_life
##   <fct>               <dbl>
## 1 Ireland              73.0
## 2 South Africa         54.0

We observe that Ireland at 73 years has a life expectancy of nearly 20 years greater than South Africa at 54 years. This difference posits a question, is this statistically significant? To make this determination I conduct a t-test.

A t-test is run to determine whether the difference in life expectancy between South Africa and Ireland are statistically significant.

# A new data frame gapminder_t is created to perform a t-test.
 gapminder_t<-gapminder_rename%>%
  select(country,lifeExp)%>%     #**select** narrows down the columns we work with
  filter(country=="South Africa"|
           country=="Ireland")      #**filters** out a specific country.

The t-test is used to compare average life expectancy between South Africa and Ireland.

# Apply a t-test to the newly created data frame.
t.test(data=gapminder_t,lifeExp~country)
## 
##  Welch Two Sample t-test
## 
## data:  lifeExp by country
## t = 10.067, df = 19.109, p-value = 4.466e-09
## alternative hypothesis: true difference in means between group Ireland and group South Africa is not equal to 0
## 95 percent confidence interval:
##  15.07022 22.97794
## sample estimates:
##      mean in group Ireland mean in group South Africa 
##                   73.01725                   53.99317

T-Test analysis

From this analysis we see the average life expectancy from 1952 -2007 in South Africa is 54 years and Ireland is 73 years respectively. Thus, our observed difference is 20 years, so our question is if there isn’t actually a difference in the average life expectancy between these two countries which is 0 and is the null hypothesis, then what are the chances from our sample that we would get the difference we observed?

The chances of that happening or probability is the p-value which is 4.466e-09 and is extremely close to 0. Because it is so close to 0 and extremely unlikely, then we can reject the null hypothesis. The assumption that the difference between the means of the countries is 0 and we can accept the only alternative, which is that the difference between the means is not 0 and that there is a real difference. To support this we have a 95 percent confidence interval for where we think the actual difference is likely to be,which is between 15 and 22 years.


After the conducting t-test I am interested if life expectancy in South Africa,despite being low, is on an upward trajectory. I use the years 1997,2002 and 2007.

#filter for South African lifeExp over 10 years
gapminder_rename %>%
  filter(
    continent == "Africa",
    country == "South Africa",
    five_year_increment %in% c(1997, 2002, 2007)
  )
## # A tibble: 3 x 6
##   country  continent five_year_increm~ lifeExp total_country_p~ gdpPerCap_US_Do~
##   <fct>    <fct>                 <int>   <dbl>            <int>            <dbl>
## 1 South A~ Africa                 1997    60.2         42835005            7479.
## 2 South A~ Africa                 2002    53.4         44433622            7711.
## 3 South A~ Africa                 2007    49.3         43997828            9270.

After running the chunk I am surprised to see that life expectancy of South Africa is on a downward trajectory, going from 60 to 49 in 10 years, despite an increase of population from 1997 to 2002. However,there is a decrease in population by 435,794 between the years 2002 and 2007. After conducting external research for that time period,I determined the reduction of life expectancy was primarily attributed to the HIV/AIDS epidemic in South Africa.6

I run the same analysis for Ireland to determine if life expectancy is going down from 1997-2007.

gapminder_rename %>%
  filter(
    continent == "Europe",
    country == "Ireland",
    five_year_increment %in% c(1997, 2002, 2007)
  )
## # A tibble: 3 x 6
##   country continent five_year_increm~ lifeExp total_country_po~ gdpPerCap_US_Do~
##   <fct>   <fct>                 <int>   <dbl>             <int>            <dbl>
## 1 Ireland Europe                 1997    76.1           3667233           24522.
## 2 Ireland Europe                 2002    77.8           3879155           34077.
## 3 Ireland Europe                 2007    78.9           4109086           40676.

Life expectancy in Ireland is in an upward trajectory, from 76 years to almost 79 years during a 10 year window. Total population is also on an upward trend.


4 Visualization

To gain insight into the data, it is be helpful to represent data visually for ease of comprehension and understanding of the dataset.


For the Gapminder dataset, I start with a boxplot. It effectively displays the distribution of the continuous variable lifeExp (Life Expectancy from Birth) against the continent variable.

## Boxplot showing distribution of the life expectancy data dis-aggregated by the continent data.

boxplot(lifeExp~continent,data= gapminder_rename,main="Life Expectancy from Birth Continent Data from 1952-2007",xlab = "Continent", ylab="(lifeExp) Life Expectancy from Birth",col="skyblue")

We observe the mean life expectancy for the African continent is approximately 47 years compared to Oceania which has a mean life expectancy of approximately 76 years of age.


I now want to take a look the relationship between two numeric variables lifeExp as a dependent variable on the y-axis and gdpPerCap_US_Dollars as an independent variable on the x-axis using a scatter plot. However, when I first run the data, the plot isn’t particularly linear, and is very skewed, therefore, I try a log transformation.11 The log transformation is useful in that it normalizes highly skewed data.

#plot of numeric variables lifeExp and gdpPerCap
plot(lifeExp~gdpPerCap_US_Dollars,data = gapminder_rename,col="green",main="Life Expectancy as a function of GDP-PerCapita")

#variables lifeExp and gdpPerCap plot with log transformation
plot(lifeExp~log(gdpPerCap_US_Dollars),data = gapminder_rename,col="green",main="Life Expectancy as a Function of GDP PerCapita - Log Scale")

This plot shows that countries with a higher GDP per capita seem to have a higher life expectancy. This makes sense, because countries with higher GDP have more access to health care,as well as basic necessitates such as clean drinking water,nutritious food, shelter and less internal strife such as war.7


I now look at the 5 continents individually, to determine life expectancy as a function of GDP_perCapita and if there is a correlation.

library(ggplot2)
gapminder %>% 
  filter(gdpPercap<50000) %>% #filters what data is fed into graph 
  ggplot(aes(x= log(gdpPercap),y=lifeExp,col=continent,size=pop))+ # determines how the variables will be mapped onto the canvas
  geom_point(alpha=0.1)+
  geom_smooth(method=lm)+
  facet_wrap(~continent)+  #produces 5 continents on a single canvas.
  labs(title = "Life Expectancy vs.Gross Domestic Product Per Capita within 5 Continents")           
## `geom_smooth()` using formula 'y ~ x'

This faceted graph is representative of Life Expectancy from birth as a function of GDP_per capita across five separate graphs representing five continents. Europe and Oceania both have high life expectancies,averaging 80 years, followed by the Americas and Asia, which are high but more evenly distributed. Africa unfortunately, has a mean life expectancy of 47 years. This outcome coincides with our findings in the previous boxplot.


I wanted to finish this paper on a positive note,therefore my last question of the Gapminder data are which countries have population growth at five times more than the original data, observed initially in 1952 compared to 2007?

The analysis will focus on the span of 55 years from 1952 to 2007. I will define a new data_frame called “population_growth”, then start with the select and filter functions and then calculate between the two selected years. I will then arrange the output such that the countries with the largest population growth will appear first, and decrease from there.

#produces a subset data_frame of countries that experience the most population growth of twice the original data
population_growth<-gapminder_rename %>%
  select(country,five_year_increment,total_country_population,continent) %>% 
  filter(five_year_increment==1952|five_year_increment==2007) %>% 
  spread(key = five_year_increment,value = total_country_population) %>% 
  mutate(pop_increment_growth =`2007`-`1952`,
         pop_increment_percent = round(pop_increment_growth/`1952`*100,2)) %>% 
  arrange(desc(pop_increment_percent)) %>% 
  filter(pop_increment_percent >= 500) %>% 
  mutate(country=factor(country,levels = country)) %>% 
  select(-pop_increment_growth)

I now have a new data_frame called population growth. Let’s take a look.

head(population_growth)
## # A tibble: 6 x 5
##   country       continent  `1952`   `2007` pop_increment_percent
##   <fct>         <fct>       <int>    <int>                 <dbl>
## 1 Kuwait        Asia       160000  2505559                 1466.
## 2 Jordan        Asia       607914  6053193                  896.
## 3 Djibouti      Africa      63149   496374                  686.
## 4 Saudi Arabia  Asia      4005677 27601038                  589.
## 5 Oman          Asia       507833  3204897                  531.
## 6 Cote d'Ivoire Africa    2977019 18013409                  505.
tail(population_growth)
## # A tibble: 6 x 5
##   country       continent  `1952`   `2007` pop_increment_percent
##   <fct>         <fct>       <int>    <int>                 <dbl>
## 1 Kuwait        Asia       160000  2505559                 1466.
## 2 Jordan        Asia       607914  6053193                  896.
## 3 Djibouti      Africa      63149   496374                  686.
## 4 Saudi Arabia  Asia      4005677 27601038                  589.
## 5 Oman          Asia       507833  3204897                  531.
## 6 Cote d'Ivoire Africa    2977019 18013409                  505.
glimpse(population_growth)
## Rows: 6
## Columns: 5
## $ country               <fct> Kuwait, Jordan, Djibouti, Saudi Arabia, Oman, Co~
## $ continent             <fct> Asia, Asia, Africa, Asia, Asia, Africa
## $ `1952`                <int> 160000, 607914, 63149, 4005677, 507833, 2977019
## $ `2007`                <int> 2505559, 6053193, 496374, 27601038, 3204897, 180~
## $ pop_increment_percent <dbl> 1465.97, 895.73, 686.04, 589.05, 531.09, 505.08

We see that from 1952 to 2007, that Kuwait has experienced an over 1400 percent population growth rate. This is attributed to increased immigration to the country8

Here, I use a table to represent the countries that have experienced population growth five times greater than their original population.

#renaming data_frame for better clarity.
population_growth_rename<-population_growth %>% 
  rename(percent_increase=pop_increment_percent)

#table that reveals snapshot of data in five rows.
population_growth_rename %>% 
  knitr::kable(caption =  "Table 1: Countries with a Five Fold Population Increase Between 1952 and 2007" )
Table 1: Countries with a Five Fold Population Increase Between 1952 and 2007
country continent 1952 2007 percent_increase
Kuwait Asia 160000 2505559 1465.97
Jordan Asia 607914 6053193 895.73
Djibouti Africa 63149 496374 686.04
Saudi Arabia Asia 4005677 27601038 589.05
Oman Asia 507833 3204897 531.09
Cote d’Ivoire Africa 2977019 18013409 505.08

Here, I use a Lollipop plot to represent which countries that have experienced population growth five times(500%) greater than their original population, from 1952 to 2007. I used the Lollipop plot because it clearly and simply depicts the relationship between a numeric variable (population_growth_percentage) and categorical variable(country).

#plot inspiration^9^

ggplot(population_growth_rename,
       aes(x=country,y=percent_increase,color = continent))+
  geom_segment( aes(x=country,xend = country,y = 0,yend=percent_increase),
                color = "skyblue")+
  geom_point(size = 2)+
  geom_text(aes(label=paste0(percent_increase,"%")),size = 2.5 ,nudge_y = 130)+
  theme_light()+
  coord_flip()+
  theme(panel.grid.major.y=element_blank(),
        panel.grid.minor =element_blank(),
        panel.border = element_blank() 
        )+
  labs(title = "Countries with a Five-Hundred Percent Population increase from 1952-2007")+
  xlab("")+
  ylab("Population Growth from 1952 to 2007 (Percentage)") 

This graph clearly shows that Kuwait has had major population growth, also noted in Table 1. It went from a population of 160,000 in 1952 to 2,505,559 in 55 years. This represents an almost 15 times increase.This increase as noted earlier is primarily attributed to an increase in immigration to the region.Immigration represents approximately 70% of the population of Kuwait, while Kuwaiti citizens account for between 28% and 32%. As of 2011 the population growth rate was 1.986%. Expatriates are attracted to Kuwait primarily because of the employment opportunities.13

5 Reflection

The Gapminder Global is a fascinating dataset, and the mission of Gapminder Foundation aligned with my sensibilities and values. As a new user of R it was compelling to explore this data,help build my statistical and R skills. That said, all has not been rosey,this has been a steep learning curve.

Until I worked on this project, I didn’t appreciate the impact of the decisions I made regarding the variables. Not just with the Gapminder data set but all data sets. One could easily spend hours working on just a pair of variables, in an attempt to tease insight from the data. As I reflect on this paper I would be interested in analyzing more recent data from 2008-2021, to determine how key country wealth indicators such as GDP and GDP per capita are impacting countries in the twenty-first century, and in the age of COVID-19.

In spite of the challenges, I enjoyed the process of working on this paper. Every decision I made had an impact on the outcomes of my analysis. I will continue to work on the Gapminder data as well as Gapminder adjacent projects, such as life expectancy and health outcomes as a function of per capita income in the United States,so that I may continue grow my skills in Data Science. It is a marathon, not a sprint.

6 Conclusion

The wealth of a country can not just measured by its GDP but just as important is the quality of life and health of its citizens, shone through the lens of longevity and GDP_PerCapita.

This project examined key disparities in global wealth represented by the key indicators: life expectancy,total country population and GDP_PerCapita. Although my analysis was an overview of the state of these indicators from 1952-2007, it provided much insight. For instance, we saw that countries such as South Africa and the African continent have among the lowest Gdp_PerCapitas and life expectancies globally where as countries such as Kuwait have key indicators that have grown exponentially,due to the commercialization of oil and immigration. A few aspects of this data also surprised me.I was surprised to learn that life expectancy in South Africa from 1997-2002 decreased in part due to HIV/Aids or that Canada was the only country in the Americas in 2002 to have life expectancy from birth in the 80th percentile.

Ultimately, this project taught me the importance of data-driven analysis,methods,choices and techniques needed to gain insight to a dataset; to never make assumptions, or let bias seep in. Discovering the nuances of a data-set and tuning our approach accordingly is just as important as the data it self. It is more important to keep the goal in mind before prioritizing a routine method of achieving it.