1. Study

1.1 Purpose

We decided to focus on exploring relationships in the data.

Our research question was “What are the differences in population, life expectancy, and GDP between the earliest (1952) and latest (2007) data collection points in the data across current G20 members?”

Note that we excluded excluded Russia, the European Union and African Union because they were not available in the data set.

1.2 Data

We used a subset of the original data from gapminder, which was provided for the purposes of this assignment.

The main data frame gapminder has 1704 rows and 6 variables:

  1. country (categorical; nominal): factor with 142 levels or country names
  2. continent (categorical; nominal): factor with 5 levels or continent names
  3. year (time series or numeric; count): ranges from 1952 to 2007 in increments of 5 years
  4. lifeExp (numeric; count): life expectancy at birth, in years
  5. pop (numeric; count): population
  6. gdpPercap (numeric; continuous): GDP per capita (in USD and inflation-adjusted)

The supplemental data frame gapminder_unfiltered was not filtered on year or for complete data and has 3313 rows.

For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007 (Bryan 2023).

1.3 Visualization

A first approach we can use to answer our RQ is by looking at the three variables of interest with a connected dot plot. This visualization can help us easily see change in each G20 country on one variable in 1952 and 2007.

A second approach we can use to answer our RQ is by creating a bubble chart demonstrating the correlation between GDP per capita (USD) and life expectancy (years) with bubble size reflecting population. This visualization can help us easily see change in life expectancy as GDP increases and allows for observations in population trends in 1952 contrasted with 2007.

A final and third approach we can use to answer our RQ is through principal components analysis (PCA). This technique can help us visualize the three dimensions of the gapminder data more clearly, by projecting the data onto the first two principal components (i.e., the two dimensions of the data with the largest variability). Furthermore, this analysis can help us understand where each country is located, with respect to the)other countries, on PC1 and PC2, and how each country loads on the eigenvectors (life expectancy, GDP per capita, and population). This visualization can help us understand the differences between countries in life expectancy, GDP, and population in 1952 versus 2007. In other words, we can tell from this analysis what features a country loads highly on in 1952 and 2007; this can show what trends and correlations occurred within the country’s data for these years.

2. Plots

2.1 Version

version
##                _                           
## platform       aarch64-apple-darwin20      
## arch           aarch64                     
## os             darwin20                    
## system         aarch64, darwin20           
## status                                     
## major          4                           
## minor          3.1                         
## year           2023                        
## month          06                          
## day            16                          
## svn rev        84548                       
## language       R                           
## version.string R version 4.3.1 (2023-06-16)
## nickname       Beagle Scouts
options(scipen = 999) # no scientific notation

2.2 Libraries

library(gapminder)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggfortify) # this package is for "ggplot2" to visualize "prcomp()"
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
library(grid)

2.3 Plot 1: Cleveland Dot Plot

# data manipulation for plotting
g20.col1952 <- subset(gapminder,
                      gapminder$year == 1952)
g20.col2007 <- subset(gapminder,
                      gapminder$year == 2007)

g20.col1952 <- g20.col1952 |> filter(country == "Argentina" |
                                     country == "Australia"|
                                     country == "Brazil"|
                                     country == "Canada"|
                                     country == "China"|
                                     country == "France"|
                                     country == "Germany"|
                                     country == "India"|
                                     country == "Indonesia"|
                                     country == "Italy"|
                                     country == "Japan"|
                                     country == "Mexico"|
                                     country == "Korea, Rep."|
                                     country == "Saudi Arabia"|
                                     country == "South Africa"|
                                     country == "Turkey"|
                                     country == "United Kingdom"|
                                     country == "United States")

g20.col2007 <- g20.col2007 |> filter(  country == "Argentina" |
                                       country == "Australia"|
                                       country == "Brazil"|
                                       country == "Canada"|
                                       country == "China"|
                                       country == "France"|
                                       country == "Germany"|
                                       country == "India"|
                                       country == "Indonesia"|
                                       country == "Italy"|
                                       country == "Japan"|
                                       country == "Mexico"|
                                       country == "Korea, Rep."|
                                       country == "Saudi Arabia"|
                                       country == "South Africa"|
                                       country == "Turkey"|
                                       country == "United Kingdom"|
                                       country == "United States")

g20.col <- rbind(g20.col1952, g20.col2007)

# plotting
a <- ggplot(data = g20.col, 
            aes(x = pop, 
                y = reorder(country, 
                            lifeExp)))+
  geom_line(aes(group = country))+
  geom_point(aes(color = as.factor(year)))+
  ylab("")+
  xlab("Population\n ")+
  scale_color_discrete(guide = FALSE)+
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

b <- ggplot(data = g20.col, 
            aes(x = gdpPercap, 
                y = reorder(country, 
                            lifeExp)))+
  geom_line(aes(group = country))+
  geom_point(aes(color = as.factor(year)))+
  ylab("")+
  xlab("GDP Per Capita\n(USD)")+
  scale_color_discrete(name = "Year")+
  theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

c <- ggplot(data = g20.col, 
            aes(x = lifeExp, 
                y = reorder(country, 
                            lifeExp)))+
  geom_line(aes(group = country))+
  geom_point(aes(color = as.factor(year)))+
  ylab("Country")+
  xlab("Life Expectancy\n(Years)")+
  scale_color_discrete(guide = FALSE)
grid.arrange(ncol = 3, c, a, b)

One evident result we can get from this plot is that most G20 countries, at the exception of South Africa, have substantially increased in life expectancy and GDP per capita. However, only China and India have substantially increased in population.

2.6 Plot 2: Bubble Chart

# data manipulation for plotting
gapminder2007 <- subset(gapminder,
                        gapminder$year == c(2007)) |>
  select(-c("year")) |>
  remove_rownames() |>
  column_to_rownames(var = "country")

g20.2007 <- gapminder2007[c("Argentina",
                     "Australia",
                     "Brazil",
                     "Canada",
                     "China",
                     "France",
                     "Germany",
                     "India",
                     "Indonesia",
                     "Italy",
                     "Japan",
                     "Mexico",
                     "Korea, Rep.",
                     #"Russia", # Russia is not in the data
                     "Saudi Arabia",
                     "South Africa",
                     "Turkey",
                     "United Kingdom",
                     "United States"),
                   c("continent", "lifeExp", "pop", "gdpPercap")]

gapminder1952 <- subset(gapminder,
                        gapminder$year == c(1952)) |>
  select(-c("year")) |>
  remove_rownames() |>
  column_to_rownames(var = "country")

g20.1952 <- gapminder1952[c("Argentina",
                            "Australia",
                            "Brazil",
                            "Canada",
                            "China",
                            "France",
                            "Germany",
                            "India",
                            "Indonesia",
                            "Italy",
                            "Japan",
                            "Mexico",
                            "Korea, Rep.",
                            #"Russia",
                            "Saudi Arabia",
                            "South Africa",
                            "Turkey",
                            "United Kingdom",
                            "United States"),
                          c("continent", "lifeExp", "pop", "gdpPercap")]

g20.1952.y <- g20.1952
g20.1952.y$year <- rep(1952, nrow(g20.1952))

g20.2007.y <- g20.2007
g20.2007.y$year <- rep(2007, nrow(g20.2007))

g20 <- rbind(g20.1952.y, g20.2007.y)
# plotting
ggplot(data = g20,
       aes(x = gdpPercap,
           y = lifeExp,
           colour = as.factor(continent),
           size = pop))+
  geom_point(alpha = 0.8)+
  scale_size_area(max_size = 20)+
  scale_colour_discrete(name = "Continent")+
  scale_size_continuous(name = "Population")+
  xlab("GDP Per Capita\n(USD)")+
  ylab("Life Expectancy\n(Years)")+
  ggtitle("Bubble Chart of GDP Per Capita by Life Expectancy")+
  facet_wrap(year ~ .)
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.

In contrast to the previous plot, this plot shows how the variables relate to each other. It is evident from this plot that as GDP increases, life expectancy increases (except for South Africa). This result makes sense: As more capital is allocated to more people, they live longer.

2.5 Plot 3: PCA Biplot

pca2007 <- prcomp(g20.2007[, 2:ncol(g20.2007)],
                  scale. = TRUE)

p <- autoplot(pca2007, 
              data = g20.2007, 
              label = TRUE,
              label.size = 3,
              shape = FALSE, 
              colour = "continent",
              loadings = TRUE,
              loadings.label = TRUE,
              loadings.colour = "black",
              loadings.label.colour = "black")

pca1952 <- prcomp(g20.1952[, 2:ncol(g20.1952)],
                  scale. = TRUE)

q <- autoplot(pca1952, 
              data = g20.1952, 
              label = TRUE,
              label.size = 3,
              shape = FALSE, 
              colour = "continent",
              loadings = TRUE,
              loadings.label = TRUE,
              loadings.colour = "black",
              loadings.label.colour = "black")
f <- subplot(q, p)
f |> layout(showlegend = FALSE)

Note. The left plot depicts the PCA results for 1952 and the right for 2007. Taking the European countries, Canada, U.S., and China as our examples, we can see in 1952 that (i) the European countries and Canada load highly on lief expectancy and GDP per capita; (ii) the U.S. loads highly on population, life expectancy, and GDP; and (iii) China loads highly on population but not on GDP and life expectancy. In 2007, we can see that (i) the European countries and Canada are still loading highly on life expectancy and GDP and migrated closer to the U.S.; and (ii) China, though still primarily loading highly on population, it is now also loading on life expectancy. As opposed to the bubble chart, the biplot has revealed more vividly the dimensions upon which the countries differ between and within themselves.

3. References

Bryan, Jennifer. 2023. “Data from Gapminder.” 2023. https://www.gapminder.org/data/.