We decided to focus on exploring relationships in the data.
Our research question was “What are the differences in population, life expectancy, and GDP between the earliest (1952) and latest (2007) data collection points in the data across current G20 members?”
Note that we excluded excluded Russia, the European Union and African Union because they were not available in the data set.
We used a subset of the original data from gapminder, which was provided for the purposes of this assignment.
The main data frame gapminder has 1704 rows and 6 variables:
The supplemental data frame gapminder_unfiltered was not filtered on year or for complete data and has 3313 rows.
For each of 142 countries, the package provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007 (Bryan 2023).
A first approach we can use to answer our RQ is by looking at the three variables of interest with a connected dot plot. This visualization can help us easily see change in each G20 country on one variable in 1952 and 2007.
A second approach we can use to answer our RQ is by creating a bubble chart demonstrating the correlation between GDP per capita (USD) and life expectancy (years) with bubble size reflecting population. This visualization can help us easily see change in life expectancy as GDP increases and allows for observations in population trends in 1952 contrasted with 2007.
A final and third approach we can use to answer our RQ is through
principal components analysis (PCA). This technique can help us
visualize the three dimensions of the gapminder data more
clearly, by projecting the data onto the first two principal components
(i.e., the two dimensions of the data with the largest variability).
Furthermore, this analysis can help us understand where each country is
located, with respect to the)other countries, on PC1 and PC2, and how
each country loads on the eigenvectors (life expectancy, GDP per capita,
and population). This visualization can help us understand the
differences between countries in life expectancy, GDP, and population in
1952 versus 2007. In other words, we can tell from this analysis what
features a country loads highly on in 1952 and 2007; this can show what
trends and correlations occurred within the country’s data for these
years.
version
## _
## platform aarch64-apple-darwin20
## arch aarch64
## os darwin20
## system aarch64, darwin20
## status
## major 4
## minor 3.1
## year 2023
## month 06
## day 16
## svn rev 84548
## language R
## version.string R version 4.3.1 (2023-06-16)
## nickname Beagle Scouts
options(scipen = 999) # no scientific notation
library(gapminder)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggfortify) # this package is for "ggplot2" to visualize "prcomp()"
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(grid)
# data manipulation for plotting
g20.col1952 <- subset(gapminder,
gapminder$year == 1952)
g20.col2007 <- subset(gapminder,
gapminder$year == 2007)
g20.col1952 <- g20.col1952 |> filter(country == "Argentina" |
country == "Australia"|
country == "Brazil"|
country == "Canada"|
country == "China"|
country == "France"|
country == "Germany"|
country == "India"|
country == "Indonesia"|
country == "Italy"|
country == "Japan"|
country == "Mexico"|
country == "Korea, Rep."|
country == "Saudi Arabia"|
country == "South Africa"|
country == "Turkey"|
country == "United Kingdom"|
country == "United States")
g20.col2007 <- g20.col2007 |> filter( country == "Argentina" |
country == "Australia"|
country == "Brazil"|
country == "Canada"|
country == "China"|
country == "France"|
country == "Germany"|
country == "India"|
country == "Indonesia"|
country == "Italy"|
country == "Japan"|
country == "Mexico"|
country == "Korea, Rep."|
country == "Saudi Arabia"|
country == "South Africa"|
country == "Turkey"|
country == "United Kingdom"|
country == "United States")
g20.col <- rbind(g20.col1952, g20.col2007)
# plotting
a <- ggplot(data = g20.col,
aes(x = pop,
y = reorder(country,
lifeExp)))+
geom_line(aes(group = country))+
geom_point(aes(color = as.factor(year)))+
ylab("")+
xlab("Population\n ")+
scale_color_discrete(guide = FALSE)+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
b <- ggplot(data = g20.col,
aes(x = gdpPercap,
y = reorder(country,
lifeExp)))+
geom_line(aes(group = country))+
geom_point(aes(color = as.factor(year)))+
ylab("")+
xlab("GDP Per Capita\n(USD)")+
scale_color_discrete(name = "Year")+
theme(axis.text.y = element_blank(),
axis.ticks.y = element_blank())
c <- ggplot(data = g20.col,
aes(x = lifeExp,
y = reorder(country,
lifeExp)))+
geom_line(aes(group = country))+
geom_point(aes(color = as.factor(year)))+
ylab("Country")+
xlab("Life Expectancy\n(Years)")+
scale_color_discrete(guide = FALSE)
grid.arrange(ncol = 3, c, a, b)
One evident result we can get from this plot is that most G20 countries, at the exception of South Africa, have substantially increased in life expectancy and GDP per capita. However, only China and India have substantially increased in population.
# data manipulation for plotting
gapminder2007 <- subset(gapminder,
gapminder$year == c(2007)) |>
select(-c("year")) |>
remove_rownames() |>
column_to_rownames(var = "country")
g20.2007 <- gapminder2007[c("Argentina",
"Australia",
"Brazil",
"Canada",
"China",
"France",
"Germany",
"India",
"Indonesia",
"Italy",
"Japan",
"Mexico",
"Korea, Rep.",
#"Russia", # Russia is not in the data
"Saudi Arabia",
"South Africa",
"Turkey",
"United Kingdom",
"United States"),
c("continent", "lifeExp", "pop", "gdpPercap")]
gapminder1952 <- subset(gapminder,
gapminder$year == c(1952)) |>
select(-c("year")) |>
remove_rownames() |>
column_to_rownames(var = "country")
g20.1952 <- gapminder1952[c("Argentina",
"Australia",
"Brazil",
"Canada",
"China",
"France",
"Germany",
"India",
"Indonesia",
"Italy",
"Japan",
"Mexico",
"Korea, Rep.",
#"Russia",
"Saudi Arabia",
"South Africa",
"Turkey",
"United Kingdom",
"United States"),
c("continent", "lifeExp", "pop", "gdpPercap")]
g20.1952.y <- g20.1952
g20.1952.y$year <- rep(1952, nrow(g20.1952))
g20.2007.y <- g20.2007
g20.2007.y$year <- rep(2007, nrow(g20.2007))
g20 <- rbind(g20.1952.y, g20.2007.y)
# plotting
ggplot(data = g20,
aes(x = gdpPercap,
y = lifeExp,
colour = as.factor(continent),
size = pop))+
geom_point(alpha = 0.8)+
scale_size_area(max_size = 20)+
scale_colour_discrete(name = "Continent")+
scale_size_continuous(name = "Population")+
xlab("GDP Per Capita\n(USD)")+
ylab("Life Expectancy\n(Years)")+
ggtitle("Bubble Chart of GDP Per Capita by Life Expectancy")+
facet_wrap(year ~ .)
## Scale for size is already present.
## Adding another scale for size, which will replace the existing scale.
In contrast to the previous plot, this plot shows how the variables relate to each other. It is evident from this plot that as GDP increases, life expectancy increases (except for South Africa). This result makes sense: As more capital is allocated to more people, they live longer.
pca2007 <- prcomp(g20.2007[, 2:ncol(g20.2007)],
scale. = TRUE)
p <- autoplot(pca2007,
data = g20.2007,
label = TRUE,
label.size = 3,
shape = FALSE,
colour = "continent",
loadings = TRUE,
loadings.label = TRUE,
loadings.colour = "black",
loadings.label.colour = "black")
pca1952 <- prcomp(g20.1952[, 2:ncol(g20.1952)],
scale. = TRUE)
q <- autoplot(pca1952,
data = g20.1952,
label = TRUE,
label.size = 3,
shape = FALSE,
colour = "continent",
loadings = TRUE,
loadings.label = TRUE,
loadings.colour = "black",
loadings.label.colour = "black")
f <- subplot(q, p)
f |> layout(showlegend = FALSE)
Note. The left plot depicts the PCA results for 1952 and the right for 2007. Taking the European countries, Canada, U.S., and China as our examples, we can see in 1952 that (i) the European countries and Canada load highly on lief expectancy and GDP per capita; (ii) the U.S. loads highly on population, life expectancy, and GDP; and (iii) China loads highly on population but not on GDP and life expectancy. In 2007, we can see that (i) the European countries and Canada are still loading highly on life expectancy and GDP and migrated closer to the U.S.; and (ii) China, though still primarily loading highly on population, it is now also loading on life expectancy. As opposed to the bubble chart, the biplot has revealed more vividly the dimensions upon which the countries differ between and within themselves.