POLS 3000: Introduction to Political Research

Lab 2: Exploring the GapMinder Dataset

Author

Austin Knuppe

Published

January 20, 2023

In this lab, we are going to apply the skills we learned in week 1 to a real-life data set from the Oxford non-profit, Gapminder.

Before we begin, we need to get organized:

Save the lastname-lab-2.r from Canvas (Lab 2 on the Modules page) in the labs subfolder of pols-3000 on your computer or cloud storage (e.g., DropBox or OneDrive).
Download the gapminder-data.csv data file and save it to the data subfolder of pols-3000,

NOTE: To receive full credit for this lab, please annotate each line of code with #. For example:

mean(x = my_numbers) # find the mean of the my_numbers vector

Step 1: Load R Packages

In the code below, I’ve demonstrated how to install and load the tidyverse package. In the space below, we’ll add code to install and load the gapminder package:

# install.packages("tidyverse") # only need to install the package once
# install.packages("knitr") # only need to install the package once
install.packages("gapminder") # install GapMinder package

library(tidyverse) # load the tidyverse package
library(knitr) # to knitr package to create the table below
library(gapminder) # load GapMinder package

Step 2: Load GapMinder Dataset

Next, we need to load the dataset into RStudio. For this lab, we will load the GapMinder dataset from a .csv file (text file) saved on our computer.

Note: Windows users will need the following file path in the read.csv function: #C:\Users\austinknuppe\OneDrive\pols-3000\data for Windows users`.

gapminder_data <- read.csv("~/pols-3000/data/gapminder-data.csv")
# load .csv file from the following file path

Step 3: Exploratory Data Analysis

Let’s begin by looking at the basic structure of the dataset. The variables in this data are described below:

Name	Description
`country`	name of the country
`continent`	name of the country’s continent
`year`	year of the measurement, ranging from 1952 to 2007 in 5-year increments
`lifeExp`	life expectancy at birth, in years
`pop`	population
`gdpPercap`	GDP per capita (US dollars, inflation-adjusted)

How many rows and columns are the dataset?
What do the rows represent? What about the columns?

Enter your response as a comment (#) in the code chunk below:

knitr::kable(head(gapminder_data)) # print out first 10 rows

# There are XXX rows and XXX columns.

# The rows represent XXX and the columns represent XXX.

Class and Structure of Objects

Next, let’s look at the class and structure of each column in the dataset. For example, if we wanted to know the class of the column country, we would enter

class(gapminder_data$country)
str(gapminder_data$country)

To select a particular variable from the data frame, you can use the $ operator. So gapminder$country will be a vector of just the country column of the gapminder_data data frame.

In the code chunk below, enter the appropriate code to discover the class and structure of the year and lifeExp variables. Remember to annotate your code with #.

class()
str()

Basic Functions

Let’s continue our exploratory data analysis with some basic functions.

First, use the range() function to discover the year range for gapminder_data$year.

range()

Next, use the min() and max() to discover the minimum and maximum life expectancy (lifeExp) for countries in the dataset:

min(gapminder_data$pop) # Here's the minimum value of `pop`
max(gapminder_data$pop) # Here's the maximum value of `pop`

min()
max()

Now use the summary() function to tell me the min, max, mean, and median for the gdpPercap variable.

summary()

Filtering and Subsetting

As was explained in the textbook, there are several different ways to subset or filter columns and rows from a dataset.

In the code chunk below, use comments to tell me what each of the following functions does:

gapminder_data[, "continent"] # enter comment here
gapminder_data[1, ] # enter comment here
gapminder_data[1:3, "pop"] # enter comment here`

We can also use the filter() and group_by functions to find values or statistics for subsets of the data. In plain language the code chunk below says:

Select the gapminder_data dataset
Filter out all observations each for rows with Asia as continent
Group remaining data by country
Find the mean gdpPercap (GDP-per-capita) for each country
Print results for all 33 Asian countries in the console

In the comments below, tell me which country has the lowest and highest GDP-per-capita in Asia:

gapminder_data %>% # select data frame
  filter(continent == "Asia") %>% # filter for Asia observations
  group_by(country) %>% # group by country name
  summarize(mean = mean(gdpPercap, na.rm = TRUE)) %>%  # find mean
  arrange(-mean) %>%  # arrange output from highest to lowest values
  print(n = 33) # print all 33 results to the console

Step 4: Visualize Data Findings (Extra Credit)

Let’s visualize some key trends from the GapMinder data.

Boxplot of Average Life Expectancy

The following code chunk creates a boxplot of average life expectancy by continent:

ggplot(gapminder, mapping = aes(x = continent, y = lifeExp)) + # set x and y axes
  geom_boxplot() + # create a box plot
  labs(x = "Label X Axis Here", y = "Label Y Axis Here", # add labels/ title
       title = "Add Your Own Title Here")

Which country has the highest average life expectancy? What about the lowest?
What factors might best explain this variation?

Scaterplot of Relationship between GDP-Per-Capita and Average Life Expectancy

The following code chunk creates a scatter plot of the relationship between GDP-per-capita and life expectancy:

ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + # set x and y axes
  geom_point() + # add data points to the plot
  geom_smooth(method = "loess") + # add a trend line 
  labs(x = "Label X Axis Here", y = "Label Y Axis Here", # add labels/ title
       title = "Add Your Own Title Here")

In a short paragraph, answer the following three questions:

What does the scatter plot reveal about the relationship between GDP-per-capita and life expectancy?
Which level of GDP-per-capita corresponds to the highest life expectancy?
Why might the highest GDP-per-capita ($80,000 <), not produce the highest life expectancy?