mean(x = my_numbers) # find the mean of the my_numbers vectorPOLS 3000: Introduction to Political Research
Lab 2: Exploring the GapMinder Dataset
In this lab, we are going to apply the skills we learned in week 1 to a real-life data set from the Oxford non-profit, Gapminder.
Before we begin, we need to get organized:
Save the
lastname-lab-2.rfrom Canvas (Lab 2 on the Modules page) in thelabssubfolder ofpols-3000on your computer or cloud storage (e.g., DropBox or OneDrive).Download the
gapminder-data.csvdata file and save it to thedatasubfolder ofpols-3000,
NOTE: To receive full credit for this lab, please annotate each line of code with #. For example:
Step 1: Load R Packages
In the code below, I’ve demonstrated how to install and load the tidyverse package. In the space below, we’ll add code to install and load the gapminder package:
# install.packages("tidyverse") # only need to install the package once
# install.packages("knitr") # only need to install the package once
install.packages("gapminder") # install GapMinder package
library(tidyverse) # load the tidyverse package
library(knitr) # to knitr package to create the table below
library(gapminder) # load GapMinder packageStep 2: Load GapMinder Dataset
Next, we need to load the dataset into RStudio. For this lab, we will load the GapMinder dataset from a .csv file (text file) saved on our computer.
Note: Windows users will need the following file path in the read.csv function: #C:\Users\austinknuppe\OneDrive\pols-3000\data for Windows users`.
gapminder_data <- read.csv("~/pols-3000/data/gapminder-data.csv")
# load .csv file from the following file pathStep 3: Exploratory Data Analysis
Let’s begin by looking at the basic structure of the dataset. The variables in this data are described below:
| Name | Description |
|---|---|
country |
name of the country |
continent |
name of the country’s continent |
year |
year of the measurement, ranging from 1952 to 2007 in 5-year increments |
lifeExp |
life expectancy at birth, in years |
pop |
population |
gdpPercap |
GDP per capita (US dollars, inflation-adjusted) |
- How many rows and columns are the dataset?
- What do the rows represent? What about the columns?
Enter your response as a comment (#) in the code chunk below:
knitr::kable(head(gapminder_data)) # print out first 10 rows
# There are XXX rows and XXX columns.
# The rows represent XXX and the columns represent XXX.Class and Structure of Objects
Next, let’s look at the class and structure of each column in the dataset. For example, if we wanted to know the class of the column country, we would enter
class(gapminder_data$country)
str(gapminder_data$country)To select a particular variable from the data frame, you can use the $ operator. So gapminder$country will be a vector of just the country column of the gapminder_data data frame.
In the code chunk below, enter the appropriate code to discover the class and structure of the year and lifeExp variables. Remember to annotate your code with #.
class()
str()Basic Functions
Let’s continue our exploratory data analysis with some basic functions.
First, use the range() function to discover the year range for gapminder_data$year.
range()Next, use the min() and max() to discover the minimum and maximum life expectancy (lifeExp) for countries in the dataset:
min(gapminder_data$pop) # Here's the minimum value of `pop`
max(gapminder_data$pop) # Here's the maximum value of `pop`
min()
max()Now use the summary() function to tell me the min, max, mean, and median for the gdpPercap variable.
summary()Filtering and Subsetting
As was explained in the textbook, there are several different ways to subset or filter columns and rows from a dataset.
In the code chunk below, use comments to tell me what each of the following functions does:
gapminder_data[, "continent"] # enter comment here
gapminder_data[1, ] # enter comment here
gapminder_data[1:3, "pop"] # enter comment here`We can also use the filter() and group_by functions to find values or statistics for subsets of the data. In plain language the code chunk below says:
Select the
gapminder_datadatasetFilter out all observations each for rows with
Asiaas continentGroup remaining data by
countryFind the mean
gdpPercap(GDP-per-capita) for each countryPrint results for all 33 Asian countries in the console
In the comments below, tell me which country has the lowest and highest GDP-per-capita in Asia:
gapminder_data %>% # select data frame
filter(continent == "Asia") %>% # filter for Asia observations
group_by(country) %>% # group by country name
summarize(mean = mean(gdpPercap, na.rm = TRUE)) %>% # find mean
arrange(-mean) %>% # arrange output from highest to lowest values
print(n = 33) # print all 33 results to the consoleStep 4: Visualize Data Findings (Extra Credit)
Let’s visualize some key trends from the GapMinder data.
Boxplot of Average Life Expectancy
The following code chunk creates a boxplot of average life expectancy by continent:
ggplot(gapminder, mapping = aes(x = continent, y = lifeExp)) + # set x and y axes
geom_boxplot() + # create a box plot
labs(x = "Label X Axis Here", y = "Label Y Axis Here", # add labels/ title
title = "Add Your Own Title Here")Which country has the highest average life expectancy? What about the lowest?
What factors might best explain this variation?
Scaterplot of Relationship between GDP-Per-Capita and Average Life Expectancy
The following code chunk creates a scatter plot of the relationship between GDP-per-capita and life expectancy:
ggplot(gapminder, mapping = aes(x = gdpPercap, y = lifeExp)) + # set x and y axes
geom_point() + # add data points to the plot
geom_smooth(method = "loess") + # add a trend line
labs(x = "Label X Axis Here", y = "Label Y Axis Here", # add labels/ title
title = "Add Your Own Title Here")In a short paragraph, answer the following three questions:
What does the scatter plot reveal about the relationship between GDP-per-capita and life expectancy?
Which level of GDP-per-capita corresponds to the highest life expectancy?
Why might the highest GDP-per-capita ($80,000 <), not produce the highest life expectancy?