Module 5: Inferential Statistics
Workshop 5: Learning Inferential Statistics!
Open this Project on RStudio.Cloud!
In this workshop, you will work with three datasets from Japanese prefectural elections, applying our new inferential statistics to interpret the relationship between key traits of candidates and elections. At each step, please be thinking about what direction (positive/negative) and strength of association these statistics indicate. Please load the packages below to begin!
0. Packages
Please load the following packages:
library(tidyverse) # contains dplyr, ggplot, readr
# - one package to rule them all!
library(viridis)
library(infer) # our new inferential statistics pacakage
Task 1: Difference of Means
Import your dataset, which contains a list of electoral districts in specific elections from 2000 to 2017. Filter it to just the years 2010 and 2011. Below, you will learn to calculate the difference in mean voter turnout between these years.
# Import and filter your data
myturnout <- read_csv("japan_turnout.csv") %>%
filter(election_year %in% c(2010, 2011))
# View the first 6 rows with the head() function
myturnout %>%
head()Calculate the mean turnout rate for each year, using group_by() and summarize().
myturnout %>%
group_by(election_year) %>%
summarize(mean = mean(turnout_rate))Run your first difference of means test (a.k.a. a t-test)! Be sure to put the treatment group (2011) first, and the control group (2010) second. The statistic field contains your t-statistic, and the estimate contains your raw difference of means. (Don’t worry about the other columns for now.)
myturnout %>%
t_test(formula = turnout_rate ~ election_year,
order = c("2011", "2010"))Learning Check 1:
How much did voter turnout increase on average in 2011 elections compared to in 2010? How much different was voter turnout on average in 2012 elections compared to in 2010? Note: you will need to re-import and re-filter your data.
# Be sure to tidy up afterwards
remove(myturnout)
Task 2: Crosstabulation and the Chi-squared statistic
Next, we’re going to investigate a dataset, where each row represents a person running for office in a specific district race in a given year. Filter it just to the year 2011.
mycandidates <- read_csv("japan_incumbency.csv") %>%
filter(election_year == 2011)Calculate the total number of candidates by status and party, using group_by() and summarize(). Which set of categories overlap the most?
mycandidates %>%
group_by(candidate_status, candidate_party) %>%
summarize(count = n())Now, calculate the chi-squared statistic and evaluate how much these categories differ from what we would expect if they were not related. Is the chi-squared statistic close to zero (1, 2, 3, 4), or much larger than zero (anything bigger)? What does this tell us about incumbency rates and party affiliation?
mycandidates %>%
chisq_test(formula = candidate_status ~ candidate_party)Learning Check 2:
Task 3. Correlation
Finally, let’s examine a similar dataset that narrows into the individuals who ran each year for office. Did younger candidates win greater shares of the vote in 2011? Let’s find out. Please import your data and, you guessed it, filter it to elections in 2011.
myresults <- read_csv("japan_voteshare.csv") %>%
filter(election_year == 2011)Let’s use the correlation coefficient (Pearson’s r) to evaluate how closely correlated these two variables are. What do you find? Are they positively or negatively correlated, and how strong is the relationship? (Is it closer to -1, 0, or 1?)
myresults %>%
summarize(cor = cor(candidate_age, voteshare,
use = "pairwise.complete.obs"))
# the argument use = "pairwise.complete.obs" is like the na.rm = TRUE for the cor() function.Learning Check 3:
Learning Check 4: Synthesis Moment!
group_by() and summarize()? Try it out!
Learning Check 5: Synthesis Moment!
Challenge: Putting it all together!
Learning Check 6: Choose your own adventure
Great work! You’re done!