Module 5: Inferential Statistics

Workshop 5: Learning Inferential Statistics!

Open this Project on RStudio.Cloud!

In this workshop, you will work with three datasets from Japanese prefectural elections, applying our new inferential statistics to interpret the relationship between key traits of candidates and elections. At each step, please be thinking about what direction (positive/negative) and strength of association these statistics indicate. Please load the packages below to begin!

0. Packages

Please load the following packages:

library(tidyverse) # contains dplyr, ggplot, readr 
# - one package to rule them all!
library(viridis)
library(infer) # our new inferential statistics pacakage



Task 1: Difference of Means

Import your dataset, which contains a list of electoral districts in specific elections from 2000 to 2017. Filter it to just the years 2010 and 2011. Below, you will learn to calculate the difference in mean voter turnout between these years.

# Import and filter your data
myturnout <- read_csv("japan_turnout.csv") %>%
  filter(election_year %in% c(2010, 2011)) 

# View the first 6 rows with the head() function
myturnout %>%
  head()

Calculate the mean turnout rate for each year, using group_by() and summarize().

myturnout %>%
  group_by(election_year) %>%
  summarize(mean = mean(turnout_rate))

Run your first difference of means test (a.k.a. a t-test)! Be sure to put the treatment group (2011) first, and the control group (2010) second. The statistic field contains your t-statistic, and the estimate contains your raw difference of means. (Don’t worry about the other columns for now.)

myturnout %>%
  t_test(formula = turnout_rate ~ election_year, 
         order = c("2011", "2010"))


Learning Check 1:

  • How much did voter turnout increase on average in 2011 elections compared to in 2010?
  • How much different was voter turnout on average in 2012 elections compared to in 2010?
  • Note: you will need to re-import and re-filter your data.
# Be sure to tidy up afterwards
remove(myturnout)



Task 2: Crosstabulation and the Chi-squared statistic

Next, we’re going to investigate a dataset, where each row represents a person running for office in a specific district race in a given year. Filter it just to the year 2011.

mycandidates <- read_csv("japan_incumbency.csv") %>%
  filter(election_year == 2011)

Calculate the total number of candidates by status and party, using group_by() and summarize(). Which set of categories overlap the most?

mycandidates %>% 
  group_by(candidate_status, candidate_party) %>%
  summarize(count = n())

Now, calculate the chi-squared statistic and evaluate how much these categories differ from what we would expect if they were not related. Is the chi-squared statistic close to zero (1, 2, 3, 4), or much larger than zero (anything bigger)? What does this tell us about incumbency rates and party affiliation?

mycandidates %>%
  chisq_test(formula = candidate_status ~ candidate_party)


Learning Check 2:

Repeat this, but compare results to the most recent available election in 2017. Have the differences between parties and their incumbency rates grown greater or weaker since 2011? How do you know?



Task 3. Correlation

Finally, let’s examine a similar dataset that narrows into the individuals who ran each year for office. Did younger candidates win greater shares of the vote in 2011? Let’s find out. Please import your data and, you guessed it, filter it to elections in 2011.

myresults <- read_csv("japan_voteshare.csv") %>%
  filter(election_year == 2011)

Let’s use the correlation coefficient (Pearson’s r) to evaluate how closely correlated these two variables are. What do you find? Are they positively or negatively correlated, and how strong is the relationship? (Is it closer to -1, 0, or 1?)

myresults %>%
  summarize(cor = cor(candidate_age, voteshare, 
                      use = "pairwise.complete.obs"))
# the argument use = "pairwise.complete.obs" is like the na.rm = TRUE for the cor() function.


Learning Check 3:

How does the correlation in 2011 contrast with in 2012, when the LDP swept to power? What does this tell you about the new wave of elected officials in 2012?


Learning Check 4: Synthesis Moment!

But wait! Could we actually find out the correlations for every election year, using group_by() and summarize()? Try it out!



Learning Check 5: Synthesis Moment!

Please make a line graph using geom_line() to illustrate the changing correlation between age and voteshare over time. What change do you observe?



Challenge: Putting it all together!

Learning Check 6: Choose your own adventure

Together in your small group, please design and implement your own statistical test that best captures an interesting difference using one of the three datasets above. Then, design a visualization that highlights that association in some way. As always, please add labs and colors, and describe your findings in a brief paragraph.



Great work! You’re done!