Getting Started

We will be doing a modified version of Exploration 10.2 from the ISI textbook, using data on the 1970 lottery for the Vietnam draft to explore correlation coefficients.

Case Study

From 1969-1972 the US Selective Service held lotteries to determine which young men would be drafted into the armed forces. The lottery held in 1970 was for young men born in 1951, and those whose birthdays were assigned a draft number lower than 126 were drafted if they were classified as being available for military service. The intent was for it to be a fair and random lottery.

Learning goals

  • Work on our visualization skills in R
  • Learn how to calculate the correlation coefficient in R
  • Conduct a simulation based approach for inference on the correlation coefficient

Packages and Data

We’ll use the tidyverse package for much of the data wrangling and visualization.

library(tidyverse) 

The data were downloaded from https://www.randomservices.org/random/data/Draft.html and are available to you on Canvas in the file Vietnam.csv.

The variables in the data set are:

  • Month - the calendar months, as numbers (e.g., January = 1, February=2, etc.)
  • Day - the sequential date
  • Number - the draft number

Instructions

  • Except for the code needed to load the tidyverse, delete all the text prior to Exercises
  • After Exercises the only text that should appear are the questions and your answers to them
  • Delete all the comments that have been left for you in the code chunks
  • Include all code in the final submission

Exercises

Exercise 1: Exploring the data

Because we are downloading the data from Canvas, we will need to load it into R. Remember to save the data set to the same folder where we have saved this RMarkdown file, or else to change the working directory so that R knows where to find the data file. For more details on how this works, and some error messages that might pop up, take a look at Lab 1.

NOTE: Remember to delete eval=FALSE

draft <- read_csv("Vietnam.csv")
  1. In a perfectly fair, random lottery, what should be the value of the correlation coefficient between the variable sequential date (Day) and draft number (Number)?

NOTE: Remember to delete eval=FALSE

# Use ggplot and the data on the Vietnam draft (draft) do the following:
#   1. Plot the draft number (Number) against the sequential date (Day)
#   2. Include informative titles and axis labels
ggplot( ,aes( )) +
  geom_point() +
  labs()   
  1. Based on the scatterplot of draft numbers and sequential dates, does there appear to be an association between the two variables? Does this appear to have been a fair, random lottery?

NOTE: Remember to delete eval=FALSE

#Complete the code below to:
#  1. Group by month (Month) and then 
#  2. Summarize the data by the median of the draft number (Number)
options(pillar.sigfig = 4)
draft %>%
  group_by() %>%
  summarize()

NOTE The code on line 26 is allowing for all significant digits to be shown to us in the tibble

  1. Comment on any pattern or trend you may see in the medians. Remember that the month of the year is coded as a number, so 1 = January, 2 = February, etc. Does what you see make you reconsider your answers to questions 1 and 2?

NOTE: Remember to delete eval=FALSE

# The function cor will calculate the correlation between two variables
# Run this code without making any changes to it
cor(draft$Month,draft$Number)
  1. What value do you find for the correlation coefficient, and is it consistent with the idea that this was a fair and random lottery?

Exercise 2: Simulation Based Inference for the Correlation Coeffcient

We want to use simulation to determine if random chance is a plausible explanation for the observed correlation coefficient.

  1. What is our observed statistic?

  2. What are the steps you would take to simulate different values of the statistic?

  3. How would you use the simulated values to evaluate the strength of evidence?

  4. Which of the following are our null and alternative hypotheses? Delete the two incorrect answers.

  1. \(H_0: \rho = 0\); \(H_A: \rho \ne 0\)

  2. \(H_0: \rho = 0\); \(H_A: \rho > 0\)

  3. \(H_0: \rho = 0\); \(H_A: \rho < 0\)

# Run the code below 5 times to get 5 different values of the correlation coefficient
# You don't need to make any changes to this code - just run it
shuffle <- sample(draft$Number,366,replace=F)
cor(draft$Month,shuffle)
  1. Type your results from running the above code five times into the table below. You will type each of the values between the two vertical lines (|) on the third row beneath both the number of the repetition and the :---:.
Repetition 1 2 3 4 5
r

NOTE: Remember to delete eval=FALSE

#You don't need to make any changes to this code - just run it
r <- numeric()
for(i in 1:100000){
  shuffle <- sample(draft$Number,366,replace=F)
  r[i] <- cor(draft$Month,shuffle)
}

NOTE: Remember to delete eval=FALSE

#Complete the code below by defining:
# 1. The data set of simulated correlation coefficients to plot
# 2. The aesthetic that should be plotted (i.e., the simulated correlation coefficients)
# 2. The geom needed to produce a histogram
r <- data.frame(r)
ggplot(  ,aes(  )) +
  geom_  ()    
  1. On what value is your simulated distribution centered? Why does this make sense?

NOTE: Remember to delete eval=FALSE

# Filter your simulated correlation coefficients for all values that are:
#  1. greater than or equal to the positive value of our observed correlation coefficient
#  2. less than or equal to the negative value of our observed correlation coefficient
r %>%
filter()
#Divide the number of observations from your output above by 100000 to get the p-value
  1. Based on your p-value what conclusion do you draw regarding with the 1970 draft was a fair, random process?