We will be doing a modified version of Exploration 10.2 from the ISI textbook, using data on the 1970 lottery for the Vietnam draft to explore correlation coefficients.
From 1969-1972 the US Selective Service held lotteries to determine which young men would be drafted into the armed forces. The lottery held in 1970 was for young men born in 1951, and those whose birthdays were assigned a draft number lower than 126 were drafted if they were classified as being available for military service. The intent was for it to be a fair and random lottery.
RRWe’ll use the tidyverse package for much of the data wrangling and visualization.
library(tidyverse)
The data were downloaded from https://www.randomservices.org/random/data/Draft.html and are available to you on Canvas in the file Vietnam.csv.
The variables in the data set are:
Month - the calendar months, as numbers (e.g., January = 1, February=2, etc.)Day - the sequential dateNumber - the draft numberBecause we are downloading the data from Canvas, we will need to load it into R. Remember to save the data set to the same folder where we have saved this RMarkdown file, or else to change the working directory so that R knows where to find the data file. For more details on how this works, and some error messages that might pop up, take a look at Lab 1.
NOTE: Remember to delete eval=FALSE
draft <- read_csv("Vietnam.csv")
Day) and draft number (Number)?NOTE: Remember to delete eval=FALSE
# Use ggplot and the data on the Vietnam draft (draft) do the following:
# 1. Plot the draft number (Number) against the sequential date (Day)
# 2. Include informative titles and axis labels
ggplot( ,aes( )) +
geom_point() +
labs()
NOTE: Remember to delete eval=FALSE
#Complete the code below to:
# 1. Group by month (Month) and then
# 2. Summarize the data by the median of the draft number (Number)
options(pillar.sigfig = 4)
draft %>%
group_by() %>%
summarize()
NOTE The code on line 26 is allowing for all significant digits to be shown to us in the tibble
NOTE: Remember to delete eval=FALSE
# The function cor will calculate the correlation between two variables
# Run this code without making any changes to it
cor(draft$Month,draft$Number)
We want to use simulation to determine if random chance is a plausible explanation for the observed correlation coefficient.
What is our observed statistic?
What are the steps you would take to simulate different values of the statistic?
How would you use the simulated values to evaluate the strength of evidence?
Which of the following are our null and alternative hypotheses? Delete the two incorrect answers.
\(H_0: \rho = 0\); \(H_A: \rho \ne 0\)
\(H_0: \rho = 0\); \(H_A: \rho > 0\)
\(H_0: \rho = 0\); \(H_A: \rho < 0\)
# Run the code below 5 times to get 5 different values of the correlation coefficient
# You don't need to make any changes to this code - just run it
shuffle <- sample(draft$Number,366,replace=F)
cor(draft$Month,shuffle)
|) on the third row beneath both the number of the repetition and the :---:.| Repetition | 1 | 2 | 3 | 4 | 5 |
|---|---|---|---|---|---|
| r |
NOTE: Remember to delete eval=FALSE
#You don't need to make any changes to this code - just run it
r <- numeric()
for(i in 1:100000){
shuffle <- sample(draft$Number,366,replace=F)
r[i] <- cor(draft$Month,shuffle)
}
NOTE: Remember to delete eval=FALSE
#Complete the code below by defining:
# 1. The data set of simulated correlation coefficients to plot
# 2. The aesthetic that should be plotted (i.e., the simulated correlation coefficients)
# 2. The geom needed to produce a histogram
r <- data.frame(r)
ggplot( ,aes( )) +
geom_ ()
NOTE: Remember to delete eval=FALSE
# Filter your simulated correlation coefficients for all values that are:
# 1. greater than or equal to the positive value of our observed correlation coefficient
# 2. less than or equal to the negative value of our observed correlation coefficient
r %>%
filter()
#Divide the number of observations from your output above by 100000 to get the p-value