This is the fourth in a series of courses in a Professional Certificate in Data Science program, a series of courses that prepare you to do data analysis in R, from simple computations to machine learning. Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.
This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we’ll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.
Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals and p-values.
Finally, to understand statements about the probability of a candidate winning, you will learn about Bayesian modeling. At the end of the course, we will put it all together to recreate a simplified version of an election forecast model and apply it to the 2016 US presidential election.
The textbook for the Data Science course series is freely available online.
Section 1: Parameters and Estimates
You will learn how to estimate population parameters.
Section 2: The Central Limit Theorem in Practice
You will apply the central limit theorem to assess how close a sample estimate is to the population parameter of interest.
Section 3: Confidence Intervals and p-Values
You will learn how to calculate confidence intervals and learn about the relationship between confidence intervals and p-values.
Section 4: Statistical Models
You will learn about statistical models in the context of election forecasting.
Section 5: Bayesian Statistics
You will learn about Bayesian statistics through looking at examples from rare disease diagnosis and baseball.
Section 6: Election Forecasting
You will learn about election forecasting, building on what you’ve learned in the previous sections about statistical modeling and Bayesian statistics.
Section 7: Association Tests
You will learn how to use association and chi-squared tests to perform inference for binary, categorical, and ordinal data through an example looking at research funding rates.
Section 1 introduces you to parameters and estimates.
After completing Section 1, you will be able to:
The textbook for this section is available here
What is the expected value of this random variable \(\ S\)?
Possible Answers
A. \(\ E(S)=25(1−p)\)
B. \(\ E(S)=25p\)
C. \(\ E(S)=\sqrt{25 p (1-p)}\)
D. \(\ E(S)=p\)
What is the standard error of S?
Possible Answers
A. \(\ SE(S)=25p(1−p)\)
B. \(\ SE(S)=\sqrt{25p}\)
C. \(\ SE(S)=25(1−p)\)
D. \(\ SE(S)=\sqrt{25 p (1-p)}\)
What is the expected value of \(\ \bar{X}\)?
Possible Answers
A. \(\ E(\bar{X})=p\)
B. \(\ E(\bar{X})=Np\)
C. \(\ E(\bar{X})=N(1−p)\)
D. \(\ E(\bar{X})=1−p\)
The variable \(\ N\) represents the sample size and \(\ p\) is the proportion of Democrats in the population.
Possible Answers
A. \(\ SE(\bar{X})=\sqrt{Np(1−p)}\)
B. \(\ SE(\bar{X})=\sqrt{p(1−p)/N}\)
C. \(\ SE(\bar{X})=\sqrt{p(1−p)}\)
D. \(\ SE(\bar{X})=\sqrt{N}\)
se of a sample average when you poll 25 people in the population. Generate a sequence of 100 proportions of Democrats p that vary from 0 (no Democrats) to 1 (all Democrats).Plot se versus p for the 100 different proportions.
Instructions
seq function to generate a vector of 100 values of p that range from 0 to 1.sqrt function to generate a vector of standard errors for all values of p.plot function to generate a plot with p on the x-axis and se on the y-axis.# `N` represents the number of people polled
N <- 25
# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0,1, length = 100)
# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(p * (1 - p)/N)
# Plot `p` on the x-axis and `se` on the y-axis
plot(p,se)p versus se when the sample sizes equal N=25, N=100, and N=1000.Instructions
sqrt function to generate a vector of standard errors se for all values of p.plot function to generate a plot with p on the x-axis and se on the y-axis.ylim argument to keep the y-axis limits constant across all three plots. The lower limit should be equal to 0 and the upper limit should equal the highest calculated standard error across all values of p and N.# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)
# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)
# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
for(N in sample_sizes){
se <- sqrt(p*(1-p)/N)
plot(p, se, ylim = c(0,0.5/sqrt(25)))
}Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the expected value of d?
Possible Answers
A. \(\ E[\bar{X}−(1−\bar{X})]=E[2\bar{X}−1] =2E[\bar{X}]−1 = N(2p−1) = Np−N(1−p)\)
B. \(\ E[\bar{X}−(1−\bar{X})]=E[\bar{X}−1] =E[\bar{X}]−1 =p−1\)
C. \(\ E[\bar{X}−(1−\bar{X})]=E[2\bar{X}−1] =2E[\bar{X}]−1 =2\sqrt{p(1−p)}−1 =p−(1−p)\)
D. \(\ E[\bar{X}−(1−\bar{X})]=E[2\bar{X}−1] =2E[\bar{X}]−1 =2p−1 =p−(1−p)\)
Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the standard error of \(\ d\)?
Possible Answers
A. \(\ SE[\bar{X}−(1−\bar{X})]=SE[2\bar{X}−1] =2SE[\bar{X}] =2\sqrt{p/N}\)
B. \(\ SE[\bar{X}−(1−\bar{X})]=SE[2\bar{X}−1]=2SE[\bar{X}−1]=2\sqrt{p(1−p)/N}−1\)
C. \(\ SE[\bar{X}−(1−\bar{X})]=SE[2\bar{X}−1] =2SE[\bar{X}] =2\sqrt{p(1−p)/N}\)
D. \(\ SE[\bar{X}−(1−\bar{X})]=SE[\bar{X}−1] =SE[\bar{X}] =\sqrt{p(1−p)/N}\)
Use the sqrt function to calculate the standard error of the spread \(\ 2\bar{X}−1\).
# `N` represents the number of people polled
N <- 25
# `p` represents the proportion of Democratic voters
p <- 0.45
# Calculate the standard error of the spread. Print this value to the console.
2*sqrt(p*(1-p)/N)## [1] 0.1989975
A. This sample size is sufficient because the expected value of our estimate \(\ 2\bar{X}−1\) is d so our prediction will be right on.
B. This sample size is too small because the standard error is larger than the spread.
C. This sample size is sufficient because the standard error of about 0.2 is much smaller than the spread of 10%.
D. Without knowing p, we have no way of knowing that increasing our sample size would actually improve our standard error.
In Section 2, you will look at the Central Limit Theorem in practice.
After completing Section 2, you will be able to:
The textbook for this section is available here
take_sample that takes the proportion of Democrats p and the sample size N as arguments and returns the sample average of Democrats (1) and Republicans (0).Calculate the sample average if the proportion of Democrats equals 0.45 and the sample size is 100.
Instructions
take_sample that takes p and N as arguments.sample function as the first statement in your function to sample N elements from a vector of options where Democrats are assigned the value ‘1’ and Republicans are assigned the value ‘0’.mean function as the second statement in your function to find the average value of the random sample.# Write a function called `take_sample` that takes `p` and `N` as arguements and returns the average value of a randomly sampled population.
take_sample <- function(p, N){
X <- sample(c(0,1), size = N, replace = TRUE, prob = c(1 - p, p))
mean(X)
}
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# Define `N` as the number of people polled
N <- 100
# Call the `take_sample` function to determine the sample average of `N` randomly selected people from a population containing a proportion of Democrats equal to `p`. Print this value to the console.
take_sample(p,N)## [1] 0.46
Replicate the random sampling 10,000 times and calculate \(\ p−\bar{X}\) for each random sample. Save these differences as a vector called errors. Find the average of errors and plot a histogram of the distribution.
Instructions
take_sample that you defined in the previous exercise has already been run for you.replicate function to replicate subtracting the result of take_sample from the value of p 10,000 times.mean function to calculate the average of the differences between the sample average and actual value of p. # Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# Define `N` as the number of people polled
N <- 100
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Create an objected called `errors` that replicates subtracting the result of the `take_sample` function from `p` for `B` replications
errors <- replicate(B, p - take_sample(p, N))
# Calculate the mean of the errors. Print this value to the console.
mean(errors)## [1] -4.9e-05
errors.The errors object has already been loaded for you. Use the hist function to plot a histogram of the values contained in the vector errors. Which statement best describes the distribution of the errors?
Possible Answers
A. The errors are all about 0.05.
B. The error are all about -0.05.
C. The errors are symmetrically distributed around 0.
D. The errors range from -1 to 1.
What is the average size of the error if we define the size by taking the absolute value \(\ ∣p−\bar{X}∣\)?
Instructions
errors, a vector of \(\ ∣p−\bar{X}∣\).errors using the abs function.mean function.# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# Define `N` as the number of people polled
N <- 100
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))
# Calculate the mean of the absolute value of each simulated error. Print this value to the console.
mean(abs(errors))## [1] 0.039267
errors rather than the average of the absolute values.As we have discussed, the standard error is the square root of the average squared distance \(\ (\bar{X}−p)^2\). The standard deviation is defined as the square root of the distance squared.
Calculate the standard deviation of the spread.
Instructions
errors, a vector of \(\ ∣p−\bar{X}∣\).^2 to square the distances.mean function.sqrt function.# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# Define `N` as the number of people polled
N <- 100
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# We generated `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))
# Calculate the standard deviation of `errors`
sqrt(mean(errors^2))## [1] 0.04949939
Estimate the standard error given an expected value of 0.45 and a sample size of 100.
Instructions
Calculate the standard error using the sqrt function
# Define `p` as the expected value equal to 0.45
p <- 0.45
# Define `N` as the sample size
N <- 100
# Calculate the standard error
sqrt(p*(1-p)/N)## [1] 0.04974937
\(\ \hat{SE}(\bar{X})\)
Instructions
X using the sample function.sample function, create a vector using c() that contains all possible polling options where ‘1’ indicates a Democratic voter and ‘0’ indicates a Republican voter.sample function, use replace = TRUE within the sample function to indicate that sampling from the vector should occur with replacement.sample function, use prob = within the sample function to indicate the probabilities of selecting either element (0 or 1) within the vector of possibilities.mean function to calculate the average of the simulated poll, X_bar.X_bar using the sqrt function and print the result.# Define `p` as a proportion of Democratic voters to simulate
p <- 0.45
# Define `N` as the sample size
N <- 100
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `X` as a random sample of `N` voters with a probability of picking a Democrat ('1') equal to `p`
X <- sample(0:1, N, replace=T, p=c(1-p,p))
# Define `X_bar` as the average sampled proportion
X_bar <- mean(X)
# Calculate the standard error of the estimate. Print the result to the console.
sqrt(X_bar*(1-X_bar)/N)## [1] 0.04983974
Create a plot of the largest standard error for \(\ N\) ranging from 100 to 5,000. Based on this plot, how large does the sample size have to be to have a standard error of about 1%?
Possible Answers
A. 100
B. 500
C. 2,500
D. 4,000
Possible Answers
A. practically equal to $ p $.
B. approximately normal with expected value \(\p\) and standard error \(\ \sqrt{p(1−p)/N}\).
C. approximately normal with expected value \(\ \bar{X}\) and standard error \(\ \sqrt{\bar{X}(1−\bar{X})/N}\).
D. not a random variable.
errors that contained, for each simulated sample, the difference between the actual value p and our estimate \(\ \hat{X}\).The errors \(\ \bar{X}−p\) are:
Possible Answers
A. practically equal to 0.
B. approximately normal with expected value 0 and standard error \(\ \sqrt{p(1−p)/N}\).
C. approximately normal with expected value p and standard error \(\ \sqrt{p(1−p)/N}\).
D. not a random variable.
errors you generated previously to see if they follow a normal distribution.Instructions
qqnorm function to produce a qq-plot of the errors.qqline function to plot a line showing a normal distribution.# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# Define `N` as the number of people polled
N <- 100
# The variable `B` specifies the number of times we want the sample to be replicated
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Generate `errors` by subtracting the estimate from the actual proportion of Democratic voters
errors <- replicate(B, p - take_sample(p, N))
# Generate a qq-plot of `errors` with a qq-line showing a normal distribution
qqnorm(errors)
qqline(errors)Instructions
Use pnorm to define the probability that a value will be greater than 0.5.
# Define `p` as the proportion of Democrats in the population being polled
p <- 0.45
# Define `N` as the number of people polled
N <- 100
# Calculate the probability that the estimated proportion of Democrats in the population is greater than 0.5. Print this value to the console.
1-pnorm(0.5, mean = p, sd=(sqrt(p*(1-p)/N)))## [1] 0.1574393
What is the CLT approximation for the probability that your error is equal or larger than 0.01?
Instructions
sqrt function.pnorm twice to define the probabilities that a value will be less than 0.01 or -0.01.# Define `N` as the number of people polled
N <-100
# Define `X_hat` as the sample average
X_hat <- 0.51
# Define `se_hat` as the standard error of the sample average
se_hat <- sqrt(X_hat*(1-X_hat)/N)
# Calculate the probability that the error is 0.01 or larger
1 - pnorm(.01, 0, se_hat) + pnorm(-0.01, 0, se_hat)## [1] 0.8414493
In Section 3, you will look at confidence intervals and p-values.
After completing Section 3, you will be able to:
The textbook for this section is available here
We will use all the national polls that ended within a few weeks before the election.
Assume there are only two candidates and construct a 95% confidence interval for the election night proportion p.
Instructions
filter to subset the data set for the poll data you want. Include polls that ended on or after October 31, 2016 (enddate). Only include polls that took place in the United States. Call this filtered object polls.nrow to make sure you created a filtered object polls that contains the correct number of rows.Nfrom the first poll in your subset object polls.rawpoll_clinton) from the first poll in polls to a proportion, X_hat. Print this value to the console.X_hat given N. Print this result to the console.qnorm function.ci. Save the lower confidence interval first.# Load the data
data(polls_us_election_2016)
# Generate an object `polls` that contains data filtered for polls that ended on or after October 31, 2016 in the United States
polls <- filter(polls_us_election_2016, enddate >= "2016-10-31" & state == "U.S.")
# How many rows does `polls` contain? Print this value to the console.
nrow(polls)## [1] 70
# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- head(polls$samplesize,1)
N## [1] 2220
# For the first poll in `polls`, assign the estimated percentage of Clinton voters to a variable called `X_hat`. Print this value to the console.
X_hat <- (head(polls$rawpoll_clinton,1)/100)
X_hat## [1] 0.47
# Calculate the standard error of `X_hat` and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(X_hat*(1-X_hat)/N)
se_hat## [1] 0.01059279
# Use `qnorm` to calculate the 95% confidence interval for the proportion of Clinton voters. Save the lower and then the upper confidence interval to a variable called `ci`.
qnorm(0.975)## [1] 1.959964
pollster_results that contains the pollster’s name, the end date of the poll, the proportion of voters who declared a vote for Clinton, the standard error of this estimate, and the lower and upper bounds of the confidence interval for the estimate.Instructions
mutate function to define four new columns: X_hat, se_hat, lower, and upper. Temporarily add these columns to the polls object that has already been loaded for you.X_hatcolumn, convert the raw poll results for Clinton to a proportion.se_hat column, calculate the standard error of X_hat for each poll using the sqrt function.lower column, calculate the lower bound of the 95% confidence interval using the qnorm function.upper column, calculate the upper bound of the 95% confidence interval using the qnorm function.select function to select the columns from polls to save to the new object pollster_results.# The `polls` object that filtered all the data by date and nation has already been loaded. Examine it using the `head` function.
head(polls)# Create a new object called `pollster_results` that contains columns for pollster name, end date, X_hat, lower confidence interval, and upper confidence interval for each poll.
polls <- mutate(polls, X_hat = polls$rawpoll_clinton/100, se_hat = sqrt(X_hat*(1-X_hat)/polls$samplesize), lower = X_hat - qnorm(0.975)*se_hat, upper = X_hat + qnorm(0.975)*se_hat)
pollster_results <- select(polls, pollster, enddate, X_hat, se_hat, lower, upper)hit to pollster_results that states if the confidence interval included the true proportion p=0.482 or not. What proportion of confidence intervals included p?Instructions
mutate function to define a new variable called ‘hit’.lower and upper span the actual proportion.mean function to determine the average value in hit and summarize the results using summarize.avg_hit.# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)# Add a logical variable called `hit` that indicates whether the actual value exists within the confidence interval of each poll. Summarize the average `hit` result to determine the proportion of polls with confidence intervals include the actual value. Save the result as an object called `avg_hit`.
avg_hit <- pollster_results %>% mutate(hit=(lower<0.482 & upper>0.482)) %>% summarize(mean(hit))
avg_hitPossible Answers
A. 0.05
B. 0.31
C. 0.50
D. 0.95
In this case, it is more informative to estimate the spread or the difference between the proportion of two candidates d, or 0.482−0.461=0.021 for this election.
Assume that there are only two parties and that \(\ d=2p−1\). Construct a 95% confidence interval for difference in proportions on election night.
Instructions
mutate function to define a new variable called ‘d_hat’ in polls. The new variable subtract the proportion of Trump voters from the proportion of Clinton voters.N from the first poll in your subset object polls.d_hat from the first poll in your subset object polls.d_hat. Assign p to the variable X_hat.N.d_hat, using the qnorm function.ci. Save the lower confidence interval first.# Add a statement to this line of code that will add a new column named `d_hat` to `polls`. The new column should contain the difference in the proportion of voters.
polls <- polls_us_election_2016 %>% filter(enddate >= "2016-10-31" & state == "U.S.") %>%
mutate(d_hat = rawpoll_clinton/100 - rawpoll_trump/100)
# Assign the sample size of the first poll in `polls` to a variable called `N`. Print this value to the console.
N <- polls$samplesize[1]
# For the difference `d_hat` of the first poll in `polls` to a variable called `d_hat`. Print this value to the console.
d_hat <- polls$d_hat[1]
d_hat## [1] 0.04
# Assign proportion of votes for Clinton to the variable `X_hat`.
X_hat <- (d_hat+1)/2
# Calculate the standard error of the spread and save it to a variable called `se_hat`. Print this value to the console.
se_hat <- 2*sqrt(X_hat*(1-X_hat)/N)
se_hat## [1] 0.02120683
# Use `qnorm` to calculate the 95% confidence interval for the difference in the proportions of voters. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(d_hat - qnorm(0.975)*se_hat, d_hat + qnorm(0.975)*se_hat)pollster_results that contains the pollster’s name, the end date of the poll, the difference in the proportion of voters who declared a vote either, the standard error of this estimate, and the lower and upper bounds of the confidence interval for the estimate.Instructions
mutate function to define four new columns: ‘X_hat’, ‘se_hat’, ‘lower’, and ‘upper’. Temporarily add these columns to the polls object that has already been loaded for you.X_hat column, calculate the proportion of voters for Clinton using d_hat.se_hat column, calculate the standard error of the spread for each poll using the sqrtfunction.ower column, calculate the lower bound of the 95% confidence interval using the qnorm function.upper column, calculate the upper bound of the 95% confidence interval using the qnorm function.select function to select the columns from polls to save to the new object pollster_results.# The subset `polls` data with 'd_hat' already calculated has been loaded. Examine it using the `head` function.
head(polls)# Create a new object called `pollster_results` that contains columns for pollster name, end date, d_hat, lower confidence interval of d_hat, and upper confidence interval of d_hat for each poll.
pollster_results <- polls %>% mutate(X_hat = (d_hat + 1) / 2) %>% mutate(se_hat = 2 * sqrt(X_hat * (1 - X_hat) / samplesize)) %>% mutate(lower = d_hat - qnorm(0.975) * se_hat) %>% mutate(upper = d_hat + qnorm(0.975) * se_hat) %>% select(pollster, enddate, d_hat, lower, upper)
pollster_resultsInstructions
mutate function to define a new variable withinpollster_results called hit.lower and upper span the actual difference in proportions of voters.mean function to determine the average value in hit and summarize the results using summarize.avg_hit.# The `pollster_results` object has already been loaded. Examine it using the `head` function.
head(pollster_results)# Add a logical variable called `hit` that indicates whether the actual value (0.021) exists within the confidence interval of each poll. Summarize the average `hit` result to determine the proportion of polls with confidence intervals include the actual value. Save the result as an object called `avg_hit`.
avg_hit <- pollster_results %>% mutate(hit=lower <= 0.021 & upper >= 0.021) %>% summarize(mean(hit))To motivate our next exercises, calculate the difference between each poll’s estimate d¯ and the actual d=0.021. Stratify this difference, or error, by pollster in a plot.
Instructions
errors that contains the difference between the estimated difference between the proportion of voters and the actual difference on election day, 0.021.geom_point. The aesthetic mappings require a definition of the x-axis and y-axis variables. So the code looks like the example below, but you fill in the variables for x and y.data %>% ggplot(aes(x = , y = )) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
# Add variable called `error` to the object `polls` that contains the difference between d_hat and the actual difference on election day. Then make a plot of the error stratified by pollster.
polls %>% mutate(error = d_hat - 0.021) %>% ggplot(aes(x = pollster, y = error)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1))You can use dplyr tools group_by and n to group data by a variable of interest and then count the number of observations in the groups. The function filter filters data piped into it by your specified condition.
For example:
data %>% group_by(variable_for_grouping)
%>% filter(n() >= 5)
Instructions
errors that contains the difference between the estimated difference between the proportion of voters and the actual difference on election day, 0.021.group_by function.ggplot to create the plot of errors by pollster.geom_point.# The `polls` object has already been loaded. Examine it using the `head` function.
# The `polls` object has already been loaded. Examine it using the `head` function.
head(polls)# Add variable called `error` to the object `polls` that contains the difference between d_hat and the actual difference on election day. Then make a plot of the error stratified by pollster, but only for pollsters who took 5 or more polls.
polls %>% mutate(error = d_hat - 0.021) %>%
group_by(pollster) %>%
filter(n() >= 5) %>%
ggplot(aes(pollster, error)) +
geom_point() +
theme(axis.text.x = element_text(angle = 90, hjust = 1))In Section 4, you will look at statistical models in the context of election polling and forecasting.
After completing Section 4, you will be able to:
The textbook for this section is available here
Let’s revisit the heights dataset. For now, consider x to be the heights of all males in the data set. Mathematically speaking, x is our population. Using the urn analogy, we have an urn with the values of x in it.
What are the population average and standard deviation of our population?
Instructions
x that contains heights for all males in the population.x.x.# Load the 'dslabs' package and data contained in 'heights'
library(dslabs)
library(dplyr)
data(heights)
# Make a vector of heights from all males in the population
x <- heights %>% filter(sex == "Male") %>%
.$height
# Calculate the population average. Print this value to the console.
mean(x)## [1] 69.31475
## [1] 3.611024
Instructions
sample function to sample N values from x.# The vector of all male heights in our population `x` has already been loaded for you. You can examine the first six elements using `head`.
head(x)## [1] 75 70 68 74 61 67
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `N` as the number of people measured
N <- 50
# Define `X` as a random sample from our population `x`
X <- sample(x, N, replace = TRUE)
# Calculate the sample average. Print this value to the console.
mean(X)## [1] 70.47293
## [1] 3.426742
Possible Answers
A. It is identical to μ.
B. It is a random variable with expected value μ and standard error \(\ \sigma/\sqrt{N}\).
C. It is a random variable with expected value μ and standard error σ.
D. It underestimates μ.
Construct a 95% confidence interval for μ.
Instructions
sd and sqrt functions to define the standard error seqnorm function. Save the lower then the upper confidence interval to a variable called ci.# The vector of all male heights in our population `x` has already been loaded for you. You can examine the first six elements using `head`.
head(x)## [1] 75 70 68 74 61 67
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `N` as the number of people measured
N <- 50
# Define `X` as a random sample from our population `x`
X <- sample(x, N, replace = TRUE)
# Define `se` as the standard error of the estimate. Print this value to the console.
X_hat <- mean(X)
se_hat <- sd(X)
se <- se_hat / sqrt(N)
se## [1] 0.4846145
# Construct a 95% confidence interval for the population average based on our sample. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(qnorm(0.025, mean(X), se), qnorm(0.975, mean(X), se))Instructions
replicate function to replicate the sample code for B <- 10000 simulations. Save the results of the replicated code to a variable called res. The replicated code should complete the following steps: -1. Use the sample function to sample N values from x. Save the sampled heights as a vector called X. -2. Create an object called interval that contains the 95% confidence interval for each of the samples. Use the same formula you used in the previous exercise to calculate this interval. -3. Use the between function to determine if μ is contained within the confidence interval of that simulation.mean function to determine the proportion of results in res that contain mu.# Define `mu` as the population average
mu <- mean(x)
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Define `N` as the number of people measured
N <- 50
# Define `B` as the number of times to run the model
B <- 10000
# Define an object `res` that contains a logical vector for simulated intervals that contain mu
res <- replicate(B, {
X <- sample(x, N, replace = TRUE)
X_hat <- mean(X)
se_hat <- sd(X)
se <- se_hat / sqrt(N)
interval <- c(qnorm(0.025, mean(X), se) , qnorm(0.975, mean(X), se))
between(mu, interval[1], interval[2])
})
# Calculate the proportion of results in `res` that include mu. Print this value to the console.
mean(res)## [1] 0.9479
Is there a poll bias? Make a plot of the spreads for each poll.
Instructions
ggplot to plot the spread for each of the two pollsters.aes() within the ggplot function.geom_boxplot to make a boxplot of the data.geom_point to add data points to the plot.# Load the libraries and data you need for the following exercises
library(dslabs)
library(dplyr)
library(ggplot2)
data("polls_us_election_2016")
# These lines of code filter for the polls we want and calculate the spreads
polls <- polls_us_election_2016 %>%
filter(pollster %in% c("Rasmussen Reports/Pulse Opinion Research","The Times-Picayune/Lucid") &
enddate >= "2016-10-15" &
state == "U.S.") %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# Make a boxplot with points of the spread for each pollster
polls %>% ggplot(aes(pollster, spread)) + geom_boxplot() + geom_point()We will model the observed data Yij in the following way:
\(\ Y_{ij} = d + b_i + \varepsilon_{ij}\)
with \(\ i=1,2\) indexing the two pollsters, bi the bias for pollster \(\ i\), and \(\ ε_{ij}\) poll to poll chance variability. We assume the ε are independent from each other, have expected value 0 and standard deviation \(\ σ_i\) regardless of \(\ j\).
Which of the following statements best reflects what we need to know to determine if our data fit the urn model?
Possible Answers
A. Is \(\ εij=0\)?
B. How close are \(\ Y_{ij}\) to \(\ d\)?
C. Is \(\ b1≠b2?\)
D. Are \(\ b1=0\) and \(\ b2=0\)?
\(\ Y_{ij} = d + b_i + \varepsilon_{ij}\)
On the right side of this model, only \(\ εij\) is a random variable. The other two values are constants.
What is the expected value of \(\ Y_{ij}\)?
Possible Answers
A. \(\ d+b_1\)
B. \(\ b_1 + \varepsilon_{ij}\) C. \(\ d\)
D. \(\ d + b_1 + \varepsilon_{ij}\)
What is the expected value and standard error of \(\ \bar{Y}_1\)?
Possible Answers
A. The expected value is \(\ d+b1\) and the standard error is \(\ σ1\)
B. The expected value is \(\ d\) and the standard error is \(\ \sigma_1/\sqrt{N_1}\)
C. The expected value is \(\ d+b1\) and the standard error is \(\ \sigma_1/\sqrt{N_1}\)
D. The expected value is \(\ d\) and the standard error is \(\ \sigma_1+\sqrt{N_1}\)
What is the expected value and standard error of \(\ \bar{Y}_2\)?
Possible Answers
A. The expected value is \(\ d+b_2\) and the standard error is \(\ σ2\) B. The expected value is \(\ d\) and the standard error is \(\ \sigma_2/\sqrt{N_2}\) C. The expected value is \(\ d+b_2\) and the standard error is \(\ \sigma_2/\sqrt{N_2}\) D. The expected value is \(\ d\) and the standard error is \(\ \sigma_2 + \sqrt{N_2}\)
Possible Answers
A. \(\ (b_2 - b_1)^2\) B. \(\ b_2 - b_1/\sqrt(N)\) C. \(\ b_2 + b_1\) D. \(\ b_2 - b_1\)
Possible Answers
A. \(\ \sqrt{\sigma_2^2/N_2 + \sigma_1^2/N_1}\) B. \(\ \sqrt{\sigma_2/N_2 + \sigma_1/N_1}\) C. \(\ (\sigma_2^2/N_2 + \sigma_1^2/N_1)^2\) D. \(\ \sigma_2^2/N_2 + \sigma_1^2/N_1\)
Compute the estimates of \(\ σ1\) and \(\ σ2\).
Instructions
sigma# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)# Create an object called `sigma` that contains a column for `pollster` and a column for `s`, the standard deviation of the spread
polls %>% group_by(pollster)sigma <- polls %>% group_by(pollster) %>% summarize(s = sd(spread))
# Print the contents of sigma to the console
sigmaPossible Answers
A. The central limit theorem cannot tell us anything because this difference is not the average of a sample.
B. Because \(\ Y_{ij}\) are approximately normal, the averages are normal too.
C. If we assume N2 and N1 are large enough, \(\ \bar{Y}_2\) and \(\ \bar{Y}_1\), and their difference, are approximately normal.
D. These data do not contain vectors of 0 and 1, so the central limit theorem does not apply.
Construct a 95% confidence interval for the difference \(\ b2\) and \(\ b1\). Does this interval contain zero?
Instructions
%>% to pass the data polls on to functions that will group by pollster and summarize the average spread, standard deviation, and number of polls per pollster.qnorm function. Save the lower and then the upper confidence interval to a variable called ci.# The `polls` data have already been loaded for you. Use the `head` function to examine them.
head(polls)## [1] "state" "startdate" "enddate"
## [4] "pollster" "grade" "samplesize"
## [7] "population" "rawpoll_clinton" "rawpoll_trump"
## [10] "rawpoll_johnson" "rawpoll_mcmullin" "adjpoll_clinton"
## [13] "adjpoll_trump" "adjpoll_johnson" "adjpoll_mcmullin"
## [16] "spread"
# Create an object called `res` that summarizes the average, standard deviation, and number of polls for the two pollsters.
res <- polls %>% group_by(pollster) %>% summarize(avg=mean(spread), s = sd(spread), N=n())
res# Store the difference between the larger average and the smaller in a variable called `estimate`. Print this value to the console.
estimate <- max(res$avg) - min(res$avg)
estimate## [1] 0.05229167
# Store the standard error of the estimates as a variable called `se_hat`. Print this value to the console.
se_hat <- sqrt(res$s[2]^2/res$N[2] + res$s[1]^2/res$N[1])
se_hat## [1] 0.007031433
# Calculate the 95% confidence interval of the spreads. Save the lower and then the upper confidence interval to a variable called `ci`.
ci <- c(estimate - qnorm(0.975)*se_hat, estimate + qnorm(0.975)*se_hat)Compute a p-value to relay the fact that chance does not explain the observed pollster effect.
Instructions
pnorm function to calculate the probability that a random value is larger than the observed ratio of the estimate to the standard error.# We made an object `res` to summarize the average, standard deviation, and number of polls for the two pollsters.
res <- polls %>% group_by(pollster) %>%
summarize(avg = mean(spread), s = sd(spread), N = n())
# The variables `estimate` and `se_hat` contain the spread estimates and standard error, respectively.
estimate <- res$avg[2] - res$avg[1]
se_hat <- sqrt(res$s[2]^2/res$N[2] + res$s[1]^2/res$N[1])
# Calculate the p-value
2 * (1 - pnorm(estimate / se_hat, 0, 1))## [1] 1.030287e-13
\(\ \frac{\bar{Y}_2 - \bar{Y}_1}{\sqrt{s_2^2/N_2 + s_1^2/N_1}}\)
Later we learn will learn of another approximation for the distribution of this statistic for values of N2 and N1 that aren’t large enough for the CLT.
Note that our data has more than two pollsters. We can also test for pollster effect using all pollsters, not just two. The idea is to compare the variability across polls to variability within polls. We can construct statistics to test for effects and approximate their distribution. The area of statistics that does this is called Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to answer questions such as: is there a pollster effect?
Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation.
Instructions
polls data by pollster.var that contains three columns: pollster, mean spread, and standard deviation.avg and the column for standard deviation s.# Execute the following lines of code to filter the polling data and calculate the spread
polls <- polls_us_election_2016 %>%
filter(enddate >= "2016-10-15" &
state == "U.S.") %>%
group_by(pollster) %>%
filter(n() >= 5) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100) %>%
ungroup()
# Create an object called `var` that contains columns for the pollster, mean spread, and standard deviation. Print the contents of this object to the console.
var <- polls %>% group_by(pollster) %>% summarize(avg = mean(spread), s = sd(spread))
varIn Section 5, you will learn about Bayesian statistics through looking at examples from rare disease diagnosis and baseball.
After completing Section 5, you will be able to:
The textbook for this section is available here
Based on what we’ve learned throughout this course, which statement best describes a potential flaw in Sir Meadow’s reasoning?
Possible Answers
A. Sir Meadow assumed the second death was independent of the first son being affected, thereby ignoring possible genetic causes.
B. There is no flaw. The multiplicative rule always applies in this way: Pr(A and B)=Pr(A)Pr(B)
C. Sir Meadow should have added the probabilities: Pr(A and B)=Pr(A)+Pr(B)
D. The rate of SIDS is too low to perform these types of statistics.
What is the probability of both of Sally Clark’s sons dying of SIDS?
Instructions
# Define `Pr_1` as the probability of the first son dying of SIDS
Pr_1 <- 1/8500
# Define `Pr_2` as the probability of the second son dying of SIDS
Pr_2 <- 1/100
# Calculate the probability of both sons dying of SIDS. Print this value to the console.
Pr_1*Pr_2## [1] 1.176471e-06
\(\ \mbox{Pr}(\mbox{mother is a murderer} \mid \mbox{two children found dead with no evidence of harm})\)
Possible Answers
A. \(\ \frac{\mbox{Pr}(\mbox{two children found dead with no evidence of harm}) \mbox{Pr}(\mbox{mother is a murderer})}{\mbox{Pr}(\mbox{two children found dead with no evidence of harm})}\)
B. \(\ \mbox{Pr}(\mbox{two children found dead with no evidence of harm})\mbox{Pr}(\mbox{mother is a murderer} )\)
C. \(\ \frac{\mbox{Pr}(\mbox{two children found dead with no evidence of harm} \mid \mbox{mother is a murderer} ) \mbox{Pr}(\mbox{mother is a murderer})}{\mbox{Pr}(\mbox{two children found dead with no evidence of harm})}\)
D. 1/8500
\(\ \mbox{Pr}(\mbox{two children found dead with no evidence of harm} \mid \mbox{mother is a murderer} ) = 0.50\)
Assume that the murder rate among mothers is 1 in 1,000,000.
\(\ \mbox{Pr}(\mbox{mother is a murderer} ) = 1/1,000,000\)
According to Bayes’ rule, what is the probability of:
\(\ \mbox{Pr}(\mbox{mother is a murderer} \mid \mbox{two children found dead with no evidence of harm})\)
Instructions
Use Bayes’ rule to calculate the probability that the mother is a murderer, considering the rates of murdering mothers in the population, the probability that two siblings die of SIDS, and the probability that a murderer kills children without leaving evidence of physical harm.
# Define `Pr_1` as the probability of the first son dying of SIDS
Pr_1 <- 1/8500
# Define `Pr_2` as the probability of the second son dying of SIDS
Pr_2 <- 1/100
# Define `Pr_B` as the probability of both sons dying of SIDS
Pr_B <- Pr_1*Pr_2
# Define Pr_A as the rate of mothers that are murderers
Pr_A <- 1/1000000
# Define Pr_BA as the probability that two children die without evidence of harm, given that their mother is a murderer
Pr_BA <- 0.50
# Define Pr_AB as the probability that a mother is a murderer, given that her two children died with no evidence of physical harm. Print this value to the console.
Pr_AB <- Pr_BA*Pr_A/Pr_B
Pr_AB## [1] 0.425
In addition to misusing the multiplicative rule as we saw earlier, what else did Sir Meadow miss?
Possible Answers
A. He made an arithmetic error in forgetting to divide by the rate of SIDS in siblings.
B. He did not take into account how rare it is for a mother to murder her children.
C. He mixed up the numerator and denominator of Bayes’ rule.
D. He did not take into account murder rates in the population.
The CLT tells us that the average of these spreads is approximately normal. Calculate a spread average and provide an estimate of the standard error.
Instructions
avg in the final table.se in the final table.mean and sd functions nested within summarize to find the average and standard deviation of the grouped spread data.results.# Load the libraries and poll data
library(dplyr)
library(dslabs)
data(polls_us_election_2016)
# Create an object `polls` that contains the spread of predictions for each candidate in Florida during the last polling days
polls <- polls_us_election_2016 %>%
filter(state == "Florida" & enddate >= "2016-11-04" ) %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# Examine the `polls` object using the `head` function
head(polls)# Create an object called `results` that has two columns containing the average spread (`avg`) and the standard error (`se`). Print the results to the console.
results <- polls %>% summarize(avg = mean(spread), se = sd(spread)/sqrt(n()))
resultsWhat are the interpretations of μ and τ?
Possible Answers
A. μ and τ are arbitrary numbers that let us make probability statements about d.
B. μ and τ summarize what we would predict for Florida before seeing any polls.
C. μ and τ summarize what we want to be true. We therefore set μ at 0.10 and τ at 0.01.
D. The choice of prior has no effect on the Bayesian analysis.
Use the formulas for the posterior distribution to calculate the expected value of the posterior distribution if we set μ=0 and τ=0.01.
Instructions
results represent σ and YB using σ and τB, μ, and Y# The results` object has already been loaded. Examine the values stored: `avg` and `se` of the spread
results# Define `mu` and `tau`
mu <- 0
tau <- 0.01
# Define a variable called `sigma` that contains the standard error in the object `results
sigma <- results$se
# Define a variable called `Y` that contains the average in the object `results`
Y <- results$avg
# Define a variable `B` using `sigma` and `tau`. Print this value to the console.
tau <- 0.01
miu <- 0
B <- sigma^2 / (sigma^2 + tau^2)
B## [1] 0.342579
## [1] 0.002731286
Instructions
# Here are the variables we have defined
mu <- 0
tau <- 0.01
sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)
# Compute the standard error of the posterior distribution. Print this value to the console.
sqrt(1 / (1 / sigma ^2 + 1 / tau ^2))## [1] 0.005853024
Instructions
qnorm function.ci. Save the lower - confidence interval first.# Here are the variables we have defined in previous exercises
mu <- 0
tau <- 0.01
sigma <- results$se
Y <- results$avg
B <- sigma^2 / (sigma^2 + tau^2)
se <- sqrt( 1/ (1/sigma^2 + 1/tau^2))
# Construct the 95% credible interval. Save the lower and then the upper confidence interval to a variable called `ci`.
est <- B * mu + (1 - B) * Y
est## [1] 0.002731286
## [1] -0.008740432 0.014203003
Instructions
pnorm function, calculate the probability that the spread in Florida was less than 0.# Assign the expected value of the posterior distribution to the variable `exp_value`
exp_value <- B*mu + (1-B)*Y
# Assign the standard error of the posterior distribution to the variable `se`
se <- sqrt( 1/ (1/sigma^2 + 1/tau^2))
# Using the `pnorm` function, calculate the probability that the actual spread was less than 0 (in Trump's favor). Print this value to the console.
pnorm(0, exp_value, se)## [1] 0.3203769
Change the prior variance to include values ranging from 0.005 to 0.05 and observe how the probability of Trump winning Florida changes by making a plot.
Instructions
taus by executing the sample code.function(){} called p_calc that first calculates B given tau and sigma and then calculates the probability of Trump winning, as we did in the previous exercise.p_calc function across all the new values of taus.plot function to plot τ on the x-axis and the new probabilities on the y-axis.# Define the variables from previous exercises
mu <- 0
sigma <- results$se
Y <- results$avg
# Define a variable `taus` as different values of tau
taus <- seq(0.005, 0.05, len = 100)
# Create a function called `p_calc` that generates `B` and calculates the probability of the spread being less than 0
p_calc <- function(tau) {
B <- sigma ^ 2 / (sigma^2 + tau^2)
se <- sqrt(1 / (1/sigma^2 + 1/tau^2))
exp_value <- B * mu + (1 - B) * Y
pnorm(0, exp_value, se)
}
# Create a vector called `ps` by applying the function `p_calc` across values in `taus`
ps <- p_calc(taus)
# Plot `taus` on the x-axis and `ps` on the y-axis
plot(taus, ps)In Section 6, you will learn about election forecasting, building on what you’ve learned in the previous sections about statistical modeling and Bayesian statistics.
After completing Section 6, you will be able to:
Understand how pollsters use hierarchical models to forecast the results of elections.
Incorporate multiple sources of variability into a mathematical model to make predictions.
Construct confidence intervals that better model deviations such as those seen in election data using the t-distribution. There are 2 assignments that use the DataCamp platform for you to practice your coding skills.
The textbook for this section is available here
Instructions
%>% to pass the poll object on to the mutate function, which creates new variables.X_hatthat contains the estimate of the proportion of Clinton voters for each poll.se that contains the standard error of the spread.qnorm function and your calculated se.select function to keep the following columns: state, startdate, enddate, pollster, grade, spread, lower, upper.## Load the libraries and data
library(dplyr)
library(dslabs)
data("polls_us_election_2016")
# Create a table called `polls` that filters by state, date, and reports the spread
polls <- polls_us_election_2016 %>%
filter(state != "U.S." & enddate >= "2016-10-31") %>%
mutate(spread = rawpoll_clinton/100 - rawpoll_trump/100)
# Create an object called `cis` that has the columns indicated in the instructions
cis <- polls %>% mutate(X_hat = (spread+1)/2, se = 2*sqrt(X_hat*(1-X_hat)/samplesize),
lower = spread - qnorm(0.975)*se, upper = spread + qnorm(0.975)*se) %>%
select(state, startdate, enddate, pollster, grade, spread, lower, upper)cis table you just created using the left_join function as shown in the sample code.Now determine how often the 95% confidence interval includes the actual result.
Instructions
p_hits that contains the proportion of intervals that contain the actual spread using the following steps.mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.hit that are true as a variable called proportion_hits.# Add the actual results to the `cis` data set
add <- results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
ci_data <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
# Create an object called `p_hits` that summarizes the proportion of confidence intervals that contain the actual value. Print this object to the console.
p_hits <- ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>% summarize(proportion_hits = mean(hit))
p_hitsInstructions
p_hits that contains the proportion of intervals that contain the actual spread using the following steps.mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.group_by function to group the data by pollster.filter function to filter for pollsters that have more than 5 polls.hit that are true as a variable called proportion_hits. Also create new variables for the number of polls by each pollster using the n() function and the grade of each poll.arrange function to arrange the proportion_hits in descending order.# The `cis` data have already been loaded for you
add <- results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
ci_data <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
# Create an object called `p_hits` that summarizes the proportion of hits for each pollster that has at least 5 polls.
p_hits <- ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>%
group_by(pollster) %>%
filter(n() >= 5) %>%
summarize(proportion_hits = mean(hit), n = n(), grade = grade[1]) %>%
arrange(desc(proportion_hits))
p_hitsInstructions
p_hits that contains the proportion of intervals that contain the actual spread using the following steps.mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.group_by function to group the data by state.filter function to filter for states that have more than 5 polls.hit that are true as a variable called proportion_hits. Also create new variables for the number of polls in each state using the n() function.arrange function to arrange the proportion_hits in descending order.# The `cis` data have already been loaded for you
add <- results_us_election_2016 %>% mutate(actual_spread = clinton/100 - trump/100) %>% select(state, actual_spread)
ci_data <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
# Create an object called `p_hits` that summarizes the proportion of hits for each state that has more than 5 polls.
p_hits <- ci_data %>% mutate(hit = lower <= actual_spread & upper >= actual_spread) %>%
group_by(state) %>%
filter(n() >= 5) %>%
summarize(proportion_hits = mean(hit), n = n()) %>%
arrange(desc(proportion_hits))
p_hitsInstructions
ggplot, set the aesthetic with state as the x-variable and proportion of hits as the y-variable.geom_bar to indicate that we want to plot a barplot. Specifcy stat = "identity" to indicate that the height of the bar should match the value.coord_flip to flip the axes so the states are displayed from top to bottom and proportions are displayed from left to right.# The `p_hits` data have already been loaded for you. Use the `head` function to examine it.
head(p_hits)# Make a barplot of the proportion of hits for each state
p_hits %>% mutate(state = reorder(state, proportion_hits)) %>%
ggplot(aes(state, proportion_hits)) +
geom_bar(stat = "identity") +
coord_flip()Add two columns to the cis table by computing, for each poll, the difference between the predicted spread and the actual spread, and define a column hit that is true if the signs are the same.
Instructions
mutate function to add two new variables to the cis object: error and hit.error variable, subtract the actual spread from the spread.hit variable, return “TRUE” if the poll predicted the actual winner.errors.tail function to examine the last 6 rows of errors```.# The `cis` data have already been loaded. Examine it using the `head` function.
cis <- cis %>% mutate(state = as.character(state)) %>% left_join(add, by = "state")
head(cis)# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
errors <- cis %>% mutate(error = spread - actual_spread, hit = sign(spread) == sign(actual_spread))
# Examine the last 6 rows of `errors`
tail(errors)p_hits that contains the proportion of instances when the sign of the actual spread matches the predicted spread for states with more than 5 polls.Make a barplot based on the result from the previous exercise that shows the proportion of times the sign of the spread matched the actual result for the data in p_hits.
Instructions
group_by function to group the data by state.filter function to filter for states that have more than 5 polls.hit that are true as a variable called proportion_hits. Also create new variables for the number of polls in each state using the n() function.ggplot, set the aesthetic with state as the x-variable and proportion of hits as the y-variable.geom_bar to indicate that we want to plot a barplot.coord_flip to flip the axes so the states are displayed from top to bottom and proportions are displayed from left to right.# Create an object called `errors` that calculates the difference between the predicted and actual spread and indicates if the correct winner was predicted
errors <- cis %>% mutate(error = spread - actual_spread, hit = sign(spread) == sign(actual_spread))
# Create an object called `p_hits` that summarizes the proportion of hits for each state that has more than 5 polls
p_hits <- errors %>% group_by(state) %>%
filter(n() >= 5) %>%
summarize(proportion_hits = mean(hit), n = n())
# Make a barplot of the proportion of hits for each state
p_hits %>% mutate(state = reorder(state, proportion_hits)) %>%
ggplot(aes(state, proportion_hits)) +
geom_bar(stat = "identity") +
coord_flip()Make a histogram of the errors. What is the median of these errors?
Instructions
hist function to generate a histogram of the errorsmedian function to compute the median error## [1] 0.037
Create a boxplot to examine if the bias was general to all states or if it affected some states differently. Filter the data to include only pollsters with grades B+ or higher.
Instructions
filter function to filter the data for polls with grades equal to A+, A, A-, or B+.reorder function to order the state data by error.ggplot, set the aesthetic with state as the x-variable and error as the y-variable.geom_boxplot to indicate that we want to plot a boxplot.geom_point to add data points as a layer.# Create a boxplot showing the errors by state for polls with grades B+ or higher
errors %>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>%
mutate(state = reorder(state, error)) %>%
ggplot(aes(state, error)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot() +
geom_point()Instructions
filter function to filter the data for polls with grades equal to A+, A, A-, or B+.group_by.filter function to filter the data for states with at least 5 polls.reorder function to order the state data by error.ggplot, set the aesthetic with state as the x-variable and error as the y-variable.geom_box to indicate that we want to plot a boxplot.geom_point to add data points as a layer.# Create a boxplot showing the errors by state for states with at least 5 polls with grades B+ or higher
errors %>% filter(grade %in% c("A+","A","A-","B+") | is.na(grade)) %>%
group_by(state) %>%
filter(n() >= 5) %>%
ungroup() %>%
mutate(state = reorder(state, error)) %>%
ggplot(aes(state, error)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
geom_boxplot() +
geom_point()Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when the degrees of freedom are 3.
Instructions
Use the pt function to calculate the probability of seeing a value less than or equal to the argument.
# Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when 'df = 3'.
1 - pt(2, 3) + pt(-2, 3)## [1] 0.139326
Make a plot and notice when this probability converges to the normal distribution’s 5%.
Instructions
df that contains a sequence of numbers from 3 to 50.function, make a function called pt_func that recreates the calculation for the probability that a value is greater than 2 as an absolute value for any given degrees of freedom.sapply to apply the pt_func function across all values contained in df. Call these probabilities probs.plot function to plot df on the x-axis and probs on the y-axis.# Generate a vector 'df' that contains a sequence of numbers from 3 to 50
df <- seq(3,50)
# Make a function called 'pt_func' that calculates the probability that a value is more than |2| for any degrees of freedom
pt_func <- function(n) {
1 - pt(2, n) + pt(-2, n)
}
# Generate a vector 'probs' that uses the `pt_func` function to calculate the probabilities
probs <- sapply(df, pt_func)
# Plot 'df' on the x-axis and 'probs' on the y-axis
plot(df, probs)Re-do this Monte Carlo simulation, but now instead of N=50, use N=15. Notice what happens to the proportion of hits.
Instructions
replicate function to carry out the simulation. Specify the number of times you want the code to run and, within brackets, the three lines of code that should run.sample function to randomly sample N values from x.nterval that calculates the 95% confidence interval for the sample. You will use the qnorm function.between function to determine if the population mean mu is contained between the confidence intervals.res.mean function to determine the proportion of hits in res.# Load the neccessary libraries and data
library(dslabs)
library(dplyr)
data(heights)
# Use the sample code to generate 'x', a vector of male heights
x <- heights %>% filter(sex == "Male") %>%
.$height
# Create variables for the mean height 'mu', the sample size 'N', and the number of times the simulation should run 'B'
mu <- mean(x)
N <- 15
B <- 10000
# Use the `set.seed` function to make sure your answer matches the expected result after random sampling
set.seed(1)
# Generate a logical vector 'res' that contains the results of the simulations
res <- replicate(B, {
X <- sample(x, N, replace=TRUE)
interval <- mean(X) + c(-1,1)*qnorm(0.975)*sd(X)/sqrt(N)
between(mu, interval[1], interval[2])
})
# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)## [1] 0.9331
What are the proportion of 95% confidence intervals that span the actual mean height now?
Instructions
replicate function to carry out the simulation. Specify the number of times you want the code to run and, within brackets, the three lines of code that should run.sample function to randomly sample N values from x.interval that calculates the 95% confidence interval for the sample. Remember to use the qt function this time to generate the confidence interval.between function to determine if the population mean mu is contained between the confidence intervals.res.mean function to determine the proportion of hits in res.# The vector of filtered heights 'x' has already been loaded for you. Calculate the mean.
mu <- mean(x)
# Use the same sampling parameters as in the previous exercise.
set.seed(1)
N <- 15
B <- 10000
# Generate a logical vector 'res' that contains the results of the simulations using the t-distribution
res <- replicate(B, {
s <- sample(x, N, replace = TRUE)
interval <- c(mean(s) - qt(0.975, N - 1) * sd(s) / sqrt(N), mean(s) + qt(0.975, N - 1) * sd(s) / sqrt(N))
between(mu, interval[1], interval[2])
})
# Calculate the proportion of times the simulation produced values within the 95% confidence interval. Print this value to the console.
mean(res)## [1] 0.9512
Possible Answers
A. The t-distribution takes the variability into account and generates larger confidence intervals. B. Because the t-distribution shifts the intervals in the direction towards the actual mean. C. This was just a chance occurrence. If we run it again, the CLT will work better. D. The t-distribution is always a better approximation than the normal distribution.
In Section 7, you will learn how to use association and chi-squared tests to perform inference for binary, categorical, and ordinal data through an example looking at research funding rates.
After completing Section 7, you will be able to:
The textbook for this section is available here
In this exercise, filter the errors data for just polls with grades A- and C-. Calculate the proportion of times each grade of poll predicted the correct winner.
Instructions
library(tidyr)
# The 'errors' data have already been loaded. Examine them using the `head` function.
head(errors)# Generate an object called 'totals' that contains the numbers of good and bad predictions for polls rated A- and C-
totals <- errors %>%
filter(grade %in% c("A-", "C-")) %>%
group_by(grade,hit) %>%
summarize(num = n()) %>%
spread(grade, num)
# Print the proportion of hits for grade A- polls to the console
totals[[2,3]]/sum(totals[[3]])## [1] 0.8030303
## [1] 0.8614958
Use a chi-squared test to determine if these proportions are different.
Instructions
chisq.test function to perform the chi-squared test. Save the results to an object called chisq_test.# Perform a chi-squared test on the hit data. Save the results as an object called 'chisq_test'.
chisq_test <- totals %>%
select(-hit) %>%
chisq.test()
chisq_test##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: .
## X-squared = 2.1053, df = 1, p-value = 0.1468
## [1] 0.1467902
Calculate the odds ratio to determine the magnitude of the difference in performance between these two grades of polls.
Instructions
odds_C.odds_A. -Calculate the odds ratio that tells us how many times larger the odds of a grade A- poll is at predicting the winner than a grade C- poll.# Generate a variable called `odds_C` that contains the odds of getting the prediction right for grade C- polls
odds_C <- (totals[[2,2]] / sum(totals[[2]])) /
(totals[[1,2]] / sum(totals[[2]]))
# Generate a variable called `odds_A` that contains the odds of getting the prediction right for grade A- polls
odds_A <- (totals[[2,3]] / sum(totals[[3]])) /
(totals[[1,3]] / sum(totals[[3]]))
# Calculate the odds ratio to determine how many times larger the odds ratio is for grade A- polls than grade C- polls
odds_A/odds_C## [1] 0.6554539
Based on what we learned in the last section, which statement reflects the best interpretation of this result?
Possible Answers
A. The p-value is below 0.05, so there is a significant difference. Grade A- polls are significantly better at predicting winners.
B. The p-value is too close to 0.05 to call this a significant difference. We do not observe a difference in performance.
C. The p-value is below 0.05, but the odds ratio is very close to 1. There is not a scientifically significant difference in performance.
D. The p-value is below 0.05 and the odds ratio indicates that grade A- polls perform significantly better than grade C- polls.