Inference and Modeling

  • Course Instructor: Rafael Irizarry

Abstract

This is the fourth in a series of courses in a Professional Certificate in Data Science program, a series of courses that prepare you to do data analysis in R, from simple computations to machine learning. Statistical inference and modeling are indispensable for analyzing data affected by chance, and thus essential for data scientists. In this course, you will learn these key concepts through a motivating case study on election forecasting.

This course will show you how inference and modeling can be applied to develop the statistical approaches that make polls an effective tool and we’ll show you how to do this using R. You will learn concepts necessary to define estimates and margins of errors and learn how you can use these to make predictions relatively well and also provide an estimate of the precision of your forecast.

Once you learn this you will be able to understand two concepts that are ubiquitous in data science: confidence intervals and p-values.

Finally, to understand statements about the probability of a candidate winning, you will learn about Bayesian modeling. At the end of the course, we will put it all together to recreate a simplified version of an election forecast model and apply it to the 2016 US presidential election.

The textbook for the Data Science course series is freely available online.

Learning Objectives

  • The concepts necessary to define estimates and margins of errors of populations, parameters, estimates, and standard errors in order to make predictions about data
  • How to use models to aggregate data from different sources
  • The very basics of Bayesian statistics and predictive modeling

Course Overview

Section 1: Parameters and Estimates
You will learn how to estimate population parameters.

Section 2: The Central Limit Theorem in Practice
You will apply the central limit theorem to assess how close a sample estimate is to the population parameter of interest.

Section 3: Confidence Intervals and p-Values
You will learn how to calculate confidence intervals and learn about the relationship between confidence intervals and p-values.

Section 4: Statistical Models
You will learn about statistical models in the context of election forecasting.

Section 5: Bayesian Statistics
You will learn about Bayesian statistics through looking at examples from rare disease diagnosis and baseball.

Section 6: Election Forecasting
You will learn about election forecasting, building on what you’ve learned in the previous sections about statistical modeling and Bayesian statistics.

Section 7: Association Tests
You will learn how to use association and chi-squared tests to perform inference for binary, categorical, and ordinal data through an example looking at research funding rates.

Section 1 Overview

Section 1 introduces you to parameters and estimates.

After completing Section 1, you will be able to:

  • Understand how to use a sampling model to perform a poll.
  • Explain the terms population, parameter, and sample as they relate to statistical inference.
  • Use a sample to estimate the population proportion from the sample average.
  • Calculate the expected value and standard error of the sample average.

The textbook for this section is available here

Assessment 1.1: Parameters and Estimates

  1. Polling - expected value of S
    Suppose you poll a population in which a proportion \(\ p\) of voters are Democrats and \(\ 1−p\) are Republicans. Your sample size is \(\ N=25\). Consider the random variable \(\ S\), which is the total number of Democrats in your sample.

What is the expected value of this random variable \(\ S\)?

Possible Answers

A. \(\ E(S)=25(1−p)\)
B. \(\ E(S)=25p\)
C. \(\ E(S)=\sqrt{25 p (1-p)}\)
D. \(\ E(S)=p\)

  1. Polling - standard error of S
    Again, consider the random variable S, which is the total number of Democrats in your sample of 25 voters. The variable p describes the proportion of Democrats in the sample, whereas 1−p describes the proportion of Republicans.

What is the standard error of S?

Possible Answers

A. \(\ SE(S)=25p(1−p)\)
B. \(\ SE(S)=\sqrt{25p}\)
C. \(\ SE(S)=25(1−p)\)
D. \(\ SE(S)=\sqrt{25 p (1-p)}\)

  1. Polling - expected value of X-bar
    Consider the random variable \(\ S/N\), which is equivalent to the sample average that we have been denoting as \(\ \bar{X}\). The variable \(\ N\) represents the sample size and \(\ p\) is the proportion of Democrats in the population.

What is the expected value of \(\ \bar{X}\)?

Possible Answers

A. \(\ E(\bar{X})=p\)
B. \(\ E(\bar{X})=Np\)
C. \(\ E(\bar{X})=N(1−p)\)
D. \(\ E(\bar{X})=1−p\)

  1. Polling - standard error of X-bar
    What is the standard error of the sample average, \(\ \bar{X}\)?

The variable \(\ N\) represents the sample size and \(\ p\) is the proportion of Democrats in the population.

Possible Answers

A. \(\ SE(\bar{X})=\sqrt{Np(1−p)}\)
B. \(\ SE(\bar{X})=\sqrt{p(1−p)/N}\)
C. \(\ SE(\bar{X})=\sqrt{p(1−p)}\)
D. \(\ SE(\bar{X})=\sqrt{N}\)

  1. se versus p
    Write a line of code that calculates the standard error se of a sample average when you poll 25 people in the population. Generate a sequence of 100 proportions of Democrats p that vary from 0 (no Democrats) to 1 (all Democrats).

Plot se versus p for the 100 different proportions.

Instructions

  • Use the seq function to generate a vector of 100 values of p that range from 0 to 1.
  • Use the sqrt function to generate a vector of standard errors for all values of p.
  • Use the plot function to generate a plot with p on the x-axis and se on the y-axis.

  1. Multiple plots of se versus p
    Using the same code as in the previous exercise, create a for-loop that generates three plots of p versus se when the sample sizes equal N=25, N=100, and N=1000.

Instructions

  • Your for-loop should contain two lines of code to be repeated for three different values of N.
  • The first line within the for-loop should use the sqrt function to generate a vector of standard errors se for all values of p.
  • The second line within the for-loop should use the plot function to generate a plot with p on the x-axis and se on the y-axis.
  • Use the ylim argument to keep the y-axis limits constant across all three plots. The lower limit should be equal to 0 and the upper limit should equal the highest calculated standard error across all values of p and N.

  1. Expected value of d
    Our estimate for the difference in proportions of Democrats and Republicans is \(\ d=\bar{X}−(1−\bar{X})\).

Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the expected value of d?

Possible Answers

A. \(\ E[\bar{X}−(1−\bar{X})]=E[2\bar{X}−1] =2E[\bar{X}]−1 = N(2p−1) = Np−N(1−p)\)
B. \(\ E[\bar{X}−(1−\bar{X})]=E[\bar{X}−1] =E[\bar{X}]−1 =p−1\)
C. \(\ E[\bar{X}−(1−\bar{X})]=E[2\bar{X}−1] =2E[\bar{X}]−1 =2\sqrt{p(1−p)}−1 =p−(1−p)\)
D. \(\ E[\bar{X}−(1−\bar{X})]=E[2\bar{X}−1] =2E[\bar{X}]−1 =2p−1 =p−(1−p)\)

  1. Standard error of d
    Our estimate for the difference in proportions of Democrats and Republicans is \(\ d=\bar{X}−(1−\bar{X})\).

Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the standard error of \(\ d\)?

Possible Answers

A. \(\ SE[\bar{X}−(1−\bar{X})]=SE[2\bar{X}−1] =2SE[\bar{X}] =2\sqrt{p/N}\)
B. \(\ SE[\bar{X}−(1−\bar{X})]=SE[2\bar{X}−1]=2SE[\bar{X}−1]=2\sqrt{p(1−p)/N}−1\)
C. \(\ SE[\bar{X}−(1−\bar{X})]=SE[2\bar{X}−1] =2SE[\bar{X}] =2\sqrt{p(1−p)/N}\)
D. \(\ SE[\bar{X}−(1−\bar{X})]=SE[\bar{X}−1] =SE[\bar{X}] =\sqrt{p(1−p)/N}\)

  1. Standard error of the spread
    Say the actual proportion of Democratic voters is \(\ p=0.45\). In this case, the Republican party is winning by a relatively large margin of \(\ d=−0.1\), or a 10% margin of victory. What is the standard error of the spread \(\ 2\bar{X}−1\) in this case?

Use the sqrt function to calculate the standard error of the spread \(\ 2\bar{X}−1\).

## [1] 0.1989975
  1. Sample size
    So far we have said that the difference between the proportion of Democratic voters and Republican voters is about 10% and that the standard error of this spread is about 0.2 when N=25. Select the statement that explains why this sample size is sufficient or not.

A. This sample size is sufficient because the expected value of our estimate \(\ 2\bar{X}−1\) is d so our prediction will be right on.
B. This sample size is too small because the standard error is larger than the spread.
C. This sample size is sufficient because the standard error of about 0.2 is much smaller than the spread of 10%.
D. Without knowing p, we have no way of knowing that increasing our sample size would actually improve our standard error.

Section 2 Overview

In Section 2, you will look at the Central Limit Theorem in practice.

After completing Section 2, you will be able to:

  • Use the Central Limit Theorem to calculate the probability that a sample estimate \(\bar{X}\) is close to the population proportion \(\ p\) .
  • Run a Monte Carlo simulation to corroborate theoretical results built using probability theory.
  • Estimate the spread based on estimates of X¯ and \(\ \hat{SE}(\bar{X})\) .
  • Understand why bias can mean that larger sample sizes aren’t necessarily better.

The textbook for this section is available here

Assessment 2.1: Introduction to Inference

  1. Sample average
    Write function called take_sample that takes the proportion of Democrats p and the sample size N as arguments and returns the sample average of Democrats (1) and Republicans (0).

Calculate the sample average if the proportion of Democrats equals 0.45 and the sample size is 100.

Instructions

  • Define a function called take_sample that takes p and N as arguments.
  • Use the sample function as the first statement in your function to sample N elements from a vector of options where Democrats are assigned the value ‘1’ and Republicans are assigned the value ‘0’.
  • Use the mean function as the second statement in your function to find the average value of the random sample.
## [1] 0.46
  1. Distribution of errors - 1
    Assume the proportion of Democrats in the population p equals 0.45 and that your sample size N is 100 polled voters. The take_sample function you defined previously generates our estimate, \(\ \bar{X}\).

Replicate the random sampling 10,000 times and calculate \(\ p−\bar{X}\) for each random sample. Save these differences as a vector called errors. Find the average of errors and plot a histogram of the distribution.

Instructions

  • The function take_sample that you defined in the previous exercise has already been run for you.
  • Use the replicate function to replicate subtracting the result of take_sample from the value of p 10,000 times.
  • Use the mean function to calculate the average of the differences between the sample average and actual value of p. 
## [1] -4.9e-05
  1. Distribution of errors - 2
    In the last exercise, you made a vector of differences between the actual value for \(\ p\) and an estimate, \(\ \bar{X}\). We called these differences between the actual and estimated values errors.

The errors object has already been loaded for you. Use the hist function to plot a histogram of the values contained in the vector errors. Which statement best describes the distribution of the errors?

Possible Answers

A. The errors are all about 0.05.
B. The error are all about -0.05.
C. The errors are symmetrically distributed around 0.
D. The errors range from -1 to 1.

  1. Average size of error
    The error \(\ p−\bar{X}\) is a random variable. In practice, the error is not observed because we do not know the actual proportion of Democratic voters, \(\ p\). However, we can describe the size of the error by constructing a simulation.

What is the average size of the error if we define the size by taking the absolute value \(\ ∣p−\bar{X}∣\)?

Instructions

  • Use the sample code to generate errors, a vector of \(\ ∣p−\bar{X}∣\).
  • Calculate the absolute value of errors using the abs function.
  • Calculate the average of these values using the mean function.
## [1] 0.039267
  1. Standard deviation of the spread
    The standard error is related to the typical size of the error we make when predicting. We say size because, as we just saw, the errors are centered around 0. In that sense, the typical error is 0. For mathematical reasons related to the central limit theorem, we actually use the standard deviation of errors rather than the average of the absolute values.

As we have discussed, the standard error is the square root of the average squared distance \(\ (\bar{X}−p)^2\). The standard deviation is defined as the square root of the distance squared.

Calculate the standard deviation of the spread.

Instructions

  • Use the sample code to generate errors, a vector of \(\ ∣p−\bar{X}∣\).
  • Use ^2 to square the distances.
  • Calculate the average squared distance using the mean function.
  • Calculate the square root of these values using the sqrt function.
## [1] 0.04949939
  1. Estimating the standard error
    The theory we just learned tells us what this standard deviation is going to be because it is the standard error of \(\ \bar{X}\).

Estimate the standard error given an expected value of 0.45 and a sample size of 100.

Instructions

Calculate the standard error using the sqrt function

## [1] 0.04974937
  1. Standard error of the estimate
    In practice, we don’t know \(\ p\), so we construct an estimate of the theoretical prediction based by plugging in \(\ \bar{X}\) for \(\ p\). Calculate the standard error of the estimate:

\(\ \hat{SE}(\bar{X})\)

Instructions

  • Simulate a poll X using the sample function.
  • When using the sample function, create a vector using c() that contains all possible polling options where ‘1’ indicates a Democratic voter and ‘0’ indicates a Republican voter.
  • When using the sample function, use replace = TRUE within the sample function to indicate that sampling from the vector should occur with replacement.
  • When using the sample function, use prob = within the sample function to indicate the probabilities of selecting either element (0 or 1) within the vector of possibilities.
  • Use the mean function to calculate the average of the simulated poll, X_bar.
  • Calculate the standard error of the X_bar using the sqrt function and print the result.
## [1] 0.04983974
  1. Plotting the standard error
    The standard error estimates obtained from the Monte Carlo simulation, the theoretical prediction, and the estimate of the theoretical prediction are all very close, which tells us that the theory is working. This gives us a practical approach to knowing the typical error we will make if we predict \(\ p\) with \(\ \hat{X}\). The theoretical result gives us an idea of how large a sample size is required to obtain the precision we need. Earlier we learned that the largest standard errors occur for \(\ p=0.5\).

Create a plot of the largest standard error for \(\ N\) ranging from 100 to 5,000. Based on this plot, how large does the sample size have to be to have a standard error of about 1%?

Possible Answers

A. 100
B. 500
C. 2,500
D. 4,000

  1. Distribution of X-hat
    For N=100, the central limit theorem tells us that the distribution of \(\ \hat{X}\) is…

Possible Answers

A. practically equal to $ p $.
B. approximately normal with expected value \(\p\) and standard error \(\ \sqrt{p(1−p)/N}\).
C. approximately normal with expected value \(\ \bar{X}\) and standard error \(\ \sqrt{\bar{X}(1−\bar{X})/N}\).
D. not a random variable.

  1. Distribution of the errors
    We calculated a vector errors that contained, for each simulated sample, the difference between the actual value p and our estimate \(\ \hat{X}\).

The errors \(\ \bar{X}−p\) are:

Possible Answers

A. practically equal to 0.
B. approximately normal with expected value 0 and standard error \(\ \sqrt{p(1−p)/N}\).
C. approximately normal with expected value p and standard error \(\ \sqrt{p(1−p)/N}\).
D. not a random variable.

  1. Plotting the errors
    Make a qq-plot of the errors you generated previously to see if they follow a normal distribution.

Instructions

  • Run the supplied code
  • Use the qqnorm function to produce a qq-plot of the errors.
  • Use the qqline function to plot a line showing a normal distribution.

  1. Estimating the probability of a specific value of X-bar
    If \(\ p=0.45\) and \(\ N=100\), use the central limit theorem to estimate the probability that \(\ \bar{X} > 0.5\).

Instructions

Use pnorm to define the probability that a value will be greater than 0.5.

## [1] 0.1574393
  1. Estimating the probability of a specific error size
    Assume you are in a practical situation and you don’t know \(\ p\). Take a sample of size N=100 and obtain a sample average of \(\ \bar{X}=0.51\).

What is the CLT approximation for the probability that your error is equal or larger than 0.01?

Instructions

  • Calculate the standard error of the sample average using the sqrt function.
  • Use pnorm twice to define the probabilities that a value will be less than 0.01 or -0.01.
  • Calculate the probability that the error will be 0.01 or larger.
## [1] 0.8414493

Section 3 Overview

In Section 3, you will look at confidence intervals and p-values.

After completing Section 3, you will be able to:

  • Calculate confidence intervals of difference sizes around an estimate.
  • Understand that a confidence interval is a random interval with the given probability of falling on top of the parameter.
  • Explain the concept of “power” as it relates to inference.
  • Understand the relationship between p-values and confidence intervals and explain why reporting confidence intervals is often preferable.

The textbook for this section is available here

Assessment 3.1: Confidence Intervals and p-Values

  1. Confidence interval for p
    For the following exercises, we will use actual poll data from the 2016 election. The exercises will contain pre-loaded data from the dslabs package.

We will use all the national polls that ended within a few weeks before the election.

Assume there are only two candidates and construct a 95% confidence interval for the election night proportion p.

Instructions

  • Use filter to subset the data set for the poll data you want. Include polls that ended on or after October 31, 2016 (enddate). Only include polls that took place in the United States. Call this filtered object polls.
  • Use nrow to make sure you created a filtered object polls that contains the correct number of rows.
  • Extract the sample size Nfrom the first poll in your subset object polls.
  • Convert the percentage of Clinton voters (rawpoll_clinton) from the first poll in polls to a proportion, X_hat. Print this value to the console.
  • Find the standard error of X_hat given N. Print this result to the console.
  • Calculate the 95% confidence interval of this estimate using the qnorm function.
  • Save the lower and upper confidence intervals as an object called ci. Save the lower confidence interval first.
## [1] 70
## [1] 2220
## [1] 0.47
## [1] 0.01059279
## [1] 1.959964
  1. Pollster results for p
    Create a new object called pollster_results that contains the pollster’s name, the end date of the poll, the proportion of voters who declared a vote for Clinton, the standard error of this estimate, and the lower and upper bounds of the confidence interval for the estimate.

Instructions

  • Use the mutate function to define four new columns: X_hat, se_hat, lower, and upper. Temporarily add these columns to the polls object that has already been loaded for you.
  • In the X_hatcolumn, convert the raw poll results for Clinton to a proportion.
  • In the se_hat column, calculate the standard error of X_hat for each poll using the sqrt function.
  • In the lower column, calculate the lower bound of the 95% confidence interval using the qnorm function.
  • In the upper column, calculate the upper bound of the 95% confidence interval using the qnorm function.
  • Use the select function to select the columns from polls to save to the new object pollster_results.
  1. Comparing to actual results - p
    The final tally for the popular vote was Clinton 48.2% and Trump 46.1%. Add a column called hit to pollster_results that states if the confidence interval included the true proportion p=0.482 or not. What proportion of confidence intervals included p?

Instructions

  • Use the mutate function to define a new variable called ‘hit’.
  • Use logical expressions to determine if each values in lower and upper span the actual proportion.
  • Use the mean function to determine the average value in hit and summarize the results using summarize.
  • Save the result as an object called avg_hit.
  1. Theory of confidence intervals
    If these confidence intervals are constructed correctly, and the theory holds up, what proportion of confidence intervals should include p?

Possible Answers

A. 0.05
B. 0.31
C. 0.50
D. 0.95

  1. Confidence interval for d
    A much smaller proportion of the polls than expected produce confidence intervals containing p. Notice that most polls that fail to include p are underestimating. The rationale for this is that undecided voters historically divide evenly between the two main candidates on election day.

In this case, it is more informative to estimate the spread or the difference between the proportion of two candidates d, or 0.482−0.461=0.021 for this election.

Assume that there are only two parties and that \(\ d=2p−1\). Construct a 95% confidence interval for difference in proportions on election night.

Instructions

  • Use the mutate function to define a new variable called ‘d_hat’ in polls. The new variable subtract the proportion of Trump voters from the proportion of Clinton voters.
  • Extract the sample size N from the first poll in your subset object polls.
  • Extract the difference in proportions of voters d_hat from the first poll in your subset object polls.
  • Use the formula above to calculate p from d_hat. Assign p to the variable X_hat.
  • Find the standard error of the spread given N.
  • Calculate the 95% confidence interval of this estimate of the difference in proportions, d_hat, using the qnorm function.
  • Save the lower and upper confidence intervals as an object called ci. Save the lower confidence interval first.
## [1] 0.04
## [1] 0.02120683
  1. Pollster results for d
    Create a new object called pollster_results that contains the pollster’s name, the end date of the poll, the difference in the proportion of voters who declared a vote either, the standard error of this estimate, and the lower and upper bounds of the confidence interval for the estimate.

Instructions

  • Use the mutate function to define four new columns: ‘X_hat’, ‘se_hat’, ‘lower’, and ‘upper’. Temporarily add these columns to the polls object that has already been loaded for you.
  • In the X_hat column, calculate the proportion of voters for Clinton using d_hat.
  • In the se_hat column, calculate the standard error of the spread for each poll using the sqrtfunction.
  • In the lower column, calculate the lower bound of the 95% confidence interval using the qnorm function.
  • In the upper column, calculate the upper bound of the 95% confidence interval using the qnorm function.
  • Use the select function to select the columns from polls to save to the new object pollster_results.
  1. Comparing to actual results - d
    What proportion of confidence intervals for the difference between the proportion of voters included d, the actual difference in election day?

Instructions

  • Use the mutate function to define a new variable withinpollster_results called hit.
  • Use logical expressions to determine if each values in lower and upper span the actual difference in proportions of voters.
  • Use the mean function to determine the average value in hit and summarize the results using summarize.
  • Save the result as an object called avg_hit.
  1. Comparing to actual results by pollster
    Although the proportion of confidence intervals that include the actual difference between the proportion of voters increases substantially, it is still lower that 0.95. In the next chapter, we learn the reason for this.

To motivate our next exercises, calculate the difference between each poll’s estimate d¯ and the actual d=0.021. Stratify this difference, or error, by pollster in a plot.

Instructions

  • Define a new variable errors that contains the difference between the estimated difference between the proportion of voters and the actual difference on election day, 0.021.
  • To create the plot of errors by pollster, add a layer with the function geom_point. The aesthetic mappings require a definition of the x-axis and y-axis variables. So the code looks like the example below, but you fill in the variables for x and y.
  • The last line of the example code adjusts the x-axis labels so that they are easier to read.
data %>% ggplot(aes(x = , y = )) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  1. Comparing to actual results by pollster - multiple polls
    Remake the plot you made for the previous exercise, but only for pollsters that took five or more polls.

You can use dplyr tools group_by and n to group data by a variable of interest and then count the number of observations in the groups. The function filter filters data piped into it by your specified condition.

For example:

data %>% group_by(variable_for_grouping) 
    %>% filter(n() >= 5)

Instructions

  • Define a new variable errors that contains the difference between the estimated difference between the proportion of voters and the actual difference on election day, 0.021.
  • Group the data by pollster using the group_by function.
  • Filter the data by pollsters with 5 or more polls.
  • Use ggplot to create the plot of errors by pollster.
  • Add a layer with the function geom_point.

Section 4 Overview

In Section 4, you will look at statistical models in the context of election polling and forecasting.

After completing Section 4, you will be able to:

  • Understand how aggregating data from different sources, as poll aggregators do for poll data, can improve the precision of a prediction.
  • Understand how to fit a multilevel model to the data to forecast, for example, election results.
  • Explain why a simple aggregation of data is insufficient to combine results because of factors such as pollster bias.
  • Use a data-driven model to account for additional types of sampling variability such as pollster-to-pollster variability.

The textbook for this section is available here

Assessment 4.1: Statistical Models

  1. Heights Revisited
    We have been using urn models to motivate the use of probability models. However, most data science applications are not related to data obtained from urns. More common are data that come from individuals. Probability plays a role because the data come from a random sample. The random sample is taken from a population and the urn serves as an analogy for the population.

Let’s revisit the heights dataset. For now, consider x to be the heights of all males in the data set. Mathematically speaking, x is our population. Using the urn analogy, we have an urn with the values of x in it.

What are the population average and standard deviation of our population?

Instructions

  • Execute the lines of code that create a vector x that contains heights for all males in the population.
  • Calculate the average of x.
  • Calculate the standard deviation of x.
## [1] 69.31475
## [1] 3.611024
  1. Sample the population of heights
    Call the population average computed above μ and the standard deviation σ. Now take a sample of size 50, with replacement, and construct an estimate for μ and σ.

Instructions

  • Use the sample function to sample N values from x.
  • Calculate the mean of the sampled heights.
  • Calculate the standard deviation of the sampled heights.
## [1] 75 70 68 74 61 67
## [1] 70.47293
## [1] 3.426742
  1. Sample and Population Averages
    What does the central limit theory tell us about the sample average and how it is related to μ, the population average?

Possible Answers

A. It is identical to μ.
B. It is a random variable with expected value μ and standard error \(\ \sigma/\sqrt{N}\).
C. It is a random variable with expected value μ and standard error σ.
D. It underestimates μ.

  1. Confidence Interval Calculation
    We will use \(\ \bar{X}\) as our estimate of the heights in the population from our sample size N. We know from previous exercises that the standard estimate of our error \(\ \bar{X}-\mu\) is \(\ \sigma/\sqrt{N}\).

Construct a 95% confidence interval for μ.

Instructions

  • Use the sd and sqrt functions to define the standard error se
  • Calculate the 95% confidence intervals using the qnorm function. Save the lower then the upper confidence interval to a variable called ci.
## [1] 75 70 68 74 61 67
## [1] 0.4846145
  1. Monte Carlo Simulation for Heights
    Now run a Monte Carlo simulation in which you compute 10,000 confidence intervals as you have just done. What proportion of these intervals include μ?

Instructions

  • Use the replicate function to replicate the sample code for B <- 10000 simulations. Save the results of the replicated code to a variable called res. The replicated code should complete the following steps: -1. Use the sample function to sample N values from x. Save the sampled heights as a vector called X. -2. Create an object called interval that contains the 95% confidence interval for each of the samples. Use the same formula you used in the previous exercise to calculate this interval. -3. Use the between function to determine if μ is contained within the confidence interval of that simulation.
  • Finally, use the mean function to determine the proportion of results in res that contain mu.
## [1] 0.9479
  1. Visualizing Polling Bias
    In this section, we used visualization to motivate the presence of pollster bias in election polls. Here we will examine that bias more rigorously. Lets consider two pollsters that conducted daily polls and look at national polls for the month before the election.

Is there a poll bias? Make a plot of the spreads for each poll.

Instructions

  • Use ggplot to plot the spread for each of the two pollsters.
  • Define the x- and y-axes usingusing aes() within the ggplot function.
  • Use geom_boxplot to make a boxplot of the data.
  • Use geom_point to add data points to the plot.

  1. Defining Pollster Bias
    The data do seem to suggest there is a difference between the pollsters. However, these data are subject to variability. Perhaps the differences we observe are due to chance. Under the urn model, both pollsters should have the same expected value: the election day difference, \(\ d\).

We will model the observed data Yij in the following way:

\(\ Y_{ij} = d + b_i + \varepsilon_{ij}\)

with \(\ i=1,2\) indexing the two pollsters, bi the bias for pollster \(\ i\), and \(\ ε_{ij}\) poll to poll chance variability. We assume the ε are independent from each other, have expected value 0 and standard deviation \(\ σ_i\) regardless of \(\ j\).

Which of the following statements best reflects what we need to know to determine if our data fit the urn model?

Possible Answers

A. Is \(\ εij=0\)?
B. How close are \(\ Y_{ij}\) to \(\ d\)?
C. Is \(\ b1≠b2?\)
D. Are \(\ b1=0\) and \(\ b2=0\)?

  1. Derive Expected Value
    We modelled the observed data \(\ Y_{ij}\) as:

\(\ Y_{ij} = d + b_i + \varepsilon_{ij}\)

On the right side of this model, only \(\ εij\) is a random variable. The other two values are constants.

What is the expected value of \(\ Y_{ij}\)?

Possible Answers

A. \(\ d+b_1\)
B. \(\ b_1 + \varepsilon_{ij}\) C. \(\ d\)
D. \(\ d + b_1 + \varepsilon_{ij}\)

  1. Expected Value and Standard Error of Poll 1
    Suppose we define \(\ \bar{Y}_1\) as the average of poll results from the first poll and σ1 as the standard deviation of the first poll.

What is the expected value and standard error of \(\ \bar{Y}_1\)?

Possible Answers

A. The expected value is \(\ d+b1\) and the standard error is \(\ σ1\)
B. The expected value is \(\ d\) and the standard error is \(\ \sigma_1/\sqrt{N_1}\)
C. The expected value is \(\ d+b1\) and the standard error is \(\ \sigma_1/\sqrt{N_1}\)
D. The expected value is \(\ d\) and the standard error is \(\ \sigma_1+\sqrt{N_1}\)

  1. Expected Value and Standard Error of Poll 2
    Now we define \(\ \bar{Y}_2\) as the average of poll results from the second poll.

What is the expected value and standard error of \(\ \bar{Y}_2\)?

Possible Answers

A. The expected value is \(\ d+b_2\) and the standard error is \(\ σ2\) B. The expected value is \(\ d\) and the standard error is \(\ \sigma_2/\sqrt{N_2}\) C. The expected value is \(\ d+b_2\) and the standard error is \(\ \sigma_2/\sqrt{N_2}\) D. The expected value is \(\ d\) and the standard error is \(\ \sigma_2 + \sqrt{N_2}\)

  1. Difference in Expected Values Between Polls
    Using what we learned by answering the previous questions, what is the expected value of \(\ \bar{Y}_2 - \bar{Y}_1\)?

Possible Answers

A. \(\ (b_2 - b_1)^2\) B. \(\ b_2 - b_1/\sqrt(N)\) C. \(\ b_2 + b_1\) D. \(\ b_2 - b_1\)

  1. Standard Error of the Difference Between Polls
    Using what we learned by answering the questions above, what is the standard error of \(\ \bar{Y}_2 - \bar{Y}_1\)?

Possible Answers

A. \(\ \sqrt{\sigma_2^2/N_2 + \sigma_1^2/N_1}\) B. \(\ \sqrt{\sigma_2/N_2 + \sigma_1/N_1}\) C. \(\ (\sigma_2^2/N_2 + \sigma_1^2/N_1)^2\) D. \(\ \sigma_2^2/N_2 + \sigma_1^2/N_1\)

  1. Compute the Estimates
    The answer to the previous question depends on \(\ σ1\) and \(\ σ2\) , which we don’t know. We learned that we can estimate these values using the sample standard deviation.

Compute the estimates of \(\ σ1\) and \(\ σ2\).

Instructions

  • Group the data by pollster.
  • Summarize the standard deviation of the spreads for each of the two pollsters.
  • Store the pollster names and standard deviations of the spreads (σ) in an object called sigma
  1. Probability Distribution of the Spread
    What does the central limit theorem tell us about the distribution of the differences between the pollster averages, \(\ \bar{Y}_2 - \bar{Y}_1\)?

Possible Answers

A. The central limit theorem cannot tell us anything because this difference is not the average of a sample.
B. Because \(\ Y_{ij}\) are approximately normal, the averages are normal too.
C. If we assume N2 and N1 are large enough, \(\ \bar{Y}_2\) and \(\ \bar{Y}_1\), and their difference, are approximately normal.
D. These data do not contain vectors of 0 and 1, so the central limit theorem does not apply.

  1. Calculate the 95% Confidence Interval of the Spreads
    We have constructed a random variable that has expected value \(\ b2−b1\), the pollster bias difference. If our model holds, then this random variable has an approximately normal distribution. The standard error of this random variable depends on \(\ σ1\) and \(\ σ2\), but we can use the sample standard deviations we computed earlier. We have everything we need to answer our initial question: is \(\ b2−b1\) different from 0?

Construct a 95% confidence interval for the difference \(\ b2\) and \(\ b1\). Does this interval contain zero?

Instructions

  • Use pipes %>% to pass the data polls on to functions that will group by pollster and summarize the average spread, standard deviation, and number of polls per pollster.
  • Calculate the estimate by subtracting the average spreads.
  • Calculate the standard error using the standard deviations of the spreads and the sample size.
  • Calculate the 95% confidence intervals using the qnorm function. Save the lower and then the upper confidence interval to a variable called ci.
##  [1] "state"            "startdate"        "enddate"         
##  [4] "pollster"         "grade"            "samplesize"      
##  [7] "population"       "rawpoll_clinton"  "rawpoll_trump"   
## [10] "rawpoll_johnson"  "rawpoll_mcmullin" "adjpoll_clinton" 
## [13] "adjpoll_trump"    "adjpoll_johnson"  "adjpoll_mcmullin"
## [16] "spread"
## [1] 0.05229167
## [1] 0.007031433
  1. Calculate the P-value
    The confidence interval tells us there is relatively strong pollster effect resulting in a difference of about 5%. Random variability does not seem to explain it.

Compute a p-value to relay the fact that chance does not explain the observed pollster effect.

Instructions

  • Use the pnorm function to calculate the probability that a random value is larger than the observed ratio of the estimate to the standard error.
  • Multiply the probability by 2, because this is the two-tailed test.
## [1] 1.030287e-13
  1. Comparing Within-Poll and Between-Poll Variability
    We compute statistic called the t-statistic by dividing our estimate of \(\ b2−b1\) by its estimated standard error:

\(\ \frac{\bar{Y}_2 - \bar{Y}_1}{\sqrt{s_2^2/N_2 + s_1^2/N_1}}\)

Later we learn will learn of another approximation for the distribution of this statistic for values of N2 and N1 that aren’t large enough for the CLT.

Note that our data has more than two pollsters. We can also test for pollster effect using all pollsters, not just two. The idea is to compare the variability across polls to variability within polls. We can construct statistics to test for effects and approximate their distribution. The area of statistics that does this is called Analysis of Variance or ANOVA. We do not cover it here, but ANOVA provides a very useful set of tools to answer questions such as: is there a pollster effect?

Compute the average and standard deviation for each pollster and examine the variability across the averages and how it compares to the variability within the pollsters, summarized by the standard deviation.

Instructions

  • Group the polls data by pollster.
  • Summarize the average and standard deviation of the spreads for each pollster.
  • Create an object called var that contains three columns: pollster, mean spread, and standard deviation.
  • Be sure to name the column for mean avg and the column for standard deviation s.

Section 5 Overview

In Section 5, you will learn about Bayesian statistics through looking at examples from rare disease diagnosis and baseball.

After completing Section 5, you will be able to:

  • Apply Bayes’ theorem to calculate the probability of A given B.
  • Understand how to use hierarchical models to make better predictions by considering multiple levels of variability.
  • Compute a posterior probability using an empirical Bayesian approach.
  • Calculate a 95% credible interval from a posterior probability.

The textbook for this section is available here

Assessment 5.1: Bayesian Statistics

  1. Statistics in the Courtroom
    In 1999 in England Sally Clark was found guilty of the murder of two of her sons. Both infants were found dead in the morning, one in 1996 and another in 1998, and she claimed the cause of death was sudden infant death syndrome (SIDS). No evidence of physical harm was found on the two infants so the main piece of evidence against her was the testimony of Professor Sir Roy Meadow, who testified that the chances of two infants dying of SIDS was 1 in 73 million. He arrived at this figure by finding that the rate of SIDS was 1 in 8,500 and then calculating that the chance of two SIDS cases was 8,500 × 8,500 ≈ 73 million.

Based on what we’ve learned throughout this course, which statement best describes a potential flaw in Sir Meadow’s reasoning?

Possible Answers

A. Sir Meadow assumed the second death was independent of the first son being affected, thereby ignoring possible genetic causes.
B. There is no flaw. The multiplicative rule always applies in this way: Pr(A and B)=Pr(A)Pr(B)
C. Sir Meadow should have added the probabilities: Pr(A and B)=Pr(A)+Pr(B)
D. The rate of SIDS is too low to perform these types of statistics.

  1. Recalculating the SIDS Statistics
    Let’s assume that there is in fact a genetic component to SIDS and the the probability of Pr(second case of SIDS∣first case of SIDS)=1/100, is much higher than 1 in 8,500.

What is the probability of both of Sally Clark’s sons dying of SIDS?

Instructions

  • Calculate the probability of both sons dying to SIDS.
## [1] 1.176471e-06
  1. NBayes’ Rule in the Courtroom
    Many press reports stated that the expert claimed the probability of Sally Clark being innocent as 1 in 73 million. Perhaps the jury and judge also interpreted the testimony this way. This probability can be written like this:

\(\ \mbox{Pr}(\mbox{mother is a murderer} \mid \mbox{two children found dead with no evidence of harm})\)

Possible Answers

A. \(\ \frac{\mbox{Pr}(\mbox{two children found dead with no evidence of harm}) \mbox{Pr}(\mbox{mother is a murderer})}{\mbox{Pr}(\mbox{two children found dead with no evidence of harm})}\)

B. \(\ \mbox{Pr}(\mbox{two children found dead with no evidence of harm})\mbox{Pr}(\mbox{mother is a murderer} )\)

C. \(\ \frac{\mbox{Pr}(\mbox{two children found dead with no evidence of harm} \mid \mbox{mother is a murderer} ) \mbox{Pr}(\mbox{mother is a murderer})}{\mbox{Pr}(\mbox{two children found dead with no evidence of harm})}\)

D. 1/8500

  1. Calculate the Probability
    Assume that the probability of a murderer finding a way to kill her two children without leaving evidence of physical harm is:

\(\ \mbox{Pr}(\mbox{two children found dead with no evidence of harm} \mid \mbox{mother is a murderer} ) = 0.50\)

Assume that the murder rate among mothers is 1 in 1,000,000.

\(\ \mbox{Pr}(\mbox{mother is a murderer} ) = 1/1,000,000\)

According to Bayes’ rule, what is the probability of:

\(\ \mbox{Pr}(\mbox{mother is a murderer} \mid \mbox{two children found dead with no evidence of harm})\)

Instructions

Use Bayes’ rule to calculate the probability that the mother is a murderer, considering the rates of murdering mothers in the population, the probability that two siblings die of SIDS, and the probability that a murderer kills children without leaving evidence of physical harm.

## [1] 0.425
  1. Misuse of Statistics in the Courts
    After Sally Clark was found guilty, the Royal Statistical Society issued a statement saying that there was “no statistical basis” for the expert’s claim. They expressed concern at the “misuse of statistics in the courts”. Eventually, Sally Clark was acquitted in June 2003.

In addition to misusing the multiplicative rule as we saw earlier, what else did Sir Meadow miss?

Possible Answers

A. He made an arithmetic error in forgetting to divide by the rate of SIDS in siblings.
B. He did not take into account how rare it is for a mother to murder her children.
C. He mixed up the numerator and denominator of Bayes’ rule.
D. He did not take into account murder rates in the population.

  1. Back to Election Polls
    Florida is one of the most closely watched states in the U.S. election because it has many electoral votes and the election is generally close. Create a table with the poll spread results from Florida taken during the last days before the election using the sample code.

The CLT tells us that the average of these spreads is approximately normal. Calculate a spread average and provide an estimate of the standard error.

Instructions

  • Calculate the average of the spreads. Call this average avg in the final table.
  • Calculate an estimate of the standard error of the spreads. Call this standard error se in the final table.
  • Use the mean and sd functions nested within summarize to find the average and standard deviation of the grouped spread data.
  • Save your results in an object called results.
  1. The Prior Distribution
    Assume a Bayesian model sets the prior distribution for Florida’s election night spread d to be normal with expected value μ and standard deviation τ.

What are the interpretations of μ and τ?

Possible Answers

A. μ and τ are arbitrary numbers that let us make probability statements about d.
B. μ and τ summarize what we would predict for Florida before seeing any polls.
C. μ and τ summarize what we want to be true. We therefore set μ at 0.10 and τ at 0.01.
D. The choice of prior has no effect on the Bayesian analysis.

  1. Estimate the Posterior Distribution
    The CLT tells us that our estimate of the spread \(\ \hat{d}\) has a normal distribution with expected value d and standard deviation σ, which we calculated in a previous exercise.

Use the formulas for the posterior distribution to calculate the expected value of the posterior distribution if we set μ=0 and τ=0.01.

Instructions

  • Define μ and τ Identify which elements stored in the object results represent σ and Y
  • Estimate B using σ and τ
  • Estimate the posterior distribution using B, μ, and Y
## [1] 0.342579
## [1] 0.002731286
  1. Standard Error of the Posterior Distribution
    Compute the standard error of the posterior distribution.

Instructions

  • Using the variables we have defined so far, calculate the standard error of the posterior distribution.
  • Print this value to the console.
## [1] 0.005853024
  1. Constructing a Credible Interval
    Using the fact that the posterior distribution is normal, create an interval that has a 95% of occurring centered at the posterior expected value. Note that we call these credible intervals.

Instructions

  • Calculate the 95% credible intervals using the qnorm function.
  • Save the lower and upper confidence intervals as an object called ci. Save the lower - confidence interval first.
## [1] 0.002731286
## [1] -0.008740432  0.014203003
  1. Odds of Winning Florida
    According to this analysis, what was the probability that Trump wins Florida?

Instructions

  • Using the pnorm function, calculate the probability that the spread in Florida was less than 0.
## [1] 0.3203769
  1. Change the Priors
    We had set the prior variance τ to 0.01, reflecting that these races are often close.

Change the prior variance to include values ranging from 0.005 to 0.05 and observe how the probability of Trump winning Florida changes by making a plot.

Instructions

  • Create a vector of values of taus by executing the sample code.
  • Create a function using function(){} called p_calc that first calculates B given tau and sigma and then calculates the probability of Trump winning, as we did in the previous exercise.
  • Apply your p_calc function across all the new values of taus.
  • Use the plot function to plot τ on the x-axis and the new probabilities on the y-axis.

Section 6 Overview

In Section 6, you will learn about election forecasting, building on what you’ve learned in the previous sections about statistical modeling and Bayesian statistics.

After completing Section 6, you will be able to:

Understand how pollsters use hierarchical models to forecast the results of elections.
Incorporate multiple sources of variability into a mathematical model to make predictions.
Construct confidence intervals that better model deviations such as those seen in election data using the t-distribution. There are 2 assignments that use the DataCamp platform for you to practice your coding skills.

The textbook for this section is available here

Assessment 6.1: Election Forecasting

  1. Confidence Intervals of Polling Data
    For each poll in the polling data set, use the CLT to create a 95% confidence interval for the spread. Create a new table called cis that contains columns for the lower and upper limits of the confidence intervals.

Instructions

  • Use pipes %>% to pass the poll object on to the mutate function, which creates new variables.
  • Create a variable called X_hatthat contains the estimate of the proportion of Clinton voters for each poll.
  • Create a variable called se that contains the standard error of the spread.
  • Calculate the confidence intervals using the qnorm function and your calculated se.
  • Use the select function to keep the following columns: state, startdate, enddate, pollster, grade, spread, lower, upper.
  1. Compare to Actual Results
    You can add the final result to the cis table you just created using the left_join function as shown in the sample code.

Now determine how often the 95% confidence interval includes the actual result.

Instructions

  • Create an object called p_hits that contains the proportion of intervals that contain the actual spread using the following steps.
  • Use the mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.
  • Summarize the proportion of values in hit that are true as a variable called proportion_hits.
  1. Stratify by Pollster and Grade
    Now find the proportion of hits for each pollster. Show only pollsters with more than 5 polls and order them from best to worst. Show the number of polls conducted by each pollster and the FiveThirtyEight grade of each pollster.

Instructions

  • Create an object called p_hits that contains the proportion of intervals that contain the actual spread using the following steps.
  • Use the mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.
  • Use the group_by function to group the data by pollster.
  • Use the filter function to filter for pollsters that have more than 5 polls.
  • Summarize the proportion of values in hit that are true as a variable called proportion_hits. Also create new variables for the number of polls by each pollster using the n() function and the grade of each poll.
  • Use the arrange function to arrange the proportion_hits in descending order.
  1. Stratify by State
    Repeat the previous exercise, but instead of pollster, stratify by state. Here we can’t show grades.

Instructions

  • Create an object called p_hits that contains the proportion of intervals that contain the actual spread using the following steps.
  • Use the mutate function to create a new variable called hit that contains a logical vector for whether the actual_spread falls between the lower and upper confidence intervals.
  • Use the group_by function to group the data by state.
  • Use the filter function to filter for states that have more than 5 polls.
  • Summarize the proportion of values in hit that are true as a variable called proportion_hits. Also create new variables for the number of polls in each state using the n() function.
  • Use the arrange function to arrange the proportion_hits in descending order.
  1. Plotting Prediction Results
    Make a barplot based on the result from the previous exercise.

Instructions

  • Reorder the states in order of the proportion of hits.
  • Using ggplot, set the aesthetic with state as the x-variable and proportion of hits as the y-variable.
  • Use geom_bar to indicate that we want to plot a barplot. Specifcy stat = "identity" to indicate that the height of the bar should match the value.
  • Use coord_flip to flip the axes so the states are displayed from top to bottom and proportions are displayed from left to right.

  1. Predicting the Winner
    Even if a forecaster’s confidence interval is incorrect, the overall predictions will do better if they correctly called the right winner.

Add two columns to the cis table by computing, for each poll, the difference between the predicted spread and the actual spread, and define a column hit that is true if the signs are the same.

Instructions

  • Use the mutate function to add two new variables to the cis object: error and hit.
  • For the error variable, subtract the actual spread from the spread.
  • For the hit variable, return “TRUE” if the poll predicted the actual winner.
  • Save the new table as an object called errors.
  • Use the tail function to examine the last 6 rows of errors```.
  1. Plotting Prediction Results
    Create an object called p_hits that contains the proportion of instances when the sign of the actual spread matches the predicted spread for states with more than 5 polls.

Make a barplot based on the result from the previous exercise that shows the proportion of times the sign of the spread matched the actual result for the data in p_hits.

Instructions

  • Use the group_by function to group the data by state.
  • Use the filter function to filter for states that have more than 5 polls.
  • Summarize the proportion of values in hit that are true as a variable called proportion_hits. Also create new variables for the number of polls in each state using the n() function.
  • To make the plot, follow these steps:
  • Reorder the states in order of the proportion of hits.
  • Using ggplot, set the aesthetic with state as the x-variable and proportion of hits as the y-variable.
  • Use geom_bar to indicate that we want to plot a barplot.
  • Use coord_flip to flip the axes so the states are displayed from top to bottom and proportions are displayed from left to right.

  1. Plotting the Errors
    In the previous graph, we see that most states’ polls predicted the correct winner 100% of the time. Only a few states polls’ were incorrect more than 25% of the time. Wisconsin got every single poll wrong. In Pennsylvania and Michigan, more than 90% of the polls had the signs wrong.

Make a histogram of the errors. What is the median of these errors?

Instructions

  • Use the hist function to generate a histogram of the errors
  • Use the median function to compute the median error

## [1] 0.037
  1. Plot Bias by State
    We see that, at the state level, the median error was slightly in favor of Clinton. The distribution is not centered at 0, but at 0.037. This value represents the general bias we described in an earlier section.

Create a boxplot to examine if the bias was general to all states or if it affected some states differently. Filter the data to include only pollsters with grades B+ or higher.

Instructions

  • Use the filter function to filter the data for polls with grades equal to A+, A, A-, or B+.
  • Use the reorder function to order the state data by error.
  • Using ggplot, set the aesthetic with state as the x-variable and error as the y-variable.
  • Use geom_boxplot to indicate that we want to plot a boxplot.
  • Use geom_point to add data points as a layer.

  1. Filter Error Plot
    Some of these states only have a few polls. Repeat the previous exercise to plot the errors for each state, but only include states with five good polls or more.

Instructions

  • Use the filter function to filter the data for polls with grades equal to A+, A, A-, or B+.
  • Group the filtered data by state using group_by.
  • Use the filter function to filter the data for states with at least 5 polls.
  • Use the reorder function to order the state data by error.
  • Using ggplot, set the aesthetic with state as the x-variable and error as the y-variable.
  • Use geom_box to indicate that we want to plot a boxplot.
  • Use geom_point to add data points as a layer.

Assessment 6.2: The t-Distribution

  1. Using the t-Distribution
    We know that, with a normal distribution, only 5% of values are more than 2 standard deviations away from the mean.

Calculate the probability of seeing t-distributed random variables being more than 2 in absolute value when the degrees of freedom are 3.

Instructions

Use the pt function to calculate the probability of seeing a value less than or equal to the argument.

## [1] 0.139326
  1. Plotting the t-distribution
    Now use sapply to compute the same probability for degrees of freedom from 3 to 50.

Make a plot and notice when this probability converges to the normal distribution’s 5%.

Instructions

  • Make a vector called df that contains a sequence of numbers from 3 to 50.
  • Using function, make a function called pt_func that recreates the calculation for the probability that a value is greater than 2 as an absolute value for any given degrees of freedom.
  • Use sapply to apply the pt_func function across all values contained in df. Call these probabilities probs.
  • Use the plot function to plot df on the x-axis and probs on the y-axis.

  1. Sampling From the Normal Distribution
    In a previous section, we repeatedly took random samples of 50 heights from a distribution of heights. We noticed that about 95% of the samples had confidence intervals spanning the true population mean.

Re-do this Monte Carlo simulation, but now instead of N=50, use N=15. Notice what happens to the proportion of hits.

Instructions

  • Use the replicate function to carry out the simulation. Specify the number of times you want the code to run and, within brackets, the three lines of code that should run.
  • First use the sample function to randomly sample N values from x.
  • Second, create a vector called interval that calculates the 95% confidence interval for the sample. You will use the qnorm function.
  • Third, use the between function to determine if the population mean mu is contained between the confidence intervals.
  • Save the results of the Monte Carlo function to a vector called res.
  • Use the mean function to determine the proportion of hits in res.
## [1] 0.9331
  1. Sampling from the t-Distribution
    N=15 is not that big. We know that heights are normally distributed, so the t-distribution should apply. Repeat the previous Monte Carlo simulation using the t-distribution instead of using the normal distribution to construct the confidence intervals.

What are the proportion of 95% confidence intervals that span the actual mean height now?

Instructions

  • Use the replicate function to carry out the simulation. Specify the number of times you want the code to run and, within brackets, the three lines of code that should run.
  • First use the sample function to randomly sample N values from x.
  • Second, create a vector called interval that calculates the 95% confidence interval for the sample. Remember to use the qt function this time to generate the confidence interval.
  • Third, use the between function to determine if the population mean mu is contained between the confidence intervals.
  • Save the results of the Monte Carlo function to a vector called res.
  • Use the mean function to determine the proportion of hits in res.
## [1] 0.9512
  1. Why the t-Distribution?
    Why did the t-distribution confidence intervals work so much better?

Possible Answers

A. The t-distribution takes the variability into account and generates larger confidence intervals. B. Because the t-distribution shifts the intervals in the direction towards the actual mean. C. This was just a chance occurrence. If we run it again, the CLT will work better. D. The t-distribution is always a better approximation than the normal distribution.

Section 7 Overview

In Section 7, you will learn how to use association and chi-squared tests to perform inference for binary, categorical, and ordinal data through an example looking at research funding rates.

After completing Section 7, you will be able to:

  • Use association and chi-squared tests to perform inference on binary, categorical, and ordinal data.
  • Calculate an odds ratio to get an idea of the magnitude of an observed effect.

The textbook for this section is available here

Assessment 7.1: Association and Chi-Squared Tests

  1. Comparing Proportions of Hits
    In a previous exercise, we determined whether or not each poll predicted the correct winner for their state in the 2016 U.S. presidential election. Each poll was also assigned a grade by the poll aggregator. Now we’re going to determine if polls rated A- made better predictions than polls rated C-.

In this exercise, filter the errors data for just polls with grades A- and C-. Calculate the proportion of times each grade of poll predicted the correct winner.

Instructions

  • Filter errors for grades A- and C-.
  • Group the data by grade and hit.
  • Summarize the number of hits for each grade.
  • Generate a two-by-two table containing the number of hits and misses for each grade.
  • Calculate the proportion of times each grade was correct.
## [1] 0.8030303
## [1] 0.8614958
  1. Chi-squared Test
    We found that the A- polls predicted the correct winner about 86% of the time in their states and C- polls predicted the correct winner about 80% of the time.

Use a chi-squared test to determine if these proportions are different.

Instructions

  • Use the chisq.test function to perform the chi-squared test. Save the results to an object called chisq_test.
  • Print the p-value of the test to the console.
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  .
## X-squared = 2.1053, df = 1, p-value = 0.1468
## [1] 0.1467902
  1. Odds Ratio Calculation
    It doesn’t look like the grade A- polls performed significantly differently than the grade C- polls in their states.

Calculate the odds ratio to determine the magnitude of the difference in performance between these two grades of polls.

Instructions

  • Calculate the odds that a grade C- poll predicts the correct winner. Save this result to a variable called odds_C.
  • Calculate the odds that a grade A- poll predicts the correct winner. Save this result to a variable called odds_A. -Calculate the odds ratio that tells us how many times larger the odds of a grade A- poll is at predicting the winner than a grade C- poll.
## [1] 0.6554539
  1. Significance
    We did not find meaningful differences between the poll results from grade A- and grade C- polls in this subset of the data, which only contains polls for about a week before the election. Imagine we expanded our analysis to include all election polls and we repeat our analysis. In this hypothetical scenario, we get that the p-value for the difference in prediction success if 0.0015 and the odds ratio describing the effect size of the performance of grade A- over grade B- polls is 1.07.

Based on what we learned in the last section, which statement reflects the best interpretation of this result?

Possible Answers

A. The p-value is below 0.05, so there is a significant difference. Grade A- polls are significantly better at predicting winners.
B. The p-value is too close to 0.05 to call this a significant difference. We do not observe a difference in performance.
C. The p-value is below 0.05, but the odds ratio is very close to 1. There is not a scientifically significant difference in performance.
D. The p-value is below 0.05 and the odds ratio indicates that grade A- polls perform significantly better than grade C- polls.