Sampling Model Parameters and Estimates
To help us understand the connection between polls and the probability theory that we have learned, let’s construct a scenario that we can work through together and that is similar to the one that pollsters face.
We will use an urn instead of voters. And because pollsters are competing with other pollsters for media attention, we will imitate that by having our competition with a $25 prize. The challenge is to guess the spread between the proportion of blue and red balls in this urn. Before making a prediction, you can take a sample, with replacement, from the urn.
To mimic the fact that running polls is expensive, it will cost you $0.10 per bead you sample. So if your sample size is 250 and you win, you’ll break even, as you’ll have to pay me $25 to collect your $25.
Your entry into the competition can be an interval. If the interval you submit contains the true proportion, you get half what you paid and pass to the second phase of the competition.
In the second phase of the competition, the entry with the smallest interval is selected as the winner.
The dslabs package includes a function that shows a random draw from the urn that we just saw. Here’s the code that you can write to see a sample.
library(tidyverse)
library(dslabs)
ds_theme_set()
# And here is a sample with 25 beads.
take_poll(25)

OK, now that you know the rules, think about how you would construct your interval. How many beads would you sample, et cetera.
Notice that we have just described a simple sampling model for opinion polls. The beads inside the urn represent the individuals that will vote on election day. Those that will vote Republican are represented with red beads and the Democrats with blue beads. For simplicity, assume there are no other colors, that there are just two parties.
We want to predict the proportion of blue beads in the urn. Let’s call this quantity p, which in turn tells us the proportion of red beads, \(1 - p\), and the spread, \(p - (1 - p)\), which simplifies to \(2p - 1\).
In statistical textbooks, the beads in the urn are called the population. The proportion of blue beads in the population, p, is called a parameter. The 25 beads that we saw in an earlier plot after we sampled, that’s called a sample.
The task of statistical inference is to predict the parameter, p, using the observed data in the sample. Now, can we do this with just the 25 observations we showed you?
Well, they are certainly informative. For example, given that we see 13 red and 12 blue, it is unlikely that p is bigger than 0.9 or smaller than 0.1. Because if they were, it would be un-probable to see 13 red and 12 blue. But are we ready to predict with certainty that there are more red beads than blue.
OK, what we want to do is construct an estimate of p using only the information we observe. An estimate can be thought of as a summary of the observed data that we think is informative about the parameter of interest. It seems intuitive to think that the proportion of blue beads in the sample, which in this case is 0.48, must be at least related to the actual proportion p. But do we simply predict p to be 0.48?
First, note that the sample proportion is a random variable. If we run the command take_poll(25), say four times, we get four different answers. Each time the sample is different and the sample proportion is different. The sample proportion is a random variable.
par(mfrow = c(2, 2))
take_poll(25)
take_poll(25)
take_poll(25)
take_poll(25)

Note that in the four random samples we show, the sample proportion ranges from 0.44 to 0.6. By describing the distribution of this random variable, we’ll be able to gain insights into how good this estimate is and how we can make it better.
The Sample Average
Taking an opinion poll is being modeled as taking a random sample from an urn. We are proposing the use of the proportion of blue beads in our sample as an estimate of the parameter p. Once we have this estimate, we can easily report an estimate of the spread: \(2p - 1\).
But for simplicity, we will illustrate the concept of statistical inference for estimating p. We will use our knowledge of probability to defend our use of the sample proportion, and quantify how close we think it is from the population proportion p.
We start by defining the random variable X. X is going to be 1 if we pick a blue bead at random, and 0 if it’s red. This implies that we’re assuming that the population, the beads in the urn, are a list of 0s and 1s.
If we sample N beads, then the average of the draws X1 through XN is equivalent to the proportion of blue beads in our sample. This is because adding the Xs is equivalent to counting the blue beads, and dividing by the total N turns this into a proportion. We use the symbol x̅ to represent this average. In general, in statistics textbooks, a bar on top of a symbol means the average.
The theory we just learned about the sum of draws becomes useful, because we know the distribution of the sum N times X-bar. We know the distribution of the average X-bar, because N is a non random constant.
\[N \bar{X} = \frac{N(X_1 + X_2 + ... +X_N)}{N} \\ = (X_1 + X_2 + ... + X_N)\]
For simplicity, let’s assume that the draws are independent. After we see each sample bead, we return it to the urn. It’s a sample with replacement. In this case, what do we know about the distribution of the sum of draws?
First, we know that the expected value of the sum of draws is N times the average of the values in the urn. We know that the average of the 0s and 1s in the urn must be the proportion p, the value we want to estimate.
Here, we encounter an important difference with what we did in the probability module. We don’t know what is in the urn. We know there are blue and red beads, but we don’t know how many of each. This is what we’re trying to find out. We’re trying to estimate p.
Just like we use variables to define unknowns in systems of equations, in statistical inference, we define parameters to define unknown parts of our models. In the urn model we are using to mimic an opinion poll, we do not know the proportion of blue beads in the urn. We define the parameter p to represent this quantity. We are going to estimate this parameter.
Note that the ideas presented here, on how we estimate parameters and provide insights into how good these estimates are, extrapolate to many data science tasks.
For example, we may ask, what is the difference in health improvement between patients receiving treatment and a control group?
We may ask, what are the health effects of smoking on a population? What are the differences in racial groups of fatal shootings by police? What is the rate of change in life expectancy in the US during the last 10 years?
All these questions can be framed as a task of estimating a parameter from a sample.
Properties of Our Estimate
To understand how good our estimate is, we’ll describe the statistical properties of the random variable we just defined, the sample proportion.
\[\bar{X} = \frac{X_1 + X_2 + ... +X_N}{N}\]
Note that, if we multiply by N, N times X-bar is a sum of independent draws, so the rules we covered in the probability module apply.
\[N \bar{X} = \frac{N(X_1 + X_2 + ... +X_N)}{N} \\ = (X_1 + X_2 + ... + X_N)\]
Using what we have learned, the expected value of the sum N times X-bar is N times the average of the urn, p.
\[E(N\bar{X}) = N \times p\]
So, dividing by the nonrandom constant N gives us that the expected value of the average X-bar is p. We can write it using our mathematical notation like this.
\[E(\bar{X}) = p\]
We also can use what we learned to figure out the standard error. We know that the standard error of the sum is square root of N times the standard deviation of the values in the urn.
Can we compute the standard error of the urn? We learned a formula that tells us that it’s 1 minus 0 times the square of p times 1 minus p, which is the square root of p times 1 minus p.
\[(1-0)\sqrt{p(1-p)}\]
Because we are dividing by the sum, N, we arrive at the following formula for the standard error of the average. The standard error of the average is square root of p times 1 minus p divided by the square root of N.
\[SE(\bar{X}) = \sqrt{p(1-p)/N}\]
This result reveals the power of polls. The expected value of the sample proportion, X-bar, is the parameter of interest, p. \(E(\bar{X}) = p\)
And we can make the standard error as small as we want by increasing the sample size, N. \(SE(\bar{X}) = \sqrt{p(1-p)/N}\)
The law of large numbers tells us that, with a large enough poll, our estimate converges to p. If we take a large enough poll to make our standard error, say, about 0.01, we’ll be quite certain about who will win.
But how large does a pool have to be for the standard error to be this small? One problem is that we do not know p, so we can’t actually compute the standard error.
For illustrative purposes, let’s assume that p is 0.51 and make a plot of the standard error versus the sample size N. Here it is.

You can see that obviously it’s dropping. From the plot, we also see that we would need a poll of over 10,000 people to get the standard error as low as we want it to be. We rarely see polls of this size due, in part, to costs. We’ll give other reasons later.
From the RealClearPolitics table we saw earlier, we learned that the sample sizes in opinion polls range from 500 to 3,500. For a sample size of 1,000, if we set p to be 0.51, the standard error is about 0.15, or 1.5 percentage points.
So even with large polls, for close elections, X-bar can lead us astray if we don’t realize it’s a random variable.
But, we can actually say more about how close we can get to the parameter p. We’ll do that in the next video.
---
title: "Data Science: Inference and Modelling part I"
subtitle: "HarvardX: PH125.4x"
author: "Rafael Irizarry"
output: html_notebook
---

</br>

<h2>Sampling Model Parameters and Estimates</h2>

</br>

To help us understand the connection between polls and the probability theory that we have learned, let's
construct a scenario that we can work through together and that is similar to the one that pollsters face.

We will use an urn instead of voters. And because pollsters are competing with other pollsters for media attention, we will imitate that by having our competition with a $25 prize. The challenge is to guess the spread between the proportion of blue and red balls in this urn. Before making a prediction, you can take a sample, with replacement, from the urn.

To mimic the fact that running polls is expensive, it will cost you \$0.10 per bead you sample. So if your sample size is 250 and you win, you'll break even, as you'll have to pay me \$25 to collect your \$25.

Your entry into the competition can be an interval. If the interval you submit contains the true proportion,
you get half what you paid and pass to the second phase of the competition.

In the second phase of the competition, the entry with the smallest interval is selected as the winner.

The dslabs package includes a function that shows a random draw from the urn that we just saw. Here's the code that you can write to see a sample.

```{r}
library(tidyverse)
library(dslabs)
ds_theme_set()

# And here is a sample with 25 beads.
take_poll(25)
```


OK, now that you know the rules, think about how you would construct your interval. How many beads would you sample, et cetera.

Notice that we have just described a simple sampling model for opinion polls. The beads inside the urn represent the individuals that will vote on election day. Those that will vote Republican are represented with red beads and the Democrats with blue beads. For simplicity, assume there are no other colors, that there are just two parties.

We want to predict the proportion of blue beads in the urn. Let's call this quantity **p**, which in turn tells us the proportion of red beads, $1 - p$, and the spread, $p - (1 - p)$, which simplifies to $2p - 1$.

In statistical textbooks, the beads in the urn are called the *population*. The proportion of blue beads in the population, **p**, is called a *parameter*. The 25 beads that we saw in an earlier plot after we sampled,
that's called a sample.

The task of statistical inference is to predict the parameter, p, using the observed data in the sample. Now, can we do this with just the 25 observations we showed you?

Well, they are certainly informative. For example, given that we see 13 red and 12 blue, it is unlikely that p is bigger than 0.9 or smaller than 0.1. Because if they were, it would be un-probable to see 13 red and 12 blue. But are we ready to predict with certainty that there are more red beads than blue.

OK, what we want to do is construct an estimate of p using only the information we observe. An estimate can be thought of as a summary of the observed data that we think is informative about the parameter of interest.
It seems intuitive to think that the proportion of blue beads in the sample, which in this case is 0.48, must be at least related to the actual proportion p. But do we simply predict p to be 0.48? 

First, note that the sample proportion is a random variable. If we run the command *take_poll(25)*, say four times, we get four different answers. Each time the sample is different and the sample proportion is different. The sample proportion is a random variable.

```{r}
par(mfrow = c(2, 2))
take_poll(25)
take_poll(25)
take_poll(25)
take_poll(25)
```

Note that in the four random samples we show, the sample proportion ranges from 0.44 to 0.6. By describing the distribution of this random variable, we'll be able to gain insights into how good this estimate is and how we can make it better.

</br>
</br>

---

</br>

<h2>The Sample Average</h2>

</br>


Taking an opinion poll is being modeled as taking a random sample from an urn. We are proposing the use of the proportion of blue beads in our sample as an estimate of the parameter p. Once we have this estimate, we can easily report an estimate of the spread: $2p - 1$.

But for simplicity, we will illustrate the concept of statistical inference for estimating p. We will use our knowledge of probability to defend our use of the sample proportion, and quantify how close we think it is from the population proportion p.

We start by defining the random variable **X**. X is going to be 1 if we pick a blue bead at random, and 0 if it's red. This implies that we're assuming that the population, the beads in the urn,
are a list of 0s and 1s.

If we sample N beads, then the average of the draws X1 through XN is equivalent to the proportion of blue beads in our sample. This is because adding the Xs is equivalent to counting the blue beads, and dividing by the total N turns this into a proportion. We use the symbol x̅ to represent this average. In general, in statistics textbooks, a bar on top of a symbol means the average.

The theory we just learned about the sum of draws becomes useful, because we know the distribution of the sum N times X-bar. We know the distribution of the average X-bar, because N is a non random constant.

$$N \bar{X} = \frac{N(X_1 + X_2 + ... +X_N)}{N} \\ = (X_1 + X_2 + ... + X_N)$$

For simplicity, let's assume that the draws are independent. After we see each sample bead, we return it to the urn. It's a sample with replacement. In this case, what do we know about the distribution of the sum of draws?

First, we know that the expected value of the sum of draws is N times the average of the values in the urn. We know that the average of the 0s and 1s in the urn must be the proportion p, the value we want to estimate.

Here, we encounter an important difference with what we did in the probability module. We don't know what is in the urn. We know there are blue and red beads, but we don't know how many of each. This is what we're trying to find out. We're trying to estimate p.

Just like we use variables to define unknowns in systems of equations, in statistical inference, we define parameters to define unknown parts of our models. In the urn model we are using to mimic an opinion poll,
we do not know the proportion of blue beads in the urn. We define the parameter p to represent this quantity. We are going to estimate this parameter.

Note that the ideas presented here, on how we estimate parameters and provide insights into how good these estimates are, extrapolate to many data science tasks.

For example, we may ask, what is the difference in health improvement between patients receiving treatment and a control group?

We may ask, what are the health effects of smoking on a population?
What are the differences in racial groups of fatal shootings by police?
What is the rate of change in life expectancy in the US during the last 10 years?

All these questions can be framed as a task of estimating a parameter from a sample.

</br>
</br>

---

</br>

<h2>Properties of Our Estimate</h2>

</br>

To understand how good our estimate is, we'll describe the statistical properties of the random variable we
just defined, the sample proportion. 

$$\bar{X} = \frac{X_1 + X_2 + ... +X_N}{N}$$

Note that, if we multiply by N, N times X-bar is a sum of independent draws, so the rules we covered in the probability module apply.

$$N \bar{X} = \frac{N(X_1 + X_2 + ... +X_N)}{N} \\ = (X_1 + X_2 + ... + X_N)$$

Using what we have learned, the expected value of the sum N times X-bar is N times the average of the urn, p.

$$E(N\bar{X}) = N \times p$$

So, dividing by the nonrandom constant N gives us that the expected value of the average X-bar is p. We can write it using our mathematical notation like this.

$$E(\bar{X}) = p$$

We also can use what we learned to figure out the standard error. We know that the standard error of the sum is square root of N times the standard deviation of the values in the urn.

Can we compute the standard error of the urn? We learned a formula that tells us that it's 1 minus 0 times the square of p times 1 minus p, which is the square root of p times 1 minus p.

$$(1-0)\sqrt{p(1-p)}$$

Because we are dividing by the sum, N, we arrive at the following formula for the standard error of the average. The standard error of the average is square root of p times 1 minus p divided by the square root of N.

$$SE(\bar{X}) = \sqrt{p(1-p)/N}$$

This result reveals the power of polls. The expected value of the sample proportion, X-bar, is the parameter of interest, p. $E(\bar{X}) = p$

And we can make the standard error as small as we want by increasing the sample size, N. $SE(\bar{X}) = \sqrt{p(1-p)/N}$


The law of large numbers tells us that, with a large enough poll, our estimate converges to p. If we take a large enough poll to make our standard error, say, about 0.01, we'll be quite certain about who will win.

But how large does a pool have to be for the standard error to be this small? One problem is that we do not know p, so we can't actually compute the standard error.

For illustrative purposes, let's assume that p is 0.51 and make a plot of the standard error versus the sample size N. Here it is.

<img src = "images/Estimate1.png"/>

You can see that obviously it's dropping. From the plot, we also see that we would need a poll of over 10,000 people to get the standard error as low as we want it to be. We rarely see polls of this size due, in part, to costs. We'll give other reasons later.

From the RealClearPolitics table we saw earlier, we learned that the sample sizes in opinion polls range from 500 to 3,500. For a sample size of 1,000, if we set p to be 0.51, the standard error is about 0.15, or 1.5 percentage points.

So even with large polls, for close elections, X-bar can lead us astray if we don't realize it's a random variable.

But, we can actually say more about how close we can get to the parameter p. We'll do that in the next video.
