Exercise 1. Polling - expected value of S
Suppose you poll a population in which a proportion p of voters are Democrats and 1-p are Republicans. Your sample size is N=25. Consider the random variable S, which is the total number of Democrats in your sample.
What is the expected value of this random variable S?
Answer:
\(\mbox{E}(S) = 25p\)
Exercise 2. Polling - standard error of S
Again, consider the random variable S, which is the total number of Democrats in your sample of 25 voters. The variable p describes the proportion of Democrats in the sample, whereas 1-p describes the proportion of Republicans.
What is the standard error of S?
Answer:
\(\mbox{SE}(S) = \sqrt{25 p (1-p)}\)
Exercise 3. Polling - expected value of \(\bar{X}\)
Consider the random variable S/N, which is equivalent to the sample average that we have been denoting as \(\bar{X}\). The variable N represents the sample size and p is the proportion of Democrats in the population.
What is the expected value of \(\bar{X}\)?
Answer:
\(\mbox{E}(\bar{X}) = p\)
Exercise 4. Polling - standard error of \(\bar{X}\)
What is the standard error of the sample average, \(\bar{X}\)?
The variable N represents the sample size and p is the proportion of Democrats in the population.
Answer:
\(\mbox{SE}(\bar{X}) = \sqrt{p (1-p) / N}\)
Exercise 5. se versus p
Write a line of code that calculates the standard error se of a sample average when you poll 25 people in the population. Generate a sequence of 100 proportions of Democrats p that vary from 0 (no Democrats) to 1 (all Democrats).
Plot se versus p for the 100 different proportions.
Instructions
- Use the
seq function to generate a vector of 100 values of p that range from 0 to 1.
- Use the
sqrt function to generate a vector of standard errors for all values of p.
- Use the plot function to generate a plot with
p on the x-axis and se on the y-axis.
Answer:
rr # N represents the number of people polled N <- 25 # Create a variable p that contains 100 proportions ranging from 0 to 1 using the seq function p <- seq(0, 1, length.out=100) # Create a variable se that contains the standard error of each sample average se <- sqrt(p * (1-p) / N) # Plot p on the x-axis and se on the y-axis plot(p, se)

Exercise 6. Multiple plots of se versus p
Using the same code as in the previous exercise, create a for-loop that generates three plots of p versus se when the sample sizes equal N=25, N=100, and N=1000.
Instructions
- Your for-loop should contain two lines of code to be repeated for three different values of N.
- The first line within the for-loop should use the
sqrt function to generate a vector of standard errors se for all values of p.
- The second line within the for-loop should use the plot function to generate a plot with
p on the x-axis and se on the y-axis.
- Use the
ylim argument to keep the y-axis limits constant across all three plots. The lower limit should be equal to 0 and the upper limit should equal the highest calculated standard error across all values of p and N.
Answer:
rr # The vector p contains 100 proportions of Democrats ranging from 0 to 1 using the seq function p <- seq(0, 1, length = 100) # The vector sample_sizes contains the three sample sizes sample_sizes <- c(25, 100, 1000) # Write a for-loop that calculates the standard error se for every value of p for each of the three samples sizes N in the vector sample_sizes. Plot the three graphs, using the ylim argument to standardize the y-axis across all three plots. for (val in sample_sizes) { se <- sqrt(p * (1-p) / val) plot(p, se, ylim=c(0,max(se))) }



Exercise 7. Expected value of d
Our estimate for the difference in proportions of Democrats and Republicans is \(d = \bar{X} - (1-\bar{X})\).
Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the expected value of d?
Answer:
\(\begin{eqnarray} \mbox{E}[\bar{X} - (1-\bar{X})] &=& \mbox{E}[2\bar{X} - 1] \
&=& 2\mbox{E}[\bar{X}] - 1 \
&=& 2p - 1\
&=& p - (1-p) \end{eqnarray}\)
Exercise 8. Standard error of d
Our estimate for the difference in proportions of Democrats and Republicans is \(d = \bar{X} - (1-\bar{X})\).
Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the standard error of d?
Answer:
\(\begin{eqnarray} \mbox{SE}[\bar{X} - (1-\bar{X})] &=& \mbox{SE}[2\bar{X} - 1] \
&=& 2\mbox{SE}[\bar{X}] \
&=& 2\sqrt{p(1-p)/N} \end{eqnarray}\)
Exercise 9. Standard error of the spread
Say the actual proportion of Democratic voters is \(p=0.45\). In this case, the Republican party is winning by a relatively large margin of \(d= -0.1\), or a 10% margin of victory. What is the standard error of the spread \(2\bar{X}-1\) in this case?
Instructions
- Use the
sqrt function to calculate the standard error of the spread \(2\bar{X}-1\).
Answer:
rr # N represents the number of people polled N <- 25 # p represents the proportion of Democratic voters p <- 0.45 # Calculate the standard error of the spread. Print this value to the console. 2 * sqrt(p * (1-p) / N)
[1] 0.1989975
Exercise 10. Sample size
So far we have said that the difference between the proportion of Democratic voters and Republican voters is about 10% and that the standard error of this spread is about 0.2 when N=25. Select the statement that explains why this sample size is sufficient or not.
Answer:
This sample size is too small because the standard error is larger than the spread.
END
---
title: "Data Science: Inference and Modelling part I (EXERCISE)"
subtitle: "An Exercises from HarvardX: PH125.4x"
author: "Rafael Irizarry"
output: html_notebook
---

---

<h3>Exercise 1. Polling - expected value of `S`</h3>


Suppose you poll a population in which a proportion *p* of voters are Democrats and *1-p* are Republicans. Your sample size is **N=25**. Consider the random variable *S*, which is the total number of Democrats in your sample.

What is the expected value of this random variable *S*?

</br>

<h4>Answer:</h4>

$\mbox{E}(S) = 25p$

</br>

---

<h3>Exercise 2. Polling - standard error of `S`</h3>


Again, consider the random variable *S*, which is the total number of Democrats in your sample of 25 voters. The variable *p* describes the proportion of Democrats in the sample, whereas *1-p* describes the proportion of Republicans.

What is the standard error of *S*?

</br>

<h4>Answer:</h4>

$\mbox{SE}(S) = \sqrt{25 p (1-p)}$

</br>

---

<h3>Exercise 3. Polling - expected value of $\bar{X}$</h3>


Consider the random variable *S/N*, which is equivalent to the sample average that we have been denoting as $\bar{X}$. The variable *N* represents the sample size and *p* is the proportion of Democrats in the population.

What is the expected value of $\bar{X}$?

</br>

<h4>Answer:</h4>

$\mbox{E}(\bar{X}) = p$

</br>

---

<h3>Exercise 4. Polling - standard error of $\bar{X}$</h3>


What is the standard error of the sample average, $\bar{X}$?

The variable *N* represents the sample size and *p* is the proportion of Democrats in the population.

</br>

<h4>Answer:</h4>

$\mbox{SE}(\bar{X}) = \sqrt{p (1-p) / N}$

</br>

---

<h3>Exercise 5. `se` versus `p`</h3>


Write a line of code that calculates the standard error `se` of a sample average when you poll 25 people in the population. Generate a sequence of 100 proportions of Democrats `p` that vary from 0 (no Democrats) to 1 (all Democrats).

Plot `se` versus `p` for the 100 different proportions.


**Instructions**

* Use the `seq` function to generate a vector of 100 values of `p` that range from 0 to 1.
* Use the `sqrt` function to generate a vector of standard errors for all values of `p`.
* Use the plot function to generate a plot with `p` on the x-axis and `se` on the y-axis.

</br>

<h4>Answer:</h4>

```{r}
# `N` represents the number of people polled
N <- 25

# Create a variable `p` that contains 100 proportions ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length.out=100)

# Create a variable `se` that contains the standard error of each sample average
se <- sqrt(p * (1-p) / N)

# Plot `p` on the x-axis and `se` on the y-axis
plot(p, se)
```


</br>

---

<h3>Exercise 6. Multiple plots of `se` versus `p`</h3>


Using the same code as in the previous exercise, create a for-loop that generates three plots of `p` versus `se` when the sample sizes equal *N=25*, *N=100*, and *N=1000*.


**Instructions**

* Your for-loop should contain two lines of code to be repeated for three different values of *N*.
* The first line within the for-loop should use the `sqrt` function to generate a vector of standard errors `se` for all values of `p`.
* The second line within the for-loop should use the plot function to generate a plot with `p` on the x-axis and `se` on the y-axis.
* Use the `ylim` argument to keep the y-axis limits constant across all three plots. The lower limit should be equal to 0 and the upper limit should equal the highest calculated standard error across all values of `p` and `N`.

</br>

<h4>Answer:</h4>

```{r}
# The vector `p` contains 100 proportions of Democrats ranging from 0 to 1 using the `seq` function
p <- seq(0, 1, length = 100)

# The vector `sample_sizes` contains the three sample sizes
sample_sizes <- c(25, 100, 1000)

# Write a for-loop that calculates the standard error `se` for every value of `p` for each of the three samples sizes `N` in the vector `sample_sizes`. Plot the three graphs, using the `ylim` argument to standardize the y-axis across all three plots.
for (val in sample_sizes) {
    se <- sqrt(p * (1-p) / val)
    plot(p, se, ylim=c(0,max(se)))
}
```



</br>

---

<h3>Exercise 7. Expected value of `d`</h3>


Our estimate for the difference in proportions of Democrats and Republicans is $d = \bar{X} - (1-\bar{X})$.

Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the expected value of `d`?

</br>

<h4>Answer:</h4>

$\begin{eqnarray}
\mbox{E}[\bar{X} - (1-\bar{X})] &=& \mbox{E}[2\bar{X} - 1] \
&=& 2\mbox{E}[\bar{X}] - 1  \
&=& 2p - 1\
&=& p - (1-p)
\end{eqnarray}$

</br>

---

<h3>Exercise 8. Standard error of `d`</h3>


Our estimate for the difference in proportions of Democrats and Republicans is $d = \bar{X} - (1-\bar{X})$.

Which derivation correctly uses the rules we learned about sums of random variables and scaled random variables to derive the standard error of `d`?

</br>

<h4>Answer:</h4>

$\begin{eqnarray}
\mbox{SE}[\bar{X} - (1-\bar{X})] &=& \mbox{SE}[2\bar{X} - 1] \
&=& 2\mbox{SE}[\bar{X}]   \
&=& 2\sqrt{p(1-p)/N}
\end{eqnarray}$

</br>

---

<h3>Exercise 9. Standard error of the spread</h3>


Say the actual proportion of Democratic voters is $p=0.45$. In this case, the Republican party is winning by a relatively large margin of $d= -0.1$, or a 10% margin of victory. What is the standard error of the spread $2\bar{X}-1$ in this case?


**Instructions**

* Use the `sqrt` function to calculate the standard error of the spread $2\bar{X}-1$.

</br>

<h4>Answer:</h4>

```{r}
# `N` represents the number of people polled
N <- 25

# `p` represents the proportion of Democratic voters
p <- 0.45

# Calculate the standard error of the spread. Print this value to the console.
2 * sqrt(p * (1-p) / N)
```


</br>

---

<h3>Exercise 10. Sample size</h3>

So far we have said that the difference between the proportion of Democratic voters and Republican voters is about 10% and that the standard error of this spread is about 0.2 when N=25. Select the statement that explains why this sample size is sufficient or not.

</br>

<h4>Answer:</h4>

This sample size is too small because the standard error is larger than the spread.

</br>

---

<center>END</center>