Probability and Statistics Homework 4

Problem 1: NBC Pilot Survey

Part A

Construct a filtered data set containing only viewer responses where Show == “Living with Ed” or Show == “My Name is Earl”. Then construct a 95% confidence interval for the difference in mean viewer response to the Q1_Happy question for these two shows. Is there evidence that one show consistently produces a higher mean Q1_Happy response among viewers?

1) Question

The question is whether there is statistically significant evidence that one show consistently produces a higher mean Q1_Happy response among viewers. To start, lets take the means from the sample data of the Q1_Happy question for Show == “Living with Ed” vs Show == “My Name is Earl”. The results below suggest that Living with Ed produces a higher happiness score, on average.

##  Living with Ed My Name is Earl 
##        3.926829        3.777778

2) Approach

In my approach I will use the ‘do’ function in R to calculate differences in means between the two shows for bootstrapped data over and over- in this case, 10,000 times.

3) Results

##       name      lower     upper level     method   estimate
## 1 diffmean -0.3919316 0.1068697  0.95 percentile -0.1490515

Above results present the distribution of 10,000 bootstrapped differences in means. The lower bound of the 95% confidence interval is -0.04 and the upper bound is 0.1, indicating that 95% of the bootstrapped differences in means fall between those values, we can visualize that distribution using the following histogram.

4) Conclusion

Since zero is well within the 95% confidence interval for differences in means for Q1_Happy for the two shows, I am unable to say that there is a statistically significant difference in happiness response between the two shows.

Part B

Consider the shows “The Biggest Loser” and “The Apprentice: Los Angeles.” Which reality/contest show made people feel more annoyed? Construct a filtered data set containing only viewer responses where Show == “The Biggest Loser” or Show == “The Apprentice: Los Angeles”. Then construct a 95% confidence interval for the difference in mean viewer response to the Q1_Annoyed question for these two shows. Is there evidence that one show consistently produces a higher mean Q1_Annoyed response among viewers?

1) Question

Similar to question 1, this question asks us whether there is statistically significant evidence that one show consistently produces a higher mean Q1_Annoyed response among viewers. Again, lets take sample means from the Q1_Happy question for Show == “The Biggest Loser” and Show == “The Apprentice: Los Angeles.” The results below suggest that ‘The Apprentice’ produces a higher annoyed score, on average.

## The Apprentice: Los Angeles           The Biggest Loser 
##                    2.307229                    2.036232

2) Approach

Replicating the approach from Part 1, we calculate the difference many times using bootstrapped data. We then output this data to a vector to visualize in results.

3) Results

##       name      lower       upper level     method  estimate
## 1 diffmean -0.5202125 -0.01874834  0.95 percentile -0.270997

Again, the above 95% confidence interval is calculated using diff_mean from 10,000 variables, where the lower bound is -0.52 and the upper bound is -0.01. We are able to visualize below:

4) Conclusion

Although it is close, zero falls right outside of the 95% confidence interval of the Monte Carlo simulation of bootstrapped samnples of difference in mean between the annoyed ratings between the two shows. Therefore, we confirm there is a statistically significant difference between how annoying viewers find the two shows.

Part C

Construct a filtered data set containing only viewer responses where Show == “Dancing with the Stars”. Assuming this sample of respondents is representative of TV viewers more broadly, what proportion of American TV watchers would we expect to give a response of 4 or greater to the “Q2_Confusing” question? Form a 95% confidence interval for this proportion and report your results.

1) Question

This question is similar to the last two except we are required to make a new variable for whether or not viewers found dancing with the stars confusing. Then we are expected to calculate confidence intervals in order to estimate the proportion of this variable == TRUE.

2) Approach

First I filter the dataset to only include responses to ‘dancing with the stars’, and then I create a boolean ‘confused’ variable to indicate 1 if R1_confusing >= 4. Similar to earlier approaches, we then run a Monte Carlo simulation with bootstrapped data, where each simulation will calculate its own total number of viewers that found the show to be confusing.

3) Results

##        name      lower     upper level     method   estimate
## 1 prop_TRUE 0.03867403 0.1160221  0.95 percentile 0.07734807

4) Conclusion

From the above confidence interval we can see that 95% of the bootstrapped estimates of the proportion of viewers that found dancing with the stars annoying falls between .038 and .116. It gives us a specific estimate of .077. Therefore I estimate that 7.7% of viewers find the show confusing.

Problem 2: EBay

1. Question

The question is to measure the difference in the revenue ratio variable between treatment and control group DMAs. Results will help to determine whether Google ads had/have a significant effect on consumer behavior.

2. Approach

See below output of the means of the rev_ratio variable for adwords_pause == 0 vs adwords pause ==1, indicating there is on average a -.05 difference in the revenue ratio between the control group and the treatment group.

##         0         1 
## 0.9488775 0.8965961

Similar to the approach used in question 1, I use bootstrapped samples to recalculate the difference between means. I save the results to a vector, and find the 95% confidence interval for the mean_diff.

##       name       lower       upper level     method    estimate
## 1 diffmean -0.09136935 -0.01277595  0.95 percentile -0.05228145

3. Results

The 95% confidence interval for our bootstrapped difference in means has a lower bound of -0.09 and an upper bound of -0.01. Meaning that 95% of our bootstrapped differences in means fall between those bounds.

For clarity, see the below distribution of bootstrapped differences in means. It is clear that over 95% of the 10,000 instances of diff_mean are less than zero.

4. Conclusion

Since zero does not fall within our bootstrapped confidence interval, the difference in rev_ratio between the control group and the treatment group is statistically significant. In other words, the results from bootstrapping align with what our original difference in means suggested- that there is a negative impact on revenue when Ebay does not purchase ads from Google.

Problem 3: Iron Bank

Use Monte Carlo simulation (with at least 100000 simulations) to calculate a p-value under the null hypothesis that, over the long run, securities trades from the Iron Bank are flagged at the same 2.4% baseline rate as that of other traders?.

Include the following items in your write-up: (i) the null hypothesis that your are testing; (ii) the test statistic you used to measure evidence against the null hypothesis; (iii) a plot of the probability distribution of the test statistic, assuming that the null hypothesis is true; (iv) the p-value itself; (v) and a one-sentence conclusion about the extent to which you think the null hypothesis looks plausible in light of the data.

(i)

Our null hypothesis is that securities trades from the Iron Bank are flagged at the same 2.4% baseline rate as that of other traders. Or, that the Iron Bank is not engaging in illegal trades.

(ii)

The test statistic used to measure evidence against the null hypothesis is the 70/2021 Iron Bank trades that were flagged by the SEC.

(iii)

See the below distribution of the test statistic, assuming our null hypothesis is true.

It appears that the probability that our test statistic of 70 out of 2021 trades being flagged by the SEC, assuming that the null hypothesis is true, is very small. 70 trades flagged is on the far far tail end of the distribution, undoubtedly falling outside of the 95th percentile.

(iv)

Our p-value, calculated in R, is 0.00196.

(v)

The null hypothesis that the Iron Bank is only engaging in legal trades appears unlikely due to the notably low P value of close to .002 which indicates a less than .2% chance that 70 trades would be flagged if indeed the iron bank was trading legally.

Problem 4: Milk Demand, Revisited

Your task is to use bootstrapping to quantify your uncertainty regarding the price elasticity of demand for milk based on this data. For this problem you should turn in a single figure showing the bootstrap sampling distribution for the elasticity parameter. This figure should have an informative caption in which you explain what is shown in the figure and also quote a 95% bootstrap confidence interval for the elasticity, rounded to two decimal places.

Methodology: We are asked to calculate the Elasticity of price for milk. Let \(d\) = milk demanded, represented by the ‘sales’ variable, and \(p\) is price of milk, represented by the ‘price’ variable. See the following scatterplot of log(x),log(y).

Since the relationship is linear, we are able to apply power law and represent d(p) as follows,

\(log(d) = log(A) + \beta \cdot log(p)\)

where the Beta coefficient represents price elasticity of demand, or the average percentage change in d for a 1% change in p. A log-linear regression will provide a beta coefficient which is the price elasticity. The following plot displays the distribution of the elasticity parameter calculated from 10000 bootstrapped samples.

Here, ‘log.price.’ represents the Beta coefficient which is the price elasticity of demand of milk. We see that the distribution appears normal, and falls roughly between -1.8 and -1.4, implying that milk is elastic with regards to price. Indeed, R tells us that the Beta coefficient has a 0.95 confidence interval of [-1.78, -1.46].

Problem 5: Standard Error Calculations

Part A

Suppose that \(X_1, . . . , X_N \sim Bernoulli(p)\) and that \(Y_1, . . . , Y_M \sim Bernoulli(q)\) (all independent). We will consider \(\hat{p} = X_N \text{ and } \hat{q} = Y_M\) as estimators of \(p\) and \(q\), respectively.

i.

Show that \(E[\hat{p} − \hat{q}] = p − q\), the true difference in success probabilities.

Let \(\hat{p}\) = \(\frac{X_1 + ... + X_N}{N}\) and \(\hat{q}\) = \(\frac{Y_1 + ... + Y_M}{M}\). We have the following.

\(E[\hat{p}-\hat{q}] = E[\frac{X_1 + ... + X_N}{N} - \frac{Y_1 + ... + Y_M}{M}]]\)

Passing out the constants of \(\frac{1}{N}\) and \(\frac{1}{M}\), and using that the expectation of the sum is the sum of the expectations, we have:

\(E[\hat{p}-\hat{q}] = \frac{1}{N}(E[X_1] + ... + E[X_N]) - \frac{1}{M}(E[Y_1]+ ... + E[Y_M])\)

Since, \(X,Y \sim Bernoulli(p,q)\), respectively, we can simplify further as:

\(E[\hat{p}-\hat{q}] = \frac{Np}{N} - \frac{Mq}{M} = p - q\).

ii.

Use what you know of probability to compute the standard error of \(\hat{p}\), i.e. the standard deviation of the sampling distribution of \(\hat{p}\).

To calculate the standard error we can use the variance of \(\hat{p}\), as follows:

\(SE[\hat{p}] = \sqrt{var[\hat{p}]} = \sqrt{var[\frac{X_1 + ... + X_N}{N}]}\)

We can take out the constant \(\frac{1}{N}\) and separate the variance as a sum of individual variances, since we are given the variables are independent.

\(SE[\hat{p}] = \sqrt{\frac{1}{N^2}(var[X_1]+ ... + var[X_N])}\)

Finally, substituting for the variance of \(X \sim Bernoulli(p)\) and summing, we have,

\(SE[\hat{p}] = \sqrt{\frac{N \cdot p(1-p)}{N^2}} = \sqrt{\frac{p(1-p)}{N}}\)

iii.

Compute the standard error of \(\hat{\Delta} = \hat{p} − \hat{q}\) as an estimator of the true difference \(\Delta = p − q\).

We can use the formula referenced in office hours for the variance of a sum, but modify to form a difference:

\(var[\hat{p} − \hat{q}] = var[\hat{p} + (-\hat{q})] = var[\hat{p}] + (-1)^2var[\hat{q}] + (-1)(1) \cdot Cov(\hat{p},\hat{q})\)

Since the variables are independent, \(Cov[\hat{p}, \hat{q}] = 0\), thus,

\(var[\hat{p} − \hat{q}] = var[\hat{p}] + var[\hat{q}]\)

Taking the square root and substituting from previous question, we get:

\(SE[\hat{\Delta}] = \sqrt{var[\hat{p} − \hat{q}]} = \sqrt{\frac{p(1-p)}{N}+\frac{1(1-q)}{M}}\)

Part B

Given our added information that \(E[X_i] = \mu_X\) and \(var[X_i] = \sigma_X^2\), as well as \(E[Y_i] = \mu_Y\) and \(var[Y_i] = \sigma_Y^2\), we are able to replace the Bernoulli expected value of \(p, q\) with \(\mu_X, \mu_Y\) and the variance \(p(1-p), q(1-q)\) with \(\sigma_X^2, \sigma_Y^2\). Repeating steps from Part A but performing the above substitution, we have:

\(\begin{aligned} E[\hat{\Delta}] &= E[\bar{X_N}] - E[\bar{Y_M}] = \mu_X - \mu_Y \\ SE[\hat{\Delta}] &= \sqrt{\frac{\sigma_X^2}{N}+\frac{\sigma_Y^2}{M}}. \end{aligned}\)

Probability and Statistics Homework 4

Joseph Williams

2023-08-07

Problem 1: NBC Pilot Survey

Part A

1) Question

2) Approach

3) Results

4) Conclusion

Part B

1) Question

2) Approach

3) Results

4) Conclusion

Part C

1) Question

2) Approach

3) Results

4) Conclusion

Problem 2: EBay

1. Question

2. Approach

3. Results

4. Conclusion

Problem 3: Iron Bank

(i)

(ii)

(iii)

(iv)

(v)

Problem 4: Milk Demand, Revisited

Problem 5: Standard Error Calculations

Part A

i.

ii.

iii.

Part B