Homework 0

Problem 1 (30 pts)

The length of a critical part, measured in mm, in a manufacturing process varies according to a N(μ,\(σ^2_0\)). Here \(σ^2_0\) is unknown. This is the model assumption. Engineers plan to observe an i.i.d. sample of n = 10 parts and record \(X_1\) , . . . , \(X_{10}\) . The observed data is given in the follow.

{12.2, 12.0, 12.2, 11.9, 12.4, 12.6, 12.1, 12.2, 12.9, 12.4}

We would like to construct a 95 percent confidence interval for μ.

What is the formula for the confidence interval?

\[ \begin{align} \left(\overline{X}- z_{0.05/2} \frac{σ_0}{\sqrt{n}} , \overline{X}+ z_{0.05/2} \frac{σ_0}{\sqrt{n}} \right) \end{align} \]
What is the computed 95 percent confidence interval based on the given data?

For a 95 percent confidence interval, then we are 95 percent confident that the population mean μ is between (12.1, 12.5).
Please give your interpretation to both (1) and (2).

For the formula in (1), this is used to find the 95 percent confidence interval for a certain μ. The percent is based on the alpha value, in this case, 0.05, which signifies the upper 0.025 quantile of the standard normal random variable. This formula is used to calculate the range of (12.1, 12.5) which can be interpretted as when treating as a random variable, this random interval will contain μ 95 percent of the time. That is to say, if we were to repeat this process, then 95% of the time the interval would contain the true μ value.

Problem 2 (20 pts)

The length of a critical part, measured in mm, in a manufacturing process varies according to a N(μ, \(σ^2_0\)). Here \(σ^2_0\) is unknown. This is the model assumption. Engineers plan to observe an i.i.d. sample of n = 10 parts and record \(X_1\) , . . . , \(X_{10}\) . The observed data is given in the follow.

{12.2, 12.0, 12.2, 11.9, 12.4, 12.6, 12.1, 12.2, 12.9, 12.4}

The safety standard requires that a proper functioning machine has μ = 12.2. We would like to conduct a hypothesis test to find out if this is true based on the observations:

\(H_0\) : μ = 12.2 versus \(H_1\) : μ ≠ 12.2

Please conduct this test at 5% significance level using the critical region method. Your test should include four elements: (a) the null and alternative hypotheses, (b) the test statistic, (c) the associated critical region or decision rule, and (d) the decision.

the null and alternative hypothesis

null: \(H_0\) : μ = 12.2 versus alternative: \(H_1\) : μ ≠ 12.2 (notice that the alternative hypothesis is a two-sided hypothesis)
the test statistic

Given the test statistic, or pivotal quantity, definition, our test statistic, T, in this scenario, with the standard deviation \(σ_0\) being unknown, is given as

\[ \begin{align} T = \frac{\overline{X}- \theta}{σ_0 / \sqrt{n} } = \frac{\overline{X}- 12.2}{σ_0 / \sqrt{10} } \end{align} \]

Also note that under the null hypothesis, our test statistic, T, follows a standard normal distribution.

In this scenario, the standard deviation of the observed data can be calculated with the formula

\[ \begin{align} σ = \sqrt{ \frac{\sum_{i=1}^{n}\left( x_i - μ_{obs} \right)^2 }{ n } } \end{align} \]

where n is the sample size and \(μ_{obs}\) can be calculated from the observed data to be 12.29 or ≈ 12.3.

With this calculated mean, the standard deviation can then be calculated to be σ = 0.296 or ≈ 0.30.

This results in our test statistic being

\[ \begin{align} T = \frac{\overline{X}- 12.2}{0.3 / \sqrt{10} } \end{align} \]
the associated critical region or decision rule

For a two-sided test with a significance level of 5%, or \(\alpha\) = 0.05, the critical values are approximately +/- 1.96. Therefore the critical region can be defined as

\[ \begin{align} T < -1.96 \ or \ T > 1.96 \end{align} \]

With that being said, this would make the decision rule be given as follows:

Considering the observed value of the test statistic,

\[ \begin{align} t_{obs} = \frac{ \left( \overline{x}- 12.2 \right) }{ 0.3 / \sqrt{10}} \end{align} \]

if \(t_{obs} < -1.96\) or \(t_{obs} > 1.96\), we would then reject \(H_0\). Equivalently, if | \(t_{obs}\) | > 1.96, reject \(H_0\). Otherwise, we do not reject \(H_0\) and say that we fail to reject the null hypothesis.
the decision

Ultimately, the calculated value for \(t_{obs}\) would be which is equivalent to ≈1.054. Since this value is not greater than 1.96, we therefore fail to reject the null hypothesis.

Problem 3 (20 pts)

The same problem as above, but please conduct it using the p-value method. Your test should consists of (a) the null and alternative hypotheses, (b) the test statistic, (c) the p-value, and (d) the interpretation of the p-value.

the null and alternative hypothesis

null: \(H_0\) : μ = 12.2 versus alternative: \(H_1\) : μ ≠ 12.2 (notice that the alternative hypothesis is a two-sided hypothesis)
the test statistic

Given the test statistic, or pivotal quantity, definition, our test statistic, T, in this scenario, with the standard deviation \(σ_0\) being unknown, is given as

\[ \begin{align} T = \frac{\overline{X}- \theta}{σ_0 / \sqrt{n} } = \frac{\overline{X}- 12.2}{σ_0 / \sqrt{10} } \end{align} \]

Also note that under the null hypothesis, our test statistic, T, follows a standard normal distribution.

In this scenario, the standard deviation of the observed data can be calculated with the formula

\[ \begin{align} σ = \sqrt{ \frac{\sum_{i=1}^{n}\left( x_i - μ_{obs} \right)^2 }{ n } } \end{align} \]

where n is the sample size and \(μ_{obs}\) can be calculated from the observed data to be 12.29 or ≈ 12.3.

With this calculated mean, the standard deviation can then be calculated to be σ = 0.296 or ≈ 0.30.

This results in our test statistic being

\[ \begin{align} T = \frac{\overline{X}- 12.2}{0.3 / \sqrt{10} } \end{align} \]
the p-value

For a two-sided test, the p value is calculated as follows:

\[ \begin{align} p = 2 * P(T > | t_{obs}|) \end{align} \]

Following from the previous problem, we can obtain \(t_{obs}\) = 1.054. With this information, we can calculate our p-value to be p = 0.2918829 or p ≈ 0.29.
the interpretation of the p-value

In this particular scenario, the result of having the p-value being 0.29 would be that we fail to reject the null hypothesis as p ≥ \(\alpha\), where \(\alpha\) (our significance level) is 0.05. As for the interpretation of the p-value, the p-value is commonly used to evaluate the strength of the evidence against the null hypothesis without reference to significance level. So, the smaller the p-value, the stronger the evidence against the null hypothesis. Note that the p-value does not actually give the probability that the null hypothesis is true.

Problem 4 (10 pts)

Suppose instead the safety standard requires μ = 12.3. The test should be \(H_0\) : μ = 12.3 versus \(H_1\) : μ ≠ 12.3. Without formally repeat the hypothesis test procedure, based on your answer to Problem 1 (2) alone, could you tell right away whether this hypothesis test is significant at 5% level? Explain.

From Problem 1 (2), it was calculated that the 95% confidence interval for the observed data was (12.1, 12.5). This signifies that we are confident 95% of the time with repeated sampling that this interval will contain the true mean value. With now the null hypothesis being μ = 12.3, and with 12.3 falling in between that calculated interval, the hypothesis test can be said to be significant at a 5% level. Given this 95% confidence interval, the true value should fall out of the range 5% of the time, signifying a 5% significance level.

Problem 5 (10 pts)

Suppose a linear CEF model describing the relationship between advertising spending (X, in thousands of dollars) and sales (Y, in thousands of units) is given by: Y = 10 + 2 log(X) + ε. Interpret the slope coefficient.

In this scenario, we are given a level-log model as seen with the log(X). This can be interpreted as the average difference in Y for every 1% difference in X being the slope coefficient divided by 100, or \(\frac{\beta_1}{100}\). Specifically in this model, for every 1% difference in X, there is a 0.02 average difference in Y.
Consider a linear CEF model for the effect of age (X, in years) on the natural log of salary (ln(Y), where Y is in dollars): log(Y) = 3 + 0.55X − 0.01\(X^2\) + ε. How do you interpret the slope coefficient 0.55 and −0.01? What is the predictive effect of having one-unit change in the age on the salary?

Seeing in the scenario that one coefficient is positive and the other negative, with the negative coefficient being associated with the quadratic effect, this signifies that when X starts increasing, initially there is a positive effect, however as X grows, there will be diminishing returns as the Y value will fall off and decrease. Also noting the log-level model in this scenario, this signifies that that average percentage difference in Y for every one unit difference in X is 100 times the slop coefficients. Specifically, that linear effect will have a 55% average percentage difference in salary while the quadratic effect will have a 1% average percentage difference in salary for every one-unit change in age. In other words, factoring in the signs, a 1-unit increase in X leads to an approximate 55% increase in Y initially (due to the 0.55 coefficient), but as X grows, the quadratic term reduces the rate of increase, with the effect of X decreasing by about 1 percentage point for each additional unit increase in.
What is the key assumption for the above interpretation?

Given the log-level model nature, it should be noted that 100 times the difference in log is approximately equal to the percentage difference in level. This means that the approximation is good only when we assume that the percentage difference in level is less than 10 percent, or the difference in log is less than 0.1.

Problem 6 (10 pts)

An observational study finds a strong correlation between the number of ice cream sales (X) and the number of drowning incidents (Y ) in a given area during the summer months. The regression model is given by: Y = 50 + 15X + ε where Y is the number of drowning incidents, and X is the number of ice cream sales in thousands.

• Explain why the relationship between ice cream sales and drowning incidents cannot be viewed as causal in this observational study.

• Propose a method to establish a causal relationship between these two variables.


For one, the main point is that correlation does not equivalate to causation. Thus, just as in this scenario where we see a relationship between ice cream sales and drowning incidents, the relationships cannot be viewed as causal in this observational study. This particular phenomenon could just be the result of coincidence, or there are some obvious confounding variables that were ommitted, such as seasonality or high temperatures as summer months with higher temperatures can lead to both an increase in ice cream consumption (due to the heat) and an increase in people going swimming, which raises the risk of drowning. Here, hot weather is a confounding variable that affects both aspects.

A strong method would be to conduct randomized controlled tests to see if one variable really has an effect on the other, but this can introduce some ethical concerns as assigning random people to drown in order to see if ice cream sales increase or decrease is not practical or moral. On the other hand, we could close all ice cream stands or give away free ice cream to see if it changes the number of drowning accidents, but this would also not be practical. Thus, we would have to look for methods that attempt to elimate the confounding variables in order to truly see if the correlation between ice cream sales and drowning incidents is a causal relationship. For example, we can attempt to eliminate fluctuations in ice cream prices that would affect sales or eliminate swimming ability as a factor in drowning incidents.