DATA 606 - Homework

Exercise 8.2

The equation of the regression line is: \[ \widehat{weight} = 120.07 - 1.93 \cdot \text{parity} \]
The slope is \(\beta_\text{parity} = -1.93\), which means that baby weights observed when \(\text{parity} = 1\) (non-first born) are -1.93 oz. less than when \(\text{parity} = 0\) (first born).
- First born, \(\text{parity} = 0\): \[\widehat{weight} = 120.07\]
- Non-first born, \(\text{parity} = 1\): \[\widehat{weight} = 120.07 - 1.93 = 118.14\]
Null hypothesis \(H_0\): \(\beta_\text{parity} = 0\)
Alternative hypothesis \(H_A\): \(\beta_\text{parity} \neq 0\)

From the table, the estimate for \(\beta_\text{parity}\) has a p-value of 10.5%, which is greater than the significant level of \(\alpha = 0.05\); in this case, we fail to reject \(H_0\), and conclude that the relationship between birth weight and parity is not statistically significant.

Exercise 8.4

The equation of the regression line is: \[ \widehat{\text{days_absent}} = 18.93 - 9.11 \cdot \text{eth} + 3.10 \cdot \text{sex} + 2.15 \cdot \text{lrn} \]
The interpretation of each slope is:
- \(\beta_\text{eth} = -9.11\): holding all else constant, \(\text{eth} = 1\) (not aboriginal) is associated with -9.11 less days absent, on average, than \(\text{eth} = 0\) (aboriginal).
- \(\beta_\text{sex} = 3.10\): holding all else constant, \(\text{sex} = 1\) (male) is associated with 3.10 more days absent, on average, than \(\text{sex} = 0\) (female).
- \(\beta_\text{lrn} = 2.15\): holding all else constant, \(\text{lrn} = 1\) (slow learner) is associated with 2.15 more days absent, on average, than \(\text{lrn} = 0\) (average learner).
Observation: \(y_1 = 2\)
Prediction: \(\hat{y_1} = 18.93 - 9.11 \cdot 0 + 3.10 \cdot 1 + 2.15 \cdot 1 = 24.18\)
Residual: \(e_1 = y_1 - \hat{y_1} = 2 - 24.18 = -22.18\)
The model over-predicts the days absent for the first observation.
\(n = 146\)
\(k = 3\)
\(n - 1 = 145\)
\(n - k -1 = 142\)
\(Var(e_i) = 240.57\)
\(Var(y_i) = 264.17\)
\[R^2 = 1 - \frac{Var(e_i)}{Var(y_i)} = 1 - 240.57 / 264.17 = 0.0893\] \[R^2_{adj} = 1 - \frac{Var(e_i)}{Var(y_i)} \cdot \frac{n-1}{n-k-1} = 0.0701\]
```
1 - 240.57 / 264.17
```
```
## [1] 0.08933641
```
```
1 - 240.57 / 264.17 * 145 / 142
```
```
## [1] 0.07009704
```

Exercise 8.8

The variable \(\text{lrn}\) (learner status) should be removed. This improves the model \(R^2_{adj}\) from 0.0701 to 0.0723.

Exercise 8.16

The data suggests that the lower temperatures are associated with a greater frequency of damaged O-rings. For instance, when the temperature was 53, 5 out of 6 O-rings were damaged in mission 1.
Key components of the table:
- First column: intercept or explanatory variable (temperature) in the logistic regresssion.
- Second column (“Estimate”): point estimate for the intercept or variable coefficient, based on the dataset.
- Third column (“Std. Error”): standard error of the point estimate for the intercept or variable coefficient.
- Fourth column (“z value”): z-value for the point estimate assuming the null hypothesis of 0, i.e.: \[z = \frac{\text{Estimate}}{\text{Std. Error}}\]
- Fifth column (“Pr(>|z|)”): p-value corresponding to the z-value in the fourth column, based on a standard normal distribution.
The equation for the logistic model is:
\[logit(p_i) = log \left(\frac{p_i}{1-p_i}\right) = 11.6630 - 0.2162 \cdot \text{temp} \] This is equivalent to:
\[p_i = \frac{e^{11.6630 - 0.2162 \cdot \text{temp}}}{1 + e^{11.6630 - 0.2162 \cdot \text{temp}}}\]
Yes, concerns are justified, assuming that the conditions for logistic regression are satisfied. The p-value for the temperature variable is 0, which indicates that the relationship between \(logit(p_i)\) and temperature is statistically significant. Furthermore, the negative sign of the temperature coefficient indicates that lower temperatures are associated with higher probabilities of damage.

For instance, for temperatures in the mid-50s and below, the probability of damaged O-rings is significant (\(p_i\) > 44%):
- When \(temp = 80\):
  \[logit(p_i) = 11.6630 - 0.2162 \cdot 80 = -5.633\] Then \[p_i = \frac{e^{-5.633}}{1+e^{-5.633}} = 0.0036\]
```
(x <- 11.663 - 0.2162 * 80)
```
```
## [1] -5.633
```
```
exp(x) / (1 + exp(x))
```
```
## [1] 0.003565071
```
- When \(temp = 55\):
  \[logit(p_i) = 11.6630 - 0.2162 \cdot 55 = -0.228\] Then \[p_i = \frac{e^{-0.228}}{1+e^{-0.228}} = 0.4432\]
```
(x <- 11.663 - 0.2162 * 55)
```
```
## [1] -0.228
```
```
exp(x) / (1 + exp(x))
```
```
## [1] 0.4432456
```

Exercise 8.18

The equation for the logistic model is:
\[log \left(\frac{\hat{p}_i}{1-\hat{p}_i}\right) = 11.6630 - 0.2162 \cdot \text{temp}\] which is equivalent to:
\[\hat{p}_i = \frac{e^{11.6630 - 0.2162 \cdot \text{temp}}}{1 + e^{11.6630 - 0.2162 \cdot \text{temp}}}\]
- 51 degrees F: \(\hat{p}_{51} = 0.654\)
- 53 degrees F: \(\hat{p}_{53} = 0.551\)
- 55 degrees F: \(\hat{p}_{55} = 0.443\)
```
# probability function
prob <- function(t) {
    p <- exp(11.6630 - 0.2162 * t) / (1 + exp(11.6630 - 0.2162 * t))
    return(p)
}
round(prob(seq(51, 71, 2)), 3)
```
```
##  [1] 0.654 0.551 0.443 0.341 0.251 0.179 0.124 0.084 0.056 0.037 0.024
```

See graph below.

library(ggplot2)
xrange <- seq(51, 71, 2)
df <- data.frame(
    temp = xrange,
    prob = prob(xrange)
)
ggplot(df, aes(x = temp, y = prob)) + geom_point() + geom_smooth(se = FALSE) + 
    labs(x = "Temperature (degrees F)", y = "Probability of damage", 
         title = "Probability of O-ring damage vs. temperature")

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

One concern is the limited size of the dataset; it only includes 23 observations (shuttle missions with distinct temperature readings) with 138 outcomes for the O-rings (6 O-rings for each mission). Given the limited size of the dataset, one has to be careful in drawing conclusions and taking actions on that basis.
In terms of the conditions for logistic regression, it is assumed here that both conditions are satisfied:
- Linear relationship between each predictor variable and \(logit(\hat{p}_i)\), holding all else constant: to verify whether this condition is satisfied, we would need to see plots of the residuals vs. the outcomes as well as the residuals vs. the temperature variable; also it would be good to see a plot of the outcomes vs. the predicted probabilities.
- Independence of each outcome: we could evaluate this condition by plotting the residuals against their order of collection.

DATA 606 - Homework - Chapter 8

Kevin Benson

December 8, 2018

Exercise 8.2

Exercise 8.4

Exercise 8.8

Exercise 8.16

Exercise 8.18