8.2 | Baby weights, Part II

a. Write the equation of the regression line.

\(\widehat{birth\ weight} = parity \times -1.93 + 129.07\)


b. Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

The slope measures the difference between in birth weight between a first-born and not first-born. If a child is a first-born, this model predicts that its birth weight will be 129.07 oz. If the child is not a first-born, then it will weigh 1.93 oz less.


c. Is there a statistically significant relationship between the average birth weight and parity?

The p-value of this relationship is 0.1052, which would not make the relationship significant at the 95% confidence level.




8.4 | Absenteeism, Part I

a. Write the equation of the regression line.

\(\widehat{absenteeism} = 18.93 - (9.11 \times eth) + (3.10 \times sex) + (2.15 \times lrn)\)


b. Interpret each one of the slopes in this context.

If you hold all other variables constant:

  • Non-aboriginal students have 9.11 fewer absent days.

  • Male students have 3.10 more absent days.

  • Slow learners have 2.15 more absent days.


c. Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed two days of school.

Observed = 2 absent days

Predicted = \(18.93 - (9.11 \times 0) + (3.10 \times 1) + (2.15 \times 1)\) = 24.13 absent days

Residual = observed — predicted = –22.13 days


d. The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the \(R^2\) and the adjusted \(R^2\). Note that there are 146 observations in the data set.

\(R^2 = 0.9106636\)

\(Adjusted\ R^2 = 0.9087762\)

    var.q4 <- 264.17
    res.q4 <- 240.57
    n.q4 <- 146
    k.q4 <- 3
    
    r2.q4 <- 1-((var.q4 - res.q4) / var.q4)
    
    adjr2.q4 <- 1-( ( (1-r2.q4) * (n.q4 - 1) ) / (n.q4 - k.q4 - 1) )
    
    r2.q4
## [1] 0.9106636
    adjr2.q4
## [1] 0.9087762




8.8 | Absenteeism, Part II

Which variable, if any, should be removed from the model first?

Remove the “No Learned Status” variable, since it has the highest adjusted \(R^2\).




8.16 | Challenger Disaster, Part I

a. Examine these data and describe what you observe with respect to the relationship between temperatures and damaged O-rings.

There are more damaged O-rings at lower temperatures than higher ones.


b. Describe the key components of the summary table in words.

The data in the summary table tells us that with every one degree increase in temparature, there are 0.2162 fewer damaged O-rings. A temperature of 0 would result in 11.6630 damaged O-rings.

A p-value of less than 0.05 tells us that the relationship between temperature and damaged O-rings is statistically significant.


c. Write out the logistic model using the point estimates of the model parameters.

\(log_e(\frac{\hat{p}}{1-\hat{p}}) = 11.6630 - 0.2162 \times Temperature\)


d. Based on the model, do you think concerns regarding O-rings are justified?

Yes. The model suggests that there is a significant relationship between decreasing temperature and fewer damaged O-rings. Based on this result, it could be important to closely monitor the ambient temperature during a future launch to make sure it does not drop too low and damage the O-ring.




8.18 | Challenger Disaster, Part II

a. Use the logistic model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 52, and 55 degrees Fahrenheit.

The probability that an O-ring will become damaged…

  • At 51 degrees = 65.4%

  • At 53 degrees = 55.1%

  • At 55 degrees = 44.3%

p <- function(temp) {
  d <- 11.6630 - 0.2162 * temp
  phat <- exp(d) / (1 + exp(d))
  print(round(phat, 3))
}

p(51)
## [1] 0.654
p(53)
## [1] 0.551
p(55)
## [1] 0.443

b. Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.

The black line below represents the model-estimated probabilities for temperatures ranging from 51 to 81 degrees Fahrenheit.

prob_temp <- c(51:81)

prob <- data.frame(
  prob_temp
)

prob$prob_damage <- NA

for (i in 1:length(prob_temp)) {
  d <- 11.6630 - 0.2162 * prob_temp[i]
  phat <- round( exp(d) / (1 + exp(d)), 3 )
  prob$prob_damage[i] <- phat
}

head(prob,10) %>% 
      kable("html")     %>% 
      kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
prob_temp prob_damage
51 0.654
52 0.604
53 0.551
54 0.497
55 0.443
56 0.391
57 0.341
58 0.294
59 0.251
60 0.213
temp <- c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81)

damaged <- c(5,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0)

undamaged <- c(1,5,5,5,6,6,6,6,6,6,5,6,5,6,6,6,6,5,6,6,6,6,6)

challenger <- data.frame(temp, damaged, undamaged) %>% 
  mutate(prob_damage = damaged / (damaged+undamaged)) %>% 
  select(temp, prob_damage)

ggplot(challenger, aes(x=temp, y=prob_damage)) +
  geom_point(col="#ADD8E6", size=3) +
  labs(x="Temperature (Fahrenheit)", y="Probability of damage") +
  geom_line(data=prob, aes(x=prob_temp, y=prob_damage))


c. Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.

Logistic regression requires two conditions to be met:

Each predictor \(x_i\) is linearly related to logit \(p_i\), if all other predictors are held constant.

It is difficult to very that this condition is met with the small number of observations included in the dataset.

Each outcome \(Y_i\) is independent of other outcomes.

To verify this condition, we would need to plot the residuals in the order of their collection, and investigate other possibilities. Temperature may not be the only factor affecting O-ring damage, and the O-ring is not the only source of a space shuttle disaster.