DATA 606 Assignment 8

8.2 Baby weights, Part II.

The equation of the regression line is:

\[birthweight = 120.07 - 1.93parity\]

The slope of the regression model means the baby is not first born, the average birth weight declines by 1.93 ounces from the average birth weight of a first born. A baby who is not first born is predicted to have a birth weight of 118.14 ounces. A first born is predicted to have an average birth weight of 120.07.

(not_first_born = 120.07 - 1.93 * 1 )

## [1] 118.14

(first_born = 120.07)

## [1] 120.07

There is no statistically significant relationship between average birth weight and parity because the p-value for the parity coefficient is too big. The p-value of 0.1052 is larger than the threshold of 0.05 for statistical significance.

8.4 Absenteeism

The equation of the regression model is:

\[y = 18.93 - 9.11eth + 3.10sex + 2.15lrn\]

The slopes of the regression model may be interpreted as follows:

eth: a student has 9.11 fewer days absent if ethnicity is not aboriginal sex: a student has 3.10 more days absent if sex is male lrn: a student has 2.15 more days absent if a slow learner

The residual for the first observation is:

eth = 0
sex = 1
lrn = 1
days = 2

(fitted_value = 18.93 - 9.11 * eth + 3.10 * sex + 2.15 * lrn )

## [1] 24.18

(residual = days - fitted_value )

## [1] -22.18

The residual is -22.18 days. That is, the model predicts a higher number of absent days for this student that was actually realized.

As shown below, \(R^2 = 8.93\)% and the adjusted-\(R^2\) equals 7.01%.

(R_squared = 1 - (240.57/264.17) )

## [1] 0.08933641

n = 146
k = 3
( R_squared_adj = 1 - (240.57/264.17) * ( n - 1)/(n - k - 1) )

## [1] 0.07009704

8.8 Absenteeism, Part II.

We prefer the model with the highest adjusted R-squared. THis would be the model 4 which has no learner status. It has an adjusted R-squared of 7.23%. So we should remove the lrn (learner status) variable.

8.16 Challenger disaster, Part I.

The data suggests that lower temperatures are associated with more damaged O-rings.
As the temperature increases, the probability of O-ring damage decreases because the sign of the coefficient for temp predictor is negative. The intercept of 11.663 is associated with the logistic regression model for which temperature is zero.
The logistic model equation has the form:

\[log( \frac{p}{1-p}) = 11.663 - 0.2162 temp\]

This is equivalent to:

\[p = \frac{ exp( 11.663 - 0.2162 temp )}{ 1 + exp( 11.663 - 0.2162 temp)}\]

Based on the model, concerns about the O-ring are justified. The cold temperature was the primary factor in the failure of the Challenger mission. For example, at temperature of 53 degree, the logistic model predicts a failure probability of 55.1%.

t = 53

(prob = exp( 11.663 - 0.2162 * t)/ ( 1 + exp(11.663 - 0.2162 * t)))

## [1] 0.5509228

8.18 Challenger disaster, Part II.

To calculate the probability of O-ring damage, we write a logistic regression in the functional form below. This tells us the probability of damage at 51 degrees is 65.4%, at 53 degrees is 55.1% and at 55 degrees is 44.3%.

model_prob <- function(t){
  
    return( exp( 11.663 - 0.2162 * t) / ( 1 + exp( 11.663 - 0.2162 * t )))
}

(model_prob(51 ) )

## [1] 0.6540297

(model_prob(53 ) )

## [1] 0.5509228

(model_prob(55 ) )

## [1] 0.4432456

First, we reproduce the empirical data in a plot, then we add the fitted line of the logistic regression model.

library(knitr)
library(tidyverse)
library(kableExtra)

raw_data = data.frame( mission = c(1:23),
                       temp = c( 53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70, 70, 70, 72, 73, 75, 75, 76, 76, 78, 79, 81 ) ,
                       damage = c( 5, 1, 1, 1,  0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0 , 0 ) )

raw_data %>% mutate( freq = damage / 6.0 , fitted = model_prob(temp)) -> raw_data

Display the raw data from actual missions first.

knitr::kable( raw_data, digits = 4 ) %>% kable_styling( bootstrap_options = c("striped", "hover") )

mission	temp	damage	freq	fitted
1	53	5	0.8333	0.5509
2	57	1	0.1667	0.3406
3	58	1	0.1667	0.2939
4	63	1	0.1667	0.1237
5	66	0	0.0000	0.0687
6	67	0	0.0000	0.0561
7	67	0	0.0000	0.0561
8	67	0	0.0000	0.0561
9	68	0	0.0000	0.0457
10	69	0	0.0000	0.0372
11	70	1	0.1667	0.0301
12	70	0	0.0000	0.0301
13	70	1	0.1667	0.0301
14	70	0	0.0000	0.0301
15	72	0	0.0000	0.0198
16	73	0	0.0000	0.0160
17	75	0	0.0000	0.0104
18	75	1	0.1667	0.0104
19	76	0	0.0000	0.0084
20	76	0	0.0000	0.0084
21	78	0	0.0000	0.0055
22	79	0	0.0000	0.0044
23	81	0	0.0000	0.0029

To render the plot, add the points associated with temperatures for which no mission occurred into the data frame.

raw_data = add_row( raw_data, mission = 24, temp = 51, damage = 0, fitted = model_prob(temp), freq = fitted )
raw_data = add_row( raw_data, mission = 25, temp = 55, damage = 0, fitted = model_prob(temp), freq = fitted )
raw_data = add_row( raw_data, mission = 26, temp = 59, damage = 0, fitted = model_prob(temp), freq = fitted )
raw_data = add_row( raw_data, mission = 26, temp = 61, damage = 0, fitted = model_prob(temp), freq = fitted )
raw_data = add_row( raw_data, mission = 27, temp = 65, damage = 0, fitted = model_prob(temp), freq = fitted )


ggplot( data=raw_data, aes( x= temp, y = freq)) + geom_point() + geom_line( aes( x= temp, y = fitted), color = "red") +
  ggtitle("Challenger O-ring probability of damage with logistic fitted model in red")

The two conditions stated by the textbook as key for the validity of the logistic regression are: linearity of the predictor to the probit and the independence of observations. The linearity of the probit relationship probability breaks down as temperatures go to freezing. At some point, the O-ring damage is not in doubt. The independence of each mission seems reasonable. However, within the range of 51 to 81 degrees, the logistic model seems to be plausible. At the lower temperature of the Challenger disaster, the probability of damage seems very high.