Data 606 - Homework 8

Heather Geiger - May 3, 2018

Question 8.2

(1.93 * parity) + 120.07 = baby_weight_in_ounces
On average, babies born to parous mothers weigh 1.93 ounces less than babies born to first-time mothers.
No, the relationship is not statistically significant.

Question 8.4

eth = 0
sex = 1
lrn = 1
actual_missed_days_of_school = 2
calculated_missed_days_of_school = (-9.11 * eth) + (3.10 * sex) + (2.15 * lrn) + 18.93 
calculated_missed_days_of_school

## [1] 24.18

actual_missed_days_of_school - calculated_missed_days_of_school

## [1] -22.18

proportion_variance_not_explained <- 240.57/264.17
unadjusted_Rsquared <- 1 - proportion_variance_not_explained
n = 146
k = 3
adjustment_factor <- (n - 1)/(n - k - 1)
adjusted_Rsquared <- 1 - (proportion_variance_not_explained*adjustment_factor)
round(unadjusted_Rsquared,digits=4)

## [1] 0.0893

round(adjusted_Rsquared,digits=4)

## [1] 0.0701

missed_days_of_school = (-9.11 * eth) + (3.10 * sex) + (2.15 * lrn) + 18.93
The residual here is -22.18, meaning that the student missed over 22 days of school less than would be predicted by the model.
The unadjusted R-squared is 0.0893, while the adjusted R-squared is slightly lower at 0.0701.

Question 8.8

Using a backwards elimination process, we would remove learner status from the model, because the adjusted R-squared increases from 0.0701 to 0.0723 when we remove this variable.

Question 8.16

calculated_probabilities <- c()

for(temperature in seq(from=0,to=100,by=5))
{
calculated_probabilities <- c(calculated_probabilities,exp((-0.2162 * temperature) + 11.6630)/(1 + exp((-0.2162 * temperature) + 11.6630)))
}

plot(seq(from=0,to=100,by=5),calculated_probabilities,type="o",
xlab="Temperature",
ylab="Calculated probability of O-ring failure")

Based on just looking at the data, it would appear that the O-rings are more likely to be damaged at lower temperatures.
The model can be written as log(p/(1 - p)) = (-0.2162 * temperature) + 11.6630. Taking out of log format, this means that p = exp((-0.2162 * temperature) + 11.6630)/(1 + exp((-0.2162 * temperature) + 11.6630)). This means that as we increase the temperature, the probability of O-ring failure decreases. However, this occurs in a non-linear fashion, with the plot above showing that the greatest decrease in predicted probability occurs somewhere between 45 and 60 degrees Farenheit.
See #2.
Yes, based on the model concerns regarding O-rings seem justified. The association between lower temperature and higher probability of O-ring failure is extremely significant.

Question 8.18

original_data <- data.frame(Temperature = c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81),
    Num.damaged = c(5,1,1,1,rep(0,times=6),1,0,1,rep(0,times=4),1,rep(0,times=5)),
    stringsAsFactors=FALSE)

original_data <- data.frame(original_data,Failure.rate = original_data$Num.damaged/6,stringsAsFactors=FALSE)

calculated_probabilities <- c()

for(temperature in c(51,53,55,57,59,61,63,65,67,69,71))
{
calculated_probabilities <- c(calculated_probabilities,exp((-0.2162 * temperature) + 11.6630)/(1 + exp((-0.2162 * temperature) + 11.6630)))
}

calculated_probabilities[1:3]

## [1] 0.6540297 0.5509228 0.4432456

ggplot(original_data,
aes(Temperature,Failure.rate)) +
geom_point(alpha=1/10,size=4) +
xlab("Temperature (Farenheit)") +
ylab("Probability of damage") +
coord_cartesian(xlim = c(51,81)) +
coord_cartesian(ylim = c(0,1)) +
geom_point(data = data.frame(Temperature = seq(from=51,to=71,by=2),Failure.rate = calculated_probabilities),size=4,colour="red") +
geom_smooth(data = data.frame(Temperature = seq(from=51,to=71,by=2),Failure.rate = calculated_probabilities),colour="red")

## `geom_smooth()` using method = 'loess'

Let’s also plot actual and predicted logit(p) vs. temperature.

logit <- function(p){return(log(p/(1 - p)))}

actual_logit_p <- logit(original_data$Failure.rate)

calculated_probabilities <- c()

for(temperature in original_data$Temperature)
{
calculated_probabilities <- c(calculated_probabilities,exp((-0.2162 * temperature) + 11.6630)/(1 + exp((-0.2162 * temperature) + 11.6630)))
}

calculated_logit_p <- logit(calculated_probabilities)

logit(0) = -Inf, so some values won’t plot. Let’s plot these as if actual probability were very low but non-zero.

actual_logit_p[actual_logit_p == -Inf] <- min(calculated_logit_p)

plot(original_data$Temperature,
actual_logit_p,
xlab="Temperature",
ylab="Logit(p)")

lines(original_data$Temperature,
calculated_logit_p,
col="red",
type="o")

for(i in 1:5)
{
abline(h = logit((i/6)),lty=2)
text(80,logit(i/6),paste0("logit(",i,"/6)"))
}

abline(h = logit(min(calculated_probabilities)),lty=2)
text(55,logit(min(calculated_probabilities)) + 0.1,paste0("logit(",signif(min(calculated_probabilities),2),")"))

The calculated probability of O-ring damage is 0.654 at 51 degrees Farenheit, 0.551 at 53 degrees, and 0.443 at 55 degrees.
See plot above.
Based on this plot, it is a bit concerning that our model seems largely driven by the one data point that is also a strong outlier (low x value means it has more influence, and it also has a much higher y value than any of the other points). One of the major conditions for logistic regression is that each x value is linearly related to logit(p) if all other predictors are held constant. Here we see in the second plot that the relationship between temperature and logit(p) does not seem to be modeled that well by a linear model. Then again, we are also very limited by only being able to look at failure rates in intervals of 1/6. So maybe we can continue using the linear model, with the caveat that with only 6 O-rings we are limited in the degree we can confirm how well the model fits.