Multiple and Logistic Regression

Statistics and Probability for Data Analytics

CUNY MSDS DATA 606

Rose Koh

2018/04/27

Links

Assignments

Chapter 8 - Multiple and Logistic Regression

  Practice: 8.1, 8.3, 8.7, 8.15, 8.17
  Graded: 8.2, 8.4, 8.8, 8.16, 8.18
  Presentation 8.1

\[\widehat{body\_weight} = 123.05 -8.94 * smoke\]

The slope is the difference of predicted body weight of babies born to smoker and non-smoker mothers.

\[\widehat{body\_weight\_smoker} = 123.05 -8.94 * 1 = 114.11\]

\[\widehat{body\_weight\_non\_smoker} = 123.05 -8.94 * 0 = 123.05\]

From the regression summary table, we can see that p-value for smoke is zero.

We can do hypothesis testing to see whether there is differences in the variable.

\[H_0: \beta_1 = 0\] For the null hypothesis, the difference of estimated body weight of babies born to smoker and non-smoker mothers is zero.

\[H_a: \beta_1 \neq 0\]

And for the alternative hypothesis, that there is the difference of estimated body weight of babies born to smoker and non-smoker.

The p-value corresponds to the two-sided test and as the p-value is very small, we reject the null hypothesis that there is no differences in bodyweight of smoker/non-smoker mothers . We can state that this data provides strong evidence that the slope is not zero, and that there is a statistically significant association between smoking and birth weights. The body weight of babies and whether the mother is smoker or not is negatively correlated.

\[\widehat{body\_weight} = 120.07 - 1.93 * parity\]

The estimated body weight of babies who wasn’t the first born is 1.93 ounces lower than for first-born babies.

The t-value and P value is -.162 and 0.1052. As the p-value is > 0.05, we can conclude there is no statistical significance relationship between the average birth weight and parity.

\[\widehat{absent\_days} = 18.93 - 9.11 * eth + 3.10 * sex + 2.15 * lrn\]

eth slope predicts 9.11 absent days decrease in non-aboriginal children.

sex slope predits 3.10 absent days increase in male.

lrn slope predicts 2.15 absent days increase in slow learners.

\[\widehat{absent\_days} = 18.93 - 9.11 * eth + 3.10 * sex + 2.15 * lrn\]

absent <- 18.93 - 9.11 * (0) + 3.10 * (1) + 2.15 * (1)
residual <- 2 - absent
residual

## [1] -22.18

# R-squared = 1 - (variance of residuals)/(variance in outcome)
# R-squared adjusted. 1 - (variance of residuals)/(variance in outcome)*(n-1)/(n-k-1), where k is predictor in variables in model.

r2 <- 1 - (240.57/264.17)
r2.adjusted <- 1 - (240.57/264.17)*((146-1)/(146-3-1))

r2

## [1] 0.08933641

r2.adjusted

## [1] 0.07009704

We should remove the learner status to improve adjusted r squred from 0.070097 to 0.0723.

It appears that colder temp had more damaged O-rings than average | higher temp. This is noticeable around the 50 degree range, particularly at 53 degrees. The colder the ambient temperature, the more likely the O-rings were going to be damaged. Anything above 57 degrees is safe.
The slope of of temperature is negative and statistically significant as p-value is shown zero. This means that as temperature increases by 1, the number of estimated damages of O-rings drops by .2162. This is simplification and not interpretable in this context as we don’t have observations of temp 0 degrees.

\[\widehat{o\_ring\_failure} = 11.6630 - 0.2162 * temp\]

Yes, as the p-value is 0.0000, this is statistically significant to state it is justified for the concerns regarding O-rings for temperature.

p <- c(51, 53, 55)
logit <- 11.6630 - 0.2162 * p
answer <- exp(logit) / (1 + exp(logit))

51, 53, 55 repectively 0.6540297, 0.5509228, 0.4432456

library(ggplot2)
data <- data.frame(prob=c(0.341, 0.251, 0.179, 0.124, 0.084, 0.056, 0.037, 0.024, 0.654, 0.551, 0.443), 
                   temp=c(57, 59, 61, 63, 65, 67, 69, 71, 51, 53, 55))

ggplot(data, aes(temp, prob)) +
  geom_point() +
  geom_smooth(se = FALSE, method = 'loess')

Logistic regression requires the following 2 conditions:

Each predictor x is linearly related to logit p if all other predictors are held constant.
Each outcome y is independent of the other outcomes.

Both conditions are difficult to verify.

There was only 23 missions that is inadquate size of samples to see the first condition.
It is unclear the O-ring is independent of other outcomes.

To conclude, it is uncertain whether the logistic regression can be used with the given information.