Homework 8

Problem 8.2

The equation for linear regression is:

$\hat{y} = \beta_{0} + \beta_{1}x$

We can estimate $\beta_{0}$ and $\beta_{1}$, from the point estimates for the intercept, $b_{0}$, and slope, $b_{1}$, which are given in the table. Therefore, the equation for the line is:

$\hat{y} = 120.07 - 1.93x$
The slope represents the increase in $y$ when $x$ is increased by 1. Therefore, the average birth weight decreases by 1.93 ounces if the child is not the first born.

For first borns, we replace $x$ by 0, and for babies that are not first born, we substitute a 1 for $x$.
```
b0 <- 120.07
b1 <- -1.93
first <- b0 + b1*0
nonfirst <- b0 + b1*1
first
```
```
[1] 120.07
```
```
nonfirst
```
```
[1] 118.14
```
The predicted birth weight for first borns is 120.07 ounces, while the predicted birth weight for non-first borns is 118.14 ounces.
The hypotheses are:

$H_{0}: \beta_{1} = 0$

$H_{a}: \beta_{1} \ne 0$

The $t$ value is given in the table to be -1.62, which is calculated using the following equation:

$t = \frac{b_{1}-\beta{1}}{SE_{b_{1}}}$

The corresponding $P$-value is 0.1052. If we assume a significance level of 0.05, we fail to reject the null, and there is not sufficient evidence to support the claim that there is a relation between birth weight and birth order.

Problem 8.4

The general equation for a regression line is:

$\hat{y} = \beta_{0} + \sum_{i=1}^{n}\beta_{i}x_{i}$

Therefore, for three variables, the equation is:

$\hat{y} = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \beta_{3}x_{3}$

We can replace the $\beta_{i}$ values from the point estimates given in the table to get our final equation:

$\hat{y} = 18.93 - 9.11x_{1} + 3.10x_{2} + 2.15x_{3}$
Each slope represents the increase in $y$ when $x$ is increased by 1. There are three slopes in this equations. There is an average decrease of 9.11 days if the student comes from a not aboriginal background. There is an average increase of 3.10 days absent if the student is male. Finally, there is an average increase of 2.15 days if the student is classified as a slow learner.
First, we need to calculate $\hat{y}$ by replacing the $x_{i}$ values in our regression model with the values from the first row. The residual is the difference between the actual number of days missed and the predicted value:

$e_{i} = y_{i} - \hat{y}_{i}$
```
b <- c(18.93,-9.11,3.10,2.15)
x <- c(1,0,1,1)
y <- 2
yhat <- 0
for (i in 1:length(b)) {
  yhat <- yhat + b[i]*x[i]
}
yhat
```
```
[1] 24.18
```
```
e <- y - yhat
e
```
```
[1] -22.18
```
The residual is -22.18 days.
The following were given in the problem statement:

$Var(e_{i})=240.57, Var(y_{i})=264.17, n=146, k=3$

The equations for calculation $R^{2}$ and $R^{2}_{adj}$

$R^{2} = 1 - \frac{Var(e_{i})}{Var(y_{i})}$

$R^{2}_{adj} = 1 - \left[\frac{Var(e_{i})}{Var(y_{i})} \times \frac{n-1}{n-k-1}\right]$

We can use R to solve this:
```
vare <- 240.57
vary <- 264.17
n <- 146
k <- 3
R2 <- 1 - (vare/vary)
R2a <- 1 - ((vare/vary)*((n-1)/(n-k-1)))
R2
```
```
[1] 0.08933641
```
```
R2a
```
```
[1] 0.07009704
```
We can see that $R^{2}$ is 0.08933641 snd $R^{2}_{adj}$ is 0.07009704.

Problem 8.8

The variable that should be removed is the No learner status since its $R^{2}_{adj}$ is larger than the baseline full model. This means that when we remove this variable, the $R^{2}_{adj}$ improves.

Problem 8.16

When looking at this data, it appears that colder temperature contributes more to damaged O-rings. At 53 degrees, there were 5 damaged O-rings, and then 1 damaged O-ring for the other three lowest temps. The frequency of damaged O-rings as temperature is increased appears to go down.
The first thing that stands out to me is the slope. This shows that for every 1 degree increase in temperature, the probability of a damaged O-ring is decreased by 0.2162. In addition, the $P$ value appears to be 0. This could provide justification that temperature has an effect on if O-rings are damaged or not.
The equation for logistic regression is:

$\text{log}_{e}\left(\frac{p_{i}}{1-p_{i}}\right) = \beta_{0} + \sum_{i=1}^{n}\sum_{k=1}^{m} \beta_{k}x_{k,i}$

Given the data in the table, the logistic regression line is:

$\text{log}\left(\frac{\hat{p}}{1-\hat{p}}\right) = 11.6630 - 0.2162\times\text{Temperature}$
As stated in part b, since the $P$-value is 0, then the results are statistically significant. Therefore, cold temperature tends to cause more damaged O-rings than higher temperatures.

Problem 8.18

For this problem, I decided to take all the data given in problem 8.16 and create my own logistic regression model using R. I placed the data from 8.16 into vectors and then I created a for loop that listed all the trials and gave a 1 for damaged and 0 for not damaged. I then used the glm function to get the logistic regression line, which matched the problem statement.

n <- 6
temp <- c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81)
damaged <- c(5,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0)
tempall <- NULL
damagedall <- NULL
for (i in 1:length(temp)) {
  tempall <- c(tempall,rep(temp[i],n))
  damagedall <- c(damagedall,rep(1,damaged[i]))
  damagedall <- c(damagedall,rep(0,n-damaged[i]))
}
equation <- glm(damagedall ~ tempall, family=binomial)
summary(equation)


Call:
glm(formula = damagedall ~ tempall, family = binomial)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2646  -0.3395  -0.2472  -0.1299   3.0216  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) 11.66299    3.29616   3.538 0.000403 ***
tempall     -0.21623    0.05318  -4.066 4.77e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 76.745  on 137  degrees of freedom
Residual deviance: 54.759  on 136  degrees of freedom
AIC: 58.759

Number of Fisher Scoring iterations: 6

This part asks for the probability of a damaged O-ring at 51, 53 and 55. Since, I already created the regression line, I can calulated the probability for all of the temperatures. My selected range will be from 51 to 83 degrees. Below is an R-code that utilizes the predict function to do this.
```
probability <- data.frame(temp=c(51:83),p=rep(-1,length(c(51:83))))
probability[,2] <- predict(equation,newdata=data.frame(tempall=probability[,1]),type="response")
probability[probability$temp==51,2]
```
```
[1] 0.6536388
```
```
probability[probability$temp==53,2]
```
```
[1] 0.5504788
```
```
probability[probability$temp==55,2]
```
```
[1] 0.4427862
```
The probability for 51, 53 and 55 degrees are 0.654, 0.550 and 0.443, respectively.

We can use the probability data frame from part a to plot the data.

library(ggplot2)
gginit <- ggplot(probability,aes(x=temp,y=p))
ggtype <- geom_point()
line <- geom_smooth(method="glm",method.args=list(family="binomial"),se=FALSE)
gginit + ggtype + line + xlab("Temperature") + ylab("Probability of Damaged O-ring")

The assumptions for this model include that this a binary event, all samples are independent, and that we have a large enough sample size. This model also states that temperature is the only independent variable. The main concerns that I have are that this model does not take into the degree of damage to an O-ring. While I am not a NASA engineer, I do not know if an O-ring that is slightly damaged will cause a catastrophe. It may be better to have a model with a scale of damage. Also, there may be other factors that contribute to damage besides the temperature and this model does not account for that. We may need to look at other variables as well.