8.2

a) Write the equation of the regression line. Birth weight=120.07-1.93*parity

b) Interpret the slope in this context, and calculate the predicted birth weight of first borns and others. For each unit of parsity increase there will be a decrease of 1.93 ounces.

c) Is there a statistically significant relationship between the average birth weight and parity? Because the p-value is greater than 0.05 one can determine that there is not a statistically significant relationship.

8.4

a) Write the equation of the regression line. Days Absent=18.93-9.11ethn+3.10sex+2.15*lrn

b) Interpret each one of the slopes in this context. For ethnicity, holding all other variables constant there is a decrease of 9.11 days in the predicted absenteeisn when the person is not aboriginal.

For sex, holding all other variables constant there is an increase of 3.10 days in the predicted absenteeisn when the person is ale.

For learner status, holding all other variables constant there is a decrease of 2.15 days in the predicted absenteeisn when the person is a slow learner.

c) Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.

ethn<- 0
sex<- 1
lrn<- 1
missedtwodays<-2
predicted<-18.93-9.11*ethn+3.10*sex+2.15*lrn
missedtwodays-predicted

## [1] -22.18

The residual would be -22.18

d) The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 146 observations in the data set.

n<- 146
k<-3
varianceresidual<-240.57
varianceallstudents<-264.17
rsquared<-1-(varianceresidual/varianceallstudents)
rsquared

## [1] 0.08933641

adjustedrsquared<-1-(varianceresidual/varianceallstudents)*((n-1)/(n-k-1))
adjustedrsquared

## [1] 0.07009704

R squared is 0.08933641, and adjusted R squared is 0.07009704

8.6

a) Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.

0.34+(1.96*0.13)

## [1] 0.5948

0.34-(1.96*0.13)

## [1] 0.0852

There is a 95% confidence that each inch of tree height contributes 0.09 to 0.59 cubic feet in tree volume.

b) One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.

-57.99 + 0.34*79 + 4.71*11.3

## [1] 22.093

(-57.99 + 0.34*79 + 4.71*11.3) - 24.2

## [1] -2.107

It estimates that the tree would have a volume of 22.093, and underestimates the volume of the tree by 2.107.

8.8

Which, if any, variable should be removed from the model first? The learner status variable should be removed first because the adjusted R-squared is larger without it.

8.10

Based on this table, which variable should be added to the model first? Ethnicity because it has the highest adjusted r square and statistically significant p-value

8.12

Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system? Because the goal is to optimize accuracy then we should use the adjusted R square.

8.14

Using the plots given below, determine if this regression model is appropriate for these data. For the first plot it demonstrates nearly normal residuals as nearly normal distribution of the residuals, but has some large outliers on the lower tail. For the second plot it demonstrates constant variability of residuals as it has structure in the middle but there is not an overall constant in the variance of the residuals. For the third plot it demonstrates independent residuals: The plot of the residuals in the order of their collection has a random scatter, so there is no apparent structure that would indicate a problem. For the fourth and fifth plots they demonstrate linear relationships between the response variable and numerical explanatory variables. In the case of residuals and IQ they are randomly distributed around 0 with some outliers. For residuals and gender there is a random distribution around 0.

8.18

exp(11.663 - 0.2162*51)/(1+exp(11.663 - 0.2162*51))

## [1] 0.6540297

exp(11.663 - 0.2162*53)/(1+exp(11.663 - 0.2162*53))

## [1] 0.5509228

exp(11.663 - 0.2162*55)/(1+exp(11.663 - 0.2162*55))

## [1] 0.4432456

Where 0.6540297 for p-51, 0.5509228 for p-53 and 0.4432456 for p-55

b) Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.

Temp<-c(51,53,55,57,59,61,63,65,67,69,71)
Probabilitydamage<-c(0.654,0.551,0.443,0.341,0.251,0.179,0.124,0.084,0.056,0.037,0.024)
Challenger<-data.frame(cbind(Temp,Probabilitydamage))
Challenger

##    Temp Probabilitydamage
## 1    51             0.654
## 2    53             0.551
## 3    55             0.443
## 4    57             0.341
## 5    59             0.251
## 6    61             0.179
## 7    63             0.124
## 8    65             0.084
## 9    67             0.056
## 10   69             0.037
## 11   71             0.024

require(ggplot2)

## Loading required package: ggplot2

ggplot(Challenger, aes(y = Probabilitydamage, x = Temp)) + geom_point() + geom_smooth(method = "glm")

c) Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.

HW 8

Carola Rojas

3/12/2018

8.2

8.4

8.6

8.8

8.10

8.12

8.14

8.18