Week 8 Homework

require("ggplot2")

## Loading required package: ggplot2

8.2 Baby Weights

Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

a)Write the equation of the regression line.

Wegiht = 120.07 - 1.93(Parity)

b) Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

The slope of -1.93 means that if the baby is not a first born, it will be, on average, 1.93 ounces less than a first born.

According to this model, a first born baby weighs, on average, 120.07 ounces.

According to this model, non-first born baby weighs, on average, 118.14 ounces.

c) Is there a statistically significant relationship between the average birth weight and parity?

Because the p-value is greater than .05 and the t is less than 2 (greater than -2), I would say there is not a statistically significant relationship between the average birth weight and parity.

8.4 Absenteeism

Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are threeobservations from this data set.

a) Write the equation of the regression line.

Days = 18.93 -9.11(eth) + 3.10(sex) + 2.15(lrn)

b)Interpret each one of the slopes in this context.

According to this model, if you are not aboriginal, your number of days absent will be 9.11 less on average.

According to this model, if you are male (rather than female), your number of days absent will be 3.10 more on average.

According to this model, if you are a slow learner, your number of days absent will be 2.15 more on average.

c) Calculate the residual for the first observation in the data set: a student who is aboriginal,male, a slow learner, and missed 2 days of school.

Days(predicted) = 18.93 -9.11(0) + 3.10(1) + 2.15(1)

Days(predicted) = 24.18

Days(actual) = 2

residual = 24.18 - 2 = 22.18

d)The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 146 observations in the data set.

R2 = 1 - (240.57)/(264.17)
R2

## [1] 0.08933641

AdjR2 = 1 - (240.57/264.17)*((146-1)/(146-3-1))
AdjR2

## [1] 0.07009704

8.6 Cherry Trees

Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.

a) Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.

Confidence Interval at the 95% Level alpha = .05 t = 2.05

CI = (b1 - t * se, b1 + t * se)

Lower = .34 - 2.05 * .13
Lower

## [1] 0.0735

Upper = .34 + 2.05 * .13
Upper

## [1] 0.6065

CI = (0.0735, 0.6065)

We are 95% confident that the coefficient of height will fall between .0735 and .06065.

b)One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.

Volume(predicted) = -57.99 + .34(height) + 4.71(diameter)

 Volume = -57.99 + .34*(79) + 4.71*(11.3)
Volume

## [1] 22.093

Volume(predicted) = 22.093

Volume(actual) = 24.2

The model slightly underestimated the volume of this tree, only by 2.107

8.8 Absenteeism

Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.

Which, if any, variable should be removed from the model first?

The learner status variable should be removed because we get a better adjusted R2 value.

8.10 Ansenteeism

Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?

Ethnicity should be added to the model first due to the adjusted R2 value and the p-value below .05, meaning it is statistically significant.

8.12 Movie Lovers

Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?

I think the p-value approah would be better for selecting variables here. The p-value will show if the variable is statistically significant in the model.

8.14 GPA and IQ

A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.

Looking at the plots, the residuals seem mostly normal, and follow the regression line very well. It seems like the regression should have a strong R2 value. There are just a few lower valus that do not follow the regression line. It is also good to note that there does not seem to be correlation between the residuals and fitted values.The regression model looks appropriate for these data.

8.18 Challenger Disaster

a) The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as:

log(p/1-p) = 11.6630 - .2162 x Temperature

where ^p is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:

^p57 = 0.341 ^p59 = 0.251 ^p61 = 0.179 ^p63 = 0.124

^p65 = 0.084 ^p67 = 0.056 ^p69 = 0.037 ^p71 = 0.024

Temperature1 = 51

Damage1 = 11.6630 - 0.2162 * Temperature1
P1 = exp(Damage1) / (1 + exp(Damage1))



P1

## [1] 0.6540297

Temperature2 = 53

Damage2 = 11.6630 - 0.2162 * Temperature2
P2 = exp(Damage2) / (1 + exp(Damage2))



P2

## [1] 0.5509228

Temperature3 = 55

Damage3 = 11.6630 - 0.2162 * Temperature3
P3 = exp(Damage3) / (1 + exp(Damage3))



P3

## [1] 0.4432456

b)Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.

Temperature <- c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81)

Damaged <- c(5,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0)

Undamaged <- c(1,5,5,5,6,6,6,6,6,6,5,6,5,6,6,6,6,5,6,6,6,6,6)

Challenger <- data.frame(Temperature, Damaged, Undamaged)
library(ggplot2)
ggplot(Challenger,aes(x=Temperature,y=Damaged)) + geom_point() + 
stat_smooth(method = 'glm', family = 'binomial')

## Warning: Ignoring unknown parameters: family

Temp <- seq(from = 51, to = 71, by = 2)
Prob <- c(P1, P2, P3, 0.341, 0.251, 0.179, 0.124, 0.084, 0.056, 0.037, 0.024)
plot1 = plot(Temp, Prob, type = "o", col = "red")

c)Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.

In order to apply logistic regression in this applicatin:

Each predictor xi must be linearly related to logit(pi) if all other predictors are held constant.

This condition is tough to varify.

The second condition is:

Each outcome Yi is independent of the other outcomes.

Each launch should be independent of the others, therefore this condition is met.