8.2 Baby weights, Part II Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

|Estimate Std. Error t value Pr(>|t|) |(Intercept) 120.07 0.60 199.94 0.0000 |parity -1.93 1.19 -1.62 0.1052

  1. Write the equation of the regression line.
    Answer: Equation of regression line:
    \(y = c + a1x1 + a2x2 + .. + anxn\) where x1,x2,xn are the independent variables and a1, a2, an are their coefficients or parameter estimates.

\(birthweight = 120.07 - 1.93 * Parity\)

  1. Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.
    Answer:
    Slope of this regression line indicates that, every first born child weighs heavier at birth than the ones born afterwards. In other words, any child who is not a first born will weigh 1.93 lbs lesser at birth than the first child.

  2. Is there a statistically significant relationship between the average birth weight and parity?
    Answer:
    Since the p-value = .1052 (p-value > 0.05), the relationship between birthweight and order of childbirth is not statistically significant.


8.4 Absenteeism, Part I Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set. eth sex lrn days 1 0 1 1 2 2 0 1 1 11 146 1 0 0 37 The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex(sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner). Estimate Std. Error t value Pr(>|t|) (Intercept) 18.93 2.57 7.37 0.0000 eth -9.11 2.60 -3.51 0.0000 sex 3.10 2.64 1.18 0.2411 lrn 2.15 2.65 0.81 0.4177
(a) Write the equation of the regression line.
Answer: Equation of regression line:
\(y = c + a1x1 + a2x2 + .. + anxn\) where x1,x2,xn are the independent variables and a1, a2, an are their coefficients or parameter estimates.

\(absent = 18.93 - 9.11 * eth + 3.10* sex + 2.15*lrn\)

  1. Interpret each one of the slopes in this context.
    Answer:
    Slope of this regression line indicates that, absenteism is directly related to sex and learner status and its inversely proportional to the ethnicity.
    eth or Ethnic background: The model predicts a decrease of 9.11 (absent) days in non-aboriginal children, all else held constant.
    Sex: The model predicts a increase of 3.10 (absent) days in males over females, all else held constant.
    lrn or Learner status: The model predicts a increase of 2.15 (absent) days in slow learners over average learners, all else held constant.

  2. Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.
    Answer:

# Model prediction: 
# Given for this case: eth=0, sex=1, lrn=1
Absent <- 18.93 - 9.11 * (0) + 3.10 * (1) + 2.15 * (1)
Residual <- 2 - Absent
paste("Residual of absence days calculated: ", Residual)
## [1] "Residual of absence days calculated:  -22.18"
  1. The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 146 observations in the data set.
    Answer:
n <- 146 # Number of observations used to fit the model
k <- 3   # number of predictor variables in the model
var_Abs_Res <- 240.57 # Variance of residual
var_Abs_All <- 264.17      # Variance for absence of all students

R2 <- 1 - (var_Abs_Res/var_Abs_All)  # R2
R2
## [1] 0.08933641
paste("Rsquare of absenteism model: ", R2)
## [1] "Rsquare of absenteism model:  0.0893364121588371"
adjustedR2 <- 1 - (var_Abs_Res / var_Abs_All) * ( (n-1) / (n-k-1) ) # Adjusted R2
paste("Adjusted Rsquare of absenteism model: ", adjustedR2)
## [1] "Adjusted Rsquare of absenteism model:  0.0700970405847281"

8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.19 Estimate Std. Error t value Pr(>|t|) (Intercept) -57.99 8.64 -6.71 0.00 height 0.34 0.13 2.61 0.01 diameter 4.71 0.26 17.82 0.00

dia<-c(83,86,88,105,107,108,110,110,111,112,113,114,114,117,120,129,129,133,137,138,140,142,145,160,163,173,175,179,180,180,206)
ht<-c(70,65,63,72,81,83,66,75,80,75,79,76,76,69,75,74,85,86,71,64,78,80,74,72,77,81,82,80,80,80,87)
vol<-c(103,103,102,164,188,197,156,182,226,199,242,210,214,213,191,222,338,274,257,249,345,317,363,383,426,554,557,583,515,510,770)
dia<-dia/10
vol<-vol/10
meas <- cbind(dia,ht,vol)
meas
##        dia ht  vol
##  [1,]  8.3 70 10.3
##  [2,]  8.6 65 10.3
##  [3,]  8.8 63 10.2
##  [4,] 10.5 72 16.4
##  [5,] 10.7 81 18.8
##  [6,] 10.8 83 19.7
##  [7,] 11.0 66 15.6
##  [8,] 11.0 75 18.2
##  [9,] 11.1 80 22.6
## [10,] 11.2 75 19.9
## [11,] 11.3 79 24.2
## [12,] 11.4 76 21.0
## [13,] 11.4 76 21.4
## [14,] 11.7 69 21.3
## [15,] 12.0 75 19.1
## [16,] 12.9 74 22.2
## [17,] 12.9 85 33.8
## [18,] 13.3 86 27.4
## [19,] 13.7 71 25.7
## [20,] 13.8 64 24.9
## [21,] 14.0 78 34.5
## [22,] 14.2 80 31.7
## [23,] 14.5 74 36.3
## [24,] 16.0 72 38.3
## [25,] 16.3 77 42.6
## [26,] 17.3 81 55.4
## [27,] 17.5 82 55.7
## [28,] 17.9 80 58.3
## [29,] 18.0 80 51.5
## [30,] 18.0 80 51.0
## [31,] 20.6 87 77.0
  1. Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.
    Answer:
    df=n-2 = 29. For small n we will uset t table to identify the for two-tailed multiplier with a=.05 (.025*2)

\(Confidence Interval = (b1-t*se, b1 + t*se )\)
\(Confidence Interval = (0.34-2.05*0.13, 0.34 + 2.05*0.13 )\)
\(Confidence Interval = (0.0735, 0.6065 )\)

  1. One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.
    Answer:
    Using the estimates/ regression equation we can find that:
    \(vol = -57.99 + 0.34*ht + 4.71*dia\)
    \(vol = -57.99 + 0.34*79 + 4.71*11.3\)
    \(vol = 22.093\)

Actual vol = 24.2 Estimated vol = 22.093. Thus this case has been underestimated.


8.8 Absenteeism, Part II Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process. Model Adjusted R2 1 Fullmodel 0.0701 2 Noethnicity -0.0033 3 Nosex 0.0676 4 No learner status 0.0723 Which, if any, variable should be removed from the model first?
Answer:
If we remove lrn or learner status, we seem to have a better adjusted R2 of the model. Thus it is a good idea to remove lrn from the model.
***

8.10 Absenteeism, Part III Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first? variable ethnicity sex learner status p-value 0.0007 0.3142 0.5870 R2 adj 0.0714 0.0001 0
Answer:
p-value of eth or ethinicity is <0.05. Thus it is the only significant variable in the given scenario. Also it has the highest contribution in explaining the response variable (R2, adj R2), so it should added first.
***

8.12 Movie lovers, Part II Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?
Answer:
The company should use p-value approach in order to select variables. The p-value approach uses the probability of the hypothetical result obtaines by a sample survey as compared with the real situtation result.


8.14 GPA and IQ A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.
Answer:
Looking at the graphs we would notice that almost all the observations lie on the normal quantile plot except the first few. However, some of the initial values are way off. Also we observe that the residuals and fitted values do not seem correlated meaning they are independent.
This in a way is a good indicator of the model robustness in the sense that the basic assumptions of regression hold true.
***

8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data. (a) The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as

log(p/(1−p)=11.6630−0.2162×Temperature log(p/1−p)=11.6630−0.2162×Temperature where p is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:

p_57=0.341(p_59)=0.251(p_61)=0.179(p_63)=0.124 p_57=0.341(p_59)=0.251(p_61)=0.179(p_63)=0.124 p_65=0.084(p_67)=0.056(p_69)=0.037(p_71)=0.024 p_65=0.084(p_67)=0.056(p_69)=0.037(p_71)=0.024 Answer:

p <- function(temp)
{
  Orig_Failure_Prob <- 11.6630 - 0.2162 * temp
  p_hat <- exp(Orig_Failure_Prob) / (1 + exp(Orig_Failure_Prob))
  return (round(p_hat*100,2))
}

# Finding probabilities for different temperatures.
paste("O-Ring Failure Probability at Temp=51 F: ", p(51),"%") ; 
## [1] "O-Ring Failure Probability at Temp=51 F:  65.4 %"
paste("O-Ring Failure Probability at Temp=53 F: ", p(53),"%") ;
## [1] "O-Ring Failure Probability at Temp=53 F:  55.09 %"
paste("O-Ring Failure Probability at Temp=55 F: ", p(55),"%")
## [1] "O-Ring Failure Probability at Temp=55 F:  44.32 %"
  1. Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.
    Answer:
temperature <- c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81)

damaged <- c(5,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0)

undamaged <- c(1,5,5,5,6,6,6,6,6,6,5,6,5,6,6,6,6,5,6,6,6,6,6)

ShuttleMission <- data.frame(temperature, damaged, undamaged)
library(ggplot2)
ggplot(ShuttleMission,aes(x=temperature,y=damaged)) + geom_point() + 
  stat_smooth(method = 'glm', family = 'binomial')
## Warning: Ignoring unknown parameters: family

temp.x <- seq(from = 51, to = 71, by = 2)
y <- c(p(51)/100, p(53)/100, p(55)/100, 0.341, 0.251, 0.179, 0.124, 0.084, 0.056, 0.037, 0.024)
plot(temp.x, y, type = "o", col = "blue")

  1. Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.
    Answer:
    There are two key conditions for fitting a logistic regression model: