Multiple and Logistic Regression.
Loaded Packages
prettydoc TRUE
Graded: 8.2, 8.4, 8.8, 8.16, 8.18
8.2 Baby weights, Part II.
Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 120.07 0.60 199.94 0.0000 parity -1.93 1.19 -1.62 0.1052
- Write the equation of the regression line.
\(\widehat{weight} = 120.07 - 1.93 * {parity}\)
- Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.
The estimated body weight of non-first-born babies is 1.62 ounces lower than for first-born babies.
- Is there a statistically significant relationship between the average birth weight and parity
\(H_0: \beta_1 = 0\)
\(H_A: \beta_1 \neq 0\)
T is -1.62, and p-value is 0.1052. Consequently, since the p-value is substantial, we fail to reject the null hypothesis. The data do not provide strong evidence that there is a difference in weight between first-born and non-first-born babies.
8.4 Absenteeism, Part I.
Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set.
eth sex lrn days
1 0 1 1 2 2 0 1 1 11 146 1 0 0 37
The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.93 2.57 7.37 0.0000 eth -9.11 2.60 -3.51 0.0000 sex 3.10 2.64 1.18 0.2411 lrn 2.15 2.65 0.81 0.4177
- Write the equation of the regression line.
\(\widehat{days_{absent}} = 18.93 - 9.11eth + 3.10sex + 2.15lrn\)
- Interpret each one of the slopes in this context.
- Holding the other variables constant, the model predicts that non-aboriginal students will have 9.11 less absences than aboriginal students.
- Holding the other variables constant, the model predicts that males will be absent 3.1 days more than females.
- Holding the other variables constant, the model predicts that slow learners will be absent 2.15 days more than average learners.
- Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.
eth <- 0
sex <- 1
lrn <- 1
yhat <- 18.93 - 9.11 * eth + 3.1 * sex + 2.15 * lrn
2 - yhat## [1] -22.18
- The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 146 observations in the data set.
var_ei <- 240.57
var_yi <- 264.17
(rsquared <- 1 - var_ei/var_yi)## [1] 0.08933641
n <- 146
k <- 3
(adj_rsquared <- 1 - var_ei/var_yi * (n - 1)/(n - k - 1))## [1] 0.07009704
8.8 Absenteeism, Part II.
Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.
Learner status should be removed first.
8.16 Challenger disaster, Part I.
On January 28, 1986, a routine launch was anticipated for the Challenger space shuttle. Seventy-three seconds into the flight, disaster happened: the shuttle broke apart, killing all seven crew members on board. An investigation into the cause of the disaster focused on a critical seal called an O-ring, and it is believed that damage to these O-rings during a shuttle launch may be related to the ambient temperature during the launch. The table below summarizes observational data on O-rings for 23 shuttle missions, where the mission order is based on the temperature at the time of the launch. Temp gives the temperature in Fahrenheit, Damaged represents the number of damaged O-rings, and Undamaged represents the number of O-rings that were not damaged.
- Each column of the table above represents a different shuttle mission. Examine these data and describe what you observe with respect to the relationship between temperatures and damaged O-rings.
There appears to be a negative linear relationship between the variables.
Temperature <- c(53, 57, 58, 63, 66, 67, 67, 67, 68, 69, 70, 70, 70, 70, 72,
73, 75, 75, 76, 76, 78, 79, 81)
Damaged <- c(5, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0,
0, 0)
shuttles <- data.frame(Temperature, Damaged_indicator = as.integer(Damaged >
0))
plot(shuttles) (b) Failures have been coded as 1 for a damaged O-ring and 0 for an undamaged O-ring, and a logistic regression model was fit to these data. A summary of this model is given below. Describe the key components of this summary table in words.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 11.6630 3.2963 3.54 0.0004 Temperature -0.2162 0.0532 -4.07 0.0000
The y-intercept is 11.7, and the point estimate of the slope is -.22.
- Write out the logistic model using the point estimates of the model parameters.
log(p-hat/ (1 - p-hat)) = 11.6630 - 0.2162 * Temperature
- Based on the model, do you think concerns regarding O-rings are justified? Explain.
Yes. At low temperatures, the chance of a damaged o-ring is greater than 40%
Temperature <- seq(53, 81, 1)
oring_model <- function(Temperature) {
value <- 11.663 - 0.2162 * Temperature
p <- exp(value)/(1 + exp(value))
return(p)
}
temp_damage <- cbind(Temperature, p = oring_model(Temperature))
plot(temp_damage)8.18 Challenger disaster, Part II.
Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.
- The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as p-hat/ (1 - p-hat)) = 11.6630 - 0.2162 * Temperature where ^p is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:
Temperature <- c(51, 53, 55)
(p <- oring_model(Temperature))## [1] 0.6540297 0.5509228 0.4432456
- Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.
(temp_damage <- rbind(temp_damage, cbind(Temperature, p)))## Temperature p
## [1,] 53 0.550922830
## [2,] 54 0.497050034
## [3,] 55 0.443245647
## [4,] 56 0.390740650
## [5,] 57 0.340649763
## [6,] 58 0.293882838
## [7,] 59 0.251091387
## [8,] 60 0.212654228
## [9,] 61 0.178697069
## [10,] 62 0.149135196
## [11,] 63 0.123727017
## [12,] 64 0.102128054
## [13,] 65 0.083938432
## [14,] 66 0.068740463
## [15,] 67 0.056125657
## [16,] 68 0.045712204
## [17,] 69 0.037154789
## [18,] 70 0.030148728
## [19,] 71 0.024430239
## [20,] 72 0.019774295
## [21,] 73 0.015991141
## [22,] 74 0.012922227
## [23,] 75 0.010436032
## [24,] 76 0.008424090
## [25,] 77 0.006797363
## [26,] 78 0.005483026
## [27,] 79 0.004421698
## [28,] 80 0.003565071
## [29,] 81 0.002873921
## [30,] 51 0.654029738
## [31,] 53 0.550922830
## [32,] 55 0.443245647
plot(temp_damage)- Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.
- We need to check for Nearly normal residuals
- We need to check for Constant variability of the residuals
- Residuals are independent of the order they were colleceted
Practice: 8.1, 8.3, 8.7, 8.15, 8.17
8.1 Baby weights, Part I.
The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.05 0.65 189.60 0.0000 smoke -8.94 1.03 -8.65 0.0000
The variability within the smokers and non-smokers are about equal and the distributions are symmetric. With these conditions satisfied, it is reasonable to apply the model. (Note that we don’t need to check linearity since the predictor has only two levels.)
- Write the equation of the regression line.
\(\widehat{weight} = 123.05 - 8.94*{smoke}\)
Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.
Is there a statistically significant relationship between the average birth weight and smoking?
8.3 Baby weights, Part III.
We considered the variables smoke and parity, one at a time, in modeling birth weights of babies in Exercises 8.1 and 8.2. A more realistic approach to modeling infant weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in days (gestation), mother’s age in years (age), mother’s height in inches (height), and mother’s pregnancy weight in pounds (weight). Below are three observations from this data set.
The summary table below shows the results of a regression model for predicting the average birth weight of babies based on all of the variables included in the data set.
Estimate Std. Error t value Pr(>|t|)
(Intercept) -80.41 14.35 -5.60 0.0000 gestation 0.44 0.03 15.26 0.0000 parity -3.33 1.13 -2.95 0.0033 age -0.01 0.09 -0.10 0.9170 height 1.15 0.21 5.63 0.0000 weight 0.05 0.03 1.99 0.0471 smoke -8.40 0.95 -8.81 0.0000
- Write the equation of the regression line that includes all of the variables.
\(\widehat{weight} = 120.07 - 1.93 * {parity}\)
Interpret the slopes of gestation and age in this context.
The coefficient for parity is different than in the linear model shown in Exercise 8.2. Why might there be a difference?
Calculate the residual for the first observation in the data set.
The variance of the residuals is 249.28, and the variance of the birth weights of all babies in the data set is 332.57. Calculate the R2 and the adjusted R2. Note that there are 1,236 observations in the data set.
var_ei <- 249.28
var_yi <- 332.57
1 - var_ei/var_yi## [1] 0.2504435
n <- 1236
k <- 6
1 - var_ei/var_yi * (n - 1)/(n - k - 1)## [1] 0.2467842
8.15 Possum classification, Part I.
The common brushtail possum of the Australia region is a bit cuter than its distant cousin, the American opossum (see Figure 7.5 on page 334). We consider 104 brushtail possums from two regions in Australia, where the possums may be considered a random sample from the population. The first region is Victoria, which is in the eastern half of Australia and traverses the southern coast. The second region consists of New South Wales and Queensland, which make up eastern and northeastern Australia. We use logistic regression to differentiate between possums in these two regions. The outcome variable, called population, takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland. We consider five predictors: sex male (an indicator for a possum being male), head length, skull width, total length, and tail length. Each variable is summarized in a histogram. The full logistic regression model and a reduced model after variable selection are summarized in the table.
- Examine each of the predictors. Are there any outliers that are likely to have a very large influence on the logistic regression model? total length = 75 skull width = 70
- The summary table for the full model indicates that at least one variable should be eliminated when using the p-value approach for variable selection: head length. The second component of the table summarizes the reduced model following variable selection. Explain why the remaining estimates change between the two models. ####8.17 Possum classification, Part II. A logistic regression model was proposed for classifying common brushtail possums into their two regions in Exercise 8.15. The outcome variable took value 1 if the possum was from Victoria and 0 otherwise. Estimate SE Z Pr(>|Z|) (Intercept) 33.5095 9.9053 3.38 0.0007 sex male -1.4207 0.6457 -2.20 0.0278 skull width -0.2787 0.1226 -2.27 0.0231 total length 0.5687 0.1322 4.30 0.0000 tail length -1.8057 0.3599 -5.02 0.0000
- Write out the form of the model. Also identify which of the variables are positively associated when controlling for other variables.
- Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had been captured in the wild in Australia, but it doesn’t say which part of Australia. However, the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37 cm long, and its total length is 83 cm. What is the reduced model’s computed probability that this possum is from Victoria? How confident are you in the model’s accuracy of this probability calculation?