DATA606_Chapter 8 - Multiple and Logistic Regression

Chapter 8 - Multiple and Logistic Regression

Practice: 8.1, 8.3, 8.7, 8.15, 8.17

8.5.1 Introduction to multiple regression

8.1 Baby weights, Part I. The Child Health and Development Studies investigate a range of topics. One study considered all pregnancies between 1960 and 1967 among women in the Kaiser Foundation Health Plan in the San Francisco East Bay area. Here, we study the relationship between smoking and weight of the baby. The variable smoke is coded 1 if the mother is a smoker, and 0 if not. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, based on the smoking status of the mother.

	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	123.05	0.65	189.60	0
smoke	-8.94	1.03	-8.65	0

The variability within the smokers and non-smokers are about equal and the distributions are symmetric. With these conditions satisfied, it is reasonable to apply the model. (Note that we don’t need to check linearity since the predictor has only two levels.)

Write the equation of the regression line.

\[\hat{bodyweight} = 123.05 - 8.94 \times smoke\]

Interpret the slope in this context, and calculate the predicted birth weight of babies born to smoker and non-smoker mothers.

The slope here is the difference of predicted body weight of babies born to smoker and non-smoker mothers.

the predicted birth weight of babies born to smoker: \[\hat{bodyweight}_{smoker} = 123.05 - 8.94 \times 1 = 114.11\]

the predicted birth weight of babies born to non-smoker: \[\hat{bodyweight}_{smoker} = 123.05 - 8.94 \times 0 = 123.05\]

Is there a statistically significant relationship between the average birth weight and smoking?

The hypotheses: H0: The the difference of estimated body weight of babies born to smoker and non-smoker mothers is zero. HA: The the difference of estimated body weight of babies born to smoker and non-smoker mothers is zero.

The p-value corresponds exactly to the two-sided test we are interested in: 0. The p-value is so small that we reject the null hypothesis and conclude that body weight of babies and smoking are negatively correlated.

8.3 Baby weights, Part III. We considered the variables smoke and parity, one at a time, in modeling birth weights of babies in Exercises 8.1 and 8.2. A more realistic approach to modeling infant weights is to consider all possibly related variables at once. Other variables of interest include length of pregnancy in days (gestation), mother’s age in years (age), mother’s height in inches (height), and mother’s pregnancy weight in pounds (weight). Below are three observations from this data set.

The summary table below shows the results of a regression model for predicting the average birth weight of babies based on all of the variables included in the data set.

Write the equation of the regression line that includes all of the variables.

\[\hat{bodyweight} = ???80.41 + 0.44 \times gestation ??? 3.33 \times parity ??? 0.01 \times age + 1.15 \times height + 0.05 \timesweight ??? 8.40 \times smoke\]

Interpret the slopes of gestation and age in this context. The slope of gestation means that the body weight of the baby increases 0.44 ounce for each additional day of pregnancy, giving other variables held constant. The slope of age means that the body weight of the baby decreases 0.01 ounce for each additional year of mother’s age, giving other variables held constant.
The coefficient for parity is different than in the linear model shown in Exercise 8.2. Why might there be a difference?

Exercise 8.2 only consider on vairiable. 8.3 represents the estimated coefficient when we are also accounting for other variables in the logistic regression model. The change of coefficient of parity may due to the existence of colinearity between parity to the other predictor vairiables.

Calculate the residual for the first observation in the data set. 27 62 100 0 \[\e_{i} = y_i - \hat{y}_i = 120 - (???80.41 + 0.44 \times 284 ??? 3.33 \times 0 ??? 0.01 \times 27 + 1.15 \times 62 + 0.05 \times 100 ??? 8.40 \times 0) = 120 - 120.58= 0.58\]
The variance of the residuals is 249.28, and the variance of the birth weights of all babies in the data set is 332.57. Calculate the R2 and the adjusted R2. Note that there are 1,236 observations in the data set.

\[R^2 = 1-\frac{VAR_(e_i)}{VAR_(y_i)} = 1-\frac{249.28}{332.57}=0.2504 \] \[R^2_adj = 1-\frac{VAR_(e_i)}{VAR_(y_i)}\times \frac{n-1}{n-k-1} = 1-\frac{249.28}{332.57} \times \frac{1236-1}{1236-6-1}=0.0.2468 \]

8.5.2 Model selection

8.7 Baby weights, Part IV. Exercise 8.3 considers a model that predicts a newborn’s weight using several predictors (gestation length, parity, age of mother, height of mother, weight of mother, smoking status of mother). The table below shows the adjusted R-squared for the full model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.

	Model	Adjusted R2
1	Fullmodel	0.2541
2	Nogestation	0.1031
3	No parity	0.2492
4	Noage	0.2547
5	Noheight	0.2311
6	Noweight	0.2536
7	Nosmokingstatus	0.2072

Which, if any, variable should be removed from the model first?

The fourth model without age has the highest adjusted R2 of 0.2547, so we compare it to the adjusted \(R^2\) for the full model. Because eliminating age leads to a model with a higher adjusted \(R^2\), we drop age from the model.

8.5.4 Introduction to logistic regression

8.15 Possum classification, Part I. The common brushtail possum of the Australia region is a bit cuter than its distant cousin, the American opossum (see Figure 7.5 on page 334). We consider 104 brushtail possums from two regions in Australia, where the possums may be considered a random sample from the population. The first region is Victoria, which is in the eastern half of Australia and traverses the southern coast. The second region consists of New South Wales and Queensland, which make up eastern and northeastern Australia.

We use logistic regression to differentiate between possums in these two regions. The outcome variable, called population, takes value 1 when a possum is from Victoria and 0 when it is from New South Wales or Queensland. We consider five predictors: sex male (an indicator for a possum being male), head length, skull width, total length, and tail length. Each variable is summarized in a histogram. The full logistic regression model and a reduced model after variable selection are summarized in the table.

Australia possum

Australia Possum

Examine each of the predictors. Are there any outliers that are likely to have a very large influence on the logistic regression model?

There are several outliers on both lower and higher ends in head_length, on higher end in skull_width, and on lower end of total_length. The outliers will not have large influence on the logistic regression model because the sample size if big enough.

The summary table for the full model indicates that at least one variable should be eliminated when using the p-value approach for variable selection: head length. The second component of the table summarizes the reduced model following variable selection. Explain why the remaining estimates change between the two models.
```
The removal of varaible head_length caused the change of the remaining estimates of the model because head_length is correlated to other vairables. For example the P value of sex_male and skull_width both decreased, suggesting a possum's head length may relate to its gender and  skull width. And, most likely the head length is correlated to the skull width.
```
8.17 Possum classification, Part II. A logistic regression model was proposed for classifying common brushtail possums into their two regions in Exercise 8.15. The outcome variable took value 1 if the possum was from Victoria and 0 otherwise.

Estimate SE Z Pr(>|Z|) (Intercept) 33.5095 9.9053 3.38 0.0007 sex male -1.4207 0.6457 -2.20 0.0278 skull width -0.2787 0.1226 -2.27 0.0231 total length 0.5687 0.1322 4.30 0.0000 tail length -1.8057 0.3599 -5.02 0.0000

	Estimate	SE	Z	Pr(>\|Z\|)
(Intercept)	33.5095	9.9053	3.38	0.0007
sex male	-1.4207	0.6457	-2.20	0.0278
skull width	-0.2787	0.1226	-2.27	0.0231
total length	0.5687	0.1322	4.30	0.0000
tail length	-1.8057	0.3599	-5.02	0.0000

Write out the form of the model. Also identify which of the variables are positively associated when controlling for other variables.

logistic model relating :

\[log(\frac{\hat{p_i}}{1-\hat{p_i}})== 33.5095 - 1.4207\times x_1,i -0.2787\times x_2,i + 0.5687\times x_3,i - 1.8057\times x_4,i\]

total length is positively associated with the classifying outcome when controlling for other variables.

Suppose we see a brushtail possum at a zoo in the US, and a sign says the possum had been captured in the wild in Australia, but it doesn’t say which part of Australia. However, the sign does indicate that the possum is male, its skull is about 63 mm wide, its tail is 37 cm long, and its total length is 83 cm. What is the reduced model’s computed probability that this possum is from Victoria? How confident are you in the model’s accuracy of this probability calculation?

\[p_i= \frac{e^{\beta_0 + \beta_1\times x_1,i + \beta_2\times x_2,i + \beta_3\times x_3,i + \beta_4\times x_4,i}} {1 + e^{\beta_0 + \beta_1\times x_1,i + \beta_2\times x_2,i + \beta_3\times x_3,i + \beta_4\times x_4,i}} \] \[{\beta_0 + \beta_1\times x_1,i + \beta_2\times x_2,i + \beta_3\times x_3,i + \beta_4\times x_4,i}\] \[= 33.5095 - 1.4207\times 1 -0.2787\times 63 + 0.5687\times 83 - 1.8057\times 37=0.9938\] \[p_i= \frac{e^{5.0781}} {1 + e^{5.0781}} = 0.0062\]

The reduced model's computed probability that this possum is from Victoria is 0.0062, which is very low. Because of the very low probability, I would tend to believe that the possum wasn't from Victoria, but there are additional factors to consider in this case. For example, if the this possum has some mutation that make it much smaller than the normal ones. Or if the possum has been caught at young age and under the pampering of the zoo, the size of the possum may be much larger than average.

DATA606_Chapter 8 - Multiple and Logistic Regression

Yun Mai

May 4, 2017

Chapter 8 - Multiple and Logistic Regression

8.5.1 Introduction to multiple regression

8.5.2 Model selection

8.5.4 Introduction to logistic regression