7.6 Husbands and wives, Part I. The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the husbands and wives.16 The scatterplot on the left shows the wife’s age plotted against her husband’s age, and the plot on the right shows wife’s height plotted against husband’s height.

  1. Describe the relationship between husbands’ and wives’ ages.

The husbands age seems to accurately predict his wives age, there is a linear relationship among the two variables. The correlation among the two variables is strong.

  1. Describe the relationship between husbands’ and wives’ heights.

The variables between husband and wife height appear to be independent, one does not predict the other. There is a very weak correlation among the variables.

  1. Which plot shows a stronger correlation? Explain your reasoning.

The first plot certainly shows a stronger correlation near 1 as the x variable seems to predict the y variable. There is less variation among the observations and they have a positive direction.

  1. Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between husbands’ and wives’ heights?

No, the correlation will remain the same.

7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.

  1. Describe the relationship between volume and height of these trees.

The observations are spaced out implying high variation and less predictability as a result of very low correlation between the two variables.

  1. Describe the relationship between volume and diameter of these trees.

There looks to be a small group paired closely together followed by another group which are grouped in a positive direction. The first group could be influential points worth investigating and the second group have a stronger positive correlation.

  1. Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.

Diameter seems to be a good predictor variable because it has a strong correlation to volume so I would prefer to use that

7.18 Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made

  1. $5,000 more than women?

Strong positive correlation

  1. 25% more than women?

Strong positive correlation

  1. 15% less than women?

Strong negative correlation

7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. 21 Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

  1. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

There is a positive correlation but the residuals appear to show higher variation as the calorie counts increases.

  1. In this scenario, what are the explanatory and response variables?

Calories are the explanatory variable and Carbs are the response variable

  1. Why might we want to fit a regression line to these data?

It would be interesting to fit a line to see how well calories can predict carbs

  1. Do these data meet the conditions required for fitting a least squares line?

The data show a weak but linear relationship. The variability increases as calories increase so it is not constant. I am not able to discern whether the variables are independent from one another.

7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

cats<- matrix(c(-0.357,0.692,-0.515,0.607,4.034,0.250,16.119,0.000), byrow=T, nrow=2)
colnames(cats)= c('Estimates', 'Std Error','T value', 'PR(|>|)')
rownames(cats)= c('Intercept','body wt') 

cats
##           Estimates Std Error T value PR(|>|)
## Intercept    -0.357     0.692  -0.515   0.607
## body wt       4.034     0.250  16.119   0.000
  1. Write out the linear model.
BodyWt<- 4.034
HeartWt<- -0.357 + BodyWt
HeartWt
## [1] 3.677
  1. Interpret the intercept.

The negative intercept is nonsensical when considering a cat that weigh 0kk would have a heart that weighs -0.357kg

  1. Interpret the slope.

For each increase of 1kg in the cats weight, the cats heart will increase by 4.034kg

  1. Interpret R2.

The R2 is the variation between the explanatory and response variable, in this case 64.4% of the heart weight variation is predicted by the cats body weight.

  1. Calculate the correlation coefficient.
R2<- .6466
round(sqrt(R2),4)
## [1] 0.8041

7.36 Beer and blood alcohol content. Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they differed in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.23 The scatterplot and regression table summarize the findings.

bac<- matrix(c(-0.0127,0.0126,-1.00,0.3320,0.0180,0.0024,7.48,0.0000), byrow= T, nrow=2)
colnames(bac)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(bac)=c('(Intercept)','beers')
bac
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -0.0127     0.0126   -1.00    0.332
## beers         0.0180     0.0024    7.48    0.000
  1. Describe the relationship between the number of cans of beer and BAC.

There is a positive linear correlation between the variables. The number of cans of beer give a good prediction of BAC.

  1. Write the equation of the regression line. Interpret the slope and intercept in context.
slope<- .018
BAC<- -0.0127 + slope
BAC
## [1] 0.0053

The intercept implies that a person who has consumed 0 cans of beer will have a BAC of -0.0127

  1. Do the data provide strong evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.

Ho: Drinking more beers does not impact your BAC Ha: Drinking more beers does impact your BAC

Since the p-value is significant low, we reject the null hypothesis in favor of the alternative which states there is a relationship between the variables.

  1. The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate R2 and interpret it in context.
R2<- .89^2
R2
## [1] 0.7921

The R2 statistic indicates that 79% of the variation in BAC is explained by the the number of beers consumed

  1. Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC. Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?

There are factors that could impact the samples taken in the bar. Because it wouldn’t be a controlled experiment, its hard to say how much time the drinks were consumed in and there are different concentrations of alcohol since the drinks would not have been standardized.

7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is head circumference = 3.91 + 0.78 ⇥ gestational age

  1. What is the predicted head circumference for a baby whose gestational age is 28 weeks?
gest_age<- 28
head_circumference<- 3.91 + .78 * gest_age
head_circumference
## [1] 25.75
  1. The standard error for the coefficient of gestational age is 0.35, which is associated with df = 23. Does the model provide strong evidence that gestational age is significantly associated with head circumference?

The model shows correlation coefficient of .78 which indicates a strong positive correlation

8.2 Baby weights, Part II. Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.

bw<- matrix(c(120.07,0.60,199.94,0.0000,-1.93,1.19,-1.62,0.1052), byrow= T, nrow=2)
colnames(bw)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(bw)=c('(Intercept)','parity')
bw
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   120.07       0.60  199.94   0.0000
## parity         -1.93       1.19   -1.62   0.1052
  1. Write the equation of the regression line.

BabyWt= 120.07 -1.93 * parity

  1. Interpret the slope in this context, and calculate the predicted birth weight of first born and others.

For each increase of one parity, there will be a decrease of 1.93 ounces in the baby’s weight.

  1. Is there a statistically significant relationship between the average birth weight and parity?

The p-value is very high indicating the failure to reject the null hypothesis that there is no relationship between avg birth weight and parity

8.4 Absenteeism, Part I. Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set.

The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner).

eb<- matrix(c(18.93,2.57,7.37,0.0000,-9.11,2.60,-3.51,0.0000,3.10,2.64,1.18,0.2411,2.15,2.65,0.81,0.4177), byrow= T, nrow=4)
colnames(eb)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(eb)=c('(Intercept)','eth','sex','lrn')
eb
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    18.93       2.57    7.37   0.0000
## eth            -9.11       2.60   -3.51   0.0000
## sex             3.10       2.64    1.18   0.2411
## lrn             2.15       2.65    0.81   0.4177
  1. Write the equation of the regression line.

avg_days_absent<- 18.93 -9.11 * ethn + 3.10 * sex + 2.15 * lrn

  1. Interpret each one of the slopes in this context.

The models slope for eth of 9.11 average days absent for non-aboriginal (1)

The models slope for sex of 3.10 average days absent for males (1)

The models slope for lrn of 2.15 average days absent for slow learners (1)

  1. Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.
eth<- 0
sex<- 1
lrn<- 1
days_missed<- 2

avg_days_predicted<- 18.93 - 9.11 * eth + 3.11 * sex + 2.15 * lrn
avg_days_predicted
## [1] 24.19
residual<- days_missed - avg_days_predicted
residual
## [1] -22.19
  1. The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 246 observations in the data set.
n<- 146
k<- 3
varRes<- 240.57
varStudents<- 264.17
R2<- 1-(varRes / varStudents)
round(R2,4)
## [1] 0.0893
adjR2<- 1-(varRes/varStudents) * ((n-1)/(n-k-1))
round(adjR2,4)
## [1] 0.0701

8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.

ct<- matrix(c(-57.99,8.64,-6.71,0.00,0.34,0.13,2.61,0.01,4.71,0.26,17.82,0.00), byrow= T, nrow=3)
colnames(ct)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(ct)=c('(Intercept)','ht','diameter')
ct
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -57.99       8.64   -6.71     0.00
## ht              0.34       0.13    2.61     0.01
## diameter        4.71       0.26   17.82     0.00
  1. Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.

Confidence Interval= lower.bound= b1 - t * se upper.bound= b1 + t * se

b1<- .34
t<- 2.05
se<- .13
lower.bound<- b1 - t * se
upper.bound<- b1 + t * se
CI<- c(lower.bound, upper.bound)
CI
## [1] 0.0735 0.6065
  1. One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.
actVol<-24.2
ht<- 79
diam<- 11.31
estVol<- -57.99 + .34 * ht + 4.71 * diam
round(estVol,1)
## [1] 22.1

Thus the model underestimated the volume of this tree

8.8 Absenteeism, Part II. Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.

Model<-c('Full model','No eth','No sex','No learner status')
AdjR2<-c(0.0701,-0.0033,0.0676,0.0723)  
abs<- data.frame(Model,AdjR2)
abs
##               Model   AdjR2
## 1        Full model  0.0701
## 2            No eth -0.0033
## 3            No sex  0.0676
## 4 No learner status  0.0723

Which, if any, variable should be removed from the model first?

8.10 Absenteeism, Part III. Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?

abs1<- matrix(c(0.0007,0.3142,0.5870,0.0714,0.0001,0), byrow= T, nrow=2)
colnames(abs1)=c('ethnicity','sex','learner status')
rownames(abs1)=c('p-value','R2adj')
abs1
##         ethnicity    sex learner status
## p-value    0.0007 0.3142          0.587
## R2adj      0.0714 0.0001          0.000

8.12 Movie lovers, Part II. Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?

They should use the p-values approach to select variables because they could increase the model accuracy by using a backward elimination method for fitting the model for each possible predictor and using the smallest p-value to identify the predictor.

8.14 GPA and IQ. A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.

Observing the fitted values we would see that there is a random distribution in the observations and there does not appear to be a relationship among the variables.

8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.

  1. The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as:

log(ˆp/1-ˆp)= 11.6630 − 0.2162 *Temperature

where ˆp is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:

ˆp57 = 0.341 ˆp59 = 0.251 ˆp61 = 0.179 ˆp63 = 0.124 ˆp65 = 0.084 ˆp67 = 0.056 ˆp69 = 0.037 ˆp71 = 0.024

temp51<- 51
temp53<- 53
temp55<- 55
  damagedOring51<- 11.6630 - 0.2162 * temp51  
  damagedOring53<- 11.6630 - 0.2162 * temp53 
  damagedOring55<- 11.6630 - 0.2162 * temp55 
round(phat51<- exp(damagedOring51)/(1+exp(damagedOring51)),2) 
## [1] 0.65
round(phat53<- exp(damagedOring53)/(1+exp(damagedOring53)),2)
## [1] 0.55
round(phat55<- exp(damagedOring55)/(1+exp(damagedOring55)),2)
## [1] 0.44

The probability the O-ring will be damaged at 51 degrees is 65%

The probability the O-ring will be damaged at 53 degrees is 55%

The probability the O-ring will be damaged at 55 degrees is 44%

  1. Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.

  2. Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.