7.6 Husbands and wives, Part I. The Great Britain Office of Population Census and Surveys once collected data on a random sample of 170 married couples in Britain, recording the age (in years) and heights (converted here to inches) of the husbands and wives.16 The scatterplot on the left shows the wife’s age plotted against her husband’s age, and the plot on the right shows wife’s height plotted against husband’s height.
The husbands age seems to accurately predict his wives age, there is a linear relationship among the two variables. The correlation among the two variables is strong.
The variables between husband and wife height appear to be independent, one does not predict the other. There is a very weak correlation among the variables.
The first plot certainly shows a stronger correlation near 1 as the x variable seems to predict the y variable. There is less variation among the observations and they have a positive direction.
No, the correlation will remain the same.
7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.
The observations are spaced out implying high variation and less predictability as a result of very low correlation between the two variables.
There looks to be a small group paired closely together followed by another group which are grouped in a positive direction. The first group could be influential points worth investigating and the second group have a stronger positive correlation.
Diameter seems to be a good predictor variable because it has a strong correlation to volume so I would prefer to use that
7.18 Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made
Strong positive correlation
Strong positive correlation
Strong negative correlation
7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. 21 Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
There is a positive correlation but the residuals appear to show higher variation as the calorie counts increases.
Calories are the explanatory variable and Carbs are the response variable
It would be interesting to fit a line to see how well calories can predict carbs
The data show a weak but linear relationship. The variability increases as calories increase so it is not constant. I am not able to discern whether the variables are independent from one another.
7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.
cats<- matrix(c(-0.357,0.692,-0.515,0.607,4.034,0.250,16.119,0.000), byrow=T, nrow=2)
colnames(cats)= c('Estimates', 'Std Error','T value', 'PR(|>|)')
rownames(cats)= c('Intercept','body wt')
cats
## Estimates Std Error T value PR(|>|)
## Intercept -0.357 0.692 -0.515 0.607
## body wt 4.034 0.250 16.119 0.000
BodyWt<- 4.034
HeartWt<- -0.357 + BodyWt
HeartWt
## [1] 3.677
The negative intercept is nonsensical when considering a cat that weigh 0kk would have a heart that weighs -0.357kg
For each increase of 1kg in the cats weight, the cats heart will increase by 4.034kg
The R2 is the variation between the explanatory and response variable, in this case 64.4% of the heart weight variation is predicted by the cats body weight.
R2<- .6466
round(sqrt(R2),4)
## [1] 0.8041
7.36 Beer and blood alcohol content. Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they differed in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.23 The scatterplot and regression table summarize the findings.
bac<- matrix(c(-0.0127,0.0126,-1.00,0.3320,0.0180,0.0024,7.48,0.0000), byrow= T, nrow=2)
colnames(bac)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(bac)=c('(Intercept)','beers')
bac
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0127 0.0126 -1.00 0.332
## beers 0.0180 0.0024 7.48 0.000
There is a positive linear correlation between the variables. The number of cans of beer give a good prediction of BAC.
slope<- .018
BAC<- -0.0127 + slope
BAC
## [1] 0.0053
The intercept implies that a person who has consumed 0 cans of beer will have a BAC of -0.0127
Ho: Drinking more beers does not impact your BAC Ha: Drinking more beers does impact your BAC
Since the p-value is significant low, we reject the null hypothesis in favor of the alternative which states there is a relationship between the variables.
R2<- .89^2
R2
## [1] 0.7921
The R2 statistic indicates that 79% of the variation in BAC is explained by the the number of beers consumed
There are factors that could impact the samples taken in the bar. Because it wouldn’t be a controlled experiment, its hard to say how much time the drinks were consumed in and there are different concentrations of alcohol since the drinks would not have been standardized.
7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is head circumference = 3.91 + 0.78 ⇥ gestational age
gest_age<- 28
head_circumference<- 3.91 + .78 * gest_age
head_circumference
## [1] 25.75
The model shows correlation coefficient of .78 which indicates a strong positive correlation
8.2 Baby weights, Part II. Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity.
bw<- matrix(c(120.07,0.60,199.94,0.0000,-1.93,1.19,-1.62,0.1052), byrow= T, nrow=2)
colnames(bw)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(bw)=c('(Intercept)','parity')
bw
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120.07 0.60 199.94 0.0000
## parity -1.93 1.19 -1.62 0.1052
BabyWt= 120.07 -1.93 * parity
For each increase of one parity, there will be a decrease of 1.93 ounces in the baby’s weight.
The p-value is very high indicating the failure to reject the null hypothesis that there is no relationship between avg birth weight and parity
8.4 Absenteeism, Part I. Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set.
The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner).
eb<- matrix(c(18.93,2.57,7.37,0.0000,-9.11,2.60,-3.51,0.0000,3.10,2.64,1.18,0.2411,2.15,2.65,0.81,0.4177), byrow= T, nrow=4)
colnames(eb)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(eb)=c('(Intercept)','eth','sex','lrn')
eb
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.93 2.57 7.37 0.0000
## eth -9.11 2.60 -3.51 0.0000
## sex 3.10 2.64 1.18 0.2411
## lrn 2.15 2.65 0.81 0.4177
avg_days_absent<- 18.93 -9.11 * ethn + 3.10 * sex + 2.15 * lrn
The models slope for eth of 9.11 average days absent for non-aboriginal (1)
The models slope for sex of 3.10 average days absent for males (1)
The models slope for lrn of 2.15 average days absent for slow learners (1)
eth<- 0
sex<- 1
lrn<- 1
days_missed<- 2
avg_days_predicted<- 18.93 - 9.11 * eth + 3.11 * sex + 2.15 * lrn
avg_days_predicted
## [1] 24.19
residual<- days_missed - avg_days_predicted
residual
## [1] -22.19
n<- 146
k<- 3
varRes<- 240.57
varStudents<- 264.17
R2<- 1-(varRes / varStudents)
round(R2,4)
## [1] 0.0893
adjR2<- 1-(varRes/varStudents) * ((n-1)/(n-k-1))
round(adjR2,4)
## [1] 0.0701
8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.
ct<- matrix(c(-57.99,8.64,-6.71,0.00,0.34,0.13,2.61,0.01,4.71,0.26,17.82,0.00), byrow= T, nrow=3)
colnames(ct)=c('Estimate','Std. Error','t value','Pr(>|t|)')
rownames(ct)=c('(Intercept)','ht','diameter')
ct
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -57.99 8.64 -6.71 0.00
## ht 0.34 0.13 2.61 0.01
## diameter 4.71 0.26 17.82 0.00
Confidence Interval= lower.bound= b1 - t * se upper.bound= b1 + t * se
b1<- .34
t<- 2.05
se<- .13
lower.bound<- b1 - t * se
upper.bound<- b1 + t * se
CI<- c(lower.bound, upper.bound)
CI
## [1] 0.0735 0.6065
actVol<-24.2
ht<- 79
diam<- 11.31
estVol<- -57.99 + .34 * ht + 4.71 * diam
round(estVol,1)
## [1] 22.1
Thus the model underestimated the volume of this tree
8.8 Absenteeism, Part II. Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process.
Model<-c('Full model','No eth','No sex','No learner status')
AdjR2<-c(0.0701,-0.0033,0.0676,0.0723)
abs<- data.frame(Model,AdjR2)
abs
## Model AdjR2
## 1 Full model 0.0701
## 2 No eth -0.0033
## 3 No sex 0.0676
## 4 No learner status 0.0723
Which, if any, variable should be removed from the model first?
8.10 Absenteeism, Part III. Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first?
abs1<- matrix(c(0.0007,0.3142,0.5870,0.0714,0.0001,0), byrow= T, nrow=2)
colnames(abs1)=c('ethnicity','sex','learner status')
rownames(abs1)=c('p-value','R2adj')
abs1
## ethnicity sex learner status
## p-value 0.0007 0.3142 0.587
## R2adj 0.0714 0.0001 0.000
8.12 Movie lovers, Part II. Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?
They should use the p-values approach to select variables because they could increase the model accuracy by using a backward elimination method for fitting the model for each possible predictor and using the smallest p-value to identify the predictor.
8.14 GPA and IQ. A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.
Observing the fitted values we would see that there is a random distribution in the observations and there does not appear to be a relationship among the variables.
8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeoff in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.
log(ˆp/1-ˆp)= 11.6630 − 0.2162 *Temperature
where ˆp is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature:
ˆp57 = 0.341 ˆp59 = 0.251 ˆp61 = 0.179 ˆp63 = 0.124 ˆp65 = 0.084 ˆp67 = 0.056 ˆp69 = 0.037 ˆp71 = 0.024
temp51<- 51
temp53<- 53
temp55<- 55
damagedOring51<- 11.6630 - 0.2162 * temp51
damagedOring53<- 11.6630 - 0.2162 * temp53
damagedOring55<- 11.6630 - 0.2162 * temp55
round(phat51<- exp(damagedOring51)/(1+exp(damagedOring51)),2)
## [1] 0.65
round(phat53<- exp(damagedOring53)/(1+exp(damagedOring53)),2)
## [1] 0.55
round(phat55<- exp(damagedOring55)/(1+exp(damagedOring55)),2)
## [1] 0.44
The probability the O-ring will be damaged at 51 degrees is 65%
The probability the O-ring will be damaged at 53 degrees is 55%
The probability the O-ring will be damaged at 55 degrees is 44%
Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.
Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.