7.6 Husbands and wives, Part I.

  1. Describe the relationship between husbands’ and wives’ ages.

Strong positive linear correlation.

  1. Describe the relationship between husbands’ and wives’ heights.

Looks like it could be positive correlation but perhaps non-linear and not correlated at all.

  1. Which plot shows a stronger correlation? Explain your reasoning.

The age plot because the points are more closely fit (like a line) and the slope shows a stronger positive correlation.

  1. Data on heights were originally collected in centimeters, and then converted to inches. Does this conversion affect the correlation between husbands’ and wives’ heights?

No it should not, as the measurements are the same, just in a different metric.

7.12 Trees. The scatterplots below show the relationship between height, diameter, and volume of timber in 31 felled black cherry trees. The diameter of the tree is measured 4.5 feet above the ground.

  1. Describe the relationship between volume and height of these trees.

Looks like positive correlation but perhaps non-linear.

  1. Describe the relationship between volume and diameter of these trees.

Positive linear correlation.

  1. Suppose you have height and diameter measurements for another black cherry tree. Which of these variables would be preferable to use to predict the volume of timber in this tree using a simple linear regression model? Explain your reasoning.

I would rather the diameter measurements, because the graphs seem to indicate that it has the stronger positive correlation, and aid us better in predicting the volume.

7.18 Correlation, Part II. What would be the correlation between the annual salaries of males and females at a company if for a certain type of position men always made

  1. $5,000 more than women?

salaryMen=salaryWomen+5000. This would end up being a positive linear relationship as the salary of men increases by 5000 at each point.

  1. 25% more than women?

salaryMen= salaryWomen+0.25salarywomen Another positive linear relationship.

  1. 15% less than women?

salaryMen=salaryWomen-0.15salarywomen Negative linear relationship.

7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

  1. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

Positive linear correlation.

  1. In this scenario, what are the explanatory and response variables?

The response variable would be the amount of carbs a menu item has in grams (Carbs) which is affected by the explanatory variable of calorie content (Calories).

  1. Why might we want to fit a regression line to these data?

Because we are trying to predict the amount of carbs based on the calorie count.

  1. Do these data meet the conditions required for fitting a least squares line?

The data appears to fit a linear trend based on the scatterplot. Based on the plot and histogram, the residuals appear to have a normal distribution. Based on the residual plot we do not see constant variability.

7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg).

  1. Write out the linear model.

heartweight= -0.357+4.034bodywt

  1. Interpret the intercept.

The intercept is -0.357 which means that if the body weight of a cat was 0 kg, the heart weight would be -0.357 kg. Which in reality does not make sense, but a cat would never actually weigh 0 pounds so this measurement is just a point in which the calculations are made off of.

  1. Interpret the slope.

The slope is 4.034, which indicates that if the body weight of a cat were to increase by 1 kg, than the heart weight of a cat would increase by 4.034 kg.

  1. Interpret R2.

R-squared is 64.66% which means that 64.66% of cats heart weight is explained by cats body weight.

  1. Calculate the correlation coefficient.
sqrt(0.6466)
## [1] 0.8041144

7.36 Beer and blood alcohol content. Many people believe that gender, weight, drinking habits, and many other factors are much more important in predicting blood alcohol content (BAC) than simply considering the number of drinks a person consumed. Here we examine data from sixteen student volunteers at Ohio State University who each drank a randomly assigned number of cans of beer. These students were evenly divided between men and women, and they di↵ered in weight and drinking habits. Thirty minutes later, a police officer measured their blood alcohol content (BAC) in grams of alcohol per deciliter of blood.23 The scatterplot and regression table summarize the findings.

  1. Describe the relationship between the number of cans of beer and BAC.

Positive linear correlation.

  1. Write the equation of the regression line. Interpret the slope and intercept in context.

BAC= -0.0127+0.0180beers

The slope is 0.0180 which indicates that for every 1 beer, the BAC will increase by 0.0180. The intercept is -0.0127, which indicates that if the amount of beers equals 0, then we would expect the BAC to be -0.0127.

  1. Do the data provide strong evidence that drinking more cans of beer is associated with an increase in blood alcohol? State the null and alternative hypotheses, report the p-value, and state your conclusion.

Ho: slope=0 (No association between beers and BAC) Ha: Slope does not equal zero. (Association between beers and BAC).

The p-value is equal to zero so we reject the null hypothesis and can state that there is a statistically significant association between drinking beers and BAC.

  1. The correlation coefficient for number of cans of beer and BAC is 0.89. Calculate R2 and interpret it in context.
0.89^2
## [1] 0.7921

The R-squared is 0.7921 which indicates that beers drank describes 79.21% of the variable BAC.

  1. Suppose we visit a bar, ask people how many drinks they have had, and also take their BAC. Do you think the relationship between number of drinks and BAC would be as strong as the relationship found in the Ohio State study?

Yes I think it would because a bar is a place where alcohol consumption would be higher than normal, so the relationship between beers and BAC would probably hold up.

7.42 Babies. Is the gestational age (time between conception and birth) of a low birth-weight baby useful in predicting head circumference at birth? Twenty-five low birth-weight babies were studied at a Harvard teaching hospital; the investigators calculated the regression of head circumference (measured in centimeters) against gestational age (measured in weeks). The estimated regression line is head_circumference= 3.91 + 0.78*gestational_age

  1. What is the predicted head circumference for a baby whose gestational age is 28 weeks?
hc <-3.91+0.78*28
hc 
## [1] 25.75

The predicted head circumference is 25.75 cm.

  1. The standard error for the coefficient of gestational age is 0.35, which is associated with df = 23. Does the model provide strong evidence that gestational age is significantly associated with head circumference?

Ho: slope=0, No association Ha: slope not =0, Association

qt(0.95, 23)
## [1] 1.713872
tscore <- (0.78-0)/0.35
tscore
## [1] 2.228571

Based on the t-score being greater than the critical t-value, we can reject the null hypothesis and conclude that gestational age is significantly associated with head circumference.

8.2 Baby weights, Part II. Exercise 8.1 introduces a data set on birth weight of babies. Another variable we consider is parity, which is 0 if the child is the first born, and 1 otherwise. The summary table below shows the results of a linear regression model for predicting the average birth weight of babies, measured in ounces, from parity. Estimate Std. Error t value Pr(>|t|) (Intercept) 120.07 0.60 199.94 0.0000 parity -1.93 1.19 -1.62 0.1052

  1. Write the equation of the regression line.

birthweight= 120.07-1.93parity

  1. Interpret the slope in this context, and calculate the predicted birth weight of first borns and others.

The slope is -1.93 which indicates that if parity=1 meaning the child is not a first born, then the expected birth weight decreases by 1.93 ounces.

##First borns
bw1 <- 120.07-1.93*0
bw1
## [1] 120.07
##Others
bw2 <- 120.07-1.93*1
bw2
## [1] 118.14

We expect a first born to be 120.07 ounces and the others to be 118.14 ounces.

  1. Is there a statistically significant relationship between the average birth weight and parity?

Ho: slope=0, No association Ha: slope not =0, Association. The p-value is equal to 0.1052, so we cannot reject the null hypothesis that there is no association between average birth weight and parity.

8.4 Absenteeism, Part I. Researchers interested in the relationship between absenteeism from school and certain demographic characteristics of children collected data from 146 randomly sampled students in rural New South Wales, Australia, in a particular school year. Below are three observations from this data set. eth sex lrn days 10 1 1 2 2 0 1 1 11 146 1 0 0 37 The summary table below shows the results of a linear regression model for predicting the average number of days absent based on ethnic background (eth: 0 - aboriginal, 1 - not aboriginal), sex (sex: 0 - female, 1 - male), and learner status (lrn: 0 - average learner, 1 - slow learner).18 Estimate Std. Error t value Pr(>|t|) (Intercept) 18.93 2.57 7.37 0.0000 eth -9.11 2.60 -3.51 0.0000 sex 3.10 2.64 1.18 0.2411 lrn 2.15 2.65 0.81 0.4177

  1. Write the equation of the regression line.

Daysabsent= 18.93-9.11eth+3.10sex+2.15lrn

  1. Interpret each one of the slopes in this context.

The slope for ethnic background is -9.11, which indicates that if eth=1 (not aboriginal), then the expected average days absent would decrease by 9.11 days. The slope for sex is 3.1o which indicates that if sex=1 (male) then the expected average days absent increases by 3.10 days. The slope for learner status is 2.15 which indicates that if lrn=1 (slow learner) expected average days absent increases by 2.15 days. (All else being constant for each one)

  1. Calculate the residual for the first observation in the data set: a student who is aboriginal, male, a slow learner, and missed 2 days of school.
#eth=0, sex=1, lrn=1
ada <- 2 #actual days absent
eda <- 18.93-9.13*0+3.10*1+2.15*1 #expected days absent
residual <- ada-eda
residual
## [1] -22.18
  1. The variance of the residuals is 240.57, and the variance of the number of absent days for all students in the data set is 264.17. Calculate the R2 and the adjusted R2. Note that there are 146 observations in the data set.
n <- 146
k <- 3
var1 <- 240.57
var2 <- 264.17 
rsquared <- 1-(var1/var2)
rsquared
## [1] 0.08933641
adjusted <- 1-rsquared*((n-1)/(n-k-1))
adjusted
## [1] 0.9087762

8.6 Cherry trees. Timber yield is approximately equal to the volume of a tree, however, this value is difficult to measure without first cutting the tree down. Instead, other variables, such as height and diameter, may be used to predict a tree’s volume and yield. Researchers wanting to understand the relationship between these variables for black cherry trees collected data from 31 such trees in the Allegheny National Forest, Pennsylvania. Height is measured in feet, diameter in inches (at 54 inches above ground), and volume in cubic feet.19 Estimate Std. Error t value Pr(>|t|) (Intercept) -57.99 8.64 -6.71 0.00 height 0.34 0.13 2.61 0.01 diameter 4.71 0.26 17.82 0.00

  1. Calculate a 95% confidence interval for the coefficient of height, and interpret it in the context of the data.
n <-31
df <- n-2
t <-qt(0.975, df)
se <- 0.13
slope <-0.34
lower <- slope-(t*se)
lower
## [1] 0.07412015
upper <-slope+(t*se)
upper
## [1] 0.6058799

We are 95% confident that the coefficient of height will be between about 0.0741 and 0.6059. This means that all else being constant, if the height of a tree grows 1 foot, we would expect the volume of the tree to increase somewhere between 0.0741 and 0.6059 cubic feet.

  1. One tree in this sample is 79 feet tall, has a diameter of 11.3 inches, and is 24.2 cubic feet in volume. Determine if the model overestimates or underestimates the volume of this tree, and by how much.
##height=79, diameter=11.3, volume=24.2
av <- 24.2 #actual volume
av
## [1] 24.2
ev <- -57.99+0.34*79+4.71*11.3 #expected volume
ev
## [1] 22.093
av-ev
## [1] 2.107

The model underestimates the volume of this tree by 2.107 cubic feet.

8.8 Absenteeism, Part II. Exercise 8.4 considers a model that predicts the number of days absent using three predictors: ethnic background (eth), gender (sex), and learner status (lrn). The table below shows the adjusted R-squared for the model as well as adjusted R-squared values for all models we evaluate in the first step of the backwards elimination process. Model Adjusted R2 1 Full model 0.0701 2 No ethnicity -0.0033 3 No sex 0.0676 4 No learner status 0.0723 Which, if any, variable should be removed from the model first?

When we remove the variable for learner status (lrn), our R-squared indicates we describe expected average days absent better so we should remove that variable first.

8.10 Absenteeism, Part III. Exercise 8.4 provides regression output for the full model, including all explanatory variables available in the data set, for predicting the number of days absent from school. In this exercise we consider a forward-selection algorithm and add variables to the model one-at-a-time. The table below shows the p-value and adjusted R2 of each model where we include only the corresponding predictor. Based on this table, which variable should be added to the model first? variable ethnicity sex learner status p-value 0.0007 0.3142 0.5870 R2 adj 0.0714 0.0001 0

Based on this table, we should add the ethnicity variable first as the p-value shows a statistically significant association with absent days and adjusted R-squared is better than the other two.

8.12 Movie lovers, Part II. Suppose an online media streaming company is interested in building a movie recommendation system. The website maintains data on the movies in their database (genre, length, cast, director, budget, etc.) and additionally collects data from their subscribers (demographic information, previously watched movies, how they rated previously watched movies, etc.). The recommendation system will be deemed successful if subscribers actually watch, and rate highly, the movies recommended to them. Should the company use the adjusted R2 or the p-value approach in selecting variables for their recommendation system?

The p-value approach because it will show if each variable is statistically significant correlated to the response variable in the model.

8.14 GPA and IQ. A regression model for predicting GPA from gender and IQ was fit, and both predictors were found to be statistically significant. Using the plots given below, determine if this regression model is appropriate for these data.

Based on the plots it seems that most of the observations lie on the normal quantile plot except for a few outliers. The residuals and fitted values are not correlated showing independence. So overall it seems as though the regression model is approriate for these data.

8.18 Challenger disaster, Part II. Exercise 8.16 introduced us to O-rings that were identified as a plausible explanation for the breakup of the Challenger space shuttle 73 seconds into takeo↵ in 1986. The investigation found that the ambient temperature at the time of the shuttle launch was closely related to the damage of O-rings, which are a critical component of the shuttle. See this earlier exercise if you would like to browse the original data.

  1. The data provided in the previous exercise are shown in the plot. The logistic model fit to these data may be written as log(p/1-p)= 11.6630-0.2162*Temperature where ˆp is the model-estimated probability that an O-ring will become damaged. Use the model to calculate the probability that an O-ring will become damaged at each of the following ambient temperatures: 51, 53, and 55 degrees Fahrenheit. The model-estimated probabilities for several additional ambient temperatures are provided below, where subscripts indicate the temperature: pˆ57 = 0.341 ˆp59 = 0.251 ˆp61 = 0.179 ˆp63 = 0.124 pˆ65 = 0.084 ˆp67 = 0.056 ˆp69 = 0.037 ˆp71 = 0.024
damage51 <- 11.6630-0.2162*51
pdamaged51 <- exp(damage51)/(1+exp(damage51))
pdamaged51
## [1] 0.6540297
damage53 <- 11.6630-0.2162*53
pdamaged53 <-exp(damage53)/(1+exp(damage53))
pdamaged53
## [1] 0.5509228
damage55 <- 11.6630-0.2162*55
pdamaged55 <-exp(damage55)/(1+exp(damage55))
pdamaged55
## [1] 0.4432456
  1. Add the model-estimated probabilities from part (a) on the plot, then connect these dots using a smooth curve to represent the model-estimated probabilities.
damage51 <- 11.6630-0.2162*51
pdamaged51 <- exp(damage51)/(1+exp(damage51))
damage53 <- 11.6630-0.2162*53
pdamaged53 <-exp(damage53)/(1+exp(damage53))
damage55 <- 11.6630-0.2162*55
pdamaged55 <-exp(damage55)/(1+exp(damage55))

model <- c(pdamaged51, pdamaged53, pdamaged55, 0.341, 0.251, 0.179, 0.124, 0.084, 0.056, 0.037, 0.024)
tempmod <- c(51, 53, 55, 57, 59, 61, 63, 65, 67, 69, 71)

temperature <- c(53,57,58,63,66,67,67,67,68,69,70,70,70,70,72,73,75,75,76,76,78,79,81)

damaged <- c(5,1,1,1,0,0,0,0,0,0,1,0,1,0,0,0,0,1,0,0,0,0,0)

undamaged <- c(1,5,5,5,6,6,6,6,6,6,5,6,5,6,6,6,6,5,6,6,6,6,6)
probdamage <- damaged/(damaged+undamaged)

library(ggplot2)
mydata <- data.frame(temperature, damaged, undamaged, probdamage)
ggplot(mydata,aes(x=temperature, y=probdamage)) +geom_point() +
stat_smooth(method = 'glm', family = 'binomial')
## Warning: Ignoring unknown parameters: family
## `geom_smooth()` using formula 'y ~ x'

plot(tempmod, model, type="o", col="green")

  1. Describe any concerns you may have regarding applying logistic regression in this application, and note any assumptions that are required to accept the model’s validity.

Each predictor (xi) must be linearly related to the logit(pi) when all other predictors are held constant, which in this instance is hard to tell. In addition, independence of each data point seems to hold up in this case because the shuttle launches should occur independently of the other ones.