Chapter 7 Homework - Linear Regression

7.24

The scatterplot below shows the relationship betweenthe number of calories and amount of carbohydrates (in grams) Starbucks food menu items con-tain.21Since Starbucks only lists the number of calories on the display items, we are interested inpredicting the amount of carbs a menu item has based on its calorie content.

Question7.24 Image

Question7.24 Image

  1. There looks to be a positive, but weak, correlation between calories and carbs.

  2. In this scenario, we are looking to predict carbs based on calories, so carb would be the response variable and calories would be the explanatory variable.

  3. A regression line would be one simple way to predict the value of carbs based on simple calorie count. A linear equation would be useful to take the calorie count as displayed on a menu and predict the amount of carbs.

  4. I would say no. While the residuals are rougly normally distributed, they don’t seem to be consistently distributed. Based on the center graph above, we can see that variance of residuals increases as calorie count increases. This does not pass our tests for application of a linear regression.

7.26

Question7.26 Image

Question7.26 Image

  1. Using correlation coeffcient (R) and the standard deviations of the response and explanatory variable, we can calculate the slope of the regression line:
sy <- 9.41
sx <- 10.37
r <- 0.67

b1 <- (sy/sx)*r
b1
## [1] 0.6079749

We also know that the means of both values will appear on the regression line, (107.20,171.14). Plopping these into a point-slope line formula we can get our y-intercept, which combined with our slope can get the final regression equation:

y - y0 = b1 * (x - x0)

y - 171.14 = ~0.6079 * (x - 107.20)

yint <- (b1 * -107.20) + 171.14

yint
## [1] 105.9651

So the final regression line is:

y = 0.6079749x + 105.9651

regression_line <- function(x){return(0.6079749*x + 105.9651)}
  1. The y-intercept in this case is the value of height (y) when shoulder girth (x) is zero. In this case that would never happen (a person can’t have a 0 cm shoulder girth).

The slope in this case would be the ratio of how much the height changes based on shoulder girth.

  1. R-squared is the correlation coefficient (R), squared, i.e.:
r^2
## [1] 0.4489

This means that approximately 45% of the variation in height can be explained by our regression line/relationship to shoulder girth.

In general the closer the value is to 1 or 100%, the better the model explains the variance, so 45% is somewhat weak.

  1. Using the regression_line function defined above, we plop in the 100cm value:
d_height <- regression_line(100)

d_height
## [1] 166.7626
  1. Actual heigh is 160cm. The residual is actual - predicted, therefore:
d_resid <- 160 - d_height

d_resid
## [1] -6.76259

What means is that the regression model is overestimating by approx 6.8 cm.

  1. For a student with shoulder girth 56 cm, this model would not be a good fit. We can assume since the actual girth is nearly 5 standard deviation away from our sample mean, that it is outside of the domain of our model. Using this model would be extrapolation, and that is discouraged.

7.30

Question7.30 Image

Question7.30 Image

  1. Linear model written out would like like this:

y = yint + x*slope

y = -0.357 + 4.034x

  1. The intercept in this case would be the heart weight (y) when body weight (x) is 0. This would never actually be observed, because body weight won’t be 0 and heart weight can’t be negative.

  2. The slope is the ratio of change in heart weight to change in body weight.

  3. R-squared is the % of variation in the response variable (y, or heart weight), that’s explained by our model.

  4. Correlation coefficient is R, which is sqrt(R-square), therefore:

r_7.30 <- sqrt(0.6466)

r_7.30
## [1] 0.8041144

7.40

  1. Using slope-intercept line formula (y = mx + b), we can use means and intercept to find the slope:
y0 <- 3.9983
x0 <- -0.0883
yint <- 4.010

slope <- (y0 - yint)/x0 

slope
## [1] 0.1325028
  1. The slope we determined is positive, so the correlation is positive. The p(t) values provided in the summary statistics are 0, meaning this is not at all likely to be due to chance. Therefore, it does provide convincing evidence that the relationship between evaluation and beauty is positive.
  1. The conditions for linear regression are:

Linearity - Based on the Normal QQ plot we see a very linear trend, so this condition is satisfied.

Constant Variability - Based on the top left chart above, we see that the residuals are varied relatively constantly.

Nearly Normal Residual distribution - The histogram in the top right, we do see a nearly normal distribution.

Independent Observations - Based on the time series data in the bottom right, we see consistent distribution over time, so we can be confident that these were independent observations.