The scatterplot below shows the relationship betweenthe number of calories and amount of carbohydrates (in grams) Starbucks food menu items con-tain.21Since Starbucks only lists the number of calories on the display items, we are interested inpredicting the amount of carbs a menu item has based on its calorie content.
Question7.24 Image
There looks to be a positive, but weak, correlation between calories and carbs.
In this scenario, we are looking to predict carbs based on calories, so carb would be the response variable and calories would be the explanatory variable.
A regression line would be one simple way to predict the value of carbs based on simple calorie count. A linear equation would be useful to take the calorie count as displayed on a menu and predict the amount of carbs.
I would say no. While the residuals are rougly normally distributed, they don’t seem to be consistently distributed. Based on the center graph above, we can see that variance of residuals increases as calorie count increases. This does not pass our tests for application of a linear regression.
Question7.26 Image
sy <- 9.41
sx <- 10.37
r <- 0.67
b1 <- (sy/sx)*r
b1
## [1] 0.6079749
We also know that the means of both values will appear on the regression line, (107.20,171.14). Plopping these into a point-slope line formula we can get our y-intercept, which combined with our slope can get the final regression equation:
y - y0 = b1 * (x - x0)
y - 171.14 = ~0.6079 * (x - 107.20)
yint <- (b1 * -107.20) + 171.14
yint
## [1] 105.9651
So the final regression line is:
y = 0.6079749x + 105.9651
regression_line <- function(x){return(0.6079749*x + 105.9651)}
The slope in this case would be the ratio of how much the height changes based on shoulder girth.
r^2
## [1] 0.4489
This means that approximately 45% of the variation in height can be explained by our regression line/relationship to shoulder girth.
In general the closer the value is to 1 or 100%, the better the model explains the variance, so 45% is somewhat weak.
d_height <- regression_line(100)
d_height
## [1] 166.7626
d_resid <- 160 - d_height
d_resid
## [1] -6.76259
What means is that the regression model is overestimating by approx 6.8 cm.
Question7.30 Image
y = yint + x*slope
y = -0.357 + 4.034x
The intercept in this case would be the heart weight (y) when body weight (x) is 0. This would never actually be observed, because body weight won’t be 0 and heart weight can’t be negative.
The slope is the ratio of change in heart weight to change in body weight.
R-squared is the % of variation in the response variable (y, or heart weight), that’s explained by our model.
Correlation coefficient is R, which is sqrt(R-square), therefore:
r_7.30 <- sqrt(0.6466)
r_7.30
## [1] 0.8041144
y0 <- 3.9983
x0 <- -0.0883
yint <- 4.010
slope <- (y0 - yint)/x0
slope
## [1] 0.1325028
Linearity - Based on the Normal QQ plot we see a very linear trend, so this condition is satisfied.
Constant Variability - Based on the top left chart above, we see that the residuals are varied relatively constantly.
Nearly Normal Residual distribution - The histogram in the top right, we do see a nearly normal distribution.
Independent Observations - Based on the time series data in the bottom right, we see consistent distribution over time, so we can be confident that these were independent observations.