The scatterplot suggests that there is a positive relationship between calories and carbohydrates, i.e., higher levels of calories is associated with higher levels of carbs. The relationship appears to be only weakly linear.
Explanatory variable: number of calories
Response variable: amount of carbohydrates (in grams)
We might want to predict the amount of carbs in a Starbucks food menu item for which we only know the number of calories. For instance, the question states that Starbucks only lists the number of calories on the menu items.
Explanatory variable: shoulder girth (in cm)
Response variable: height (in cm)
\(\bar{x} = 107.20\)
\(s_x = 10.37\)
\(\bar{y} = 171.14\)
\(s_y = 9.41\)
\(R = 0.67\)
\(b_1 = \frac{s_y}{s_x} R = \frac{9.41}{10.37} 0.67 = 0.608\)
Point-slope: \(y - \bar{y} = b_1 (x - \bar{x})\)
\(\implies y = (\bar{y} - b_1 \bar{x}) + b_1 x\)
\(\implies y = 106.0 + 0.608 \cdot x\)
Or in words:
\[\text{height (in cm)} = 106.0 + 0.608 \cdot \text{shoulder girth (in cm)}\]
m_x <- 107.20
s_x <- 10.37
m_y <- 171.14
s_y <- 9.41
R <- 0.67
(b1 <- s_y / s_x * R)
## [1] 0.6079749
(b0 <- m_y - b1 * m_x)
## [1] 105.9651Slope: \(b_1 = 0.608\)
This indicates that each additional cm of shoulder girth is associated with an additional 0.608 cm of height, as predicted by the linear model.
Intercept: \(b_0 = 106.0\)
This indicates that if the linear model is correct, then a shoulder girth of 0 cm is associated with a height of 106.0 cm, assuming that the model is applicable to this range of values. However, a shoulder girth of 0 cm is not meaningful for the range of observations in the dataset, so the intercept simply serves to move the regression line up (vertically).
\(R^2 = 0.67^2 = 0.449\)
The \(R^2\) of 0.449 indicates that \(\approx\) 45% of the variability in the response variable (height) can be explained by the variability in the explanatory variable (shoulder girth), in the context of the linear model.
R^2
## [1] 0.4489\(x = 100\)
Then \(\hat{y} = 106.0 + 0.608 \cdot 100 = 166.8\), so the predicted height is 166.8 cm.
\(y = 160\), so the residual is \(e = y - \hat{y} = 160 - 166.8 = -6.8\). This means that the model over-predicts the height of this individual by 6.8 cm.
No, it would not be appropriate to apply the linear model in this case. Looking at the dataset on which the linear model was estimated, the range of shoulder girths is roughly 85-135 cm, while the range of heights is roughly 145-200 cm. A shoulder girth of 56 cm from the one-year old is clearly outside the range of values of the training dataset, so applying the linear model in this instance would be an example of extrapolation; there should be no expectation that the linear model would be applicable. In fact, if we applied the linear model to the one-year old, the model would predict a height of 140 cm, which would imply an exceptionally tall baby!
106 + 0.608 * 56
## [1] 140.048Explanatory variable: body weight (in kg)
Response variables: heart weight (in g)
Intercept: \(b_0 = -0.357\)
This indicates that that model predicts a heart weight of -0.357 g for a body weight of 0 kg, assuming the model is valid at 0 kg. In the context of the observational data, a body weight of 0 kg is not meaningful, so the intercept simply serves to translate the regression line vertically (i.e., move it down in this instance).
Slope: \(b_1 = 4.034\)
This indicates that each additional kg of body weight is associated with an additional 4.034 g of heart weight, as predicted by the linear model.
\(R^2 = 64.66\)%, which indicates that rough 65% of the variability in the response variable (heart weight) can be explained by variability in the explanatory variable (body weight), in the context of the linear model.
\(R = \sqrt{R^2} = \sqrt{0.6466} = 0.804\), so the correlation coefficient is approximately 80%.
sqrt(0.6466)
## [1] 0.8041144\(\bar{x} = -0.0883\)
\(\bar{y} = 3.9983\)
The slope is \(b_1 = 0.133\).
This can be shown in either of the following two ways.
4.13 * 0.0322
## [1] 0.132986
m_x <- -0.0883
m_y <- 3.9983
b_0 <- 4.010
(b_1 <- (b_0 - m_y) / (- m_x))
## [1] 0.1325028Based on the test statistic for the slope estimate, it appears that the slope of the relationship is positive. For instance, if we take the null hypothesis to be \(H_0: b_1 \le 0\) and the alternative hypothesis to be \(H_A: b_1 \gt 0\), then the t-value of 4.13 and \(df = 463 - 2 = 461\) correspond to a one-tailed p-value of 0.00002, which is \(\ll \alpha = 0.05\); therefore we conclude that the slope is indeed positive.
However, reviewing the scatterplot, it appears that the relationship between beauty scores and teaching evaluation scores is at best very weakly linear. Assuming this is true, then the linear model would not be appropriate for this dataset, in which case the slope estimate is not valid. It would be good to see the \(R^2\) to evaluate the model fit.
t <- 4.13
df <- 463 - 2
pt(4.13, df = df, lower.tail = FALSE)
## [1] 2.153774e-05Linearity: only weakly satisfied. The scatterplot and the plot of residuals vs. beauty scores show only a slight linear relationship. It would be good to see the \(R^2\) value.
Normally distributed residuals: not satisfied. The residuals appear to have a left skew to the distribution, so may not be well approximated by the normal distribution. This is apparent in the histogram of residuals as well as the normal quantile-quantile plot.
Constant variance of residuals: not satisfied. The residuals do not exhibit constant variance, as the variance seems wider for negative than for positive beauty scores.
Independence of observations: unclear, but probably not satisfied. It appears from the plot of residuals vs. order of data collection that there may be a slight positive relationship between residuals and the order of data collection. In other words, evaluations that were collected later appear to have slightly higher residuals; also evaluations that were collected earlier appear to have greater variance in residuals.