Nutrition at Starbucks, Part I. (8.22, p. 326) The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
##Answer (a): Based on the scatterplot, the number of calories and amount of carbohydrates (in grams) are linearly related. The scatterplot moves up and to the right, and the points are close to the straight line.
##Answer (b): The amount of carbohydrates (in grams) is the response variable since we are interested in predicting it and the number of calories is the explanatory (or predictor) variable.
##Answer (c): Fitting a regression line enables us to predict the response variable when provided with a given quantity of the explanatory variable (using the formula with the slope and intercept coefficients and the actual value of the predictor variable.)
##Answer (d): There are four conditions that must be met - 1). Linearity - The data should show a linear trend which in this case it does, the scatterplot and the residual plot don’t show any curvature. 2). Nearly normal residuals - The histogram of the residuals is unimodal and symmetric. 3). Constant variance - This condition is not satisfied, looking at the residual plot, the variance of the points increases as we move to the right of the graph. Finally, the last condition is that the observations must be independent, which in this case they are. The observations here are not part of any time series data.
Body measurements, Part I. (8.13, p. 316) Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals.19 The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.
\begin{center} \end{center}
##Answer (a): Shoulder girth and weight are moderately positively correlated as the scatterplot slopes up and to the right. Also, the points are clustered around the imaginary line emanating from the origin.
##A change in units will not impact the direction, magnitude or form of the relationship.
Body measurements, Part III. (8.24, p. 326) Exercise above introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.
##Answer (a): We start by calculating the slope coefficient, Beta1 which is the division of the standard deviation of height by the standard deviation of girth, multiplied by the correlation coefficient between height and shoulder girth. Once this value is calculated, its substituted into the slope of the line formula to calculate the intercept. Substituting the intercept and the slope coefficients, we get the equation to be
##Height <- 105.965 + (0.608 * Shoulder girth)
sy <- 9.41
sx <- 10.37
r <- 0.67
y <- 171.14
x <- 107.2
b1 <- (sy / sx) * r; b1
## [1] 0.6079749
b0 <- y - (b1 * x); b0
## [1] 105.9651
##Answer (b): The intercept b0 is the value that the height takes on if the value of the x-variable (shoulder girth = 0). The slope determines the value of height for a one-unit increase in x, i.e. height increases by 0.608cm for a one-unit increase in the shoulder girth.
##Answer (c): R-squared is the square of the correlation coefficient and describes the amount of variation in the response variable that is explained by the least squares line. i.e. the least squares line explains 44.89% of the variation in the height.
Rsquared <- 0.67 * 0.67; Rsquared
## [1] 0.4489
##Answer (d): Using the coefficients calculated earlier and the formula we defined, we predict the Height to be 166.76cm.
Height <- 105.9651 + (0.6079749 * 100); Height
## [1] 166.7626
##Answer (e): The residual is the difference between the actual and the predicted value of height, which in this case is 160 - 166.7626 = -6.7626. Since the actual value is lower than the observed, this means that the model overestimated the height by 6.7626cm.
residual <- 160 - 166.7626; residual
## [1] -6.7626
##Answer (f): 56cm is well outside the range of values used to estimate this model, by extrapolating an out-of-sample observation such as this, we might arrive at an incorrect prediction for the height of the child.
Cats, Part I. (8.26, p. 327) The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.
\begin{center} \end{center}
##Answer (a): Heart weight (in gms) <- -0.357 + 4.034 * body wt (in kg).
##Answer (b): When the body weight of the cat is 0kg, the heart weight of the cat is -0.357 gms. This doesn’t make realistic sense if the body weight of the cat is 0, then the heart weight must be 0 also.
##Answer (c): For every kilogram increase in the body weight of the cat, the heart weight increases by 4.034 grams.
##Answer (d): 64.66% of the variation in the heart weight is explained by the least squares regression line.
##Answer (e): The correlation coefficient is 0.8041144.
corr <- sqrt(0.6466); corr
## [1] 0.8041144
Rate my professor. (8.44, p. 340) Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.
\begin{center} \end{center}
##Answer (a): The linear model can be written as 3.9983 <- 4.010 + b1 * -0.0883. Rearranging, we get b1 to equal 0.1325 which is in fact the slope.
b1 <- (3.9983 - 4.010) / -0.0883; b1
## [1] 0.1325028
##Answer (b): Let’s perform a hypothesis test with the null hypothesis that the slope is 0 and the alternative one-tailed hypothesis that the slope is > 0.
##H0: b1 = 0 ##Ha: b1 > 0
##The test-statistic is 4.115. With 461 degrees of freedom and this t-stat value 4.115, the p-value is < 0.005. Since, the p-value is less than the assumed significance level of 0.05, we reject the null hypothesis that the slope is 0 in favor of the alternative hypothesis. Hence, there is convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive.
tstat <- 0.1325028/0.0322; tstat
## [1] 4.114994
##Answer (c): There are four conditions that must be met - 1). Linearity - The data should show a linear trend which in this case it does, the residual plot don’t show any curvature and points are randomly distributed about zero. 2). Nearly normal residuals - The histogram of the residuals is unimodal and approximately symmetric. 3). Constant variance - This condition is satisfied, looking at the residual plot, the variance of the points stays the same. 4). Finally, the last condition is that the observations must be independent, which we assume to be that professors are chosen independently.