Chapter 8 - Introduction to Linear Regression

Nutrition at Starbucks, Part I. (8.22, p. 326) The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

##Answer (a): Based on the scatterplot, the number of calories and amount of carbohydrates (in grams) are linearly related. The scatterplot moves up and to the right, and the points are close to the straight line.

In this scenario, what are the explanatory and response variables?

##Answer (b): The amount of carbohydrates (in grams) is the response variable since we are interested in predicting it and the number of calories is the explanatory (or predictor) variable.

Why might we want to fit a regression line to these data?

##Answer (c): Fitting a regression line enables us to predict the response variable when provided with a given quantity of the explanatory variable (using the formula with the slope and intercept coefficients and the actual value of the predictor variable.)

Do these data meet the conditions required for fitting a least squares line?

##Answer (d): There are four conditions that must be met - 1). Linearity - The data should show a linear trend which in this case it does, the scatterplot and the residual plot don’t show any curvature. 2). Nearly normal residuals - The histogram of the residuals is unimodal and symmetric. 3). Constant variance - This condition is not satisfied, looking at the residual plot, the variance of the points increases as we move to the right of the graph. Finally, the last condition is that the observations must be independent, which in this case they are. The observations here are not part of any time series data.

Body measurements, Part I. (8.13, p. 316) Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals.19 The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.

\begin{center} \end{center}

Describe the relationship between shoulder girth and height.

##Answer (a): Shoulder girth and weight are moderately positively correlated as the scatterplot slopes up and to the right. Also, the points are clustered around the imaginary line emanating from the origin.

How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?

##A change in units will not impact the direction, magnitude or form of the relationship.

Body measurements, Part III. (8.24, p. 326) Exercise above introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

Write the equation of the regression line for predicting height.

##Answer (a): We start by calculating the slope coefficient, Beta1 which is the division of the standard deviation of height by the standard deviation of girth, multiplied by the correlation coefficient between height and shoulder girth. Once this value is calculated, its substituted into the slope of the line formula to calculate the intercept. Substituting the intercept and the slope coefficients, we get the equation to be

##Height <- 105.965 + (0.608 * Shoulder girth)

sy <- 9.41
sx <- 10.37
r <- 0.67
y <- 171.14
x <- 107.2

b1 <- (sy / sx) * r; b1

## [1] 0.6079749

b0 <- y - (b1 * x); b0

## [1] 105.9651

Interpret the slope and the intercept in this context.

##Answer (b): The intercept b0 is the value that the height takes on if the value of the x-variable (shoulder girth = 0). The slope determines the value of height for a one-unit increase in x, i.e. height increases by 0.608cm for a one-unit increase in the shoulder girth.

Calculate \(R^2\) of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.

##Answer (c): R-squared is the square of the correlation coefficient and describes the amount of variation in the response variable that is explained by the least squares line. i.e. the least squares line explains 44.89% of the variation in the height.

Rsquared <- 0.67 * 0.67; Rsquared

## [1] 0.4489

A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.

##Answer (d): Using the coefficients calculated earlier and the formula we defined, we predict the Height to be 166.76cm.

Height <- 105.9651 + (0.6079749 * 100); Height

## [1] 166.7626

The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

##Answer (e): The residual is the difference between the actual and the predicted value of height, which in this case is 160 - 166.7626 = -6.7626. Since the actual value is lower than the observed, this means that the model overestimated the height by 6.7626cm.

residual <- 160 - 166.7626; residual

## [1] -6.7626

A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

##Answer (f): 56cm is well outside the range of values used to estimate this model, by extrapolating an out-of-sample observation such as this, we might arrive at an incorrect prediction for the height of the child.

Cats, Part I. (8.26, p. 327) The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

\begin{center} \end{center}

Write out the linear model.

##Answer (a): Heart weight (in gms) <- -0.357 + 4.034 * body wt (in kg).

Interpret the intercept.

##Answer (b): When the body weight of the cat is 0kg, the heart weight of the cat is -0.357 gms. This doesn’t make realistic sense if the body weight of the cat is 0, then the heart weight must be 0 also.

Interpret the slope.

##Answer (c): For every kilogram increase in the body weight of the cat, the heart weight increases by 4.034 grams.

Interpret \(R^2\).

##Answer (d): 64.66% of the variation in the heart weight is explained by the least squares regression line.

Calculate the correlation coefficient.

##Answer (e): The correlation coefficient is 0.8041144.

corr <- sqrt(0.6466); corr

## [1] 0.8041144

Rate my professor. (8.44, p. 340) Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.

\begin{center}

\end{center}

Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

##Answer (a): The linear model can be written as 3.9983 <- 4.010 + b1 * -0.0883. Rearranging, we get b1 to equal 0.1325 which is in fact the slope.

b1 <- (3.9983 - 4.010) / -0.0883; b1

## [1] 0.1325028

Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

##Answer (b): Let’s perform a hypothesis test with the null hypothesis that the slope is 0 and the alternative one-tailed hypothesis that the slope is > 0.

##H0: b1 = 0 ##Ha: b1 > 0

##The test-statistic is 4.115. With 461 degrees of freedom and this t-stat value 4.115, the p-value is < 0.005. Since, the p-value is less than the assumed significance level of 0.05, we reject the null hypothesis that the slope is 0 in favor of the alternative hypothesis. Hence, there is convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive.

tstat <- 0.1325028/0.0322; tstat

## [1] 4.114994

List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

##Answer (c): There are four conditions that must be met - 1). Linearity - The data should show a linear trend which in this case it does, the residual plot don’t show any curvature and points are randomly distributed about zero. 2). Nearly normal residuals - The histogram of the residuals is unimodal and approximately symmetric. 3). Constant variance - This condition is satisfied, looking at the residual plot, the variance of the points stays the same. 4). Finally, the last condition is that the observations must be independent, which we assume to be that professors are chosen independently.