Chapter 7 - Introduction to Linear Regression

Graded: 7.24, 7.26, 7.30, 7.40

7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

  1. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

The relationship between number of calories and amount of carbohydrates is a positively-sloped (direct), linear relationship. So as number of calories increase, so does amount of carbs.

  1. In this scenario, what are the explanatory and response variables?

The explanatory variable is the number of calories, and the response variable is the amount of carbs.

  1. Why might we want to fit a regression line to these data?

We might want to fit a regression line to these data to check whether there is a linear association between these 2 variables, and to predict the amount of carbs based on the number of calories published by Starbucks for any given drink. This could enable planning for daily nutritional needs.

  1. Do these data meet the conditions required for fitting a least squares line?

The conditions required for fitting a least squares line are:

  1. Linear relationship between independent and response variable. Based on the first scatter plot, this does seem to be the case.

  2. Residuals are normally distributed or nearly normal: Based on the third graph (histogram of residuals), it looks like the residuals are not normal, but a little left skewed.

  3. Constant variability in residuals: Based on the second graph (scatter plot of residuals), this does not seem to be the case - it looks like the variance of residuals increases as the independent variable (number of calories) increases.

  4. Independent observations: This seems like a plausible assumption in this case i.e. there is no reason to assume that the calories or carb content of different drinks from Starbucks would be dependent on each other.

Based on the above, it can be stated that not all conditions for fitting a least squares line are met in this dataset.

7.26 Body measurements, Part III. Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

  1. Write the equation of the regression line for predicting height.
fivenum(bdims$sho.gi)
## [1]  85.90  99.45 108.20 116.55 134.80
fivenum(bdims$hgt)
## [1] 147.2 163.8 170.3 177.8 198.1
avg.girth<-107.2
sd.girth<-10.37
avg.height<-171.14
sd.height<-9.41
corr.girth.ht<-0.67

(b1<-((sd.height)/(sd.girth)) * corr.girth.ht)
## [1] 0.6079749
(b0<-avg.height-b1*avg.girth)
## [1] 105.9651

height = 105.9561 + 0.6079749 x girth

  1. Interpret the slope and the intercept in this context. The slope of 0.607 cm represents the expected increase in height, for an average increase of 1 cm in shoulder girth.

The intercept of 105.96 cm represents the height, when shoulder girth = 0. Since it’s not possible that shoulder girth is actually 0, the intercept cannot be taken literally in this context, and acts only as determinant of the height of the linear model. The minimum shoulder grith in this data set is about 85.9 cms, so a shoulder girth of 0 cms is based on interpolation and not within the range in this dataset.

  1. Calculate R2 of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.
(r2<-corr.girth.ht^2)
## [1] 0.4489

The R2 for this regression model is 44.89%, and it denotes the extent of the variability in height as explained by the shoulder girth.

  1. A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.
(ht=b0+b1*100)
## [1] 166.7626

THe predicted height of this student is about 166.76 cms.

  1. The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

\[ \E \quad =\quad y\quad -\widehat { y }\]

ht.predicted<-ht
ht.actual<-160

(e<-ht.actual-ht.predicted)
## [1] -6.762581

This residual of -6.76 means that the predicted height for this student is 6.76 cms more than the actual height.

  1. A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

A shoulder girth of 56 cm is much lower than the minimum shoulder girth of 85.90 cm in this data set, so using this regression model for this case would involve extrapolation. Therefore extrapolating that this same linear model holds outside the range of this dataset is not appropriate, given that the model might change in a different data range.

(ht=b0+b1*56)
## [1] 140.0117

The predicted height of 140 cm is much higher than the average height of one-year old kids. This further re-inforces the assertion that the above model is likley based on adults and may not hold for children.

7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

          Estimate     Std.Error    t value     Pr(>|t|)

(Intercept) -0.357 0.692 -0.515 0.607 body wt 4.034 0.250 16.119 0.000

s = 1.452 R2 = 64.66% R2adj = 64.41%

str(cats)
## 'data.frame':    144 obs. of  3 variables:
##  $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Bwt: num  2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
##  $ Hwt: num  7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
  1. Write out the linear model. heart.wt = -0.357 + 4.034 x body.wt

  2. Interpret the intercept. The intercept is the heart weight when the body weight is 0. Since the body weight cannot possibly be 0, the intercept is not to be taken literally. It’s more to be used as a guiding parameter to determine the line of best fit for the model.

  3. Interpret the slope. The slope of 4.034 denotes the change in gms of heart weight for each kg of change in the cat’s body weight.

  4. Interpret R2. The R2 of 64.66% denotes the extent of variability in heart weight explained by the body weight, which is the only explanatory variable in this model.

  5. Calculate the correlation coefficient.

(corr=sqrt(0.6466))
## [1] 0.8041144

The correlation co-efficient is 0.80 which indicates a strong, positive relationship between the cat’s body weight and heart weight.

7.40 Rate my professor. Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching e↵ectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor.

Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.

  1. Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.
(b1 = (4.010 - 3.9983)/0.0883)
## [1] 0.1325028

The slope is 0.1325028

  1. Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

Yes, the data provides convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive, because of the large t-value and small p-value for the standard error of the slope.

  1. List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

The conditions required for linear regression are:

  1. Linear relationship between independent and response variable. The first scatter plot indicates a rather mild, positive relationship between the 2 variables. There is a fair degree of scatter which indicates a weak relationship between the 2 variables.

  2. Residuals are normally distributed or nearly normal: The histogram of residuals indicates a little skewness.

  3. Constant variability in residuals: Based on the second scatter plot of residuals and beauty, it looks like the residuals are somewhat randomly spread around 0. It does like the variance of residuals decreases as the rating moves from less beautiful to more beautiful.

  4. Independent observations: This seems like a plausible assumption in this case i.e. there is no reason to assume that the students evaluation is dependent on each other. This is also borne out by no strong trend in the residuals based on order of data collection.