Source files: https://github.com/djlofland/DATA606_F2019/tree/master/Homework8

Nutrition at Starbucks, Part I. (8.22, p. 326) The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

Exercise 1

  1. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

positive linear relationship, moderate correlation, residuals increase as calories increase (could be problematic), residuals are normal

Exercise 2

  1. In this scenario, what are the explanatory and response variables?

Explanatory:Calories, Response: Carbs

Exercise 3

  1. Why might we want to fit a regression line to these data?

to understand the relationship and predict other unknown foods

Exercise 4

  1. Do these data meet the conditions required for fitting a least squares line?

Mostly … the only concern is that residuals increase with calories count


Body measurements, Part I. (8.13, p. 316) Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals. The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.

\begin{center} \end{center}

Exercise 5

  1. Describe the relationship between shoulder girth and height.

There is a strong positive linear relationship, but it looks like we might have 2 segments (ie 2 lines) … possibly gender is a segmenting factor

Exercise 6

  1. How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?

Changing the units doesn’t change the relationship. It might change the slope, but not relationship between variables.


Body measurements, Part III. (8.24, p. 326) Exercise above introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

Exercise 7

  1. Write the equation of the regression line for predicting height.
R <- 0.67
sd_y <- 9.41
sd_x <- 10.37
b1 <- R * sd_y / sd_x
x <- 107.20
y <- 171.14
b0 <- y - b1 * x

\[\hat{height} = 105.9650878 + 0.6079749 * girth\]

Exercise 8

  1. Interpret the slope and the intercept in this context.

Slope: for each 1 cm increase in girth we expect 0.608 cm increase in height.

Intercept: Theoretically if girth was 0, we’d have a base height of 106cm. This is meaningless - our data points would stop well short of a shoulder girth of 0.

Exercise 9

  1. Calculate \(R^2\) of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.

R\(^2\) = 0.4489 - means that this percent of variablility in height is explained by girth.

Exercise 10

  1. A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.
height <- b0 + 100 * b1

166.7625805 cm

Exercise 11

  1. The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

Residual: -6.7625805; this means that we overestimated the height with the model and the student is actuall a little shorter than expected.

Exercise 12

  1. A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

No - that would be extrapolating beyond the bounds of our dataset.


Cats, Part I. (8.26, p. 327) The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

\begin{center} \end{center}

Exercise 13

  1. Write out the linear model.
summary(m_cats_hwt_bwt)
## 
## Call:
## lm(formula = cats$Hwt ~ cats$Bwt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5694 -0.9634 -0.0921  1.0426  5.1238 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3567     0.6923  -0.515    0.607    
## cats$Bwt      4.0341     0.2503  16.119   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared:  0.6466, Adjusted R-squared:  0.6441 
## F-statistic: 259.8 on 1 and 142 DF,  p-value: < 2.2e-16

\[\hat{heart\_weight} = -0.3567 + 4.0341 * body\_weight\]

Exercise 14

  1. Interpret the intercept.

You cannot since body_weight = 0 has no meaning. We’d be extrapolating beyond the available data points. The intercept merely gives us a basis point to help draw a linear line thru the available data points.

Exercise 15

  1. Interpret the slope.

for each 1 kg increase in body weight, we’d estimate an increase of 4.0341 g in heart weight.

Exercise 16

  1. Interpret \(R^2\).

Body weight explains 64.41% of the variation in heart weight. The remaining ~35% of variation might be due to other causes (unknown) or just random variation. We don’t know.

Exercise 17

  1. Calculate the correlation coefficient.
c <- cor(cats$Hwt,cats$Bwt)
sqrt(0.6466)
## [1] 0.8041144

THe correlation coefficient is 0.8041274


Rate my professor. (8.44, p. 340) Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.

\begin{center}

\end{center}

Exercise 18

  1. Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.
summary(m_eval_beauty)
## 
## Call:
## lm(formula = eval ~ beauty)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.80015 -0.36304  0.07254  0.40207  1.10373 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.01002    0.02551 157.205  < 2e-16 ***
## beauty       0.13300    0.03218   4.133 4.25e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5455 on 461 degrees of freedom
## Multiple R-squared:  0.03574,    Adjusted R-squared:  0.03364 
## F-statistic: 17.08 on 1 and 461 DF,  p-value: 4.247e-05

Slope: 0.13300, Intercept: 4.01002

Exercise 19

  1. Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

While there does appear to be a significant trend with beauty as a predictor of evaluation (p=0.00004247 is way blow 0.05), this trend explains very little of the variation we see in the data. Beauty only accounts for ~3.364% of the observed variation. This would suggest that while beatuy is a factor, it’s a minor one at best.

Exercise 20

  1. List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

Exercise 21

  1. Residuals are normally distributed, 2. residuals are similar across values of the explanatory variable with no clear trends, 3. no outliers with high leverage or influence, 4. that data in scatterplot form appears linear with no obvious non-linear patterns.

residuals appear randomly distributed residuals so no obvious patterns across the domain of the beauty scores No obvious outliers that might affect the linear relationship beauty and eval appear to have a rough linear trend