Grando 7 Homework

if (Sys.info()["sysname"] == "Windows") {
    setwd("~/Masters/DATA606/Week7/Homework")
} else {
    setwd("~/Documents/Masters/DATA606/Week7/Homework")
}
require(ggplot2)
## Loading required package: ggplot2

7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

(a) Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

Answer:

There is a weak to moderate positive linear relationship between calories and carbohydrates.

(b) In this scenario, what are the explanatory and response variables?

Answer:

The explanatory variable is the calorie count and the response variable is the amount of carbohydrates

(c) Why might we want to fit a regression line to these data?

Answer:

We can predict the number of carbohydrates in an item based on the number of calories it is indicated to have. If we were on a diet and wanted to limit not just calorie intake, but also carbohydrate intake, we could use this prediction to make a resonable estimate on the number of carbohydrates in a given item.

(d) Do these data meet the conditions required for fitting a least squares line?

Answer:

The variability of the data around the line increases with larger values of calories, which indicates that a linear regression model insufficient to apply to the data.

7.26 Body measurements, Part III. Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

(a) Write the equation of the regression line for predicting height.

Answer:

First, calculate the slope:

sx <- 10.37
sy <- 9.41
r <- 0.67
(b1 <- 0.67 * sy/sx)
## [1] 0.6079749

Next, the least squares regression line passes through \(\overline { y } =\quad { b }_{ O }\quad +\quad { b }_{ 1 }\quad *\quad \overline { x }\). Therefore, \({ b }_{ O }=\quad \overline { y } \quad -\quad { b }_{ 1 }\quad *\quad \overline { x }\)

y_bar <- 171.14
x_bar <- 107.2
(b0 <- y_bar - b1 * x_bar)
## [1] 105.9651

Therefore, the linear regression line is represented with this equation:

\[height\quad =\quad 105.97\quad +\quad 0.67\quad *\quad shoulder\quad girth\]

(b) Interpret the slope and the intercept in this context.

Answer:

The slope (b1): For each additional centimeter in shoulder girth, the model predicts an additional 0.67 cm in height.

The intercept (b0): When the shoulder girth measured is zero, the expected height is 105.97. However, it is not expected than an actual measurement of zero would be correct.

(c) Calculate R2 of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.

Answer:

The r-squared value is the square of correlation, so r-squared is:

0.67^2
## [1] 0.4489

About 44.9% of the variability in height is accounted for by the model, i.e. explained by the shoulder girth.

(d) A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.

Answer:

The expected height is:

b0 + b1 * 100
## [1] 166.7626

(e) The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

Answer:

A residual is the expected value subtracted from the observed, which in this case is:

160 - (b0 + b1 * 100)
## [1] -6.762581

A negative value means that the model overestimates the height of this person.

(f) A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

Answer:

No, the data appear to be for adults since the mean shoulder girth is 107.2 and the standard deviation is 10.37. A measurement of 56 cm is 5 standard deviations from the mean, which appears to be outside of the data set and would be considered an extrapolation.

7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

(a) Write out the linear model.

Answer:

\[heart\quad weight\quad (g)\quad =\quad -0.357\quad +\quad 4.034\quad *\quad body\quad weight\quad (kg)\]

(b) Interpret the intercept.

Answer:

The expected heart weight for a body weight of zero kilograms is -0.357 grams.

(c) Interpret the slope.

Answer:

For each additional increase of one kilogram in body weight, we expect an increase in heart weight of 4.034 grams.

(d) Interpret R2.

Answer:

Body weight (kg) explains 64.66% of variability of in the heart weight (g)

(e) Calculate the correlation coefficient.

Answer:

The correlation coefficient is the square root of the r-squared value; therefore, the correlation coefficient is:

sqrt(0.6466)
## [1] 0.8041144

(a) Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

Answer:

Using the information given:

\[\overline { y } =\quad { b }_{ O }\quad +\quad { b }_{ 1 }\quad *\quad \overline { x }\]

\[\frac { \overline { y } -\quad { b }_{ O } }{ \overline { x } } \quad =\quad { b }_{ 1 }\]

y_bar <- 3.9983
x_bar <- -0.0883
b0 <- 4.01
((y_bar - b0)/x_bar)
## [1] 0.1325028

Using the table provided:

t_value <- 4.13
se <- 0.0322
t_value * se
## [1] 0.132986

(b) Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

Answer:

Yes, the probability of the slope shows that the p-value is approximately zero, which means that we would expect to see 0% of seeing another point estimate as extreme as the one we saw, given that the null hypothesis (slope = 0 ) is true. Also, we can calculate the 95% confidence interval for the slope which does not contain negative numbers (or zero) in the interval, confirming there is convincing evidence that the slope is positive:

t_95 <- qt(p = 0.975, df = (463 - 1))
slope <- t_value * se
slope - t_95 * se
## [1] 0.06970939
slope + t_95 * se
## [1] 0.1962626

(c) List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

Answer:

When fitting a least squares line, we generally require:

  1. Linearity: Given the p-values indicated, and that the graph of the actual data points has not been provided, we can assume the data appear to be linear.

  2. Nearly normal residuals: The data appear to have a left skew based on the histogram and normal probability plot, which is a cause for concern when fitting the data to a linear model.

  3. Constant variability: The residual plots appear to show that the data has the same amount of variability regardless of the x value.

  4. Independent observations: It appears there is a time-series plot provided as one of the residual graphs. There does not appear to be a trend in this data.

The data provided appear to meet all the conditions except the requirement for nearly normal residuals; therefore, we should investigate this issue or procede with caution.