DATA 606 - Homework 7

7.24 Nutrition at Starbucks, Part I.

The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

(b) In this scenario, what are the explanatory and response variables?

The explanatory variable is calories and the response variables is carbohydrates (grams).

(c) Why might we want to fit a regression line to these data?

We want to fit a regression line to examine whether we are doing better or worse compared to theoretical line. This comparison (residual) gives an idea whether there is a true relationship between indepedent and dependent variable. If residual for each oberserved value is too large we can say that explanatory variable cannot explain the dependent variable effectively and thus model is not a good approximation for predicting the outcome we want.

(d) Do these data meet the conditions required for fitting a least squares line?

Even though the relationship appears linear in the scatterplot, and the residuals are nearly normal, we do not satisfy the criteria for constant variability. A simple linear model is inadequate for modeling these data.

7.26 Body measurements, Part III.

Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

(a) Write the equation of the regression line for predicting height.

responseMean = 171.14
responseSD = 9.41
R = .67
explanatoryMean= 107.2
explanatorySD= 10.37
slope = (responseSD/explanatorySD)*R
intercept = responseMean - (slope)*explanatoryMean

regressionFunction = function(x,slope,intercept){
  y =(x*slope)+ intercept
  return(y)
}

tinyGraph = 1:150
tinyGraph = sapply(tinyGraph,regressionFunction,slope,intercept)
plot(tinyGraph, type = 'l',xlab = 'Shoulder girth',ylab='height')

Equation :

Y = 105.9650878 + 0.6079749 * exogenous variable

(b) Interpret the slope and the intercept in this context.

The slope is 0.608. Which means that for every 10 cm increase in shoulder girth, there will be an additional 6.08 cm to the height (or for every 1 cm increase in shoulder girth, there will be an additional .608 cm to the height). We must remember that not all values make sense when we plug them into this linear regresion equation. For example, a person with a shoulder width of 0 cm (which can’t happen) would indicate a height of 105.965 cm. This does not make sense.

(c) Calculate R2 of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.

R.squared <- R^2
paste("R squared: ", round(R.squared,3))

## [1] "R squared:  0.449"

This means that 44.9% of the variation found in this data is explained by the linear model i.e. explained by the shoulder girth width.

(d) A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.

student.shoulder <- 100 # in cm
student.height <- 105.9650878 + 0.6079749 * student.shoulder
paste("The estimated height of a student with a shoulder girth of 100 cm is: ", round(student.height,3), "cm.")

## [1] "The estimated height of a student with a shoulder girth of 100 cm is:  166.763 cm."

(e) The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

# The residual = time(actual) - time(expected)
residual <- 160 - student.height
paste("The residual is: ", round(residual,3))

## [1] "The residual is:  -6.763"

Residual = observed - predicted, which for this case is -6.7625805 Meaning the actual height is that much less than the line predicted

(f) A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

Original data only includes 80 to 140 shoulder girth which is a minus.Variability seems to be constant, correlation is strong and linear.While it would be innappropriate, I’d do it in absence of any other model.

7.30 Cats, Part I.

The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

(a) Write out the linear model.

Heart Weight (g) = -0.357 + 4.034 * Body Weight (kg)

(b) Interpret the intercept.

The intercept means that for a body weight of 0 kg, the average heart weight is -0.357 grams. It is an obviously theoretical example useful only to intepret the linear model.

(c) Interpret the slope.

The slope means that for each additional kilogram of body weight, the average heart weight of a cat increases by 4.034 grams.

(d) Interpret R2.

Body weight (in kg) explains 64.66% of the variability in the heart weight (in g) of the cat.

(e) Calculate the correlation coefficient.

paste("Correlation coefficient: ", round(sqrt(.6466),3))

## [1] "Correlation coefficient:  0.804"

7.40 Rate my professor

Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.

(a) Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

prof.B1 <- (3.9983 - 4.010)/(-0.0883)
paste("B1 or the slope: ", round(prof.B1, 3))

## [1] "B1 or the slope:  0.133"

(b) Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

Since the slope is positive the relationship is positive. If we set up a hypothesis test with H0:B1=0 and HA:B1>0, then based on the summary table the p value is nearly 0. And this is for a two-sided test, so it’ll be even closer to 0 for a one-sided test. We reject the null hypothesis. There is convincing evidence that the relationship between teaching evluation and beauty is positive

(c) List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

.Linearity: Based on the scatterplot, there may be a weak linear relationship. There is no evident pattern in the residual plot.

.Nearly normal residuals: The histogram of the residuals exhibits a left skew. Additionally, the points seem to move away from the normal probability line on each end.

.Constant variability: Based on residual plot, there appears to be constant variability in the data.

.Independent observations: Observations are not a time series, and can be assumed to be independent (unless there is evidence that students copied each other’s evaluations).