Nutrition at Starbucks, Part I. (8.22, p. 326) The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.

  1. Describe the relationship between number of calories and amount of carbohydrates (in grams) that Starbucks food menu items contain.

Answer #The p-value of the test is 1.673^{-11}, which is less than the significance level alpha = 0.05. We can conclude that #calories and carbs are significantly correlated with a correlation coefficient of .675 and p-value of 1.673^{-11} .

summary(m_carb_cals)
## 
## Call:
## lm(formula = carb ~ calories, data = starbucks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.477  -7.476  -1.029  10.127  28.644 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.94356    4.74600   1.884   0.0634 .  
## calories     0.10603    0.01338   7.923 1.67e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.29 on 75 degrees of freedom
## Multiple R-squared:  0.4556, Adjusted R-squared:  0.4484 
## F-statistic: 62.77 on 1 and 75 DF,  p-value: 1.673e-11
res <- cor.test(starbucks$calories, starbucks$carb, 
                    method = "pearson")
res
## 
##  Pearson's product-moment correlation
## 
## data:  starbucks$calories and starbucks$carb
## t = 7.9229, df = 75, p-value = 1.673e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5313531 0.7809149
## sample estimates:
##      cor 
## 0.674999
  1. In this scenario, what are the explanatory and response variables?

Answer

Explanatory variable: Calories on the x axis.

Response variable: Carbohydrates on the y axis.

  1. Why might we want to fit a regression line to these data?

Answer We are interested in predicting the amount of carbs a menu item has based on its calorie content.

  1. Do these data meet the conditions required for fitting a least squares line?

Answer Following 4 conditions have been plotted and each satifies the conditions

Linearity: The trend appears to be linear

Nearly normal residuals :the data fall around the line with no obvious outliers

Constant variability: the variance is roughly constant.

Independent observations:These are also not time series observations

par(mfrow=c(2,2))
plot(m_carb_cals)

#vif(m_carb_cals)

#perform Durbin-Watson test
#dwtest(m_carb_cals)

Body measurements, Part I. (8.13, p. 316) Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals. The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.

\begin{center} \end{center}

  1. Describe the relationship between shoulder girth and height.

Answer

The relationship between shoulder girth and height is almost always as shoulder girth increases, height increases. This shows on most cases that the longer the shoulder girth, the taller the person.

  1. How would the relationship change if shoulder girth was measured in inches while the units of height remained in centimeters?

Answer If the shoulder girth was measured in inches while the units of height remained in centimeters the relationship would remain the same.


Body measurements, Part III. (8.24, p. 326) Exercise above introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.

  1. Write the equation of the regression line for predicting height.

Answer

#general equation is y=a+bx where a in intercept,y is heightmean,

shoulder_girth_mean <- 107.20
shoulder_girth_sd <- 10.37
height_mean <- 171.14
height_sd <- 9.41
correlation <- 0.67 #this is r value

#b is slope which is calculated by b=r*Sy/Sx

slope <- correlation * (height_sd / shoulder_girth_sd)
slope 
## [1] 0.6079749
intercept <- height_mean - slope * shoulder_girth_mean
intercept
## [1] 105.9651
cat('The equation of the regression line for predicting height is height = 105.9651 + .61 * shoulder girth ')
## The equation of the regression line for predicting height is height = 105.9651 + .61 * shoulder girth
  1. Interpret the slope and the intercept in this context.

Answer

Slope: The slope tells us the predicted increase in height, in cm, for every one cm increase in shoulder girth. For every 1 cm increase in shoulder girth, there will be an additional 0.61 cm to the height

Intercept: Represent the height in centimeters at girth of 0 cm.

  1. Calculate \(R^2\) of the regression line for predicting height from shoulder girth, and interpret it in the context of the application.
r_squared <- correlation^2
r_squared
## [1] 0.4489
cat('In the context of data it  means that this linear model explains',r_squared*100, '% of the variation of the height data.')
## In the context of data it  means that this linear model explains 44.89 % of the variation of the height data.
  1. A randomly selected student from your class has a shoulder girth of 100 cm. Predict the height of this student using the model.
randomstudentheight = intercept + slope * 100
randomstudentheight
## [1] 166.7626
cat('height of randomly selected student is : ' ,randomstudentheight)
## height of randomly selected student is :  166.7626
  1. The student from part (d) is 160 cm tall. Calculate the residual, and explain what this residual means.

Answer #ei=yi−y #residual is calculated #as observed - predicted, a negative residual means that the predicted value is higher than the observed value

residual_error<-160 - randomstudentheight

cat('Since the residual is negative, this means that the actual data point is below the linear regression line and that the model is overestimating the value by ',residual_error)
## Since the residual is negative, this means that the actual data point is below the linear regression line and that the model is overestimating the value by  -6.762581
  1. A one year old has a shoulder girth of 56 cm. Would it be appropriate to use this linear model to predict the height of this child?

Answer As we can see, the original data set had a response variable values between 85 and 135 cm. A measure of 56 is outside the sample so it would not be appropriate to use this linear model for prediction.If we extrapolate, we are making an unreliable bet that the approximate linear relationship will be valid in places where it has not been analyzed


Cats, Part I. (8.26, p. 327) The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.

\begin{center} \end{center}

  1. Write out the linear model.

#Since y=b0+b1⋅x

#and b1=(Sy/Sx)R

#from the given regression output, the value for β0 and β1 are provided by the first column titled “Estimate” respectively.so the linear output according to the formula.

Answer

y<− -0.357+4.034*x

  1. Interpret the intercept.

Answer The intercept is -0.357 g. This is not useful since it is telling us that the heart weight is -0.357 g when the body weight is 0 kg, but a cat cant have a body weight = 0 kg.

  1. Interpret the slope.

Answer The slope is 4 g(approx). This tells us that if body weight increases by 1 kg, height weight increases by 4 g approx.

par(mar = c(3.7, 3.7, 0.5, 0.5), las = 1, mgp = c(2.5, 0.7, 0), 
    cex.lab = 1.5, cex.axis = 1.5)
plot(cats$Hwt ~ cats$Bwt, 
     xlab = "Body weight (kg)", ylab = "Heart weight (g)", 
     pch = 19, col = COL[1,2],
     xlim = c(2,4), ylim = c(5, 20.5), axes = FALSE)
axis(1, at = seq(2, 4, 0.5))
axis(2, at = seq(5, 20, 5))
box()
abline(m_cats_hwt_bwt)

  1. Interpret \(R^2\). Answer # R^2 = 64.66%. This means that 64.66% of the variability in heart weight can be explained by body weight.

  2. Calculate the correlation coefficient.

Answer R-squared(correlation coefficient) tells us what percent of the prediction error in the yyy variable is eliminated when we use least-squares regression on the xxx variable.

ccof<-sqrt(0.6466)

cat('The correlation coefficient is:' , ccof * 100 ,'%')
## The correlation coefficient is: 80.41144 %

Rate my professor. (8.44, p. 340) Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.

\begin{center}

\end{center}

  1. Given that the average standardized beauty score is -0.0883 and average teaching evaluation score is 3.9983, calculate the slope. Alternatively, the slope may be computed using just the information provided in the model summary table.

Answer

#Since y=b0+b1*x

b0 <- 4.010

x <- -0.0883
y <- 3.9983


b1 <- (y - b0)/x

cat('the slope is ', b1)
## the slope is  0.1325028
  1. Do these data provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive? Explain your reasoning.

Answer These data do provide convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive because the slope calcuated above is postive

  1. List the conditions required for linear regression and check if each one is satisfied for this model based on the following diagnostic plots.

Answer

Linearity: The trend appears to be linear.

Nearly normal residuals: As shown in the residuals distribution and Q-Q plot, they are in fact nearly normal.

Constant variability: The scatterplot of the residuals does appear to have constant variability.

Independent observations: Assuming independence due to no clear evidence one way or the other. number of professors would likely be < 10% of nationwide professors, hence this can be assumed to be satisfied.