7.24 Nutrition at Starbucks, Part I. The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
The relationship between number of calories and amount of carbohydrates is a positively-sloped (direct), linear relationship. So as number of calories increase, so does amount of carbs.
The explanatory variable is the number of calories, and the response variable is the amount of carbs.
We might want to fit a regression line to these data to check whether there is a linear association between these 2 variables, and to predict the amount of carbs based on the number of calories published by Starbucks for any given drink. This could enable planning for daily nutritional needs.
The conditions required for fitting a least squares line are:
Linear relationship between independent and response variable. Based on the first scatter plot, this does seem to be the case.
Residuals are normally distributed or nearly normal: Based on the third graph (histogram of residuals), it looks like the residuals are not normal, but a little left skewed.
Constant variability in residuals: Based on the second graph (scatter plot of residuals), this does not seem to be the case - it looks like the variance of residuals increases as the independent variable (number of calories) increases.
Independent observations: This seems like a plausible assumption in this case i.e. there is no reason to assume that the calories or carb content of different drinks from Starbucks would be dependent on each other.
Based on the above, it can be stated that not all conditions for fitting a least squares line are met in this dataset.
7.26 Body measurements, Part III. Exercise 7.15 introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.
fivenum(bdims$sho.gi)
## [1] 85.90 99.45 108.20 116.55 134.80
fivenum(bdims$hgt)
## [1] 147.2 163.8 170.3 177.8 198.1
avg.girth<-107.2
sd.girth<-10.37
avg.height<-171.14
sd.height<-9.41
corr.girth.ht<-0.67
(b1<-((sd.height)/(sd.girth)) * corr.girth.ht)
## [1] 0.6079749
(b0<-avg.height-b1*avg.girth)
## [1] 105.9651
height = 105.9561 + 0.6079749 x girth
The intercept of 105.96 cm represents the height, when shoulder girth = 0. Since it’s not possible that shoulder girth is actually 0, the intercept cannot be taken literally in this context, and acts only as determinant of the height of the linear model. The minimum shoulder grith in this data set is about 85.9 cms, so a shoulder girth of 0 cms is based on interpolation and not within the range in this dataset.
(r2<-corr.girth.ht^2)
## [1] 0.4489
The R2 for this regression model is 44.89%, and it denotes the extent of the variability in height as explained by the shoulder girth.
(ht=b0+b1*100)
## [1] 166.7626
THe predicted height of this student is about 166.76 cms.
\[ \E \quad =\quad y\quad -\widehat { y }\]
ht.predicted<-ht
ht.actual<-160
(e<-ht.actual-ht.predicted)
## [1] -6.762581
This residual of -6.76 means that the predicted height for this student is 6.76 cms more than the actual height.
A shoulder girth of 56 cm is much lower than the minimum shoulder girth of 85.90 cm in this data set, so using this regression model for this case would involve extrapolation. Therefore extrapolating that this same linear model holds outside the range of this dataset is not appropriate, given that the model might change in a different data range.
(ht=b0+b1*56)
## [1] 140.0117
The predicted height of 140 cm is much higher than the average height of one-year old kids. This further re-inforces the assertion that the above model is likley based on adults and may not hold for children.
7.30 Cats, Part I. The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.
Estimate Std.Error t value Pr(>|t|)
(Intercept) -0.357 0.692 -0.515 0.607 body wt 4.034 0.250 16.119 0.000
s = 1.452 R2 = 64.66% R2adj = 64.41%
str(cats)
## 'data.frame': 144 obs. of 3 variables:
## $ Sex: Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ Bwt: num 2 2 2 2.1 2.1 2.1 2.1 2.1 2.1 2.1 ...
## $ Hwt: num 7 7.4 9.5 7.2 7.3 7.6 8.1 8.2 8.3 8.5 ...
Write out the linear model. heart.wt = -0.357 + 4.034 x body.wt
Interpret the intercept. The intercept is the heart weight when the body weight is 0. Since the body weight cannot possibly be 0, the intercept is not to be taken literally. It’s more to be used as a guiding parameter to determine the line of best fit for the model.
Interpret the slope. The slope of 4.034 denotes the change in gms of heart weight for each kg of change in the cat’s body weight.
Interpret R2. The R2 of 64.66% denotes the extent of variability in heart weight explained by the body weight, which is the only explanatory variable in this model.
Calculate the correlation coefficient.
(corr=sqrt(0.6466))
## [1] 0.8041144
The correlation co-efficient is 0.80 which indicates a strong, positive relationship between the cat’s body weight and heart weight.
7.40 Rate my professor. Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching e↵ectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor.
Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.
(b1 = (4.010 - 3.9983)/0.0883)
## [1] 0.1325028
The slope is 0.1325028
Yes, the data provides convincing evidence that the slope of the relationship between teaching evaluation and beauty is positive, because of the large t-value and small p-value for the standard error of the slope.
The conditions required for linear regression are:
Linear relationship between independent and response variable. The first scatter plot indicates a rather mild, positive relationship between the 2 variables. There is a fair degree of scatter which indicates a weak relationship between the 2 variables.
Residuals are normally distributed or nearly normal: The histogram of residuals indicates a little skewness.
Constant variability in residuals: Based on the second scatter plot of residuals and beauty, it looks like the residuals are somewhat randomly spread around 0. It does like the variance of residuals decreases as the rating moves from less beautiful to more beautiful.
Independent observations: This seems like a plausible assumption in this case i.e. there is no reason to assume that the students evaluation is dependent on each other. This is also borne out by no strong trend in the residuals based on order of data collection.