Source files: https://github.com/djlofland/DATA606_F2019/tree/master/Homework8
Nutrition at Starbucks, Part I. (8.22, p. 326) The scatterplot below shows the relationship between the number of calories and amount of carbohydrates (in grams) Starbucks food menu items contain. Since Starbucks only lists the number of calories on the display items, we are interested in predicting the amount of carbs a menu item has based on its calorie content.
positive linear relationship, moderate correlation, residuals increase as calories increase (could be problematic), residuals are normal
Explanatory:Calories, Response: Carbs
to understand the relationship and predict other unknown foods
Mostly … the only concern is that residuals increase with calories count
Body measurements, Part I. (8.13, p. 316) Researchers studying anthropometry collected body girth measurements and skeletal diameter measurements, as well as age, weight, height and gender for 507 physically active individuals. The scatterplot below shows the relationship between height and shoulder girth (over deltoid muscles), both measured in centimeters.
\begin{center} \end{center}
There is a strong positive linear relationship, but it looks like we might have 2 segments (ie 2 lines) … possibly gender is a segmenting factor
Changing the units doesn’t change the relationship. It might change the slope, but not relationship between variables.
Body measurements, Part III. (8.24, p. 326) Exercise above introduces data on shoulder girth and height of a group of individuals. The mean shoulder girth is 107.20 cm with a standard deviation of 10.37 cm. The mean height is 171.14 cm with a standard deviation of 9.41 cm. The correlation between height and shoulder girth is 0.67.
\[\hat{height} = 105.9650878 + 0.6079749 * girth\]
Slope: for each 1 cm increase in girth we expect 0.608 cm increase in height.
Intercept: Theoretically if girth was 0, we’d have a base height of 106cm. This is meaningless - our data points would stop well short of a shoulder girth of 0.
R\(^2\) = 0.4489 - means that this percent of variablility in height is explained by girth.
166.7625805 cm
Residual: -6.7625805; this means that we overestimated the height with the model and the student is actuall a little shorter than expected.
No - that would be extrapolating beyond the bounds of our dataset.
Cats, Part I. (8.26, p. 327) The following regression output is for predicting the heart weight (in g) of cats from their body weight (in kg). The coefficients are estimated using a dataset of 144 domestic cats.
\begin{center} \end{center}
##
## Call:
## lm(formula = cats$Hwt ~ cats$Bwt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5694 -0.9634 -0.0921 1.0426 5.1238
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3567 0.6923 -0.515 0.607
## cats$Bwt 4.0341 0.2503 16.119 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.452 on 142 degrees of freedom
## Multiple R-squared: 0.6466, Adjusted R-squared: 0.6441
## F-statistic: 259.8 on 1 and 142 DF, p-value: < 2.2e-16
\[\hat{heart\_weight} = -0.3567 + 4.0341 * body\_weight\]
You cannot since body_weight = 0 has no meaning. We’d be extrapolating beyond the available data points. The intercept merely gives us a basis point to help draw a linear line thru the available data points.
for each 1 kg increase in body weight, we’d estimate an increase of 4.0341 g in heart weight.
Body weight explains 64.41% of the variation in heart weight. The remaining ~35% of variation might be due to other causes (unknown) or just random variation. We don’t know.
## [1] 0.8041144
THe correlation coefficient is 0.8041274
Rate my professor. (8.44, p. 340) Many college courses conclude by giving students the opportunity to evaluate the course and the instructor anonymously. However, the use of these student evaluations as an indicator of course quality and teaching effectiveness is often criticized because these measures may reflect the influence of non-teaching related characteristics, such as the physical appearance of the instructor. Researchers at University of Texas, Austin collected data on teaching evaluation score (higher score means better) and standardized beauty score (a score of 0 means average, negative score means below average, and a positive score means above average) for a sample of 463 professors. The scatterplot below shows the relationship between these variables, and also provided is a regression output for predicting teaching evaluation score from beauty score.
\begin{center} \end{center}
##
## Call:
## lm(formula = eval ~ beauty)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.80015 -0.36304 0.07254 0.40207 1.10373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.01002 0.02551 157.205 < 2e-16 ***
## beauty 0.13300 0.03218 4.133 4.25e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5455 on 461 degrees of freedom
## Multiple R-squared: 0.03574, Adjusted R-squared: 0.03364
## F-statistic: 17.08 on 1 and 461 DF, p-value: 4.247e-05
Slope: 0.13300, Intercept: 4.01002
While there does appear to be a significant trend with beauty as a predictor of evaluation (p=0.00004247 is way blow 0.05), this trend explains very little of the variation we see in the data. Beauty only accounts for ~3.364% of the observed variation. This would suggest that while beatuy is a factor, it’s a minor one at best.
- Residuals are normally distributed, 2. residuals are similar across values of the explanatory variable with no clear trends, 3. no outliers with high leverage or influence, 4. that data in scatterplot form appears linear with no obvious non-linear patterns.
residuals appear randomly distributed residuals so no obvious patterns across the domain of the beauty scores No obvious outliers that might affect the linear relationship beauty and eval appear to have a rough linear trend