Some homework problems are taken from the book. If you do not own the book, these questions can be found in the BAS320HomeworkPacket.pdf file up in the Homework section on Blackboard. Some questions may be slightly modified, so always answer the requested questions as written up here when they are provided. Write your (brief) answer immediately after the : (if R code is required, have the R code be AFTER your completed response).
Question 1: Ch 4 Exercise 3 (p. 191). Use the AUTO dataset. We have discussed this dataset on many occasions, and the goal is to predict FuelEfficiency from a car’s characteristics (the units of all values are in metric and not in standard US measurements).
data(AUTO)
a:: Make the scatterplot matrix of the variables in the dataframe. Do the relationships of each predictor with FuelEfficiency look roughly linear? If not, which ones don’t?
Response: Weight is the only predictor that shows a linear relationship. All other predictors show curvature.
pairs(FuelEfficiency~., data = AUTO)
b: Fit the “full model” (with all variables in it) using the “twidle dot” shortcut. Examine summary as well as check_regression on the model (with extra=TRUE). Discuss at least one issue with the regression that potentially makes it a bad model.
Response: The QQ plot shows the plotted points to be in the confidence intervals, except for a couple of points, but the thing that makes me question this model is the normality of the histogram. It is unimodel, but definitely shows some inconsistencies with regards to normality.
M <- lm(FuelEfficiency~. , data = AUTO)
summary(M)
##
## Call:
## lm(formula = FuelEfficiency ~ ., data = AUTO)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0108 -2.7731 0.2733 1.8362 11.9854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 192.43775 23.53161 8.178 4.62e-12 ***
## CabVolume -0.01565 0.02283 -0.685 0.495
## Horsepower 0.39221 0.08141 4.818 7.13e-06 ***
## TopSpeed -1.29482 0.24477 -5.290 1.11e-06 ***
## Weight -1.85980 0.21336 -8.717 4.22e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.653 on 77 degrees of freedom
## Multiple R-squared: 0.8733, Adjusted R-squared: 0.8667
## F-statistic: 132.7 on 4 and 77 DF, p-value: < 2.2e-16
check_regression(M, extra=TRUE)
##
## Tests of Assumptions: ( sample size n = 82 ):
## Linearity
## p-value for CabVolume : 0.4865
## p-value for Horsepower : 0
## p-value for TopSpeed : 1e-04
## p-value for Weight : 0
## p-value for overall model : 0.2518
## Equal Spread: p-value is 2e-04
## Normality: p-value is 0.0032
##
## Advice: if n<25 then all tests must be passed.
## If n >= 25 and test is failed, refer to diagnostic plot to see if violation is severe
## or is small enough to be ignored.
##
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 4 plots to show )
c: *Nonlinearities may exist in the relationships, so let us try polynomial fits. Fit four separate simple linear regressions (one for each predictor) and run choose_order (default max order of 6) on each. Report the selected orders for each predictor based on the 0.005 heuristic.
Cabvolume 3
Horsepower 4
TopSpeed 2
Weight 4
P <- lm(FuelEfficiency~CabVolume, data = AUTO)
P2 <- lm(FuelEfficiency~Horsepower, data = AUTO)
P3 <- lm(FuelEfficiency~TopSpeed, data = AUTO)
P4 <- lm(FuelEfficiency~Weight, data = AUTO)
choose_order(P, max.order = 6)
choose_order(P2, max.order = 6)
choose_order(P3, max.order = 6)
choose_order(P4, max.order = 6)
d: Fit a multiple regression model with the relevant polynomial terms and call it M.poly. Report \(R^2_{adj}\) and the \(RMSE\) of this model and compare them to the model in b:. Comment on whether these extra terms have “significantly” improved the model. Note: your command should look something like (replace the question marks with the selected orders):
Response: New: Adjusted R-squared: 0.9113, Residual standard error: 2.979 Old: Adjusted R-squared: 0.8667, Residual standard error: 3.653 I would say they have improved the model, but “significantly” is a relative term and I do not know what counts as significant. Personally I would say it is significant because of how much the RMSE is reduced more than the increase in the p-value, but both are good indicators of a significant improvement.
M.poly <- lm(FuelEfficiency~poly(CabVolume,3)+poly(Horsepower,4)+ poly(TopSpeed,2)+poly(Weight,4),data=AUTO)
summary(M.poly)
##
## Call:
## lm(formula = FuelEfficiency ~ poly(CabVolume, 3) + poly(Horsepower,
## 4) + poly(TopSpeed, 2) + poly(Weight, 4), data = AUTO)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4999 -1.6573 -0.1364 1.4637 9.5628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.7817 0.3290 102.692 <2e-16 ***
## poly(CabVolume, 3)1 -1.9769 4.1312 -0.479 0.6338
## poly(CabVolume, 3)2 9.9651 4.5269 2.201 0.0311 *
## poly(CabVolume, 3)3 -4.1992 3.4636 -1.212 0.2296
## poly(Horsepower, 4)1 30.0910 142.3503 0.211 0.8332
## poly(Horsepower, 4)2 -20.4550 38.4338 -0.532 0.5963
## poly(Horsepower, 4)3 -7.2160 7.4455 -0.969 0.3359
## poly(Horsepower, 4)4 7.2371 4.2185 1.716 0.0908 .
## poly(TopSpeed, 2)1 -30.5153 94.9502 -0.321 0.7489
## poly(TopSpeed, 2)2 26.3873 22.6416 1.165 0.2479
## poly(Weight, 4)1 -92.4703 62.4722 -1.480 0.1434
## poly(Weight, 4)2 12.6912 9.3905 1.352 0.1810
## poly(Weight, 4)3 -0.8139 3.9999 -0.203 0.8394
## poly(Weight, 4)4 8.6876 3.8339 2.266 0.0266 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.979 on 68 degrees of freedom
## Multiple R-squared: 0.9256, Adjusted R-squared: 0.9113
## F-statistic: 65.05 on 13 and 68 DF, p-value: < 2.2e-16
e: Run check_regression on the polynomial model (be sure to add extra=TRUE). Comment on whether you think the model’s assumptions are satisfied.
Response: The assumption for Linearity of the overall model pass with a p-value larger then .05, but normality and equal spread of the model fails with a p-value smller than .05. Overall I do not think the assumptions are satisfied.
check_regression(M.poly, extra = TRUE)
##
## Tests of Assumptions: ( sample size n = 82 ):
## Linearity
## p-value for poly(CabVolume, 3)1 : 0.1644
## p-value for poly(CabVolume, 3)2 : 0.0527
## p-value for poly(CabVolume, 3)3 : 0.1821
## p-value for poly(Horsepower, 4)1 : 0
## p-value for poly(Horsepower, 4)2 : 0
## p-value for poly(Horsepower, 4)3 : 0
## p-value for poly(Horsepower, 4)4 : 0
## p-value for poly(TopSpeed, 2)1 : 1e-04
## p-value for poly(TopSpeed, 2)2 : 0
## p-value for poly(Weight, 4)1 : 0
## p-value for poly(Weight, 4)2 : 0
## p-value for poly(Weight, 4)3 : 0
## p-value for poly(Weight, 4)4 : 0
## p-value for overall model : 0.0822
## Equal Spread: p-value is 0.0445
## Normality: p-value is 0.03
##
## Advice: if n<25 then all tests must be passed.
## If n >= 25 and test is failed, refer to diagnostic plot to see if violation is severe
## or is small enough to be ignored.
##
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 13 plots to show )
##
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 4 sets of plots to go )
##
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 3 sets of plots to go )
##
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 2 sets of plots to go )
##
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 1 sets of plots to go )
Question 2: Ch 4 Exercise 4 (p. 191). Use EX4.BIKE.
data("EX4.BIKE")
a: Fit a model predicting Demand from all predictors and look at its summary. Examine the influence plot for this model.
Z <- lm(Demand~., data = EX4.BIKE)
summary(Z)
##
## Call:
## lm(formula = Demand ~ ., data = EX4.BIKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2966.6 -984.7 -195.6 1129.8 3420.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4884.152 483.998 10.091 < 2e-16 ***
## AvgTemp 63.309 59.831 1.058 0.291
## EffectiveAvgTemp 68.452 55.547 1.232 0.219
## AvgHumidity -33.945 5.244 -6.473 2.76e-10 ***
## AvgWindspeed -76.116 14.343 -5.307 1.83e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1365 on 409 degrees of freedom
## Multiple R-squared: 0.4006, Adjusted R-squared: 0.3947
## F-statistic: 68.33 on 4 and 409 DF, p-value: < 2.2e-16
influence_plot(Z)
## $Leverage
## [1] 337
a1: There is one extremely influential point in this dataset. Show the influential row, then explain what makes it so unusual through the use of the scatterplot matrix (its not possible to see why its unusual with summary on the individuals variables).
Response: This point is so unusual because on 3 out of the 4 predictors, this point is in the farthest quartile. It is also in the largest quartile for AvgTemp and in the lowest for EffectiveAvgTemp, which just doesnt make sense to have 2 opposite ends for redundent predictors. It is just an unusual point. AvgHumidity is the only predictor showing it to be a “regular” point.
#EX4.BIKE[?,] #Replacing the ? with the row number will print out that row to the screen
#pairs() #scatterplot matrix
#summary(EX4.BIKE) #summaries of each variable
EX4.BIKE[337,]
## Demand AvgTemp EffectiveAvgTemp AvgHumidity AvgWindspeed
## 337 7148 29.65667 12.12 57.08333 15.50073
pairs(Demand~., data = EX4.BIKE)
summary(EX4.BIKE)
## Demand AvgTemp EffectiveAvgTemp AvgHumidity
## Min. : 705 Min. : 4.407 Min. : 5.083 Min. :18.79
## 1st Qu.:3645 1st Qu.:14.837 1st Qu.:18.071 1st Qu.:52.00
## Median :4736 Median :21.388 Median :25.473 Median :62.10
## Mean :4808 Mean :20.915 Mean :24.406 Mean :62.25
## 3rd Qu.:6200 3rd Qu.:27.009 3rd Qu.:30.699 3rd Qu.:72.26
## Max. :8714 Max. :35.328 Max. :40.246 Max. :97.04
## AvgWindspeed
## Min. : 1.500
## 1st Qu.: 9.167
## Median :11.896
## Mean :12.585
## 3rd Qu.:15.323
## Max. :34.000
a2: The influential point represents a data error, so we are justified in removing it. Re-fit the model without this point and examine the results of summary on the model. Comment on the change in \(R^2\) and \(RMSE\) (has it gotten better by throwing away the point?) and the significance of predictors using this new model.
Response: Yes, the R^2, RMSE, and R^2adj have all increased. Also, each predictor has a significant p-value after removing row 337. Previously, AvgTemp and EffectiveAvgTemp were not statistically significant in the multiple regression, but they are now. Old: R^2= 0.4006 RMSE= 1365 R^2adj= 0.3947 New: R^2= 0.4223 RMSE= 1339 R^2adj= 0.4166
M.new <- lm(Demand~.,data=EX4.BIKE[-337,])
#replace ? by the offending row number so the model is fit WITHOUT that row
summary(M.new)
##
## Call:
## lm(formula = Demand ~ ., data = EX4.BIKE[-337, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3051.2 -955.7 -220.3 1123.0 3162.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4107.533 510.360 8.048 9.26e-15 ***
## AvgTemp -376.312 121.236 -3.104 0.00204 **
## EffectiveAvgTemp 476.857 112.611 4.235 2.83e-05 ***
## AvgHumidity -36.682 5.186 -7.073 6.62e-12 ***
## AvgWindspeed -64.506 14.344 -4.497 8.99e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1339 on 408 degrees of freedom
## Multiple R-squared: 0.4223, Adjusted R-squared: 0.4166
## F-statistic: 74.56 on 4 and 408 DF, p-value: < 2.2e-16
summary(lm(Demand~.,data=EX4.BIKE))
##
## Call:
## lm(formula = Demand ~ ., data = EX4.BIKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2966.6 -984.7 -195.6 1129.8 3420.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4884.152 483.998 10.091 < 2e-16 ***
## AvgTemp 63.309 59.831 1.058 0.291
## EffectiveAvgTemp 68.452 55.547 1.232 0.219
## AvgHumidity -33.945 5.244 -6.473 2.76e-10 ***
## AvgWindspeed -76.116 14.343 -5.307 1.83e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1365 on 409 degrees of freedom
## Multiple R-squared: 0.4006, Adjusted R-squared: 0.3947
## F-statistic: 68.33 on 4 and 409 DF, p-value: < 2.2e-16
b: The strength of the relationship between Demand and the other predictors may not be a constant. For example, the strength of the relationship between Demand and AvgTemp may depend on AvgWindspeed (on warm days, wind speed may not matter, but on cold days, wind speed may matter a lot). Fit a model with all two-way interactions (but without that influential point) and use see_interactions to scan for possible interactions (note: add the argument many=TRUE or there will be too many plots). Does it look like we should include interactions into the model? Why or why not?
Response: No, the interaction between AvgTemp and EffectiveAvgTemp is the only interaction with a p-value that is statistically signific, but when analyzing it visually it looks to have an interaction that is weak. I do not think we should include interactions into the model.
#model with all two way interactions omitting the offending row
#see interactions, with argument extra=TRUE
T <- lm(Demand~AvgTemp*AvgWindspeed, data = EX4.BIKE[-337,])
M.int <- lm(Demand~.^2, data = EX4.BIKE[-337,])
summary(M.int)
##
## Call:
## lm(formula = Demand ~ .^2, data = EX4.BIKE[-337, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2791.5 -929.1 -148.5 1092.5 3008.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2904.853 2041.973 1.423 0.15564
## AvgTemp 128.474 699.806 0.184 0.85443
## EffectiveAvgTemp 398.094 647.452 0.615 0.53899
## AvgHumidity -89.059 29.648 -3.004 0.00283 **
## AvgWindspeed -96.565 67.110 -1.439 0.15095
## AvgTemp:EffectiveAvgTemp -10.210 1.418 -7.201 2.97e-12 ***
## AvgTemp:AvgHumidity -2.737 8.906 -0.307 0.75871
## AvgTemp:AvgWindspeed 23.087 20.796 1.110 0.26761
## EffectiveAvgTemp:AvgHumidity 3.535 8.426 0.420 0.67506
## EffectiveAvgTemp:AvgWindspeed -22.274 18.733 -1.189 0.23513
## AvgHumidity:AvgWindspeed 1.231 0.815 1.511 0.13159
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1266 on 402 degrees of freedom
## Multiple R-squared: 0.4912, Adjusted R-squared: 0.4786
## F-statistic: 38.82 on 10 and 402 DF, p-value: < 2.2e-16
see_interactions(M.int)
c: Fit a model predicting Demand from AvgTemp and AvgWindspeed, including the interaction (but omitting the offensive row). Examine the results of summary and visualize_model. Is the interaction statistically significant?
Response: No, the p-value of the interaction term is 0.4288
#model with all two way interactions omitting the offending row
#see interactions, with argument many=TRUE
B <- lm(Demand~AvgTemp*AvgWindspeed, data = EX4.BIKE[-337,])
summary(B)
##
## Call:
## lm(formula = Demand ~ AvgTemp * AvgWindspeed, data = EX4.BIKE[-337,
## ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -3550.1 -1083.8 -198.1 1096.3 3527.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3096.950 598.465 5.175 3.58e-07 ***
## AvgTemp 110.519 28.709 3.850 0.000137 ***
## AvgWindspeed -83.023 42.864 -1.937 0.053446 .
## AvgTemp:AvgWindspeed 1.715 2.165 0.792 0.428764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1432 on 409 degrees of freedom
## Multiple R-squared: 0.3381, Adjusted R-squared: 0.3332
## F-statistic: 69.63 on 3 and 409 DF, p-value: < 2.2e-16
visualize_model(B)
##
## Interaction term has p-value 0.4288
Question 3: Ch 4 Exercise 5 (p. 192). Use TIPS. The percentage tip left on a bill TipPercentage has an interesting relationship with the bill amount Bill} and number of people at the table PartySize.
data(TIPS)
a: Fit a model predicting the tip percentage from the bill amount and size of party, including the interaction. Look at visualize_model and summary. You’ll see that the interaction is significant but PartySize itself is not. Would we be justified in dropping the PartySize term from the model? Why or why not?
Response: No, we would not be justified, because of the signifiance of the interaction with other predictors in the model (specifically Bill). Party size is a fairly week interaction compared to how strong the bill interaction is, but it is strong enough to state that we need to keep it in the model.
Q <- lm(TipPercentage~ Bill*PartySize, data= TIPS)
visualize_model(Q)
##
## Interaction term has p-value 0.02681
summary(Q)
##
## Call:
## lm(formula = TipPercentage ~ Bill * PartySize, data = TIPS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.570 -3.406 -0.548 2.448 51.173
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.80750 2.86608 9.004 < 2e-16 ***
## Bill -0.52721 0.12608 -4.182 4.06e-05 ***
## PartySize -1.75579 1.16011 -1.513 0.1315
## Bill:PartySize 0.09334 0.04190 2.228 0.0268 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.706 on 240 degrees of freedom
## Multiple R-squared: 0.1382, Adjusted R-squared: 0.1274
## F-statistic: 12.83 on 3 and 240 DF, p-value: 8.413e-08
b: Write out the overall regression equation. Round each coefficient to two decimal places.
Response: TipPercentage = 25.8075 - 0.52721Bill - 1.75579PartySize + 0.09334Bill:PartySize
c: Now write out the implicit regression equations between tip percentage and bill for parties of size 1 and for parties of size 4.
Equation when party is size 1 TipPercentage = 24.05171 - 0.52721Bill + 0.09334Bill
Equation when party is size 4 TipPercentage = 18.7847 - 0.52721Bill + 0.37337Bill
d: What is the story that the data tells us? In other words, describe how the relationship between tip percentage and bill amount changes as the party size gets bigger. Does it get weaker, stronger? Is it positive, negative?
Response: The data teslls us that the interaction between TipPercentage and PartySize is strongly dependent on the Bill size. It also tells us that the interaction betweeen TipPercentage and Bill is depend on PartySize, but not strongly dependent. As party size gets larger it gets stronger. The relationship is negative, but it gets less negative as PartySize gets larger.
e: A better predictor of tip percentage may be the bill amount per person Define a variable to be Bill divided by PartySize (call it BPP, the bill per person, and add it to the TIPS dataframe). Then fit a simple linear regression predicting tip percentage from it. Examine the summary output, and comment on whether this is a better model than predicting tip percentage from bill, party size, and the interaction from a:
Response: It is not a better model, because the R^2, R^2adj, and the RMSE are all worse then the orinal interaction.However, both are statistically significant. The R^2 adjusted is almost the same as the R^2 on both models. We can see the R^2 and the adj both are higher on the old and the RMSE is smaller so the orignial interaction is the best.
OLD:R^2= 0.1382 RMSE= 5.706 R^2adj= 0.1274 New:R^2= 0.09864 RMSE= 5.811 R^2adj= 0.09491
PL\(NewV <- PL\)x2/PL$x3
#Code defining and adding bill per person to the dataframe
#Code fitting the simple linear regression
#Code looking at summary
BPP <- (TIPS$Bill/TIPS$PartySize)
V <- lm(TipPercentage~BPP, data = TIPS)
summary(V)
##
## Call:
## lm(formula = TipPercentage ~ BPP, data = TIPS)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.694 -3.235 -0.624 2.437 52.113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21.2734 1.0753 19.783 < 2e-16 ***
## BPP -0.6582 0.1279 -5.146 5.5e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.811 on 242 degrees of freedom
## Multiple R-squared: 0.09864, Adjusted R-squared: 0.09491
## F-statistic: 26.48 on 1 and 242 DF, p-value: 5.502e-07