Homework 7 - T/R - James Reagan Matlock

Some homework problems are taken from the book. If you do not own the book, these questions can be found in the BAS320HomeworkPacket.pdf file up in the Homework section on Blackboard. Some questions may be slightly modified, so always answer the requested questions as written up here when they are provided. Write your (brief) answer immediately after the : (if R code is required, have the R code be AFTER your completed response).

Question 1: Ch 4 Exercise 3 (p. 191). Use the AUTO dataset. We have discussed this dataset on many occasions, and the goal is to predict FuelEfficiency from a car’s characteristics (the units of all values are in metric and not in standard US measurements).

data(AUTO)

a:: Make the scatterplot matrix of the variables in the dataframe. Do the relationships of each predictor with FuelEfficiency look roughly linear? If not, which ones don’t?

Response: Weight is the only predictor that shows a linear relationship. All other predictors show curvature.

pairs(FuelEfficiency~., data = AUTO)

b: Fit the “full model” (with all variables in it) using the “twidle dot” shortcut. Examine summary as well as check_regression on the model (with extra=TRUE). Discuss at least one issue with the regression that potentially makes it a bad model.

Response: The QQ plot shows the plotted points to be in the confidence intervals, except for a couple of points, but the thing that makes me question this model is the normality of the histogram. It is unimodel, but definitely shows some inconsistencies with regards to normality.

M <- lm(FuelEfficiency~. , data = AUTO)
summary(M)
## 
## Call:
## lm(formula = FuelEfficiency ~ ., data = AUTO)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0108 -2.7731  0.2733  1.8362 11.9854 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 192.43775   23.53161   8.178 4.62e-12 ***
## CabVolume    -0.01565    0.02283  -0.685    0.495    
## Horsepower    0.39221    0.08141   4.818 7.13e-06 ***
## TopSpeed     -1.29482    0.24477  -5.290 1.11e-06 ***
## Weight       -1.85980    0.21336  -8.717 4.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.653 on 77 degrees of freedom
## Multiple R-squared:  0.8733, Adjusted R-squared:  0.8667 
## F-statistic: 132.7 on 4 and 77 DF,  p-value: < 2.2e-16
check_regression(M, extra=TRUE)

## 
## Tests of Assumptions: ( sample size n = 82 ):
## Linearity
##    p-value for CabVolume : 0.4865 
##    p-value for Horsepower : 0 
##    p-value for TopSpeed : 1e-04 
##    p-value for Weight : 0 
##    p-value for overall model : 0.2518 
## Equal Spread:  p-value is 2e-04 
## Normality:  p-value is 0.0032 
## 
## Advice:  if n<25 then all tests must be passed.
## If n >= 25 and test is failed, refer to diagnostic plot to see if violation is severe
##  or is small enough to be ignored.
## 
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 4 plots to show )

c: *Nonlinearities may exist in the relationships, so let us try polynomial fits. Fit four separate simple linear regressions (one for each predictor) and run choose_order (default max order of 6) on each. Report the selected orders for each predictor based on the 0.005 heuristic.

Cabvolume 3

Horsepower 4

TopSpeed 2

Weight 4

P <- lm(FuelEfficiency~CabVolume, data = AUTO)
P2 <- lm(FuelEfficiency~Horsepower, data = AUTO)
P3 <- lm(FuelEfficiency~TopSpeed, data = AUTO)
P4 <- lm(FuelEfficiency~Weight, data = AUTO)
choose_order(P, max.order = 6)
choose_order(P2, max.order = 6)
choose_order(P3, max.order = 6)
choose_order(P4, max.order = 6)

d: Fit a multiple regression model with the relevant polynomial terms and call it M.poly. Report $R^2_{adj}$ and the $RMSE$ of this model and compare them to the model in b:. Comment on whether these extra terms have “significantly” improved the model. Note: your command should look something like (replace the question marks with the selected orders):

Response: New: Adjusted R-squared: 0.9113, Residual standard error: 2.979 Old: Adjusted R-squared: 0.8667, Residual standard error: 3.653 I would say they have improved the model, but “significantly” is a relative term and I do not know what counts as significant. Personally I would say it is significant because of how much the RMSE is reduced more than the increase in the p-value, but both are good indicators of a significant improvement.

M.poly <- lm(FuelEfficiency~poly(CabVolume,3)+poly(Horsepower,4)+ poly(TopSpeed,2)+poly(Weight,4),data=AUTO)
summary(M.poly)
## 
## Call:
## lm(formula = FuelEfficiency ~ poly(CabVolume, 3) + poly(Horsepower, 
##     4) + poly(TopSpeed, 2) + poly(Weight, 4), data = AUTO)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.4999 -1.6573 -0.1364  1.4637  9.5628 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           33.7817     0.3290 102.692   <2e-16 ***
## poly(CabVolume, 3)1   -1.9769     4.1312  -0.479   0.6338    
## poly(CabVolume, 3)2    9.9651     4.5269   2.201   0.0311 *  
## poly(CabVolume, 3)3   -4.1992     3.4636  -1.212   0.2296    
## poly(Horsepower, 4)1  30.0910   142.3503   0.211   0.8332    
## poly(Horsepower, 4)2 -20.4550    38.4338  -0.532   0.5963    
## poly(Horsepower, 4)3  -7.2160     7.4455  -0.969   0.3359    
## poly(Horsepower, 4)4   7.2371     4.2185   1.716   0.0908 .  
## poly(TopSpeed, 2)1   -30.5153    94.9502  -0.321   0.7489    
## poly(TopSpeed, 2)2    26.3873    22.6416   1.165   0.2479    
## poly(Weight, 4)1     -92.4703    62.4722  -1.480   0.1434    
## poly(Weight, 4)2      12.6912     9.3905   1.352   0.1810    
## poly(Weight, 4)3      -0.8139     3.9999  -0.203   0.8394    
## poly(Weight, 4)4       8.6876     3.8339   2.266   0.0266 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.979 on 68 degrees of freedom
## Multiple R-squared:  0.9256, Adjusted R-squared:  0.9113 
## F-statistic: 65.05 on 13 and 68 DF,  p-value: < 2.2e-16

e: Run check_regression on the polynomial model (be sure to add extra=TRUE). Comment on whether you think the model’s assumptions are satisfied.

Response: The assumption for Linearity of the overall model pass with a p-value larger then .05, but normality and equal spread of the model fails with a p-value smller than .05. Overall I do not think the assumptions are satisfied.

check_regression(M.poly, extra = TRUE)

## 
## Tests of Assumptions: ( sample size n = 82 ):
## Linearity
##    p-value for poly(CabVolume, 3)1 : 0.1644 
##    p-value for poly(CabVolume, 3)2 : 0.0527 
##    p-value for poly(CabVolume, 3)3 : 0.1821 
##    p-value for poly(Horsepower, 4)1 : 0 
##    p-value for poly(Horsepower, 4)2 : 0 
##    p-value for poly(Horsepower, 4)3 : 0 
##    p-value for poly(Horsepower, 4)4 : 0 
##    p-value for poly(TopSpeed, 2)1 : 1e-04 
##    p-value for poly(TopSpeed, 2)2 : 0 
##    p-value for poly(Weight, 4)1 : 0 
##    p-value for poly(Weight, 4)2 : 0 
##    p-value for poly(Weight, 4)3 : 0 
##    p-value for poly(Weight, 4)4 : 0 
##    p-value for overall model : 0.0822 
## Equal Spread:  p-value is 0.0445 
## Normality:  p-value is 0.03 
## 
## Advice:  if n<25 then all tests must be passed.
## If n >= 25 and test is failed, refer to diagnostic plot to see if violation is severe
##  or is small enough to be ignored.
## 
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 13 plots to show )

## 
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 4 sets of plots to go )

## 
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 3 sets of plots to go )

## 
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 2 sets of plots to go )

## 
## Press [enter] to continue to Predictor vs. Residuals plots or q (then Return) to quit ( 1 sets of plots to go )

Question 2: Ch 4 Exercise 4 (p. 191). Use EX4.BIKE.

data("EX4.BIKE")

a: Fit a model predicting Demand from all predictors and look at its summary. Examine the influence plot for this model.

Z <- lm(Demand~., data = EX4.BIKE)
summary(Z)
## 
## Call:
## lm(formula = Demand ~ ., data = EX4.BIKE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2966.6  -984.7  -195.6  1129.8  3420.7 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4884.152    483.998  10.091  < 2e-16 ***
## AvgTemp            63.309     59.831   1.058    0.291    
## EffectiveAvgTemp   68.452     55.547   1.232    0.219    
## AvgHumidity       -33.945      5.244  -6.473 2.76e-10 ***
## AvgWindspeed      -76.116     14.343  -5.307 1.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1365 on 409 degrees of freedom
## Multiple R-squared:  0.4006, Adjusted R-squared:  0.3947 
## F-statistic: 68.33 on 4 and 409 DF,  p-value: < 2.2e-16
influence_plot(Z)

## $Leverage
## [1] 337

a1: There is one extremely influential point in this dataset. Show the influential row, then explain what makes it so unusual through the use of the scatterplot matrix (its not possible to see why its unusual with summary on the individuals variables).

Response: This point is so unusual because on 3 out of the 4 predictors, this point is in the farthest quartile. It is also in the largest quartile for AvgTemp and in the lowest for EffectiveAvgTemp, which just doesnt make sense to have 2 opposite ends for redundent predictors. It is just an unusual point. AvgHumidity is the only predictor showing it to be a “regular” point.

#EX4.BIKE[?,]    #Replacing the ? with the row number will print out that row to the screen
#pairs()  #scatterplot matrix
#summary(EX4.BIKE) #summaries of each variable
EX4.BIKE[337,]
##     Demand  AvgTemp EffectiveAvgTemp AvgHumidity AvgWindspeed
## 337   7148 29.65667            12.12    57.08333     15.50073
pairs(Demand~., data = EX4.BIKE)

summary(EX4.BIKE)
##      Demand        AvgTemp       EffectiveAvgTemp  AvgHumidity   
##  Min.   : 705   Min.   : 4.407   Min.   : 5.083   Min.   :18.79  
##  1st Qu.:3645   1st Qu.:14.837   1st Qu.:18.071   1st Qu.:52.00  
##  Median :4736   Median :21.388   Median :25.473   Median :62.10  
##  Mean   :4808   Mean   :20.915   Mean   :24.406   Mean   :62.25  
##  3rd Qu.:6200   3rd Qu.:27.009   3rd Qu.:30.699   3rd Qu.:72.26  
##  Max.   :8714   Max.   :35.328   Max.   :40.246   Max.   :97.04  
##   AvgWindspeed   
##  Min.   : 1.500  
##  1st Qu.: 9.167  
##  Median :11.896  
##  Mean   :12.585  
##  3rd Qu.:15.323  
##  Max.   :34.000

a2: The influential point represents a data error, so we are justified in removing it. Re-fit the model without this point and examine the results of summary on the model. Comment on the change in $R^2$ and $RMSE$ (has it gotten better by throwing away the point?) and the significance of predictors using this new model.

Response: Yes, the R^2, RMSE, and R^2adj have all increased. Also, each predictor has a significant p-value after removing row 337. Previously, AvgTemp and EffectiveAvgTemp were not statistically significant in the multiple regression, but they are now. Old: R^2= 0.4006 RMSE= 1365 R^2adj= 0.3947 New: R^2= 0.4223 RMSE= 1339 R^2adj= 0.4166

M.new <- lm(Demand~.,data=EX4.BIKE[-337,])  
#replace ? by the offending row number so the model is fit WITHOUT that row
summary(M.new)
## 
## Call:
## lm(formula = Demand ~ ., data = EX4.BIKE[-337, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3051.2  -955.7  -220.3  1123.0  3162.4 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4107.533    510.360   8.048 9.26e-15 ***
## AvgTemp          -376.312    121.236  -3.104  0.00204 ** 
## EffectiveAvgTemp  476.857    112.611   4.235 2.83e-05 ***
## AvgHumidity       -36.682      5.186  -7.073 6.62e-12 ***
## AvgWindspeed      -64.506     14.344  -4.497 8.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1339 on 408 degrees of freedom
## Multiple R-squared:  0.4223, Adjusted R-squared:  0.4166 
## F-statistic: 74.56 on 4 and 408 DF,  p-value: < 2.2e-16
summary(lm(Demand~.,data=EX4.BIKE))
## 
## Call:
## lm(formula = Demand ~ ., data = EX4.BIKE)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2966.6  -984.7  -195.6  1129.8  3420.7 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4884.152    483.998  10.091  < 2e-16 ***
## AvgTemp            63.309     59.831   1.058    0.291    
## EffectiveAvgTemp   68.452     55.547   1.232    0.219    
## AvgHumidity       -33.945      5.244  -6.473 2.76e-10 ***
## AvgWindspeed      -76.116     14.343  -5.307 1.83e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1365 on 409 degrees of freedom
## Multiple R-squared:  0.4006, Adjusted R-squared:  0.3947 
## F-statistic: 68.33 on 4 and 409 DF,  p-value: < 2.2e-16

b: The strength of the relationship between Demand and the other predictors may not be a constant. For example, the strength of the relationship between Demand and AvgTemp may depend on AvgWindspeed (on warm days, wind speed may not matter, but on cold days, wind speed may matter a lot). Fit a model with all two-way interactions (but without that influential point) and use see_interactions to scan for possible interactions (note: add the argument many=TRUE or there will be too many plots). Does it look like we should include interactions into the model? Why or why not?

Response: No, the interaction between AvgTemp and EffectiveAvgTemp is the only interaction with a p-value that is statistically signific, but when analyzing it visually it looks to have an interaction that is weak. I do not think we should include interactions into the model.

#model with all two way interactions omitting the offending row
#see interactions, with argument extra=TRUE
T <- lm(Demand~AvgTemp*AvgWindspeed, data = EX4.BIKE[-337,])
M.int <- lm(Demand~.^2, data = EX4.BIKE[-337,])
summary(M.int)
## 
## Call:
## lm(formula = Demand ~ .^2, data = EX4.BIKE[-337, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2791.5  -929.1  -148.5  1092.5  3008.9 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   2904.853   2041.973   1.423  0.15564    
## AvgTemp                        128.474    699.806   0.184  0.85443    
## EffectiveAvgTemp               398.094    647.452   0.615  0.53899    
## AvgHumidity                    -89.059     29.648  -3.004  0.00283 ** 
## AvgWindspeed                   -96.565     67.110  -1.439  0.15095    
## AvgTemp:EffectiveAvgTemp       -10.210      1.418  -7.201 2.97e-12 ***
## AvgTemp:AvgHumidity             -2.737      8.906  -0.307  0.75871    
## AvgTemp:AvgWindspeed            23.087     20.796   1.110  0.26761    
## EffectiveAvgTemp:AvgHumidity     3.535      8.426   0.420  0.67506    
## EffectiveAvgTemp:AvgWindspeed  -22.274     18.733  -1.189  0.23513    
## AvgHumidity:AvgWindspeed         1.231      0.815   1.511  0.13159    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1266 on 402 degrees of freedom
## Multiple R-squared:  0.4912, Adjusted R-squared:  0.4786 
## F-statistic: 38.82 on 10 and 402 DF,  p-value: < 2.2e-16
see_interactions(M.int)

c: Fit a model predicting Demand from AvgTemp and AvgWindspeed, including the interaction (but omitting the offensive row). Examine the results of summary and visualize_model. Is the interaction statistically significant?

Response: No, the p-value of the interaction term is 0.4288

#model with all two way interactions omitting the offending row
#see interactions, with argument many=TRUE
B <- lm(Demand~AvgTemp*AvgWindspeed, data = EX4.BIKE[-337,])
summary(B)
## 
## Call:
## lm(formula = Demand ~ AvgTemp * AvgWindspeed, data = EX4.BIKE[-337, 
##     ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3550.1 -1083.8  -198.1  1096.3  3527.9 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3096.950    598.465   5.175 3.58e-07 ***
## AvgTemp               110.519     28.709   3.850 0.000137 ***
## AvgWindspeed          -83.023     42.864  -1.937 0.053446 .  
## AvgTemp:AvgWindspeed    1.715      2.165   0.792 0.428764    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1432 on 409 degrees of freedom
## Multiple R-squared:  0.3381, Adjusted R-squared:  0.3332 
## F-statistic: 69.63 on 3 and 409 DF,  p-value: < 2.2e-16
visualize_model(B)

## 
## Interaction term has p-value 0.4288

Question 3: Ch 4 Exercise 5 (p. 192). Use TIPS. The percentage tip left on a bill TipPercentage has an interesting relationship with the bill amount Bill} and number of people at the table PartySize.

data(TIPS)

a: Fit a model predicting the tip percentage from the bill amount and size of party, including the interaction. Look at visualize_model and summary. You’ll see that the interaction is significant but PartySize itself is not. Would we be justified in dropping the PartySize term from the model? Why or why not?

Response: No, we would not be justified, because of the signifiance of the interaction with other predictors in the model (specifically Bill). Party size is a fairly week interaction compared to how strong the bill interaction is, but it is strong enough to state that we need to keep it in the model.

Q <- lm(TipPercentage~ Bill*PartySize, data= TIPS)
visualize_model(Q)

## 
## Interaction term has p-value 0.02681
summary(Q)
## 
## Call:
## lm(formula = TipPercentage ~ Bill * PartySize, data = TIPS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.570  -3.406  -0.548   2.448  51.173 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    25.80750    2.86608   9.004  < 2e-16 ***
## Bill           -0.52721    0.12608  -4.182 4.06e-05 ***
## PartySize      -1.75579    1.16011  -1.513   0.1315    
## Bill:PartySize  0.09334    0.04190   2.228   0.0268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.706 on 240 degrees of freedom
## Multiple R-squared:  0.1382, Adjusted R-squared:  0.1274 
## F-statistic: 12.83 on 3 and 240 DF,  p-value: 8.413e-08

b: Write out the overall regression equation. Round each coefficient to two decimal places.

Response: TipPercentage = 25.8075 - 0.52721Bill - 1.75579PartySize + 0.09334Bill:PartySize

c: Now write out the implicit regression equations between tip percentage and bill for parties of size 1 and for parties of size 4.

Equation when party is size 1 TipPercentage = 24.05171 - 0.52721Bill + 0.09334Bill

Equation when party is size 4 TipPercentage = 18.7847 - 0.52721Bill + 0.37337Bill

d: What is the story that the data tells us? In other words, describe how the relationship between tip percentage and bill amount changes as the party size gets bigger. Does it get weaker, stronger? Is it positive, negative?

Response: The data teslls us that the interaction between TipPercentage and PartySize is strongly dependent on the Bill size. It also tells us that the interaction betweeen TipPercentage and Bill is depend on PartySize, but not strongly dependent. As party size gets larger it gets stronger. The relationship is negative, but it gets less negative as PartySize gets larger.

e: A better predictor of tip percentage may be the bill amount per person Define a variable to be Bill divided by PartySize (call it BPP, the bill per person, and add it to the TIPS dataframe). Then fit a simple linear regression predicting tip percentage from it. Examine the summary output, and comment on whether this is a better model than predicting tip percentage from bill, party size, and the interaction from a:

Response: It is not a better model, because the R^2, R^2adj, and the RMSE are all worse then the orinal interaction.However, both are statistically significant. The R^2 adjusted is almost the same as the R^2 on both models. We can see the R^2 and the adj both are higher on the old and the RMSE is smaller so the orignial interaction is the best.
OLD:R^2= 0.1382 RMSE= 5.706 R^2adj= 0.1274 New:R^2= 0.09864 RMSE= 5.811 R^2adj= 0.09491

PL$NewV <- PL$x2/PL$x3

#Code defining and adding bill per person to the dataframe
#Code fitting the simple linear regression
#Code looking at summary
BPP <- (TIPS$Bill/TIPS$PartySize)
V <- lm(TipPercentage~BPP, data = TIPS)
summary(V)
## 
## Call:
## lm(formula = TipPercentage ~ BPP, data = TIPS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.694  -3.235  -0.624   2.437  52.113 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21.2734     1.0753  19.783  < 2e-16 ***
## BPP          -0.6582     0.1279  -5.146  5.5e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.811 on 242 degrees of freedom
## Multiple R-squared:  0.09864,    Adjusted R-squared:  0.09491 
## F-statistic: 26.48 on 1 and 242 DF,  p-value: 5.502e-07

Homework 7 - T/R - James Reagan Matlock

WORTH 2 HOMEWORKS