3. The analyst was so impressed with your answers to Exercise 5 in Section 3.4 (cars04.csv) that your advice has been sought regarding the next stage regarding the next stage in the data analysis, namely an analysis of the effects of different aspects of a car on its suggested retail price. Data are available for all 234 cars on the following variables:
cars <- read.csv('https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%206/cars04.csv', header = TRUE)
head(cars)
## Vehicle.Name Hybrid SuggestedRetailPrice DealerCost
## 1 Chevrolet Aveo 4dr 0 11690 10965
## 2 Chevrolet Aveo LS 4dr hatch 0 12585 11802
## 3 Chevrolet Cavalier 2dr 0 14610 13697
## 4 Chevrolet Cavalier 4dr 0 14810 13884
## 5 Chevrolet Cavalier LS 2dr 0 16385 15357
## 6 Dodge Neon SE 4dr 0 13670 12849
## EngineSize Cylinders Horsepower CityMPG HighwayMPG Weight WheelBase
## 1 1.6 4 103 28 34 2370 98
## 2 1.6 4 103 28 34 2348 98
## 3 2.2 4 140 26 37 2617 104
## 4 2.2 4 140 26 37 2676 104
## 5 2.2 4 140 26 37 2617 104
## 6 2.0 4 132 29 36 2581 105
## Length Width
## 1 167 66
## 2 153 66
## 3 183 69
## 4 183 68
## 5 183 69
## 6 174 67
\[Y = \text{Suggested Retial Price; } x_1 = \text{ Engine size; } x_2 = \text{Cyclinders; } x_3 = \text{Horse power; } x_4 = \text{Highway mpg; } x_5 = \text{Weight}\]
\(x_6\) = Wheel Base; and \(x_7\) = Hybrid, a dummy variable which is 1 for so-called hybrid cars. The first model considered for these data was:
\[Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_5 + \beta_6x_6 + \beta_7x_7 + e\]
Output from model (6.36) appears on the following pages (Fig 6.53).
var_names <- c('SuggestedRetailPrice', 'EngineSize', 'Cylinders', 'Horsepower', 'HighwayMPG','Weight', 'WheelBase', 'Hybrid')
cars_df <- subset(cars, select=var_names)
head(cars_df)
## SuggestedRetailPrice EngineSize Cylinders Horsepower HighwayMPG Weight
## 1 11690 1.6 4 103 34 2370
## 2 12585 1.6 4 103 34 2348
## 3 14610 2.2 4 140 37 2617
## 4 14810 2.2 4 140 37 2676
## 5 16385 2.2 4 140 37 2617
## 6 13670 2.0 4 132 36 2581
## WheelBase Hybrid
## 1 98 0
## 2 98 0
## 3 104 0
## 4 104 0
## 5 104 0
## 6 105 0
lmod <- lm(SuggestedRetailPrice ~ ., cars_df)
summary(lmod)
##
## Call:
## lm(formula = SuggestedRetailPrice ~ ., data = cars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17436 -4134 173 3561 46392
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -68965.793 16180.381 -4.262 2.97e-05 ***
## EngineSize -6957.457 1600.137 -4.348 2.08e-05 ***
## Cylinders 3564.755 969.633 3.676 0.000296 ***
## Horsepower 179.702 16.411 10.950 < 2e-16 ***
## HighwayMPG 637.939 202.724 3.147 0.001873 **
## Weight 11.911 2.658 4.481 1.18e-05 ***
## WheelBase 47.607 178.070 0.267 0.789444
## Hybrid 431.759 6092.087 0.071 0.943562
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7533 on 226 degrees of freedom
## Multiple R-squared: 0.7819, Adjusted R-squared: 0.7751
## F-statistic: 115.7 on 7 and 226 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod)
plot(cars_df)
The Adjusted R-squared seems like this model should be reasonable, but there are some violations in the basic assumptions of a linear regression model including linearity. The data appears skewed and also appears to have leverage points that may be influencing the models.
A curved pattern suggests that we may need to adjust the variables, including potentially transforming them. There may be a quadratic like shape component to this model as well.
Judging from Cook’s distance, point 223 is a bad leverage point.
The multivariate version of the Box-Cox method was used to transform the predictors, while a log transformation was used for the response variable to improve interpretability. This resulted in the following model:
\[log(Y) = \beta_0 + \beta_1x^{0.25}_{1} + \beta_2log(x_2) + \beta_3log(x_3) + \beta \frac{1}{x_4} + \beta_5x_5 + \beta_6log(x_6) + \beta_7x_7 + e\]
Output from model (6.37) appears on the following pages. In that output a “t” at the start of a variable name means that the variable has been transformed according to model (6.37).
lmod_car2 <- lm(log(SuggestedRetailPrice) ~ I(EngineSize^(0.25)) + I(log(Cylinders)) + I(log(Horsepower)) + I(1/HighwayMPG) + Weight + I(log(WheelBase)) + Hybrid, cars_df)
summary(lmod_car2)
##
## Call:
## lm(formula = log(SuggestedRetailPrice) ~ I(EngineSize^(0.25)) +
## I(log(Cylinders)) + I(log(Horsepower)) + I(1/HighwayMPG) +
## Weight + I(log(WheelBase)) + Hybrid, data = cars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42288 -0.10983 -0.00203 0.10279 0.70068
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.703e+00 2.010e+00 2.838 0.00496 **
## I(EngineSize^(0.25)) -1.575e+00 3.332e-01 -4.727 4.01e-06 ***
## I(log(Cylinders)) 2.335e-01 1.204e-01 1.940 0.05359 .
## I(log(Horsepower)) 8.992e-01 8.876e-02 10.130 < 2e-16 ***
## I(1/HighwayMPG) 8.029e-01 4.758e+00 0.169 0.86614
## Weight 5.043e-04 6.367e-05 7.920 1.07e-13 ***
## I(log(WheelBase)) -6.385e-02 4.715e-01 -0.135 0.89240
## Hybrid 6.422e-01 1.150e-01 5.582 6.78e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1789 on 226 degrees of freedom
## Multiple R-squared: 0.8621, Adjusted R-squared: 0.8578
## F-statistic: 201.8 on 7 and 226 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod_car2)
lmod_car3 <- update(lmod_car2, . ~ . - I(1/HighwayMPG) - I(log(WheelBase)))
summary(lmod_car3)
##
## Call:
## lm(formula = log(SuggestedRetailPrice) ~ I(EngineSize^(0.25)) +
## I(log(Cylinders)) + I(log(Horsepower)) + Weight + Hybrid,
## data = cars_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.42224 -0.11001 -0.00099 0.10191 0.70205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.422e+00 3.291e-01 16.474 < 2e-16 ***
## I(EngineSize^(0.25)) -1.591e+00 3.157e-01 -5.041 9.45e-07 ***
## I(log(Cylinders)) 2.375e-01 1.186e-01 2.003 0.0463 *
## I(log(Horsepower)) 9.049e-01 8.305e-02 10.896 < 2e-16 ***
## Weight 5.029e-04 5.203e-05 9.666 < 2e-16 ***
## Hybrid 6.340e-01 1.080e-01 5.870 1.53e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1781 on 228 degrees of freedom
## Multiple R-squared: 0.862, Adjusted R-squared: 0.859
## F-statistic: 284.9 on 5 and 228 DF, p-value: < 2.2e-16
The F-statistic shows even a greater significance.
Use Regex to take the manufacturer name and then create dummy variables.
An avid fan of the PGA tour with limited background in statistics has sought your help in answering one of the age-old questions in golf, namely, what is the relative importance of each different aspect of the game on average prize money in professional golf?
The following data on the top 196 tour players in 2006 can be found on the book web site in the file pgatour2006.csv
golf <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%206/pgatour2006.csv", header=TRUE)
head(golf)
## Name TigerWoods PrizeMoney AveDrivingDistance
## 1 Aaron Baddeley 0 60661 288.3
## 2 Adam Scott 0 262045 301.1
## 3 Alex Aragon 0 3635 302.6
## 4 Alex Cejka 0 17516 288.8
## 5 Arjun Atwal 0 16683 287.7
## 6 Arron Oberholser 0 107294 285.0
## DrivingAccuracy GIR PuttingAverage BirdieConversion SandSaves
## 1 60.73 58.26 1.745 31.36 54.80
## 2 62.00 69.12 1.767 30.39 53.61
## 3 51.12 59.11 1.787 29.89 37.93
## 4 66.40 67.70 1.777 29.33 45.13
## 5 63.24 64.04 1.761 29.32 52.44
## 6 62.53 69.27 1.775 29.20 47.20
## Scrambling BounceBack PuttsPerRound
## 1 59.37 19.30 27.96
## 2 57.94 19.35 29.28
## 3 50.78 16.80 29.20
## 4 54.82 17.05 29.46
## 5 57.07 18.21 28.93
## 6 57.67 20.00 29.56
lm_golf <- lm(PrizeMoney ~ DrivingAccuracy + GIR + PuttingAverage + BirdieConversion + SandSaves + Scrambling + PuttsPerRound, golf)
summary(lm_golf)
##
## Call:
## lm(formula = PrizeMoney ~ DrivingAccuracy + GIR + PuttingAverage +
## BirdieConversion + SandSaves + Scrambling + PuttsPerRound,
## data = golf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -81239 -26260 -6521 17539 420230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1165233.1 587382.9 -1.984 0.048737 *
## DrivingAccuracy -1835.8 889.2 -2.065 0.040326 *
## GIR 9671.3 3309.4 2.922 0.003899 **
## PuttingAverage -47435.3 521566.4 -0.091 0.927631
## BirdieConversion 10426.0 3049.6 3.419 0.000771 ***
## SandSaves 1182.1 744.8 1.587 0.114184
## Scrambling 4741.3 2400.8 1.975 0.049749 *
## PuttsPerRound 5267.5 35765.7 0.147 0.883070
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50140 on 188 degrees of freedom
## Multiple R-squared: 0.4064, Adjusted R-squared: 0.3843
## F-statistic: 18.39 on 7 and 188 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_golf)
golf_df <- subset(golf, select=c('PrizeMoney','DrivingAccuracy','GIR','PuttingAverage','BirdieConversion','SandSaves', 'Scrambling', 'PuttsPerRound'))
plot(golf_df)
lm_log <- lm(log(PrizeMoney) ~ ., golf_df)
summary(lm_log)
##
## Call:
## lm(formula = log(PrizeMoney) ~ ., data = golf_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.71949 -0.48608 -0.09172 0.44561 2.14013
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.194300 7.777129 0.025 0.980095
## DrivingAccuracy -0.003530 0.011773 -0.300 0.764636
## GIR 0.199311 0.043817 4.549 9.66e-06 ***
## PuttingAverage -0.466304 6.905698 -0.068 0.946236
## BirdieConversion 0.157341 0.040378 3.897 0.000136 ***
## SandSaves 0.015174 0.009862 1.539 0.125551
## Scrambling 0.051514 0.031788 1.621 0.106788
## PuttsPerRound -0.343131 0.473549 -0.725 0.469601
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6639 on 188 degrees of freedom
## Multiple R-squared: 0.5577, Adjusted R-squared: 0.5412
## F-statistic: 33.87 on 7 and 188 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_log)
hist(golf_df$PrizeMoney)
I feel that this analyst is justified in requesting a log transformation for the Y variable.
Shown above.
There are some outliers that need to be further investigated.
There are some outliers that need to be investigated.
We should be taking one variable at a time, as the changes in one variable may make the other “insignificant” value, statistically significant.
**3. Using the divusa
data:**
require(faraway)
## Loading required package: faraway
data(divusa, package='faraway')
head(divusa)
## year divorce unemployed femlab marriage birth military
## 1 1920 8.0 5.2 22.70 92.0 117.9 3.2247
## 2 1921 7.2 11.7 22.79 83.0 119.8 3.5614
## 3 1922 6.6 6.7 22.88 79.7 111.2 2.4553
## 4 1923 7.1 2.4 22.97 85.2 110.5 2.2065
## 5 1924 7.2 5.0 23.06 80.3 110.9 2.2889
## 6 1925 7.2 3.2 23.15 79.2 106.6 2.1735
divorce
as the response and unemployed
, femlab
, marriage
, and military
as predictors. Compute the condition numbers and interpret their meanings.div_lm <- lm(divorce ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm)
##
## Call:
## lm(formula = divorce ~ unemployed + femlab + marriage + military,
## data = divusa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5080 -1.8613 -0.2785 1.5483 5.3526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.17511 3.94663 -3.338 0.00134 **
## unemployed 0.10809 0.06892 1.568 0.12115
## femlab 0.51252 0.03686 13.906 < 2e-16 ***
## marriage 0.08384 0.03359 2.496 0.01486 *
## military -0.01805 0.01985 -0.909 0.36612
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.305 on 72 degrees of freedom
## Multiple R-squared: 0.8434, Adjusted R-squared: 0.8347
## F-statistic: 96.92 on 4 and 72 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(div_lm)
plot(div_lm$fitted.values, divusa$divorce)
There appears to be a violation in the assumptions of the linearity component of the regression. Will need to look at this closer.
pairs(divusa[,c(2,3,4,5,7)])
Let’s take a look at the Y variable and its distribution.
hist(divusa$divorce, col='green', main="Divorce in the USA (divorce per 1000 women aged 15 or more)", breaks=20)
There appears to be a bimodal pattern.
Let’s see what happens if we take the log(divorce) and create the model.
div_lm2 <- lm(log(divorce) ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm2)
##
## Call:
## lm(formula = log(divorce) ~ unemployed + femlab + marriage +
## military, data = divusa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25296 -0.11826 -0.01309 0.09829 0.30294
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1783155 0.2581306 0.691 0.492
## unemployed 0.0074488 0.0045074 1.653 0.103
## femlab 0.0407824 0.0024106 16.918 < 2e-16 ***
## marriage 0.0095098 0.0021973 4.328 4.77e-05 ***
## military -0.0002198 0.0012980 -0.169 0.866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1508 on 72 degrees of freedom
## Multiple R-squared: 0.8796, Adjusted R-squared: 0.873
## F-statistic: 131.6 on 4 and 72 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(div_lm2)
There’s still a lot of issues with this linear regression model. The Adjusted R-squared is 0.873 which is better than the previous model. It also suggests that femlab and marriage are statistically significant. That being said, let’s take the box-cox transformation.
library(MASS)
## Warning: package 'MASS' was built under R version 3.4.3
boxcox(divorce ~ unemployed + femlab + marriage + military, data=divusa)
Let’s try taking the inverse as boxcox suggests.
div_lm3 <- lm(1/divorce ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm3)
##
## Call:
## lm(formula = 1/divorce ~ unemployed + femlab + marriage + military,
## data = divusa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.021302 -0.008442 0.001082 0.008205 0.021169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.091e-01 1.882e-02 16.426 < 2e-16 ***
## unemployed -4.906e-04 3.286e-04 -1.493 0.140
## femlab -3.549e-03 1.757e-04 -20.196 < 2e-16 ***
## marriage -1.071e-03 1.602e-04 -6.684 4.2e-09 ***
## military -7.501e-05 9.462e-05 -0.793 0.431
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01099 on 72 degrees of freedom
## Multiple R-squared: 0.9077, Adjusted R-squared: 0.9026
## F-statistic: 177 on 4 and 72 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(div_lm3)
Again, better as the adjusted R-squared improved.
Now what about the loess() function.
div_lm6 <- loess(divorce ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm6)
## Call:
## loess(formula = divorce ~ unemployed + femlab + marriage + military,
## data = divusa)
##
## Number of Observations: 77
## Equivalent Number of Parameters: 20.89
## Residual Standard Error: 2.867
## Trace of smoother matrix: 26.61 (exact)
##
## Control settings:
## span : 0.75
## degree : 2
## family : gaussian
## surface : interpolate cell = 0.2
## normalize: TRUE
## parametric: FALSE FALSE FALSE FALSE
## drop.square: FALSE FALSE FALSE FALSE
vif(divusa[, c(3,4,5,7)])
## unemployed femlab marriage military
## 1.753836 2.689578 2.780902 1.242909
Reference: https://stats.stackexchange.com/questions/169445/how-to-interpret-a-vif-of-4
VIF = 1 (Not correlated) 1 < VIF < 5 (Moderately correlated) VIF >=5 (Highly correlated)
In the model building phase, military and unemployment were not statistically significant. Let’s remove these two.
div_lm4 <- lm(divorce ~ femlab + marriage, data=divusa)
summary(div_lm4)
##
## Call:
## lm(formula = divorce ~ femlab + marriage, data = divusa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7249 -1.9049 0.1659 1.4827 6.3876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.88696 2.86111 -2.757 0.00735 **
## femlab 0.46902 0.02996 15.655 < 2e-16 ***
## marriage 0.04194 0.02686 1.562 0.12259
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.339 on 74 degrees of freedom
## Multiple R-squared: 0.8343, Adjusted R-squared: 0.8298
## F-statistic: 186.3 on 2 and 74 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(div_lm4)
vif(divusa[, c(4,5)])
## femlab marriage
## 1.726273 1.726273
Not particularly.
8. For the divusa
data, fit a model with divorce
as the response and the other variables, except year
as predictors. Check for serial correlations.
divorce_lm <- lm(divorce ~ . - year, data=divusa)
summary(divorce_lm)
##
## Call:
## lm(formula = divorce ~ . - year, data = divusa)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8611 -0.8916 -0.0496 0.8650 3.8300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.48784 3.39378 0.733 0.4659
## unemployed -0.11125 0.05592 -1.989 0.0505 .
## femlab 0.38365 0.03059 12.543 < 2e-16 ***
## marriage 0.11867 0.02441 4.861 6.77e-06 ***
## birth -0.12996 0.01560 -8.333 4.03e-12 ***
## military -0.02673 0.01425 -1.876 0.0647 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.65 on 71 degrees of freedom
## Multiple R-squared: 0.9208, Adjusted R-squared: 0.9152
## F-statistic: 165.1 on 5 and 71 DF, p-value: < 2.2e-16
Reference: http://www.statisticshowto.com/durbin-watson-test-coefficient/, https://www.safaribooksonline.com/library/view/the-r-book/9780470510247/ch010-sec018.html
The Durbin-Watson function is used for testing whether there is autocorrelation in the residuals from a linear model or a GLM, and is implemented as part of the car package.
library(car)
## Warning: package 'car' was built under R version 3.4.3
##
## Attaching package: 'car'
## The following objects are masked from 'package:faraway':
##
## logit, vif
durbinWatsonTest(divorce_lm)
## lag Autocorrelation D-W Statistic p-value
## 1 0.8260933 0.2998834 0
## Alternative hypothesis: rho != 0
The D-W Statistic is < 2, suggestive of positive correlation. There is serial correlation with a P-value of 0.