MARR 6.3

3. The analyst was so impressed with your answers to Exercise 5 in Section 3.4 (cars04.csv) that your advice has been sought regarding the next stage regarding the next stage in the data analysis, namely an analysis of the effects of different aspects of a car on its suggested retail price. Data are available for all 234 cars on the following variables:

cars <- read.csv('https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%206/cars04.csv', header = TRUE)
head(cars)

##                  Vehicle.Name Hybrid SuggestedRetailPrice DealerCost
## 1          Chevrolet Aveo 4dr      0                11690      10965
## 2 Chevrolet Aveo LS 4dr hatch      0                12585      11802
## 3      Chevrolet Cavalier 2dr      0                14610      13697
## 4      Chevrolet Cavalier 4dr      0                14810      13884
## 5   Chevrolet Cavalier LS 2dr      0                16385      15357
## 6           Dodge Neon SE 4dr      0                13670      12849
##   EngineSize Cylinders Horsepower CityMPG HighwayMPG Weight WheelBase
## 1        1.6         4        103      28         34   2370        98
## 2        1.6         4        103      28         34   2348        98
## 3        2.2         4        140      26         37   2617       104
## 4        2.2         4        140      26         37   2676       104
## 5        2.2         4        140      26         37   2617       104
## 6        2.0         4        132      29         36   2581       105
##   Length Width
## 1    167    66
## 2    153    66
## 3    183    69
## 4    183    68
## 5    183    69
## 6    174    67

\[Y = \text{Suggested Retial Price; } x_1 = \text{ Engine size; } x_2 = \text{Cyclinders; } x_3 = \text{Horse power; } x_4 = \text{Highway mpg; } x_5 = \text{Weight}\]

\(x_6\) = Wheel Base; and \(x_7\) = Hybrid, a dummy variable which is 1 for so-called hybrid cars. The first model considered for these data was:

\[Y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + \beta_4x_4 + \beta_5x_5 + \beta_6x_6 + \beta_7x_7 + e\]

Output from model (6.36) appears on the following pages (Fig 6.53).

Decide whether (6.36) is a valid model. Give reasons to support your answer.

var_names <- c('SuggestedRetailPrice', 'EngineSize', 'Cylinders', 'Horsepower', 'HighwayMPG','Weight', 'WheelBase', 'Hybrid')
cars_df <- subset(cars, select=var_names)
head(cars_df)

##   SuggestedRetailPrice EngineSize Cylinders Horsepower HighwayMPG Weight
## 1                11690        1.6         4        103         34   2370
## 2                12585        1.6         4        103         34   2348
## 3                14610        2.2         4        140         37   2617
## 4                14810        2.2         4        140         37   2676
## 5                16385        2.2         4        140         37   2617
## 6                13670        2.0         4        132         36   2581
##   WheelBase Hybrid
## 1        98      0
## 2        98      0
## 3       104      0
## 4       104      0
## 5       104      0
## 6       105      0

lmod <- lm(SuggestedRetailPrice ~ ., cars_df)
summary(lmod)

## 
## Call:
## lm(formula = SuggestedRetailPrice ~ ., data = cars_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17436  -4134    173   3561  46392 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -68965.793  16180.381  -4.262 2.97e-05 ***
## EngineSize   -6957.457   1600.137  -4.348 2.08e-05 ***
## Cylinders     3564.755    969.633   3.676 0.000296 ***
## Horsepower     179.702     16.411  10.950  < 2e-16 ***
## HighwayMPG     637.939    202.724   3.147 0.001873 ** 
## Weight          11.911      2.658   4.481 1.18e-05 ***
## WheelBase       47.607    178.070   0.267 0.789444    
## Hybrid         431.759   6092.087   0.071 0.943562    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7533 on 226 degrees of freedom
## Multiple R-squared:  0.7819, Adjusted R-squared:  0.7751 
## F-statistic: 115.7 on 7 and 226 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod)

plot(cars_df)

The Adjusted R-squared seems like this model should be reasonable, but there are some violations in the basic assumptions of a linear regression model including linearity. The data appears skewed and also appears to have leverage points that may be influencing the models.

The plot of residuals against fitted values produces a curved pattern. Describe what, if anything can be learnt about model (6.36) from this plot.

A curved pattern suggests that we may need to adjust the variables, including potentially transforming them. There may be a quadratic like shape component to this model as well.

Identify any bad leverage points for model (6.36).

Judging from Cook’s distance, point 223 is a bad leverage point.

The multivariate version of the Box-Cox method was used to transform the predictors, while a log transformation was used for the response variable to improve interpretability. This resulted in the following model:

\[log(Y) = \beta_0 + \beta_1x^{0.25}_{1} + \beta_2log(x_2) + \beta_3log(x_3) + \beta \frac{1}{x_4} + \beta_5x_5 + \beta_6log(x_6) + \beta_7x_7 + e\]

Output from model (6.37) appears on the following pages. In that output a “t” at the start of a variable name means that the variable has been transformed according to model (6.37).

Decide whether (6.37) is a valid model.

lmod_car2 <- lm(log(SuggestedRetailPrice) ~ I(EngineSize^(0.25)) + I(log(Cylinders)) + I(log(Horsepower)) + I(1/HighwayMPG) + Weight + I(log(WheelBase)) + Hybrid, cars_df)
summary(lmod_car2)

## 
## Call:
## lm(formula = log(SuggestedRetailPrice) ~ I(EngineSize^(0.25)) + 
##     I(log(Cylinders)) + I(log(Horsepower)) + I(1/HighwayMPG) + 
##     Weight + I(log(WheelBase)) + Hybrid, data = cars_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42288 -0.10983 -0.00203  0.10279  0.70068 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.703e+00  2.010e+00   2.838  0.00496 ** 
## I(EngineSize^(0.25)) -1.575e+00  3.332e-01  -4.727 4.01e-06 ***
## I(log(Cylinders))     2.335e-01  1.204e-01   1.940  0.05359 .  
## I(log(Horsepower))    8.992e-01  8.876e-02  10.130  < 2e-16 ***
## I(1/HighwayMPG)       8.029e-01  4.758e+00   0.169  0.86614    
## Weight                5.043e-04  6.367e-05   7.920 1.07e-13 ***
## I(log(WheelBase))    -6.385e-02  4.715e-01  -0.135  0.89240    
## Hybrid                6.422e-01  1.150e-01   5.582 6.78e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1789 on 226 degrees of freedom
## Multiple R-squared:  0.8621, Adjusted R-squared:  0.8578 
## F-statistic: 201.8 on 7 and 226 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lmod_car2)

To obtain a final model, the analyst wants to simply remove the two insignificant predictors \(\frac{1}{x_4}\) (i.e.) tHighwayMPH) and \(log(x_6)\) (i.e., tWheelBase) from (6.37). Perform a partial F-test to see if this is a sensible strategy.

lmod_car3 <- update(lmod_car2, . ~ . - I(1/HighwayMPG) - I(log(WheelBase)))
summary(lmod_car3)

## 
## Call:
## lm(formula = log(SuggestedRetailPrice) ~ I(EngineSize^(0.25)) + 
##     I(log(Cylinders)) + I(log(Horsepower)) + Weight + Hybrid, 
##     data = cars_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.42224 -0.11001 -0.00099  0.10191  0.70205 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.422e+00  3.291e-01  16.474  < 2e-16 ***
## I(EngineSize^(0.25)) -1.591e+00  3.157e-01  -5.041 9.45e-07 ***
## I(log(Cylinders))     2.375e-01  1.186e-01   2.003   0.0463 *  
## I(log(Horsepower))    9.049e-01  8.305e-02  10.896  < 2e-16 ***
## Weight                5.029e-04  5.203e-05   9.666  < 2e-16 ***
## Hybrid                6.340e-01  1.080e-01   5.870 1.53e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1781 on 228 degrees of freedom
## Multiple R-squared:  0.862,  Adjusted R-squared:  0.859 
## F-statistic: 284.9 on 5 and 228 DF,  p-value: < 2.2e-16

The F-statistic shows even a greater significance.

The analyst’s boss has complained about model (6.37) saying that it fails to take account of the manufacturer of the vehicle (e.g., BMW vs Toyota). Describe how model (6.37) could be expanded in order to estimate the effect of manufacturer on suggested retail price.

Use Regex to take the manufacturer name and then create dummy variables.

MARR 6.5

An avid fan of the PGA tour with limited background in statistics has sought your help in answering one of the age-old questions in golf, namely, what is the relative importance of each different aspect of the game on average prize money in professional golf?

The following data on the top 196 tour players in 2006 can be found on the book web site in the file pgatour2006.csv

golf <- read.csv("https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20621/Week%206/pgatour2006.csv", header=TRUE)
head(golf)

##               Name TigerWoods PrizeMoney AveDrivingDistance
## 1   Aaron Baddeley          0      60661              288.3
## 2       Adam Scott          0     262045              301.1
## 3      Alex Aragon          0       3635              302.6
## 4       Alex Cejka          0      17516              288.8
## 5      Arjun Atwal          0      16683              287.7
## 6 Arron Oberholser          0     107294              285.0
##   DrivingAccuracy   GIR PuttingAverage BirdieConversion SandSaves
## 1           60.73 58.26          1.745            31.36     54.80
## 2           62.00 69.12          1.767            30.39     53.61
## 3           51.12 59.11          1.787            29.89     37.93
## 4           66.40 67.70          1.777            29.33     45.13
## 5           63.24 64.04          1.761            29.32     52.44
## 6           62.53 69.27          1.775            29.20     47.20
##   Scrambling BounceBack PuttsPerRound
## 1      59.37      19.30         27.96
## 2      57.94      19.35         29.28
## 3      50.78      16.80         29.20
## 4      54.82      17.05         29.46
## 5      57.07      18.21         28.93
## 6      57.67      20.00         29.56

Y, PrizeMoney = average prize money per tournament
\(x_1\), Driving Accuracy is the percent of time a player is able to hit the fairway with his tee shot.
\(x_2\), GIR, Greens in Regulation is the percent of time a player was able to hit the green in regulation. A green is considered hit in regulation if any part of the ball is touching the putting surface and the number of strokes taken is two or less than par.
\(x_3\), Putting Average measures putting performance on those holes where the green is hit in regulation (GIR). By using greens hit in regulation the effects of chipping close and one putting are eliminated.
\(x_4\), Birdie Conversion% is the percent of time a player makes birdie or better after hitting the green in regulation.
\(x_5\), SandSaves% is the percent of time a player was able to get “up and down” once in a greenside and bunker.
\(x_6\), Scrambling% is the percent of time that a player misses the green in regulation, but still makes par or better.
\(x_7\), PuttsPerRound is the average total number of putts per rount. (http://www.pgatour.comr/stats/; accessed March 13, 2007)

lm_golf <- lm(PrizeMoney ~ DrivingAccuracy + GIR + PuttingAverage + BirdieConversion + SandSaves + Scrambling + PuttsPerRound, golf)
summary(lm_golf)

## 
## Call:
## lm(formula = PrizeMoney ~ DrivingAccuracy + GIR + PuttingAverage + 
##     BirdieConversion + SandSaves + Scrambling + PuttsPerRound, 
##     data = golf)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81239 -26260  -6521  17539 420230 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1165233.1   587382.9  -1.984 0.048737 *  
## DrivingAccuracy     -1835.8      889.2  -2.065 0.040326 *  
## GIR                  9671.3     3309.4   2.922 0.003899 ** 
## PuttingAverage     -47435.3   521566.4  -0.091 0.927631    
## BirdieConversion    10426.0     3049.6   3.419 0.000771 ***
## SandSaves            1182.1      744.8   1.587 0.114184    
## Scrambling           4741.3     2400.8   1.975 0.049749 *  
## PuttsPerRound        5267.5    35765.7   0.147 0.883070    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50140 on 188 degrees of freedom
## Multiple R-squared:  0.4064, Adjusted R-squared:  0.3843 
## F-statistic: 18.39 on 7 and 188 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm_golf)

golf_df <- subset(golf, select=c('PrizeMoney','DrivingAccuracy','GIR','PuttingAverage','BirdieConversion','SandSaves', 'Scrambling', 'PuttsPerRound'))
plot(golf_df)

A statistician from Australia has recommended to the analyst that they not transform any of the predictor variables but that they transform Y using the log transformation. Do you agree with this recommendation? Give reasons to support your answer.

lm_log <- lm(log(PrizeMoney) ~ ., golf_df)
summary(lm_log)

## 
## Call:
## lm(formula = log(PrizeMoney) ~ ., data = golf_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.71949 -0.48608 -0.09172  0.44561  2.14013 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       0.194300   7.777129   0.025 0.980095    
## DrivingAccuracy  -0.003530   0.011773  -0.300 0.764636    
## GIR               0.199311   0.043817   4.549 9.66e-06 ***
## PuttingAverage   -0.466304   6.905698  -0.068 0.946236    
## BirdieConversion  0.157341   0.040378   3.897 0.000136 ***
## SandSaves         0.015174   0.009862   1.539 0.125551    
## Scrambling        0.051514   0.031788   1.621 0.106788    
## PuttsPerRound    -0.343131   0.473549  -0.725 0.469601    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6639 on 188 degrees of freedom
## Multiple R-squared:  0.5577, Adjusted R-squared:  0.5412 
## F-statistic: 33.87 on 7 and 188 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(lm_log)

hist(golf_df$PrizeMoney)

I feel that this analyst is justified in requesting a log transformation for the Y variable.

Develop a valid full regression model containing all seven potential predictor variables listed above. Ensure that you provide justification for your choice of full model, which includes scatter plots of the data, plots of standardized residuals, and any other relevant diagnostic plots.

Shown above.

Identify any points that should be investigated. Give one or more reasons to support each point chosen.

There are some outliers that need to be further investigated.

Describe any weaknesses in your model.

There are some outliers that need to be investigated.

The golf fan wants to remove all predictors with insignificant t-values from the full model in a single step. Explain why you would not recommend this approach.

We should be taking one variable at a time, as the changes in one variable may make the other “insignificant” value, statistically significant.

LMR 7.3

**3. Using the divusa data:**

require(faraway)

## Loading required package: faraway

data(divusa, package='faraway')
head(divusa)

##   year divorce unemployed femlab marriage birth military
## 1 1920     8.0        5.2  22.70     92.0 117.9   3.2247
## 2 1921     7.2       11.7  22.79     83.0 119.8   3.5614
## 3 1922     6.6        6.7  22.88     79.7 111.2   2.4553
## 4 1923     7.1        2.4  22.97     85.2 110.5   2.2065
## 5 1924     7.2        5.0  23.06     80.3 110.9   2.2889
## 6 1925     7.2        3.2  23.15     79.2 106.6   2.1735

Fit a regression model with divorce as the response and unemployed, femlab, marriage, and military as predictors. Compute the condition numbers and interpret their meanings.

div_lm <- lm(divorce ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm)

## 
## Call:
## lm(formula = divorce ~ unemployed + femlab + marriage + military, 
##     data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5080 -1.8613 -0.2785  1.5483  5.3526 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.17511    3.94663  -3.338  0.00134 ** 
## unemployed    0.10809    0.06892   1.568  0.12115    
## femlab        0.51252    0.03686  13.906  < 2e-16 ***
## marriage      0.08384    0.03359   2.496  0.01486 *  
## military     -0.01805    0.01985  -0.909  0.36612    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.305 on 72 degrees of freedom
## Multiple R-squared:  0.8434, Adjusted R-squared:  0.8347 
## F-statistic: 96.92 on 4 and 72 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(div_lm)

plot(div_lm$fitted.values, divusa$divorce)

There appears to be a violation in the assumptions of the linearity component of the regression. Will need to look at this closer.

pairs(divusa[,c(2,3,4,5,7)])

Let’s take a look at the Y variable and its distribution.

hist(divusa$divorce, col='green', main="Divorce in the USA (divorce per 1000 women aged 15 or more)", breaks=20)

There appears to be a bimodal pattern.

Let’s see what happens if we take the log(divorce) and create the model.

div_lm2 <- lm(log(divorce) ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm2)

## 
## Call:
## lm(formula = log(divorce) ~ unemployed + femlab + marriage + 
##     military, data = divusa)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25296 -0.11826 -0.01309  0.09829  0.30294 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1783155  0.2581306   0.691    0.492    
## unemployed   0.0074488  0.0045074   1.653    0.103    
## femlab       0.0407824  0.0024106  16.918  < 2e-16 ***
## marriage     0.0095098  0.0021973   4.328 4.77e-05 ***
## military    -0.0002198  0.0012980  -0.169    0.866    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1508 on 72 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.873 
## F-statistic: 131.6 on 4 and 72 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(div_lm2)

There’s still a lot of issues with this linear regression model. The Adjusted R-squared is 0.873 which is better than the previous model. It also suggests that femlab and marriage are statistically significant. That being said, let’s take the box-cox transformation.

library(MASS)

## Warning: package 'MASS' was built under R version 3.4.3

boxcox(divorce ~ unemployed + femlab + marriage + military, data=divusa)

Let’s try taking the inverse as boxcox suggests.

div_lm3 <- lm(1/divorce ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm3)

## 
## Call:
## lm(formula = 1/divorce ~ unemployed + femlab + marriage + military, 
##     data = divusa)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.021302 -0.008442  0.001082  0.008205  0.021169 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.091e-01  1.882e-02  16.426  < 2e-16 ***
## unemployed  -4.906e-04  3.286e-04  -1.493    0.140    
## femlab      -3.549e-03  1.757e-04 -20.196  < 2e-16 ***
## marriage    -1.071e-03  1.602e-04  -6.684  4.2e-09 ***
## military    -7.501e-05  9.462e-05  -0.793    0.431    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01099 on 72 degrees of freedom
## Multiple R-squared:  0.9077, Adjusted R-squared:  0.9026 
## F-statistic:   177 on 4 and 72 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(div_lm3)

Again, better as the adjusted R-squared improved.

Now what about the loess() function.

div_lm6 <- loess(divorce ~ unemployed + femlab + marriage + military, data=divusa)
summary(div_lm6)

## Call:
## loess(formula = divorce ~ unemployed + femlab + marriage + military, 
##     data = divusa)
## 
## Number of Observations: 77 
## Equivalent Number of Parameters: 20.89 
## Residual Standard Error: 2.867 
## Trace of smoother matrix: 26.61  (exact)
## 
## Control settings:
##   span     :  0.75 
##   degree   :  2 
##   family   :  gaussian
##   surface  :  interpolate      cell = 0.2
##   normalize:  TRUE
##  parametric:  FALSE FALSE FALSE FALSE
## drop.square:  FALSE FALSE FALSE FALSE

For the same model, compute the VIFs. Is there evidence that collinearity causes some predictors not be significant? Explain.

vif(divusa[, c(3,4,5,7)])

## unemployed     femlab   marriage   military 
##   1.753836   2.689578   2.780902   1.242909

Reference: https://stats.stackexchange.com/questions/169445/how-to-interpret-a-vif-of-4

VIF = 1 (Not correlated) 1 < VIF < 5 (Moderately correlated) VIF >=5 (Highly correlated)

Does the removal of insignificant predictors from the model reduce the collinearity? Investigate.

In the model building phase, military and unemployment were not statistically significant. Let’s remove these two.

div_lm4 <- lm(divorce ~ femlab + marriage, data=divusa)
summary(div_lm4)

## 
## Call:
## lm(formula = divorce ~ femlab + marriage, data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.7249 -1.9049  0.1659  1.4827  6.3876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.88696    2.86111  -2.757  0.00735 ** 
## femlab       0.46902    0.02996  15.655  < 2e-16 ***
## marriage     0.04194    0.02686   1.562  0.12259    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.339 on 74 degrees of freedom
## Multiple R-squared:  0.8343, Adjusted R-squared:  0.8298 
## F-statistic: 186.3 on 2 and 74 DF,  p-value: < 2.2e-16

par(mfrow=c(2,2))
plot(div_lm4)

vif(divusa[, c(4,5)])

##   femlab marriage 
## 1.726273 1.726273

Not particularly.

LMR 7.8

8. For the divusa data, fit a model with divorce as the response and the other variables, except year as predictors. Check for serial correlations.

divorce_lm <- lm(divorce ~ . - year, data=divusa)
summary(divorce_lm)

## 
## Call:
## lm(formula = divorce ~ . - year, data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8611 -0.8916 -0.0496  0.8650  3.8300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.48784    3.39378   0.733   0.4659    
## unemployed  -0.11125    0.05592  -1.989   0.0505 .  
## femlab       0.38365    0.03059  12.543  < 2e-16 ***
## marriage     0.11867    0.02441   4.861 6.77e-06 ***
## birth       -0.12996    0.01560  -8.333 4.03e-12 ***
## military    -0.02673    0.01425  -1.876   0.0647 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.65 on 71 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9152 
## F-statistic: 165.1 on 5 and 71 DF,  p-value: < 2.2e-16

Reference: http://www.statisticshowto.com/durbin-watson-test-coefficient/, https://www.safaribooksonline.com/library/view/the-r-book/9780470510247/ch010-sec018.html

The Durbin-Watson function is used for testing whether there is autocorrelation in the residuals from a linear model or a GLM, and is implemented as part of the car package.

library(car)

## Warning: package 'car' was built under R version 3.4.3

## 
## Attaching package: 'car'

## The following objects are masked from 'package:faraway':
## 
##     logit, vif

durbinWatsonTest(divorce_lm)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.8260933     0.2998834       0
##  Alternative hypothesis: rho != 0

The D-W Statistic is < 2, suggestive of positive correlation. There is serial correlation with a P-value of 0.

CUNY 621 Week 6 Suggested Exercises

Joel Park

3/21/2018

MARR 6.3

MARR 6.5

LMR 7.3

LMR 7.8