Residual analysis in the model with log(wages) and interaction.

load("/Users/arminda/Desktop/Arminda/AID/LRS/Contingency.RData")
lm3w
## 
## Call:
## lm(formula = log(wages) ~ education + workexp + unionmember + 
##     south + occupation + female + female * workexp, data = wages)
## 
## Coefficients:
##            (Intercept)               education                 workexp  
##               1.158463                0.071404                0.014051  
##            unionmember                   south         occupationSales  
##               0.196405               -0.110057               -0.344788  
##     occupationClerical       occupationService  occupationProfessional  
##              -0.202401               -0.381997               -0.036711  
##        occupationOther                  female          workexp:female  
##              -0.189099               -0.085820               -0.006936

Plots and Normality

Besides the typical plots R does by default using the function plot(object.lm):

par(mfrow=c(2,2))
plot(lm3w, which=c(1:4), ask=F)

we can save the residuals and the fitted values to plot them separately, using ggplot():

library("dplyr")
library("ggplot2")
wages <- dplyr::mutate(wages, resid=residuals(lm3w), fv=fitted(lm3w), predwage=exp(fv)) 
head(wages) 
##   education south female workexp unionmember wages age ethnicity occupation
## 1         8     0      1      21           0  5.10  35  Hispanic      Other
## 2         9     0      1      42           0  4.95  57     White      Other
## 3        12     0      0       1           0  6.67  19     White      Other
## 4        12     0      0       4           0  4.00  22     White      Other
## 5        12     0      0      17           0  7.50  35     White      Other
## 6        13     0      0       9           1 13.07  28     White      Other
##          sector married         res       resid       fv predwage           Di
## 1 Manufacturing       1  0.02504699  0.02504699 1.604194 4.973847 4.345787e-05
## 2 Manufacturing       1 -0.22562513 -0.22562513 1.825013 6.202874 5.223779e-04
## 3 Manufacturing       0  0.05735287  0.05735287 1.840267 6.298220 6.514003e-05
## 4         Other       0 -0.49612462 -0.49612462 1.882419 6.569377 1.229871e-03
## 5         Other       1 -0.05017450 -0.05017450 2.065078 7.885909 5.234100e-07
## 6         Other       0  0.34983768  0.34983768 2.220482 9.211768 1.163574e-03
##      Leverage n
## 1 0.017794032 1
## 2 0.027427466 2
## 3 0.013888409 3
## 4 0.011737654 4
## 5 0.008144814 5
## 6 0.016393749 6
ggplot(wages, aes(x=fv, y=resid))+geom_point()+geom_text(label=rownames(wages))

Normality of the residuals, visually. The observations are not far from normality, except for two outliers: observations no. 171 and 200. They are far enough that any test would reject the assumption of normality. Let’s check that, in fact, these are outliers. The null is \(H_0:\,\) observation is not an outlier:

wages[c(171,200),]
##     education south female workexp unionmember wages age ethnicity occupation
## 171        14     0      1       1           0  44.5  21     White Management
## 200        12     0      0      24           0   1.0  42     White Management
##     sector married       res     resid       fv  predwage         Di   Leverage
## 171  Other       0  1.716070  1.716070 2.079419  7.999821 0.04770244 0.03006997
## 200  Other       1 -2.352531 -2.352531 2.352531 10.512145 0.06398884 0.02317375
##       n
## 171 171
## 200 200
library("car")
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
outlierTest(lm3w, cutoff=0.05)
##      rstudent unadjusted p-value Bonferroni p
## 200 -5.726492         1.7340e-08   9.2596e-06
## 171  4.131709         4.1946e-05   2.2399e-02
library("tseries")
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
qqPlot(wages$resid, dist="norm")

## [1] 200 171
jarque.bera.test(wages$resid)
## 
##  Jarque Bera Test
## 
## data:  wages$resid
## X-squared = 65.557, df = 2, p-value = 5.773e-15

Removing these observations:

jarque.bera.test(wages$resid[-c(171,200)])
## 
##  Jarque Bera Test
## 
## data:  wages$resid[-c(171, 200)]
## X-squared = 0.46768, df = 2, p-value = 0.7915

Constant variance.

In the plots above, we still see a hint of lower variance at low wages values and it seems more regular as wages increase. We check constant variance, for instance, with the Breusch-Pagan test (the null being that your residuals are homocedastic):

library("lmtest") #after linear model tests
bptest(lm3w)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm3w
## BP = 17.707, df = 11, p-value = 0.08864

The p-value is near our nominal, subjective value of \(0.05\), but higher. Let’s assume the variance of these residuals is constant.

A different test for testing constant variance. This test compares the variances of two submodels divided by a specified breakpoint and rejects if the variances differ. Type help(gqtest) for more information:

gqtest(lm3w)
## 
##  Goldfeld-Quandt test
## 
## data:  lm3w
## GQ = 0.96821, df1 = 255, df2 = 255, p-value = 0.6017
## alternative hypothesis: variance increases from segment 1 to 2

Independence of the residuals.

Let’s check this assumption using, for instance, de Durwin-Watson test. The null is the independence of the residuals.

dwtest(lm3w)
## 
##  Durbin-Watson test
## 
## data:  lm3w
## DW = 1.9345, p-value = 0.1654
## alternative hypothesis: true autocorrelation is greater than 0

Influential Observations.

Basically, there are three types of unusual observations that may appear in a regression model: Outliers, Leverage and Influential points. There are many different measures to assess the degree of a point being an outlier, leverage or influential, but we are using basically three: residuals (for outliers), hat-values (for leverage) and Cook’s distance (for influence). But, the warning is there, there are many more (but they are usually related to any of the already mentioned).

We have already studied the residuals of our model: observations 171 and 200. They are residuals because there is a large difference of the wages (\(Y\) value) we have observed for them and waht the model predicts \(\hat{Y}\). Residuals with values greater than \(\pm 2\) or \(\pm 3\) should be further inspected.

wages[c(171,200),c(6,14)]
##     wages       fv
## 171  44.5 2.079419
## 200   1.0 2.352531

The leverage (\(h_i\)) of a point has to do with the difference from the other observations on one or more of the explanatory variables only. That is, it measures how far an observation is from the others in terms of the levels of the explanatory variables only (not the dependent variable), see slides 29 to 32 of the second set of slides on Rgression Analysis for an illustration. The leverage is measured through the hat-value: the rule of thumb is that observations with values larger that \(\frac{2(p+1)}{n}\) are considered potentially influential (\(p\) is the number of predictors and \(+1\) is the intercept).

Cook’s distance (\(D_i\)) combines a measure of discrepancy (differences between \(Y\) and \(\hat{Y}\)) and leverage. The criterion of \(D_i >1\) is risky, it can leave out noteworthy points. The rule of thumb will be values larger than \(\frac{4}{n-p-1}\).

To obtain these measures, we are going to use the built-in R-function influence.measures():

InF=influence.measures(lm3w)

With summary(InF) you get the observations that R judge influential based on many more criteria than the three we are using. To keep the columns we are interested in + the residuals:

wages <- dplyr::mutate(wages, Di=InF$infmat[,15], Leverage=InF$infmat[,16], n=rownames(InF$infmat))
head(wages)
##   education south female workexp unionmember wages age ethnicity occupation
## 1         8     0      1      21           0  5.10  35  Hispanic      Other
## 2         9     0      1      42           0  4.95  57     White      Other
## 3        12     0      0       1           0  6.67  19     White      Other
## 4        12     0      0       4           0  4.00  22     White      Other
## 5        12     0      0      17           0  7.50  35     White      Other
## 6        13     0      0       9           1 13.07  28     White      Other
##          sector married         res       resid       fv predwage           Di
## 1 Manufacturing       1  0.02504699  0.02504699 1.604194 4.973847 5.723859e-06
## 2 Manufacturing       1 -0.22562513 -0.22562513 1.825013 6.202874 6.935196e-04
## 3 Manufacturing       0  0.05735287  0.05735287 1.840267 6.298220 2.322252e-05
## 4         Other       0 -0.49612462 -0.49612462 1.882419 6.569377 1.482924e-03
## 5         Other       1 -0.05017450 -0.05017450 2.065078 7.885909 1.085233e-05
## 6         Other       0  0.34983768  0.34983768 2.220482 9.211768 9.698191e-04
##      Leverage n
## 1 0.019317639 1
## 2 0.028317334 2
## 3 0.015077283 3
## 4 0.012922872 4
## 5 0.009314255 5
## 6 0.016861891 6

and filter this data frame to find the observations larger than each cutoff. For leverage :

wages %>% filter(Leverage >0.04) %>% select(n, wages, education, workexp, female, Leverage)
##      n wages education workexp female   Leverage
## 1   32  4.00        12      46      1 0.04075805
## 2   34  5.00        17       1      1 0.04029067
## 3   63  7.00         3      55      0 0.06215793
## 4   76  3.00         6      43      1 0.04594829
## 5  168  4.84        15       1      0 0.04213041
## 6  218 12.50        12      43      0 0.04553410
## 7  223  6.40        12      45      1 0.04838451
## 8  229  3.35        13       2      1 0.04027242
## 9  231 19.98        14      44      0 0.05309616
## 10 240 13.71        12      43      0 0.05249917
## 11 347  6.00         4      54      0 0.06729734
## 12 351  3.75         2      16      0 0.07189593
## 13 356  3.50         9      48      0 0.04296142
## 14 406  8.93         8      47      0 0.04457778
## 15 414  3.50         9      47      0 0.04286220
## 16 427  9.50         9      46      1 0.04045953
## 17 525  5.75         9      34      1 0.05204557
mean(wages$workexp)
## [1] 17.8221

These are leverages beacuse their workexperience is substantially below the mean, or above it.

For Cook’s distance : Many points, \(30\), exceed the cutoff of \(0.007\), but there are two that are much larger that all the others:

 wages %>% filter(Di >0.02) %>% select(n, wages, education, workexp, female, Di)
##     n wages education workexp female         Di
## 1 171  44.5        14       1      1 0.04376649
## 2 200   1.0        12      24      0 0.06344948
## 3 410  25.0        14       4      0 0.02002423
## 4 414   3.5         9      47      0 0.02224673

which are again the outliers that have been already identified. The first two plots in this document (default plots in R) show a plot of points based on their Cook distance. There, three points inmediately stick out from the rest: observations 200, 171 and 414.

influencePlot(lm3w, labels=NULL,id.method="noteworthy")
## Warning in plot.window(...): "id.method" is not a graphical parameter
## Warning in plot.xy(xy, type, ...): "id.method" is not a graphical parameter
## Warning in axis(side = side, at = at, labels = labels, ...): "id.method" is not
## a graphical parameter

## Warning in axis(side = side, at = at, labels = labels, ...): "id.method" is not
## a graphical parameter
## Warning in box(...): "id.method" is not a graphical parameter
## Warning in title(...): "id.method" is not a graphical parameter
## Warning in plot.xy(xy.coords(x, y), type = type, ...): "id.method" is not a
## graphical parameter

##         StudRes        Hat        CookD
## 171  4.13170920 0.03073788 4.376649e-02
## 200 -5.72649166 0.02404034 6.344948e-02
## 347 -0.07018317 0.06729734 2.967345e-05
## 351  0.43023806 0.07189593 1.196803e-03

Box-Cox transformation.

We start working with our original model, lm1w, plus the interaction effect term, where the response variable was \(Y=\)hourly wages. The function powerTransform() is going to be applied on the whole model, to see if there are any transformation of the \(Y\) variable that improves the residuals distribution.

lm4w=update(lm1w, ~.+workexp:female)
summary(powerTransform(lm4w))
## bcPower Transformation to Normality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1   -0.0153           0      -0.1365       0.1059
## 
## Likelihood ratio test that transformation parameter is equal to 0
##  (log transformation)
##                              LRT df    pval
## LR test, lambda = (0) 0.06123304  1 0.80456
## 
## Likelihood ratio test that no transformation is needed
##                            LRT df       pval
## LR test, lambda = (1) 255.4845  1 < 2.22e-16

As it is shown in the output, the estimated \(\lambda=-0.015\) is very close to \(0\). Down in the output, parameter \(\lambda=0\) which corresponds to the logarithmic transformation, cannot be rejected (p-value =\(0.8045576\)). As transformations with easier interpretability are preferred, the logarithmic transformation is chosen as a very good one for our \(Y\) variable (certainly, not the best). We have already seen the improvements of that transformation in the residuals distribution.

library("MASS")
boxcox(lm1w)

The Log-Likelihood function shows how close the best value for \(\lambda\) is to \(0\).