Abstract

I will be using Linear Models with R (Faraway), to explore solving a problem where the errors have an issue and robust regression techniques are needed.

The Data

The data looks at divorce rates in the USA from 1920 - 1996.

## 'data.frame':    77 obs. of  7 variables:
##  $ year      : int  1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 ...
##  $ divorce   : num  8 7.2 6.6 7.1 7.2 7.2 7.5 7.8 7.8 8 ...
##  $ unemployed: num  5.2 11.7 6.7 2.4 5 3.2 1.8 3.3 4.2 3.2 ...
##  $ femlab    : num  22.7 22.8 22.9 23 23.1 ...
##  $ marriage  : num  92 83 79.7 85.2 80.3 79.2 78.7 77 74.1 75.5 ...
##  $ birth     : num  118 120 111 110 111 ...
##  $ military  : num  3.22 3.56 2.46 2.21 2.29 ...
##       year         divorce        unemployed         femlab     
##  Min.   :1920   Min.   : 6.10   Min.   : 1.200   Min.   :22.70  
##  1st Qu.:1939   1st Qu.: 8.70   1st Qu.: 4.200   1st Qu.:27.47  
##  Median :1958   Median :10.60   Median : 5.600   Median :37.10  
##  Mean   :1958   Mean   :13.27   Mean   : 7.173   Mean   :38.58  
##  3rd Qu.:1977   3rd Qu.:20.30   3rd Qu.: 7.500   3rd Qu.:47.80  
##  Max.   :1996   Max.   :22.80   Max.   :24.900   Max.   :59.30  
##     marriage          birth           military     
##  Min.   : 49.70   Min.   : 65.30   Min.   : 1.940  
##  1st Qu.: 61.90   1st Qu.: 68.90   1st Qu.: 3.469  
##  Median : 74.10   Median : 85.90   Median : 9.102  
##  Mean   : 72.97   Mean   : 88.89   Mean   :12.365  
##  3rd Qu.: 80.00   3rd Qu.:107.30   3rd Qu.:14.266  
##  Max.   :118.10   Max.   :122.90   Max.   :86.641

**The year shows values from 1920 - 1996, that makes up the 77 observations that are part of the dataset.

  • divorce per 1,000 women aged 15 or more. or more.
  • unemployed tells us the unemployment rate for the year.
  • femlab is the percent of female participation in the labor force 16 and over.
  • marriage marriages per 1,000 unmarried women aged 16.
  • birth another “per 1,000” women aged 15-44.
  • military another “per 1,000” military personnel.

The time series also has very clean data, as we can see from the summary there are no NA values and all types are of numeric.

Problem Statement

Using the divusa data, fit a regression model with divorce as the response and unemployed, femlab, marriage, birth and military as predictors.

  1. Make two graphical checks for correlated errors. What do you conclude?
  2. Allow for serial correlation with an AR(1) model for the errors. (Hint: Use maximum likelihood to estimate the parameters in the GLS fit by gls(…, method=“ML”, …)). What is the estimated correlation and is it significant? Does the GLS model change which variables are found to be significant?
  3. Speculate why there might be correlation in the errors.

Regression Model

## 
## Call:
## lm(formula = divorce ~ unemployed + femlab + marriage + birth + 
##     military, data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8611 -0.8916 -0.0496  0.8650  3.8300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.48784    3.39378   0.733   0.4659    
## unemployed  -0.11125    0.05592  -1.989   0.0505 .  
## femlab       0.38365    0.03059  12.543  < 2e-16 ***
## marriage     0.11867    0.02441   4.861 6.77e-06 ***
## birth       -0.12996    0.01560  -8.333 4.03e-12 ***
## military    -0.02673    0.01425  -1.876   0.0647 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.65 on 71 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9152 
## F-statistic: 165.1 on 5 and 71 DF,  p-value: < 2.2e-16
## unemployed     femlab   marriage      birth   military 
##   2.252888   3.613276   2.864864   2.585485   1.249596

The R-squared is very high and the multi-collinearity checks out as well.

The next thing we will look at is to see if there are any problems with the errors (specifically if they are correlated).

Graphical Checks for Correlated Errors

## [1] "Correlation between successive residuals: 0.846979231301387"

We are able to see that there is a clear positive relationship between the residuals and a linear trend. The cor package verifies this when the residuals from the model are given with a value of .847 between successive residuals.

Serial Correlation with an AR(1) Model

## Generalized least squares fit by maximum likelihood
##   Model: divorce ~ unemployed + femlab + marriage + birth + military 
##   Data: divusa 
##        AIC      BIC    logLik
##   179.9523 198.7027 -81.97613
## 
## Correlation Structure: AR(1)
##  Formula: ~year 
##  Parameter estimate(s):
##       Phi 
## 0.9715486 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) -7.059682  5.547193 -1.272658  0.2073
## unemployed   0.107643  0.045915  2.344395  0.0219
## femlab       0.312085  0.095151  3.279878  0.0016
## marriage     0.164326  0.022897  7.176766  0.0000
## birth       -0.049909  0.022012 -2.267345  0.0264
## military     0.017946  0.014271  1.257544  0.2127
## 
##  Correlation: 
##            (Intr) unmply femlab marrig birth 
## unemployed -0.420                            
## femlab     -0.802  0.240                     
## marriage   -0.516  0.607  0.307              
## birth      -0.379  0.041  0.066 -0.094       
## military   -0.036  0.436 -0.311  0.530  0.128
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4509327 -0.9760939 -0.6164694  1.1375377  2.1593261 
## 
## Residual standard error: 2.907665 
## Degrees of freedom: 77 total; 71 residual
## Approximate 95% confidence intervals
## 
##  Correlation structure:
##         lower      est.     upper
## Phi 0.6528097 0.9715486 0.9980192
## attr(,"label")
## [1] "Correlation structure:"
## 
##  Residual standard error:
##      lower       est.      upper 
##  0.7974404  2.9076645 10.6020628

We can see above that the Phi value almost outputs a complete positive relationship at .97! Our confidence interval check also checks that at 95% the AR(1) correlations is significant.

The GLS model also did have a different result of what variables it found significant. The GLS model only found marriage and femlab to be statistically significant.

Infer Correlation in Errors

## 
##  Durbin-Watson test
## 
## data:  divorce ~ unemployed + femlab + marriage + birth + military
## DW = 0.29988, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0

We can see further above that the Dubin-Watson statistic confirms the auto correlation.

I would say that the correlation of errors exist because our time series is using year as a period. The errors for divorce rate therefore would be quite similar from year to year with a positive trend, which all in all if you think about it is unfortunately common knowledge and makes sense.