1. Using the divusa data, fit a regression model with divorce as the response and unemployed, femlab, marriage, birth and military as predictors.
##   id year divorce unemployed femlab marriage birth military
## 1  1 1920     8.0        5.2  22.70     92.0 117.9   3.2247
## 2  2 1921     7.2       11.7  22.79     83.0 119.8   3.5614
## 3  3 1922     6.6        6.7  22.88     79.7 111.2   2.4553
## 4  4 1923     7.1        2.4  22.97     85.2 110.5   2.2065
## 5  5 1924     7.2        5.0  23.06     80.3 110.9   2.2889
## 6  6 1925     7.2        3.2  23.15     79.2 106.6   2.1735
## [1] 77  8
## [1] "id"         "year"       "divorce"    "unemployed" "femlab"    
## [6] "marriage"   "birth"      "military"

Variable Description

Year: the year from 1920-1996
Divorce: divorce per 1000 women aged 15 or more (response variable)
Unemployed: unemployment rate (explanatory variable)
Femlab: percent female participation in labor force aged 16+ (explanatory variable)
Marriage: marriages per 1000 unmarried women aged 16+ (explanatory variable)
Birth: births per 1000 women aged 15-44 (explanatory variable)
Military: military personnel per 1000 population (explanatory variable)

Questions

  1. Make two graphical checks for correlated errors. What do you conclude?
## 
## Call:
## lm(formula = divorce ~ . - id, data = divusa)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9087 -0.9212 -0.0935  0.7447  3.4689 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 380.14761   99.20371   3.832 0.000274 ***
## year         -0.20312    0.05333  -3.809 0.000297 ***
## unemployed   -0.04933    0.05378  -0.917 0.362171    
## femlab        0.80793    0.11487   7.033 1.09e-09 ***
## marriage      0.14977    0.02382   6.287 2.42e-08 ***
## birth        -0.11695    0.01470  -7.957 2.19e-11 ***
## military     -0.04276    0.01372  -3.117 0.002652 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.513 on 70 degrees of freedom
## Multiple R-squared:  0.9344, Adjusted R-squared:  0.9288 
## F-statistic: 166.2 on 6 and 70 DF,  p-value: < 2.2e-16

Based on the residual plot, there appears to be some pattern as the residuals are completely random/scattered. All variables except ‘unemployed’ are expected to be significant under this model.

## 
##  Durbin-Watson test
## 
## data:  divorce ~ . - id
## DW = 0.37429, p-value < 2.2e-16
## alternative hypothesis: true autocorrelation is greater than 0
## 
## Call:
## lm(formula = divorce ~ . - id, data = divusa_auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07659 -0.38330 -0.02652  0.30530  3.06192 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -54.060448  61.056815  -0.885   0.3791    
## year            0.030428   0.032833   0.927   0.3573    
## unemployed     -0.053001   0.027008  -1.962   0.0538 .  
## femlab         -0.007163   0.083131  -0.086   0.9316    
## marriage        0.026072   0.014850   1.756   0.0836 .  
## birth          -0.048301   0.008798  -5.490 6.45e-07 ***
## military        0.014675   0.008103   1.811   0.0746 .  
## LAST_YEAR_DIV   0.808416   0.055674  14.521  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7579 on 68 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.9838, Adjusted R-squared:  0.9822 
## F-statistic: 590.7 on 7 and 68 DF,  p-value: < 2.2e-16

This second residual plot features smaller residuals and a weaker pattern especially in the lower half of the dataset. Now only the ‘birth’ variable is significant.

  1. Allow for serial correlation with an AR(1) model for the errors. (Hint: Use maximum likelihood to estimate the parameters in the GLS fit by gls(…, method=“ML”, …)). What is the estimated correlation and is it significant? Does the GLS model change which variables are found to be significant?
## Generalized least squares fit by maximum likelihood
##   Model: divorce ~ . - id 
##   Data: divusa 
##        AIC     BIC    logLik
##   175.7777 196.872 -78.88886
## 
## Correlation Structure: AR(1)
##  Formula: ~year 
##  Parameter estimate(s):
##       Phi 
## 0.9682691 
## 
## Coefficients:
##                 Value Std.Error   t-value p-value
## (Intercept) -453.4103 184.44766 -2.458206  0.0164
## year           0.2339   0.09671  2.418909  0.0182
## unemployed     0.0955   0.04479  2.132264  0.0365
## femlab        -0.0167   0.16636 -0.100121  0.9205
## marriage       0.1573   0.02238  7.025867  0.0000
## birth         -0.0312   0.02274 -1.373247  0.1741
## military       0.0444   0.01777  2.497031  0.0149
## 
##  Correlation: 
##            (Intr) year   unmply femlab marrig birth 
## year       -1.000                                   
## unemployed  0.102 -0.114                            
## femlab      0.830 -0.843  0.226                     
## marriage    0.119 -0.134  0.612  0.279              
## birth      -0.362  0.352 -0.001 -0.256 -0.134       
## military   -0.637  0.636  0.266 -0.657  0.326  0.312
## 
## Standardized residuals:
##        Min         Q1        Med         Q3        Max 
## -1.9802473 -1.3946446 -0.4085795  0.5871811  1.9777468 
## 
## Residual standard error: 2.649147 
## Degrees of freedom: 77 total; 70 residual
## Approximate 95% confidence intervals
## 
##  Correlation structure:
##         lower      est.     upper
## Phi 0.8692724 0.9682691 0.9925951
## attr(,"label")
## [1] "Correlation structure:"
## 
##  Residual standard error:
##    lower     est.    upper 
## 1.291057 2.649147 5.435842

The gls function detects significant autocorrelation as indicated by Phi = 0.968. The interval does not come close to overlaping with a value of zero. Also, the significant variables have also changed relative to a linear model ignoring autocorrelation. All but the two variables ‘femlab’ and ‘birth’ are significant, with ‘marriage’ being the most significant.

  1. Speculate why there might be correlation in the errors.

Correlation in the errors may be due to the overall larger trends that dwarf the impact of the explanatory variables. In the initial model, there were patterns in the residuals in the latter half of the 20th century, owing to the significant changes starting in the late 1960’s. The dataset does mention marriage rate, but since these are annual statistics it does not capture the time between marriage and divorce.