Data621-Week06-Discussion

Dataset divusa is part of faraway package. It contains divorce rates data collected in the USA from 1920-1996. Dataset has 77 observations on seven variables.

(a) The regression model using unemployed, femlab, marriage, birth and military as predictors and divorce as a response variable.

library(faraway)
library(dplyr)
library(ggplot2)
library(GGally)

#Load data
divusa.df <- divusa %>% select(divorce,unemployed,femlab,marriage,birth,military)

#Linear model
divusa.lm <- lm(divorce ~ unemployed + femlab + marriage + birth + military, data = divusa.df)

sumary(divusa.lm)

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  2.487845   3.393779  0.7331   0.46594
## unemployed  -0.111252   0.055925 -1.9893   0.05052
## femlab       0.383649   0.030587 12.5430 < 2.2e-16
## marriage     0.118674   0.024414  4.8609 6.772e-06
## birth       -0.129959   0.015595 -8.3334 4.027e-12
## military    -0.026734   0.014247 -1.8764   0.06471
## 
## n = 77, p = 6, Residual SE = 1.65042, R-Squared = 0.92

summary(divusa.lm)

## 
## Call:
## lm(formula = divorce ~ unemployed + femlab + marriage + birth + 
##     military, data = divusa.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8611 -0.8916 -0.0496  0.8650  3.8300 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.48784    3.39378   0.733   0.4659    
## unemployed  -0.11125    0.05592  -1.989   0.0505 .  
## femlab       0.38365    0.03059  12.543  < 2e-16 ***
## marriage     0.11867    0.02441   4.861 6.77e-06 ***
## birth       -0.12996    0.01560  -8.333 4.03e-12 ***
## military    -0.02673    0.01425  -1.876   0.0647 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.65 on 71 degrees of freedom
## Multiple R-squared:  0.9208, Adjusted R-squared:  0.9152 
## F-statistic: 165.1 on 5 and 71 DF,  p-value: < 2.2e-16

I used two functions summary and sumary to display output of the linear model. sumary function is part of faraway package and displays minimal information. The difference is summary function indicates if the variable is contributing to the model with * or . next to p-value.

Output interpretation,

When female participation in the labor force(femlab), marriages(marriage), females giving births(birth) and military service(military) are held constant, for 1% increase in unemployment rate divorce rate decreases by 0.11125.
When unemployment(unemployed) rate, marriages(marriage), females giving births(birth) and military service(military) are held constant, for 1% rate increase in female participation in labor force, divorce rate increases by 0.38365.
When female participation in the labor force(femlab), unemployment(unemployed) rate, females giving births(birth) and military service(military) are held constant, for 1% increase in marriage rate, divorce rate increases by 0.11867.
As the family size increases due to births by 1% divorce rate decreases by 0.12996. When female participation in the labor force(femlab), unemployment(unemployed) rate, marriages(marriage) and military service(military) are held constant.
Per 1000 females observed, as military service rate increases by 1% divorce rate decreases by 0.02673. When female participation in the labor force(femlab), unemployment(unemployed) rate, females giving births(birth) and marriages(marriage) are held constant.
p-value show variables femlab, marriage and birth are more significant in estimating divorce rate compared to unemployed and military.
\(R^2\) value is 0.9208, it means model explains 92% of the observed variation in divorce rate.
The value Residual standard error(\(S\)) = \(1.65\) is the estimated standard deviation of the regression errors. Roughly, it is the average absolute size of a residual.

(b) Variance inflation factors (VIFs), explains estimated coefficients are inflated due to collinearity between variables.

#ggpairs(divusa.df)

my_fn <- function(data, mapping){
  p <- ggplot(data = data, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=lm, color="blue")
  p
}

g = ggpairs(divusa.df, lower = list(continuous = my_fn))
g

vif(divusa.lm)

## unemployed     femlab   marriage      birth   military 
##   2.252888   3.613276   2.864864   2.585485   1.249596

VIF value of femlab is high compared to other variables. It means standard error for female participation in the labor force is 1.9008619 times larger than it would have been without collinearity.
Female participation in the labor force(femlab) and divorce are strongly correlated(\(r = 0.91\)).
Correlations between variables birth and marriage is also fairly strong(\(r = 0.67\)). It means we can choose to remove either predictor from the model. I lean towards removing marriage because, when children are in the picture, there is a slightly higher chance couple will stay married. On the other hand, having children does not mean the couple is married. If one is not married, divorce cannot happen. This leaves us to experiment with the removal of one variable at a time and check for the model that explains data better.

(c) Model without marriage variable.

#Load data
divusa.df1 <- divusa %>% select(divorce,unemployed,femlab,birth,military)

#Linear model
divusa.lm1 <- lm(divorce ~ ., data = divusa.df1)

summary(divusa.lm1)

## 
## Call:
## lm(formula = divorce ~ ., data = divusa.df1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6312 -0.9141 -0.3110  0.5808  7.8938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.82713    2.82584   4.893 5.87e-06 ***
## unemployed  -0.21265    0.05949  -3.575 0.000631 ***
## femlab       0.29859    0.02876  10.382 5.91e-16 ***
## birth       -0.11698    0.01761  -6.641 5.04e-09 ***
## military    -0.01249    0.01598  -0.782 0.437048    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.892 on 72 degrees of freedom
## Multiple R-squared:  0.8945, Adjusted R-squared:  0.8886 
## F-statistic: 152.6 on 4 and 72 DF,  p-value: < 2.2e-16

round(cor(divusa.df1),2)

##            divorce unemployed femlab birth military
## divorce       1.00      -0.21   0.91 -0.72     0.02
## unemployed   -0.21       1.00  -0.26 -0.31    -0.40
## femlab        0.91      -0.26   1.00 -0.60     0.05
## birth        -0.72      -0.31  -0.60  1.00     0.14
## military      0.02      -0.40   0.05  0.14     1.00

vif(divusa.lm1)

## unemployed     femlab      birth   military 
##   1.939450   2.430627   2.509711   1.196751

In the absence of marriage variable in the dataset, military variable is no longer significant due to high p-value.
There is a significant increase in intercept(\(\beta_0\)).
VIF value of femlab has decreased to \(2.430627\). It means standard error for female participation in the labor force is reduced to 1.5590468.
\(R^2\) value has decreased to 0.8945.
Residual standard error(\(S\)) also increased to 1.892. This means range increases for predicted interval and confidence interval.

Model without marriage variable.

#Load data
divusa.df2 <- divusa %>% select(divorce,unemployed,femlab,marriage,military)

#Linear model
divusa.lm2 <- lm(divorce ~ ., data = divusa.df2)

summary(divusa.lm2)

## 
## Call:
## lm(formula = divorce ~ ., data = divusa.df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5080 -1.8613 -0.2785  1.5483  5.3526 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.17511    3.94663  -3.338  0.00134 ** 
## unemployed    0.10809    0.06892   1.568  0.12115    
## femlab        0.51252    0.03686  13.906  < 2e-16 ***
## marriage      0.08384    0.03359   2.496  0.01486 *  
## military     -0.01805    0.01985  -0.909  0.36612    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.305 on 72 degrees of freedom
## Multiple R-squared:  0.8434, Adjusted R-squared:  0.8347 
## F-statistic: 96.92 on 4 and 72 DF,  p-value: < 2.2e-16

round(cor(divusa.df2),2)

##            divorce unemployed femlab marriage military
## divorce       1.00      -0.21   0.91    -0.53     0.02
## unemployed   -0.21       1.00  -0.26    -0.27    -0.40
## femlab        0.91      -0.26   1.00    -0.65     0.05
## marriage     -0.53      -0.27  -0.65     1.00     0.26
## military      0.02      -0.40   0.05     0.26     1.00

vif(divusa.lm2)

## unemployed     femlab   marriage   military 
##   1.753836   2.689578   2.780902   1.242909

In the absence of birth variable in the dataset, unemployed and military variables are no longer significant due to high p-value.
There is a significant decrease in intercept(\(\beta_0\)). When all the variables are zero, divorce has negative value, which does not make sense.
VIF value of femlab has decreased to \(2.689578\). It means standard error for female participation in the labor force is reduced to 1.6399933.
\(R^2\) value has decreased to 0.8434.
Residual standard error(\(S\)) also increased to 2.305. This means range increases for predicted interval and confidence interval.

Conclusion: Removal of variables has not improved the model. In fact, it has an adverse effect on parameters \(R^2\), Residual standard error(\(S\)) and intercept(\(\beta_0\)).

Data621-Week06-Discussion

Pavan Akula

March 8, 2018