Discussion 12

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Set up

This is a simple dataset that has a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland in the 1888. See https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/swiss.html. We will use Fertility as our predicted variable in this discussion.

Fertility: Common standardized fertility measure

Agriculture: Percent of males involved in agriculture as occupation

Examination: Percent of draftees receiving highest mark on army examination

Education: Percent education beyond primary school for draftees.

Catholic: Percent Catholic (as opposed to Protestant)

Infant.Mortality: Percent of live births who live less than 1 year

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.3
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
data("swiss")
df <- swiss
str(df)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
head(df)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
corrplot::corrplot(cor(df), method = 'number')

Education and Examination seem to play big roles in negatively affecting fertility. Not surprisingly, percent Catholic has a mild but positive influence on fertility.

Let’s build a base model using all variables to predict fertility.

mod1 <- lm(Fertility ~ ., data = df)
summary(mod1)
## 
## Call:
## lm(formula = Fertility ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
par(mfrow = c(2, 2))
plot(mod1)

Let’s transform some variables in our df in order to meet the discussion requirements.

df2 <- df %>%
    dplyr::mutate(CatholicDominant = dplyr::case_when(Catholic > 50 ~ 1, TRUE ~ 0),
                  Education2 = Education^2,
                  CatholicDominantInf.Mor = CatholicDominant * Infant.Mortality)
mod2 <- lm(Fertility ~ as.factor(CatholicDominant) + Education2 + CatholicDominantInf.Mor, data = df2)
summary(mod2)
## 
## Call:
## lm(formula = Fertility ~ as.factor(CatholicDominant) + Education2 + 
##     CatholicDominantInf.Mor, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.999  -5.646   1.276   4.472  18.998 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   70.247346   1.781919  39.422  < 2e-16 ***
## as.factor(CatholicDominant)1 -17.477310  13.909260  -1.257   0.2157    
## Education2                    -0.015853   0.002867  -5.530 1.76e-06 ***
## CatholicDominantInf.Mor        1.262223   0.665306   1.897   0.0645 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.758 on 43 degrees of freedom
## Multiple R-squared:  0.5405, Adjusted R-squared:  0.5084 
## F-statistic: 16.86 on 3 and 43 DF,  p-value: 2.187e-07
par(mfrow = c(2, 2))
plot(mod2)

The transformation performed much worse than the base model, i.e. adjusted R-squared .67 vs. .51. In addition, it violates the assumption of constant variance in the residuals and therefore it is not suitable for the linear regression.