Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
This is a simple dataset that has a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland in the 1888. See https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/swiss.html. We will use Fertility as our predicted variable in this discussion.
Fertility: Common standardized fertility measure
Agriculture: Percent of males involved in agriculture as occupation
Examination: Percent of draftees receiving highest mark on army examination
Education: Percent education beyond primary school for draftees.
Catholic: Percent Catholic (as opposed to Protestant)
Infant.Mortality: Percent of live births who live less than 1 year
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3 v purrr 0.3.3
## v tibble 3.1.0 v dplyr 1.0.5
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.6.2
## corrplot 0.84 loaded
data("swiss")
df <- swiss
str(df)
## 'data.frame': 47 obs. of 6 variables:
## $ Fertility : num 80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
## $ Agriculture : num 17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
## $ Examination : int 15 6 5 12 17 9 16 14 12 16 ...
## $ Education : int 12 9 5 7 15 7 7 8 7 13 ...
## $ Catholic : num 9.96 84.84 93.4 33.77 5.16 ...
## $ Infant.Mortality: num 22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
head(df)
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
corrplot::corrplot(cor(df), method = 'number')
Education and Examination seem to play big roles in negatively affecting fertility. Not surprisingly, percent Catholic has a mild but positive influence on fertility.
Let’s build a base model using all variables to predict fertility.
mod1 <- lm(Fertility ~ ., data = df)
summary(mod1)
##
## Call:
## lm(formula = Fertility ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
par(mfrow = c(2, 2))
plot(mod1)
Let’s transform some variables in our df in order to meet the discussion requirements.
df2 <- df %>%
dplyr::mutate(CatholicDominant = dplyr::case_when(Catholic > 50 ~ 1, TRUE ~ 0),
Education2 = Education^2,
CatholicDominantInf.Mor = CatholicDominant * Infant.Mortality)
mod2 <- lm(Fertility ~ as.factor(CatholicDominant) + Education2 + CatholicDominantInf.Mor, data = df2)
summary(mod2)
##
## Call:
## lm(formula = Fertility ~ as.factor(CatholicDominant) + Education2 +
## CatholicDominantInf.Mor, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.999 -5.646 1.276 4.472 18.998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 70.247346 1.781919 39.422 < 2e-16 ***
## as.factor(CatholicDominant)1 -17.477310 13.909260 -1.257 0.2157
## Education2 -0.015853 0.002867 -5.530 1.76e-06 ***
## CatholicDominantInf.Mor 1.262223 0.665306 1.897 0.0645 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.758 on 43 degrees of freedom
## Multiple R-squared: 0.5405, Adjusted R-squared: 0.5084
## F-statistic: 16.86 on 3 and 43 DF, p-value: 2.187e-07
par(mfrow = c(2, 2))
plot(mod2)
The transformation performed much worse than the base model, i.e. adjusted R-squared .67 vs. .51. In addition, it violates the assumption of constant variance in the residuals and therefore it is not suitable for the linear regression.