Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
library(tidyverse)
I’ll admit, I was just looking for a relatively simple dataset that came with R. Combing through the options in data(), I found \(swiss\). According to this site https://stat.ethz.ch/R-manual/R-patched/library/datasets/html/swiss.html, the dataset is a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888.
data("swiss")
sw <- swiss
dim(sw)
## [1] 47 6
summary(sw)
## Fertility Agriculture Examination Education
## Min. :35.00 Min. : 1.20 Min. : 3.00 Min. : 1.00
## 1st Qu.:64.70 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00
## Median :70.40 Median :54.10 Median :16.00 Median : 8.00
## Mean :70.14 Mean :50.66 Mean :16.49 Mean :10.98
## 3rd Qu.:78.45 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00
## Max. :92.50 Max. :89.70 Max. :37.00 Max. :53.00
## Catholic Infant.Mortality
## Min. : 2.150 Min. :10.80
## 1st Qu.: 5.195 1st Qu.:18.15
## Median : 15.140 Median :20.00
## Mean : 41.144 Mean :19.94
## 3rd Qu.: 93.125 3rd Qu.:21.70
## Max. :100.000 Max. :26.60
Dimensions output confirms 6 variables and 47 instances. Summary output confirms no missing data and all values are between 0 and 100, expected as each value is a percentage.
Following scatterplots indicate the relationship of the independent variables with the dependent variable (Fertility).
sw %>%
ggplot(aes(x=Agriculture, y=Fertility)) +
geom_point() +
labs(title = 'Agriculture vs Fertility') + geom_smooth(method='lm', formula= y~x)
sw %>%
ggplot(aes(x=Examination, y=Fertility)) +
geom_point() +
labs(title = 'Examination vs Fertility') + geom_smooth(method='lm', formula= y~x)
sw %>%
ggplot(aes(x=Education, y=Fertility)) +
geom_point() +
labs(title = 'Education vs Fertility') + geom_smooth(method='lm', formula= y~x)
sw %>%
ggplot(aes(x=Infant.Mortality, y=Fertility)) +
geom_point() +
labs(title = 'Infant.Mortality vs Fertility') + geom_smooth(method='lm', formula= y~x)
sw %>%
ggplot(aes(x=Catholic, y=Fertility)) +
geom_point() +
labs(title = 'Catholic vs Fertility') + geom_smooth(method='lm', formula= y~x)
Increasing Agriculture percentage indicates a slightly positive linear relationship with Fertility.
Increasing Examination percentage indicates a clear negative relationship with Fertility.
Increasing Education percentage indicates a negative relationship with Fertility but the relationship appears to be driven by one extreme outlier and 4 other outlier points.
Infant.Mortality percentage shows a positive linear relationship with Fertility, but looking at the plot, I would argue the relationship is quite small as the points appear evenly distributed and not necessarily following a linear direction.
Increasing Catholic percentage shows a slight positive relationship with Fertility. This plot also shows that for all but 3 provinces, each province is clearly Catholic or Protestant. Thus the reason for creating the dichotomous variable.
# Create dichotomous variable
sw$isCatholic <- ifelse(sw$Catholic >= 50.0, "Y", "N")
sw$isCatholicNum <- ifelse(sw$isCatholic == "Y", 1, 0)
As the scatterplot above shows, most provinces are either highly Catholic or highly Protestant. With that understanding, I decided to create a dichotomous variable \(isCatholic\), in which each province populated by 50% or more Catholics is deemed Catholic, ‘Y’, with the remaining provinces labeled ‘N’, for not majority Catholic.
sw %>%
ggplot(aes(x=isCatholic, y=Fertility)) +
geom_boxplot() +
labs(title = 'Is Catholic vs Fertility')
I naively create a multiple regression model with all the initial variables as a baseline.
sw_lm <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + Infant.Mortality, data=sw)
summary(sw_lm)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic + Infant.Mortality, data = sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
plot(sw_lm)
The most significant variable appears to be Education, followed by Catholic and Infant.Mortality. The adjusted R-squared is 0.671, which indicates this model describes 67% of the variability … not bad, also not great. Looking at the plots for Residuals vs Fitted and Normal Q-Q, I think a case could be made that this linear regression model is appropriate, but could be improved.
In order to meet the prompt’s requirements, I chose to create a quadratic term by calculating the square of Infant.Mortality.
sw$Infant.Mortality.2 <- sw$Infant.Mortality^2
To create a dichotomous by quantitative variable, I’ve multiplied Education by isCatholic, 1 for Catholic Yes, and 0 for Catholic No. The dichotomous variable isCatholic was created earlier in the analysis based on the initial scatterplot of the Catholic variable.
sw$Edu.by.IsCath <- sw$isCatholicNum * sw$Education
Now, I have created a multiple regression model based on the three transformed variables.
sw_t_lm <- lm(Fertility ~ Edu.by.IsCath + isCatholic + Infant.Mortality.2, data=sw)
summary(sw_t_lm)
##
## Call:
## lm(formula = Fertility ~ Edu.by.IsCath + isCatholic + Infant.Mortality.2,
## data = sw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.7460 -4.8623 -0.2472 4.7304 18.7462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.57981 4.64342 11.108 3.23e-14 ***
## Edu.by.IsCath -1.36998 0.26218 -5.225 4.82e-06 ***
## isCatholicY 21.14329 3.54171 5.970 4.05e-07 ***
## Infant.Mortality.2 0.03755 0.01120 3.354 0.00167 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.522 on 43 degrees of freedom
## Multiple R-squared: 0.5649, Adjusted R-squared: 0.5346
## F-statistic: 18.61 on 3 and 43 DF, p-value: 6.887e-08
plot(sw_t_lm)
The resulting model shows significance for each of the three variables, which is a good sign. Overall, the Adjusted R-squared is 0.5346, which means the model only accounts for 53% of the variability in the data. Yes, 53% is above 50%, but unfortunately this model does not meet the variability of the naive model used as a baseline. The selection of Education for the transformed variable may not have been the best selection given the influence of the outliers. Overall, the plot for Residuals vs Fitted isn’t horrible, but not great either. The Normal Q-Q plot appears reasonable except for one extreme outlier. Given this model doesn’t meet the level of the baseline model, I would judge this model not appropriate for this dataset.