Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility.

Let’s start by taking a quick look at the data set.

head(swiss)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

The data set is a collection of summary statistics for the french speaking parts of Switzerland in the year 1888. Our goal is to use multiple regression to predict the Fertility rate given the other variables.

We will start with the full model, fitting the multiple regression to all 5 of the possible predictors of Fertility in the data set. Once that is complete, we will look at the summary and make note of the adjusted \(R^2\) value to gauge how good of a fit it is.

#Perform Multiple Regression on the full model
full_model <- lm(Fertility ~ ., data = swiss)
summary(full_model)
## 
## Call:
## lm(formula = Fertility ~ ., data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
#Pull R^2 for comparison
rsqadj_full <- summary(full_model)$adj.r.squared
rsqadj_full
## [1] 0.670971

The adjusted \(R^2\) value is just about \(0.671\). While this is an alright fit, we may be able to find a better one by removing unimportant predictors from the model. We will do this by using backwards elimination. This method involves removing a predictor from the model and seeing if the fit improves at all. We can easily do this with a for loop in R:

#Exclude each predictor 1 at a time to see if R-squared adjusted improves in any of the resulting models
for(i in 2:6){
  temp_model <- lm(Fertility ~ ., data = swiss[-i])
  print(summary(temp_model)$adj.r.squared)
}
## [1] 0.6318526
## [1] 0.670714
## [1] 0.5014173
## [1] 0.6104921
## [1] 0.6164364

None of the resulting models seem to produce a better fit than the full one, so we will stick with that. Note: We could also use the P-value as the criterion for how significant a predictor is, but for this example I will stick to \(R^2\)

Now that we have determined which predictors to keep in the model, we have to check that the necessary conditions are met for the multiple regression model to be appropriate. They are very similar to the conditions required by simple linear regression.

1. Near Normal Residuals - Our first task is to make sure that the residuals are normally distributed around zero. This is easy enough to check with the following histogram.

#Check for Near Normal Distribution of Residuals
hist(full_model$residuals, main = "Distribution of Residuals", xlab = "Residual", ylab = "Frequency")

Though not a perfect normal distribution, there are no outliers and its shape is close enough. We can say the first condition is satisfied.

2. Constant Variance - We also need to check that the variance is relatively constant. Below we have a plot of the absolute value of the residuals vs. the fitted values. We see no obvious pattern, the residuals are just as variable for lower fitted values as they are for large ones.

#Check for Near Constant variability
plot(fitted(full_model), abs(full_model$residuals), main = "Magnitude of Residuals vs. Predicted Values",
     xlab = "Predicted Value", ylab = "|Residual|")

3. Linearity - This condition is takes a bit more to verify. We have to show that each predictor variable is linearly related to the outcome for the model to be valid. We can do this by plotting the residuals along each of the predictor variables to see if any pattern emerges.

#Agriculture
plot(swiss$Agriculture, full_model$residuals, main = "Residuals vs. Agriculture",
     xlab = "Agriculture", ylab = "Residuals")
abline(h = 0, col = "red")

#Examination
plot(swiss$Examination, full_model$residuals, main = "Residuals vs. Examination",
     xlab = "Examination", ylab = "Residuals")
abline(h = 0, col = "red")

#Education
plot(swiss$Education, full_model$residuals, main = "Residuals vs. Education",
     xlab = "Education", ylab = "Residuals")
abline(h = 0, col = "red")

#Catholic
plot(swiss$Catholic, full_model$residuals, main = "Residuals vs. Catholic",
     xlab = "Catholic", ylab = "Residuals")
abline(h = 0, col = "red")

#Infant.Mortality
plot(swiss$Infant.Mortality, full_model$residuals, main = "Residuals vs. Infant Mortality",
     xlab = "Infant Mortality", ylab = "Residuals")
abline(h = 0, col = "red")

For all variables, the residuals remain centered around zero and no other pattern is identifiable. We can say that each predictor is linearly related to the outcome.

4. Independent Observations - This condition is more difficult to prove. This data set was gathered by observation not experimentation, so we can not rule out unknown relationships between the various predictors. The model seems to be a good one, but it should come with the caveat that data gathered 2 centuries ago may not be enough to settle a debate.

Then build a logistic regression model for predicting Fertility>70.0. Post your solutions, interpretation, and code for your peers to see.

#create new data frame which we can modify
my_swiss <- swiss

#Create categorical variable "High_Fertility" which is true when Fertility is greater than 70
my_swiss$High_Fertility <- my_swiss$Fertility > 70

#Remove the original fertility variable
my_swiss <- subset(my_swiss, select = -Fertility)

#Perform logistic regression
full_lr_model <- glm(High_Fertility ~ ., data = my_swiss, family = "binomial")
summary(full_lr_model)
## 
## Call:
## glm(formula = High_Fertility ~ ., family = "binomial", data = my_swiss)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       4.82826    5.25607   0.919   0.3583  
## Agriculture      -0.09615    0.04011  -2.397   0.0165 *
## Examination      -0.32116    0.13844  -2.320   0.0203 *
## Education        -0.12078    0.08610  -1.403   0.1607  
## Catholic          0.02078    0.01376   1.509   0.1312  
## Infant.Mortality  0.29078    0.21051   1.381   0.1672  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 32.887  on 41  degrees of freedom
## AIC: 44.887
## 
## Number of Fisher Scoring iterations: 6

The summary gives us some interesting results. First looking at the P-values for each predictor’s coefficient, we see that all are less than 20%, with agriculture and examination being the 2 most statistically significant having P values of around 1.5% and 2,0% each. Both of these variables have negative coefficients, suggesting that them having higher values decreases the odds of having high fertility (Fertility > 70.0). The only 2 predictors that seem to positively contribute to the chance of high fertility is catholic and infant mortality.

For this model to be appropriate we must satisfy 2 conditions:

1. Each outcome must be independent of other outcomes - This condition is hard to verify since this was not an experimental study. The regions observed are not independent of each other, but physically linked in space. High fertility in one region could very reasonably affect the fertility of another region through any number of means. We can not comfortably say that this condition is satisfied.

2. Each predictor is linearly related to logit(p) when all other are held constant - This condition is also hard to verify given the relatively small sample size. We would likely need a larger sample size to prove this as well.

In the end, Logistic Regression may not be ideal for this case. The conditions needed are simply not met by the data we are studying.