1. Multivariate regression: Using the ‘swiss’ dataset build the best multiple regression model you can for the variable fertility.
mydata <- swiss
cor(mydata)
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000
multi_regression_model <- lm(Fertility ~ ., data = mydata)
summary(multi_regression_model)
## 
## Call:
## lm(formula = Fertility ~ ., data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
par(mfrow = c(2,2))
plot(multi_regression_model)

Looking at the various p-values in this summary we can determine that the ‘examination’ variable does not have a statistically significant relationship with fertility. This is because the p-value for this variable (0.315) is larger than 0.05. As such, I will create another multivariate regression that does not include the examination variable.

multi_regression_model_two <- lm(Fertility ~ Agriculture + Education + Catholic + Infant.Mortality, data = mydata)
summary(multi_regression_model_two)
## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10
par(mfrow = c(2,2))
plot(multi_regression_model_two)

After removing the examination variable from the model the t-values of the other variables rose. This is relevant because higher t-values indicate a higher probability of statistical significance, and provide evidence to reject the null hypothesis. Additionally, the F-statistic in this model is higher (24.42 compared to 19.76) and the p-value is lower (1.7e-10 compared to 5.6e-10), further suggesting that the observed differences are statistically significant.

  1. Logistic Regression: Then build a logistic regression model for predicting Fertility>70.0. Post your solutions, interpretation, and code for your peers to see.
high_fertility <- ifelse(swiss$Fertility > 70, 1, 0)
log_regression <- glm(high_fertility ~ Agriculture + Examination  + Education + Catholic + Infant.Mortality, data = mydata, family = binomial)
summary(log_regression)
## 
## Call:
## glm(formula = high_fertility ~ Agriculture + Examination + Education + 
##     Catholic + Infant.Mortality, family = binomial, data = mydata)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)  
## (Intercept)       4.82826    5.25607   0.919   0.3583  
## Agriculture      -0.09615    0.04011  -2.397   0.0165 *
## Examination      -0.32116    0.13844  -2.320   0.0203 *
## Education        -0.12078    0.08610  -1.403   0.1607  
## Catholic          0.02078    0.01376   1.509   0.1312  
## Infant.Mortality  0.29078    0.21051   1.381   0.1672  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 32.887  on 41  degrees of freedom
## AIC: 44.887
## 
## Number of Fisher Scoring iterations: 6
library(ggplot2)
predicted_prob <- predict(log_regression, type = "response")
ggplot(mydata, aes(x = Agriculture, y = predicted_prob)) + geom_point() + 
  stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "purple") +
  labs(title = 'Logisitic Regression for Fertility over 70', x = '% of Agriculture Workers (per region)',
       y = 'Probability of Fertility being over 70')
## `geom_smooth()` using formula = 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Looking at this summary of the logistic regression it’s apparent that only the ‘agriculture’ and ‘examination’ variables are statistically significant based on their p-values. As such, I will remove the other variables from the model to see if there is any improvement.

log_regression_two <- glm(high_fertility ~ Agriculture + Examination, data = mydata, family = binomial)
summary(log_regression_two)
## 
## Call:
## glm(formula = high_fertility ~ Agriculture + Examination, family = binomial, 
##     data = mydata)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  9.69552    3.30710   2.932 0.003371 ** 
## Agriculture -0.06706    0.03098  -2.165 0.030420 *  
## Examination -0.37609    0.11286  -3.332 0.000861 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 40.484  on 44  degrees of freedom
## AIC: 46.484
## 
## Number of Fisher Scoring iterations: 5
predicted_prob_two <- predict(log_regression_two, type = "response")
ggplot(mydata, aes(x = Agriculture, y = predicted_prob_two)) + geom_point() + 
  stat_smooth(method = "glm", method.args = list(family = "binomial"), se = FALSE, color = "purple") +
  labs(title = 'Logisitic Regression for Fertility over 70', x = '% of Agriculture Workers (per region)',
       y = 'Probability of Fertility being over 70')
## `geom_smooth()` using formula = 'y ~ x'
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!

Looking at this summary of the logistic regression it’s clear that the variable ‘Agriculture’ and the ‘Fertility’ have a minor negative relationship based on the coefficient value (-.096). The agriculture variable is statistically significant since its p-value is less than .05 (.0165). The standard error also suggests that the estimate is consistent (.04). Still, the upward slope of the stat_smooth line is concerning to me given the negative correlation between high fertility and ‘agriculture’, so if anyone has any tips on how to correct this discrepancy please let me know.