Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility.
# load swiss data set
mydata <- swiss
View(mydata)
# perform correlation analysis on swiss data
cor(mydata)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
After performing correlation analysis on the Swiss dataset, the results indicate the strongest relationship with the dependent variable Fertility are both Education and Exminination however the other variables may still important in forming an accurate regression.
# build multivariate regression for Fertility as dependant including the other five variables
MR_model <- lm(Fertility ~ Agriculture + Examination + Education + Catholic +
Infant.Mortality, data=mydata)
summary(MR_model)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic + Infant.Mortality, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
# perform residual plot analysis
par(mfrow=c(2,2))
plot(MR_model)
If we are operating at a 95% Confidence Level, in order for a variable to be considered significant it should have a t-value > 2 and a p-value < .05. Therefore the results from the summary indicate that variables, Agriculture, Education, Catholic, and Infant.Mortality are all statistically significant exhibited by t-values over 2 (or less than -2), and p-value < .05.
However, variable Examination, which, during correlation analysis, was one of the strongest correlation with Fertility is not statistically significant. Therefore, I run a second multivariate regression, excluding the Examination variable, and compare the residual plot analysis.
# multivariate regression excluding variable Examination
MR_model2 <- lm(Fertility ~ Agriculture + Education + Catholic +
Infant.Mortality, data=mydata)
summary(MR_model2)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = mydata)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
# perform residual plot analysis
par(mfrow=c(2,2))
plot(MR_model2)
After excluding variable Examination from the model, this model apears to be a better fit, since the the t-values increased, R^2 sremianed close to the original value, including a high F-Statistic and a low p-value. These results further justified by the residual plot analysis as the second model appears to better fit the data.
Then build a logistic regression model for predicting Fertility>70.0. Post your solutions, interpretation, and code for your peers to see.
# load library ggplot2
library(ggplot2)
# create binary variable for Fertility>70
binary_fertility <- ifelse(mydata$Fertility > 70, 1, 0)
# build a logistic regression model for predicting Fertility>70
logistic <- glm(binary_fertility ~ .,, family = "binomial", data = mydata)
summary(logistic) # print results
##
## Call:
## glm(formula = binary_fertility ~ ., family = "binomial", data = mydata)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.191e+03 6.338e+05 -0.002 0.999
## Fertility 1.836e+01 1.206e+04 0.002 0.999
## Agriculture 5.366e-01 2.437e+03 0.000 1.000
## Examination 3.743e+00 7.690e+03 0.000 1.000
## Education 1.021e+00 2.596e+04 0.000 1.000
## Catholic 6.181e-02 6.071e+02 0.000 1.000
## Infant.Mortality -9.375e+00 2.644e+04 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 6.5135e+01 on 46 degrees of freedom
## Residual deviance: 7.8098e-09 on 40 degrees of freedom
## AIC: 14
##
## Number of Fisher Scoring iterations: 25
# Plot Predicted data and original data points
ggplot(mydata, aes(x= Fertility, y= binary_fertility)) + geom_point() +
stat_smooth(method="glm", color="green", se=FALSE,
method.args = list(family=binomial)) +
labs(title = "Logistic Regression for Fertility > 70",
x = "Fertility",
y = "Probability of Fertility > 70")
## `geom_smooth()` using formula = 'y ~ x'
The Akaike information criterion (AIC) helps us estimate the prediction error and therefore helps determine the relative quality of statistical models for a given set of data. Lower AIC values indicate a better-fit model, so given an AIC of 14 suggests that while the model does capture a significant amount of the variance, there is room for improvement.