MULTIVARIATE REGRESSION

Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility.

# load swiss data set 
mydata <- swiss
View(mydata) 

# perform correlation analysis on swiss data 
cor(mydata)

##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

After performing correlation analysis on the Swiss dataset, the results indicate the strongest relationship with the dependent variable Fertility are both Education and Exminination however the other variables may still important in forming an accurate regression.

# build multivariate regression for Fertility as dependant including the other five variables 
MR_model <- lm(Fertility ~ Agriculture + Examination + Education + Catholic + 
                 Infant.Mortality, data=mydata)
summary(MR_model)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic + Infant.Mortality, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

# perform residual plot analysis 
par(mfrow=c(2,2))
plot(MR_model)

If we are operating at a 95% Confidence Level, in order for a variable to be considered significant it should have a t-value > 2 and a p-value < .05. Therefore the results from the summary indicate that variables, Agriculture, Education, Catholic, and Infant.Mortality are all statistically significant exhibited by t-values over 2 (or less than -2), and p-value < .05.

However, variable Examination, which, during correlation analysis, was one of the strongest correlation with Fertility is not statistically significant. Therefore, I run a second multivariate regression, excluding the Examination variable, and compare the residual plot analysis.

# multivariate regression excluding variable Examination 
MR_model2 <- lm(Fertility ~ Agriculture + Education + Catholic + 
                  Infant.Mortality, data=mydata)
summary(MR_model2)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = mydata)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

# perform residual plot analysis 
par(mfrow=c(2,2))
plot(MR_model2)

After excluding variable Examination from the model, this model apears to be a better fit, since the the t-values increased, R^2 sremianed close to the original value, including a high F-Statistic and a low p-value. These results further justified by the residual plot analysis as the second model appears to better fit the data.

LOGISTIC REGRESSION

Then build a logistic regression model for predicting Fertility>70.0. Post your solutions, interpretation, and code for your peers to see.

# load library ggplot2
library(ggplot2)

# create binary variable for Fertility>70 
binary_fertility <- ifelse(mydata$Fertility > 70, 1, 0)

# build a logistic regression model for predicting Fertility>70
logistic <- glm(binary_fertility ~ .,, family = "binomial", data = mydata) 
summary(logistic) # print results

## 
## Call:
## glm(formula = binary_fertility ~ ., family = "binomial", data = mydata)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)      -1.191e+03  6.338e+05  -0.002    0.999
## Fertility         1.836e+01  1.206e+04   0.002    0.999
## Agriculture       5.366e-01  2.437e+03   0.000    1.000
## Examination       3.743e+00  7.690e+03   0.000    1.000
## Education         1.021e+00  2.596e+04   0.000    1.000
## Catholic          6.181e-02  6.071e+02   0.000    1.000
## Infant.Mortality -9.375e+00  2.644e+04   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 6.5135e+01  on 46  degrees of freedom
## Residual deviance: 7.8098e-09  on 40  degrees of freedom
## AIC: 14
## 
## Number of Fisher Scoring iterations: 25

# Plot Predicted data and original data points
ggplot(mydata, aes(x= Fertility, y= binary_fertility)) + geom_point() +
      stat_smooth(method="glm", color="green", se=FALSE, 
                method.args = list(family=binomial)) + 
  labs(title = "Logistic Regression for Fertility > 70",
       x = "Fertility",
       y = "Probability of Fertility > 70")

## `geom_smooth()` using formula = 'y ~ x'

The Akaike information criterion (AIC) helps us estimate the prediction error and therefore helps determine the relative quality of statistical models for a given set of data. Lower AIC values indicate a better-fit model, so given an AIC of 14 suggests that while the model does capture a significant amount of the variance, there is room for improvement.

Week 7 Discussion - Multivariate and Logistic Regression

Andrew Gregory

2024-02-28

MULTIVARIATE REGRESSION

LOGISTIC REGRESSION