Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility. Then build a logistic regression model for predicting Fertility>70.0. Post your solutions / interpretation / code for your peers to see.
library(psych)
describe(swiss$Fertility)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 47 70.14 12.49 70.4 70.66 10.23 35 92.5 57.5 -0.46 0.26
## se
## X1 1.82
We should then look into the correlations that exist between the variables in order to establish which is the best one to analyze in regards to fertility
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
From a quick analysis one could say that Fertility and Education are the closest related variables. This can be demonstrated by graphing and seeing if there is indeed a linear relationship.
plot(swiss$Fertility~swiss$Education, xlab='Education', ylab='Fertility')
abline(lm(swiss$Fertility~swiss$Education))
sample1<-lm(Fertility~Education, data=swiss)
summary(sample1)
##
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.036 -6.711 -1.011 9.526 19.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.6101 2.1041 37.836 < 2e-16 ***
## Education -0.8624 0.1448 -5.954 3.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared: 0.4406, Adjusted R-squared: 0.4282
## F-statistic: 35.45 on 1 and 45 DF, p-value: 3.659e-07
Although the p-value is smaller than 0.05 only 44% of the variability in Fertility is explained by Education. Meaning that other variables add on to the weight, perhaps if we had ran the test with both Education and Examination there would be a higher variability.
Moving forward with the testing when Fertility is greater than 70 we ran the following test
greaterthan<-subset(swiss,swiss$Fertility>70)
table(greaterthan$Fertility>70)
##
## TRUE
## 24
I did not know how to continue on with this model… Any suggestions are greatly appreciated.