swiss
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Broye 83.8 70.2 16 7 92.85
## Glane 92.4 67.8 14 8 97.16
## Gruyere 82.4 53.3 12 7 97.67
## Sarine 82.9 45.2 16 13 91.38
## Veveyse 87.1 64.5 14 6 98.61
## Aigle 64.1 62.0 21 12 8.52
## Aubonne 66.9 67.5 14 7 2.27
## Avenches 68.9 60.7 19 12 4.43
## Cossonay 61.7 69.3 22 5 2.82
## Echallens 68.3 72.6 18 2 24.20
## Grandson 71.7 34.0 17 8 3.30
## Lausanne 55.7 19.4 26 28 12.11
## La Vallee 54.3 15.2 31 20 2.15
## Lavaux 65.1 73.0 19 9 2.84
## Morges 65.5 59.8 22 10 5.23
## Moudon 65.0 55.1 14 3 4.52
## Nyone 56.6 50.9 22 12 15.14
## Orbe 57.4 54.1 20 6 4.20
## Oron 72.5 71.2 12 1 2.40
## Payerne 74.2 58.1 14 8 5.23
## Paysd'enhaut 72.0 63.5 6 3 2.56
## Rolle 60.5 60.8 16 10 7.72
## Vevey 58.3 26.8 25 19 18.46
## Yverdon 65.4 49.5 15 8 6.10
## Conthey 75.5 85.9 3 2 99.71
## Entremont 69.3 84.9 7 6 99.68
## Herens 77.3 89.7 5 2 100.00
## Martigwy 70.5 78.2 12 6 98.96
## Monthey 79.4 64.9 7 3 98.22
## St Maurice 65.0 75.9 9 9 99.06
## Sierre 92.2 84.6 3 3 99.46
## Sion 79.3 63.1 13 13 96.83
## Boudry 70.4 38.4 26 12 5.62
## La Chauxdfnd 65.7 7.7 29 11 13.79
## Le Locle 72.7 16.7 22 13 11.22
## Neuchatel 64.4 17.6 35 32 16.92
## Val de Ruz 77.6 37.6 15 7 4.97
## ValdeTravers 67.6 18.7 25 7 8.65
## V. De Geneve 35.0 1.2 37 53 42.34
## Rive Droite 44.7 46.6 16 29 50.43
## Rive Gauche 42.8 27.7 22 29 58.33
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
## Broye 23.6
## Glane 24.9
## Gruyere 21.0
## Sarine 24.4
## Veveyse 24.5
## Aigle 16.5
## Aubonne 19.1
## Avenches 22.7
## Cossonay 18.7
## Echallens 21.2
## Grandson 20.0
## Lausanne 20.2
## La Vallee 10.8
## Lavaux 20.0
## Morges 18.0
## Moudon 22.4
## Nyone 16.7
## Orbe 15.3
## Oron 21.0
## Payerne 23.8
## Paysd'enhaut 18.0
## Rolle 16.3
## Vevey 20.9
## Yverdon 22.5
## Conthey 15.1
## Entremont 19.8
## Herens 18.3
## Martigwy 19.4
## Monthey 20.2
## St Maurice 17.8
## Sierre 16.3
## Sion 18.1
## Boudry 20.3
## La Chauxdfnd 20.5
## Le Locle 18.9
## Neuchatel 23.0
## Val de Ruz 20.0
## ValdeTravers 19.5
## V. De Geneve 18.0
## Rive Droite 18.2
## Rive Gauche 19.3
?swiss
hist(swiss$Fertility, main = "Distribution of Fertility",xlab = "Fetility")
The dependent variable “Fertility” seems normal enough to perform a linear regression without having to trandform the variable.
cor(swiss)
## Fertility Agriculture Examination Education Catholic
## Fertility 1.0000000 0.35307918 -0.6458827 -0.66378886 0.4636847
## Agriculture 0.3530792 1.00000000 -0.6865422 -0.63952252 0.4010951
## Examination -0.6458827 -0.68654221 1.0000000 0.69841530 -0.5727418
## Education -0.6637889 -0.63952252 0.6984153 1.00000000 -0.1538589
## Catholic 0.4636847 0.40109505 -0.5727418 -0.15385892 1.0000000
## Infant.Mortality 0.4165560 -0.06085861 -0.1140216 -0.09932185 0.1754959
## Infant.Mortality
## Fertility 0.41655603
## Agriculture -0.06085861
## Examination -0.11402160
## Education -0.09932185
## Catholic 0.17549591
## Infant.Mortality 1.00000000
This correlation matrix shows the strength and direction of each pair of variables in the dataset. Because “Fertility” is our dependent variable, we only focus on the correlations between “Fertility” and the 5 other variables.
While “Examination” and “Education” have the strongest correlations with “Fertility”, the other variables still could be important in forming an accurate regresion.
First I will create a regression using all of the available variables in the data set.
In order to decide which variables are significant we can set up a hypothesis test…
Ho: b1 = b2 = b3 = b4 = b5 = 0
Ha: b1 =/= b2 =/= b3 =/= b4 =/= b5 = 0
mylm1=lm(Fertility~Agriculture + Examination + Education + Catholic + Infant.Mortality, data = swiss)
summary(mylm1)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education +
## Catholic + Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2743 -5.2617 0.5032 4.1198 15.3213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.91518 10.70604 6.250 1.91e-07 ***
## Agriculture -0.17211 0.07030 -2.448 0.01873 *
## Examination -0.25801 0.25388 -1.016 0.31546
## Education -0.87094 0.18303 -4.758 2.43e-05 ***
## Catholic 0.10412 0.03526 2.953 0.00519 **
## Infant.Mortality 1.07705 0.38172 2.822 0.00734 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared: 0.7067, Adjusted R-squared: 0.671
## F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
If we are operating at a 95% Confidence Level, in order for a variable to be considered significant it should have a t-value > 2 and a p-value < .05.
In this first model we see that “Agriculture”, “Education”, “Catholic”, and “Infant.Mortality” have a t-value over 2 (or less than -2), while “Examination”, which had the strongest correlation with “Fertility”, does not. The same goes for the p-values.
These results make me believe that there is something else, a confounding variable accounting for the correlation between “Fertility” and “Examination”.
mylm2=lm(Fertility~Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
summary(mylm2)
##
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6765 -6.0522 0.7514 3.1664 16.1422
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.10131 9.60489 6.466 8.49e-08 ***
## Agriculture -0.15462 0.06819 -2.267 0.02857 *
## Education -0.98026 0.14814 -6.617 5.14e-08 ***
## Catholic 0.12467 0.02889 4.315 9.50e-05 ***
## Infant.Mortality 1.07844 0.38187 2.824 0.00722 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared: 0.6993, Adjusted R-squared: 0.6707
## F-statistic: 24.42 on 4 and 42 DF, p-value: 1.717e-10
After taking “Examination” out of my linear regression model the t-values increased, due to the missing variable, and R^2 stayed virtually the same. With a high F-Statistic and a low p-value, this model seems to be a little better.
par(mfrow=c(2,2))
plot(mylm1)
par(mfrow=c(2,2))
plot(mylm2)
Above we can compare the residuals of each model. While both models are pretty good, the second model seems to have more normal residuals, reaffirming that it is the stronger model.
par(mfrow=c(1,2))
predicted_values_m1<-as.numeric(predict(mylm1))
plot(predicted_values_m1,swiss$Fertility, main="MyLM1",xlab = "X",ylab = "Fertility")
abline(0,1)
predicted_values_m2<-as.numeric(predict(mylm2))
plot(predicted_values_m2,swiss$Fertility, main="MyLM2",xlab = "X",ylab = "Fertility")
abline(0,1)
The second linear model still seems to be a better model than the first.
Fertility = 62.10131 (-0.17211)Agriculture + (-0.98026)Education + (0.12467)Catholic + (1.07844)Infant.Mortality
We can test this model by plugging in X Values for each observation…
Fertility = 62.10131 + (-0.17211)Agriculture + (-0.98026)Education + (0.12467)Catholic + (1.07844)Infant.Mortality
Fertility(Courtelary) = 62.10131 + (-0.17211)17 + (-0.98026)12 + (0.12467)9.96 + (1.07844)2.22
62.10131 + -0.17211*17 + -0.98026*12 + 0.12467*9.96 + 1.07844*2.22
## [1] 51.04817
Fertility(Courtelary) = 51.047
Fertility(Courtelary,actual) = 80.2
The model did not do a great job of predicting, however the R^2 was only .67 so I did not expect to predict perfectly.
I realized my R^2 did not meet the requirement of .70, so I decided to try to transform some of the variables to see if doing so would effect the R^2.
swiss$Catholic2 <- (swiss$Catholic)^2
swiss$logFertility <- log(swiss$Fertility)
mylm3=lm(logFertility~Agriculture + Education + Catholic2 + Infant.Mortality, data = swiss)
summary(mylm3)
##
## Call:
## lm(formula = logFertility ~ Agriculture + Education + Catholic2 +
## Infant.Mortality, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.22878 -0.08112 0.00703 0.05307 0.20123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.146e+00 1.391e-01 29.806 < 2e-16 ***
## Agriculture -2.476e-03 9.978e-04 -2.481 0.017170 *
## Education -1.631e-02 2.120e-03 -7.692 1.52e-09 ***
## Catholic2 1.632e-05 4.136e-06 3.946 0.000296 ***
## Infant.Mortality 1.684e-02 5.499e-03 3.063 0.003813 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1038 on 42 degrees of freedom
## Multiple R-squared: 0.7443, Adjusted R-squared: 0.7199
## F-statistic: 30.56 on 4 and 42 DF, p-value: 6.079e-12
After squaring the variable “Catholic” and taking the log of “Fertility” the t-value for “Catholic” increased and the p-value decreased. Furthermore, the R^2 increased to .7543 and the overall p-value decreased.
In innterpreting these resuluts, we can see that based on the R^2 Value 75.43% of the variation in Fertility Rate is due to the variation in the X’s. Furthermore, based on the p-values, each dependent variable is statistically significant and the model itself is significant as well, at a 95% Confidence Level.
logFertility(Predicted) = 4.138e+00 (-2.640e-03)Agriculture + (-1.585e-02)Education + (1.821e-09)Catholic^2 + (1.746e-02)Infant.Mortality
logFertility(Courtelary) = 4.138e+00 (-2.640e-03)17 + (-1.585e-02)12 + (1.821e-09)9.96^2 + (1.746e-02)2.22
4.138e+00 + (-2.640e-03)*17 + (-1.585e-02)*12 + (1.821e-09)*9.96^2 + (1.746e-02)*2.22
## [1] 3.941681
log(80.2)
## [1] 4.384524