Week 8 Homework

Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility. Then build a logistic regression model for predicting Fertility>70.0. Post your solutions / interpretation / code for your peers to see.

swiss

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
## Broye             83.8        70.2          16         7    92.85
## Glane             92.4        67.8          14         8    97.16
## Gruyere           82.4        53.3          12         7    97.67
## Sarine            82.9        45.2          16        13    91.38
## Veveyse           87.1        64.5          14         6    98.61
## Aigle             64.1        62.0          21        12     8.52
## Aubonne           66.9        67.5          14         7     2.27
## Avenches          68.9        60.7          19        12     4.43
## Cossonay          61.7        69.3          22         5     2.82
## Echallens         68.3        72.6          18         2    24.20
## Grandson          71.7        34.0          17         8     3.30
## Lausanne          55.7        19.4          26        28    12.11
## La Vallee         54.3        15.2          31        20     2.15
## Lavaux            65.1        73.0          19         9     2.84
## Morges            65.5        59.8          22        10     5.23
## Moudon            65.0        55.1          14         3     4.52
## Nyone             56.6        50.9          22        12    15.14
## Orbe              57.4        54.1          20         6     4.20
## Oron              72.5        71.2          12         1     2.40
## Payerne           74.2        58.1          14         8     5.23
## Paysd'enhaut      72.0        63.5           6         3     2.56
## Rolle             60.5        60.8          16        10     7.72
## Vevey             58.3        26.8          25        19    18.46
## Yverdon           65.4        49.5          15         8     6.10
## Conthey           75.5        85.9           3         2    99.71
## Entremont         69.3        84.9           7         6    99.68
## Herens            77.3        89.7           5         2   100.00
## Martigwy          70.5        78.2          12         6    98.96
## Monthey           79.4        64.9           7         3    98.22
## St Maurice        65.0        75.9           9         9    99.06
## Sierre            92.2        84.6           3         3    99.46
## Sion              79.3        63.1          13        13    96.83
## Boudry            70.4        38.4          26        12     5.62
## La Chauxdfnd      65.7         7.7          29        11    13.79
## Le Locle          72.7        16.7          22        13    11.22
## Neuchatel         64.4        17.6          35        32    16.92
## Val de Ruz        77.6        37.6          15         7     4.97
## ValdeTravers      67.6        18.7          25         7     8.65
## V. De Geneve      35.0         1.2          37        53    42.34
## Rive Droite       44.7        46.6          16        29    50.43
## Rive Gauche       42.8        27.7          22        29    58.33
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6
## Broye                    23.6
## Glane                    24.9
## Gruyere                  21.0
## Sarine                   24.4
## Veveyse                  24.5
## Aigle                    16.5
## Aubonne                  19.1
## Avenches                 22.7
## Cossonay                 18.7
## Echallens                21.2
## Grandson                 20.0
## Lausanne                 20.2
## La Vallee                10.8
## Lavaux                   20.0
## Morges                   18.0
## Moudon                   22.4
## Nyone                    16.7
## Orbe                     15.3
## Oron                     21.0
## Payerne                  23.8
## Paysd'enhaut             18.0
## Rolle                    16.3
## Vevey                    20.9
## Yverdon                  22.5
## Conthey                  15.1
## Entremont                19.8
## Herens                   18.3
## Martigwy                 19.4
## Monthey                  20.2
## St Maurice               17.8
## Sierre                   16.3
## Sion                     18.1
## Boudry                   20.3
## La Chauxdfnd             20.5
## Le Locle                 18.9
## Neuchatel                23.0
## Val de Ruz               20.0
## ValdeTravers             19.5
## V. De Geneve             18.0
## Rive Droite              18.2
## Rive Gauche              19.3

?swiss

hist(swiss$Fertility, main = "Distribution of Fertility",xlab = "Fetility")

The dependent variable “Fertility” seems normal enough to perform a linear regression without having to trandform the variable.

cor(swiss)

##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

This correlation matrix shows the strength and direction of each pair of variables in the dataset. Because “Fertility” is our dependent variable, we only focus on the correlations between “Fertility” and the 5 other variables.

While “Examination” and “Education” have the strongest correlations with “Fertility”, the other variables still could be important in forming an accurate regresion.

First I will create a regression using all of the available variables in the data set.

In order to decide which variables are significant we can set up a hypothesis test…

Ho: b1 = b2 = b3 = b4 = b5 = 0

Ha: b1 =/= b2 =/= b3 =/= b4 =/= b5 = 0

mylm1=lm(Fertility~Agriculture + Examination + Education + Catholic + Infant.Mortality, data = swiss)
summary(mylm1)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Examination + Education + 
##     Catholic + Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

If we are operating at a 95% Confidence Level, in order for a variable to be considered significant it should have a t-value > 2 and a p-value < .05.

In this first model we see that “Agriculture”, “Education”, “Catholic”, and “Infant.Mortality” have a t-value over 2 (or less than -2), while “Examination”, which had the strongest correlation with “Fertility”, does not. The same goes for the p-values.

These results make me believe that there is something else, a confounding variable accounting for the correlation between “Fertility” and “Examination”.

mylm2=lm(Fertility~Agriculture + Education + Catholic + Infant.Mortality, data = swiss)
summary(mylm2)

## 
## Call:
## lm(formula = Fertility ~ Agriculture + Education + Catholic + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10

After taking “Examination” out of my linear regression model the t-values increased, due to the missing variable, and R^2 stayed virtually the same. With a high F-Statistic and a low p-value, this model seems to be a little better.

par(mfrow=c(2,2))
plot(mylm1)

par(mfrow=c(2,2))
plot(mylm2)

Above we can compare the residuals of each model. While both models are pretty good, the second model seems to have more normal residuals, reaffirming that it is the stronger model.

par(mfrow=c(1,2))

predicted_values_m1<-as.numeric(predict(mylm1))
plot(predicted_values_m1,swiss$Fertility, main="MyLM1",xlab = "X",ylab = "Fertility")
abline(0,1)

predicted_values_m2<-as.numeric(predict(mylm2))
plot(predicted_values_m2,swiss$Fertility, main="MyLM2",xlab = "X",ylab = "Fertility")
abline(0,1)

The second linear model still seems to be a better model than the first.

Fertility = 62.10131 (-0.17211)Agriculture + (-0.98026)Education + (0.12467)Catholic + (1.07844)Infant.Mortality

We can test this model by plugging in X Values for each observation…

Fertility = 62.10131 + (-0.17211)Agriculture + (-0.98026)Education + (0.12467)Catholic + (1.07844)Infant.Mortality

Fertility(Courtelary) = 62.10131 + (-0.17211)17 + (-0.98026)12 + (0.12467)9.96 + (1.07844)2.22

62.10131 + -0.17211*17 + -0.98026*12 + 0.12467*9.96 + 1.07844*2.22

## [1] 51.04817

Fertility(Courtelary) = 51.047

Fertility(Courtelary,actual) = 80.2

The model did not do a great job of predicting, however the R^2 was only .67 so I did not expect to predict perfectly.

I realized my R^2 did not meet the requirement of .70, so I decided to try to transform some of the variables to see if doing so would effect the R^2.

swiss$Catholic2 <- (swiss$Catholic)^2

swiss$logFertility <- log(swiss$Fertility)

mylm3=lm(logFertility~Agriculture + Education + Catholic2 + Infant.Mortality, data = swiss)
summary(mylm3)

## 
## Call:
## lm(formula = logFertility ~ Agriculture + Education + Catholic2 + 
##     Infant.Mortality, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.22878 -0.08112  0.00703  0.05307  0.20123 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.146e+00  1.391e-01  29.806  < 2e-16 ***
## Agriculture      -2.476e-03  9.978e-04  -2.481 0.017170 *  
## Education        -1.631e-02  2.120e-03  -7.692 1.52e-09 ***
## Catholic2         1.632e-05  4.136e-06   3.946 0.000296 ***
## Infant.Mortality  1.684e-02  5.499e-03   3.063 0.003813 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1038 on 42 degrees of freedom
## Multiple R-squared:  0.7443, Adjusted R-squared:  0.7199 
## F-statistic: 30.56 on 4 and 42 DF,  p-value: 6.079e-12

After squaring the variable “Catholic” and taking the log of “Fertility” the t-value for “Catholic” increased and the p-value decreased. Furthermore, the R^2 increased to .7543 and the overall p-value decreased.

In innterpreting these resuluts, we can see that based on the R^2 Value 75.43% of the variation in Fertility Rate is due to the variation in the X’s. Furthermore, based on the p-values, each dependent variable is statistically significant and the model itself is significant as well, at a 95% Confidence Level.

logFertility(Predicted) = 4.138e+00 (-2.640e-03)Agriculture + (-1.585e-02)Education + (1.821e-09)Catholic^2 + (1.746e-02)Infant.Mortality

logFertility(Courtelary) = 4.138e+00 (-2.640e-03)17 + (-1.585e-02)12 + (1.821e-09)9.96^2 + (1.746e-02)2.22

4.138e+00 + (-2.640e-03)*17 + (-1.585e-02)*12 + (1.821e-09)*9.96^2 + (1.746e-02)*2.22

## [1] 3.941681

log(80.2)

## [1] 4.384524

Week 8 Homework

Brendan O’Brien

3/8/2018

Using the “swiss” dataset, build the best multiple regression model you can for the variable Fertility. Then build a logistic regression model for predicting Fertility>70.0. Post your solutions / interpretation / code for your peers to see.