Question 1

In the Swiss dataset, there are 5 potential predictors. My first goal is to determine which predictors to include in determing the “best multiple regression model” for the Fertility variable.

data(swiss)
str(swiss)
## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...
cor(swiss)
##                   Fertility Agriculture Examination   Education   Catholic
## Fertility         1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847
## Agriculture       0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951
## Examination      -0.6458827 -0.68654221   1.0000000  0.69841530 -0.5727418
## Education        -0.6637889 -0.63952252   0.6984153  1.00000000 -0.1538589
## Catholic          0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000
## Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959
##                  Infant.Mortality
## Fertility              0.41655603
## Agriculture           -0.06085861
## Examination           -0.11402160
## Education             -0.09932185
## Catholic               0.17549591
## Infant.Mortality       1.00000000

So, what I see is a high correlation coefficient between between Examination (-0.66) and Education (-0.65) and perhaps Catholic (0.46). So using the lm() in R…

# Education and Catholic
lm(Fertility ~ Education + Catholic, data = swiss)
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic, data = swiss)
## 
## Coefficients:
## (Intercept)    Education     Catholic  
##     74.2337      -0.7883       0.1109
summary(lm(Fertility ~ Education + Catholic, data = swiss))
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.042  -6.578  -1.431   6.122  14.322 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 74.23369    2.35197  31.562  < 2e-16 ***
## Education   -0.78833    0.12929  -6.097 2.43e-07 ***
## Catholic     0.11092    0.02981   3.721  0.00056 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.331 on 44 degrees of freedom
## Multiple R-squared:  0.5745, Adjusted R-squared:  0.5552 
## F-statistic:  29.7 on 2 and 44 DF,  p-value: 6.849e-09
# Education and Catholic and Infant.Mortality
lm(Fertility ~ Education + Catholic + Infant.Mortality, data = swiss)
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality, 
##     data = swiss)
## 
## Coefficients:
##      (Intercept)         Education          Catholic  Infant.Mortality  
##         48.67707          -0.75925           0.09607           1.29615
summary(lm(Fertility ~ Education + Catholic + Infant.Mortality, data= swiss))
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality, 
##     data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.4781  -5.4403  -0.5143   4.1568  15.1187 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      48.67707    7.91908   6.147 2.24e-07 ***
## Education        -0.75925    0.11680  -6.501 6.83e-08 ***
## Catholic          0.09607    0.02722   3.530  0.00101 ** 
## Infant.Mortality  1.29615    0.38699   3.349  0.00169 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.505 on 43 degrees of freedom
## Multiple R-squared:  0.6625, Adjusted R-squared:  0.639 
## F-statistic: 28.14 on 3 and 43 DF,  p-value: 3.15e-10
# Education and Catholic and Infant.Mortality and Agriculture
lm(Fertility ~ Education + Catholic + Infant.Mortality + Agriculture, data = swiss)
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, data = swiss)
## 
## Coefficients:
##      (Intercept)         Education          Catholic  Infant.Mortality  
##          62.1013           -0.9803            0.1247            1.0784  
##      Agriculture  
##          -0.1546
summary(lm(Fertility ~ Education + Catholic + Infant.Mortality + Agriculture, data = swiss))
## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.6765  -6.0522   0.7514   3.1664  16.1422 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      62.10131    9.60489   6.466 8.49e-08 ***
## Education        -0.98026    0.14814  -6.617 5.14e-08 ***
## Catholic          0.12467    0.02889   4.315 9.50e-05 ***
## Infant.Mortality  1.07844    0.38187   2.824  0.00722 ** 
## Agriculture      -0.15462    0.06819  -2.267  0.02857 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.168 on 42 degrees of freedom
## Multiple R-squared:  0.6993, Adjusted R-squared:  0.6707 
## F-statistic: 24.42 on 4 and 42 DF,  p-value: 1.717e-10
fit <- lm(Fertility ~ Education + Catholic + Infant.Mortality + Agriculture, data = swiss)

par(mfrow = c(2, 2))
plot(fit)

Observations:
Plot 1: I see random scatter around zero, confirming the linearity assumption.
Plot 2: Data points follow the dashed line confirming the normality assumption.
Plot 3: A horizontal line with points spread randomly and equally. A “non-funneling” shape indicates heteroscedasticity.
The Swiss dataset includes observations from 47 different provinces, so I can assume independence. Provinces are distinct regions not impacting each other.

Question 2

For a logistic regression, I need to add binary variable to the swiss dataset

swiss$greater_than_70 <- as.integer(swiss$Fertility > 70)
table(swiss$greater_than_70)
## 
##  0  1 
## 23 24
# My starting point is the model from linear regression
summary(glm(greater_than_70 ~ Education + Catholic + Infant.Mortality + Agriculture, family = binomial(),  data = swiss ))
## 
## Call:
## glm(formula = greater_than_70 ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, family = binomial(), data = swiss)
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      -0.94524    4.13457  -0.229   0.8192   
## Education        -0.21212    0.08825  -2.404   0.0162 * 
## Catholic          0.03316    0.01187   2.795   0.0052 **
## Infant.Mortality  0.22203    0.17261   1.286   0.1983   
## Agriculture      -0.04973    0.02675  -1.859   0.0631 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 40.634  on 42  degrees of freedom
## AIC: 50.634
## 
## Number of Fisher Scoring iterations: 5

Observations:
I see that the most significant values are Catholic and Education in the Pr(>|z|), so I will eliminate Infant.Mortality and Agricuture to clean this up.

glm(greater_than_70 ~ Education + Catholic, family = binomial(),  data = swiss)
## 
## Call:  glm(formula = greater_than_70 ~ Education + Catholic, family = binomial(), 
##     data = swiss)
## 
## Coefficients:
## (Intercept)    Education     Catholic  
##     0.37798     -0.11882      0.02343  
## 
## Degrees of Freedom: 46 Total (i.e. Null);  44 Residual
## Null Deviance:       65.13 
## Residual Deviance: 48.26     AIC: 54.26
summary(glm(greater_than_70 ~ Education + Catholic, family = binomial(),  data = swiss))
## 
## Call:
## glm(formula = greater_than_70 ~ Education + Catholic, family = binomial(), 
##     data = swiss)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.377979   0.718516   0.526   0.5988  
## Education   -0.118823   0.059196  -2.007   0.0447 *
## Catholic     0.023428   0.009328   2.511   0.0120 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65.135  on 46  degrees of freedom
## Residual deviance: 48.262  on 44  degrees of freedom
## AIC: 54.262
## 
## Number of Fisher Scoring iterations: 5

Observations
Education and Catholic remain significant.

exp(coef(glm(greater_than_70 ~ Education + Catholic, family = binomial(),  data = swiss)))
## (Intercept)   Education    Catholic 
##   1.4593328   0.8879647   1.0237044

Interpretation: To predict whether a province’s Fertility exceeded 70, I built a logistic regression model using the swiss dataset. I initially included all four predictors from my best linear model (Education, Catholic, Agriculture, and Infant.Mortality), but Agriculture and Infant.Mortality were not statistically significant (p > 0.05) and were removed. The final model retained only Education and Catholic as predictors.

For every one unit increase in Education, the odds of Fertility exceeding 70 decrease by 11.2%, holding Catholic constant. For every one unit increase in Catholic, the odds of Fertility exceeding 70 increase by 2.4%, holding Education constant.

Education is the stronger and more influential predictor, suggesting that provinces with higher education levels are less likely to have high fertility rates.