data_605_discussion

Set up

This is a simple dataset that has a standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland in the 1888. See https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/swiss.html. We will use Fertility as our predicted variable in this discussion.

Fertility: Common standardized fertility measure

Agriculture: Percent of males involved in agriculture as occupation

Examination: Percent of draftees receiving highest mark on army examination

Education: Percent education beyond primary school for draftees.

Catholic: Percent Catholic (as opposed to Protestant)

Infant.Mortality: Percent of live births who live less than 1 year

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.3     v purrr   0.3.3
## v tibble  3.1.0     v dplyr   1.0.5
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.3

## Warning: package 'tibble' was built under R version 3.6.3

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'readr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.3

## Warning: package 'stringr' was built under R version 3.6.2

## Warning: package 'forcats' was built under R version 3.6.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.2

## corrplot 0.84 loaded

data("swiss")
df <- swiss
str(df)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

head(df)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

corrplot::corrplot(cor(df), method = 'number')

Education and Examination seem to play big roles in negatively affecting fertility. Not surprisingly, percent Catholic has a mild but positive influence on fertility.

Let’s build a base model using all variables to predict fertility.

mod1 <- lm(Fertility ~ ., data = df)
summary(mod1)

## 
## Call:
## lm(formula = Fertility ~ ., data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2743  -5.2617   0.5032   4.1198  15.3213 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      66.91518   10.70604   6.250 1.91e-07 ***
## Agriculture      -0.17211    0.07030  -2.448  0.01873 *  
## Examination      -0.25801    0.25388  -1.016  0.31546    
## Education        -0.87094    0.18303  -4.758 2.43e-05 ***
## Catholic          0.10412    0.03526   2.953  0.00519 ** 
## Infant.Mortality  1.07705    0.38172   2.822  0.00734 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.165 on 41 degrees of freedom
## Multiple R-squared:  0.7067, Adjusted R-squared:  0.671 
## F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10

par(mfrow = c(2, 2))
plot(mod1)

Let’s transform some variables in our df in order to meet the discussion requirements.

df2 <- df %>%
    dplyr::mutate(CatholicDominant = dplyr::case_when(Catholic > 50 ~ 1, TRUE ~ 0),
                  Education2 = Education^2,
                  CatholicDominantInf.Mor = CatholicDominant * Infant.Mortality)

mod2 <- lm(Fertility ~ as.factor(CatholicDominant) + Education2 + CatholicDominantInf.Mor, data = df2)
summary(mod2)

## 
## Call:
## lm(formula = Fertility ~ as.factor(CatholicDominant) + Education2 + 
##     CatholicDominantInf.Mor, data = df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.999  -5.646   1.276   4.472  18.998 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   70.247346   1.781919  39.422  < 2e-16 ***
## as.factor(CatholicDominant)1 -17.477310  13.909260  -1.257   0.2157    
## Education2                    -0.015853   0.002867  -5.530 1.76e-06 ***
## CatholicDominantInf.Mor        1.262223   0.665306   1.897   0.0645 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.758 on 43 degrees of freedom
## Multiple R-squared:  0.5405, Adjusted R-squared:  0.5084 
## F-statistic: 16.86 on 3 and 43 DF,  p-value: 2.187e-07

par(mfrow = c(2, 2))
plot(mod2)

The transformation performed much worse than the base model, i.e. adjusted R-squared .67 vs. .51. In addition, it violates the assumption of constant variance in the residuals and therefore it is not suitable for the linear regression.

data_605_discussion_12

Jimmy Ng

11/10/2021

Discussion 12

Set up