library(datasets)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
data(swiss)
The’swiss’ data is a collection of standardized fertility rates and some socio-economic indicators of 47 French-speaking provinces in Switzerland in 1888, including:
[,1] Fertility Ig, ‘common standardized fertility measure’ [,2] Agriculture % of males involved in agriculture as occupation [,3] Examination % draftees receiving highest mark on army examination [,4] Education % education beyond primary school for draftees. [,5] Catholic % ‘catholic’ (as opposed to ‘protestant’). [,6] Infant.Mortality live births who live less than 1 year.
Let’s observe the relationship between variables:
df <- ggpairs(swiss, lower = list(continuous = "smooth"))
df
We found that the fertility rate is positively correlated with the agricultural employment rate.
Establish a simple linear model of agricultural employment rate and standardized fertility rate:
summary(lm(Fertility ~ Agriculture, data = swiss))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.3043752 4.25125562 14.185074 3.216304e-18
## Agriculture 0.1942017 0.07671176 2.531577 1.491720e-02
It is correct, the partial regression coefficient of the agricultural employment rate is 0.1942017, and p<0.05, the two are positively correlated.However, we use the fertility rate as the dependent variable and the other indicators as independent variables to fit the multiple linear regression model.
summary(lm(Fertility ~ . , data = swiss))$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 66.9151817 10.70603759 6.250229 1.906051e-07
## Agriculture -0.1721140 0.07030392 -2.448142 1.872715e-02
## Examination -0.2580082 0.25387820 -1.016268 3.154617e-01
## Education -0.8709401 0.18302860 -4.758492 2.430605e-05
## Catholic 0.1041153 0.03525785 2.952969 5.190079e-03
## Infant.Mortality 1.0770481 0.38171965 2.821568 7.335715e-03
The p of the agricultural employment rate is still less than 0.05, and the partial regression coefficient is -0.1721140. The two are negatively correlated, which is inconsistent with the result just discovered. This phenomenon is called Simpson’s Paradox.