Regression - Discussion 11

Swiss Education vs Agriculture Regression

I am using the built-in R dataset of Swiss socioeconomic and fertility factors from 1888. I will focus on the relationship between education and agriculture.

data(swiss)
head(swiss, 6)
##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Visualization

A scatter plot of agriculture as a function of education. We can take the log of education to fit the a linear regression.

ggplot(swiss, aes(Education,Agriculture)) + geom_point() +
ggtitle("Education vs Agriculture") 

ggplot(swiss, aes(log(Education),Agriculture)) + geom_point() +
ggtitle("Log Education vs Agriculture") 

Linear Model

Fit agriculture ~ log(education).

attach(swiss)
lm <- lm(Agriculture ~ log(Education))
lm
##
## Call:
## lm(formula = Agriculture ~ log(Education))
##
## Coefficients:
##    (Intercept)  log(Education)
##          91.28          -19.35
eq = paste0("y = ", round(lm[1]$coefficients[2],3), "*x + ", round(lm[1]$coefficients[1],3))
ggplot(swiss, aes(log(Education),Agriculture)) + geom_point() +
geom_abline(intercept = lm[1]$coefficients[1], slope = lm[1]$coefficients[2]) +
ggtitle(paste("Log Education vs Agriculture:",eq)) 

Conditions

• The Swiss towns which we used to fit the line are independent.

• The relationship between the log education and agriculture is linear.

• The residuals from the regression line are nearly normal.

• The variability of education/agriculture observations around the regression line is constant.

Model Quality

• This model has an intercent of -19.348 and a slope of 91.277.

• To determine the model quality, we look at the Multiple R-squared value. R-squared values closer to one indicate better model quality, so R^2 = 0.4569 indicates that this model is not ideal.

summary(lm)
##
## Call:
## lm(formula = Agriculture ~ log(Education))
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -37.181 -14.089  -0.326  13.974  28.291
##
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)
## (Intercept)      91.277      7.048  12.950  < 2e-16 ***
## log(Education)  -19.348      3.145  -6.152 1.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.92 on 45 degrees of freedom
## Multiple R-squared:  0.4569, Adjusted R-squared:  0.4448
## F-statistic: 37.85 on 1 and 45 DF,  p-value: 1.854e-07

Residual Analysis

The residuals do not have a trend, which indicates that a linear model fits the data well.

swiss$resid <- resid(lm) ggplot(swiss, aes(log(Education),resid)) + geom_point() + geom_hline(yintercept=0, linetype="dashed", color = "blue") + ggtitle("Residual Plot")  ggplot(data=swiss, aes(swiss$resid)) +
geom_histogram(binwidth=5) +
ggtitle("Residual Histogram")