I am using the built-in R dataset of Swiss socioeconomic and fertility factors from 1888. I will focus on the relationship between education and agriculture.
## Fertility Agriculture Examination Education Catholic
## Courtelary 80.2 17.0 15 12 9.96
## Delemont 83.1 45.1 6 9 84.84
## Franches-Mnt 92.5 39.7 5 5 93.40
## Moutier 85.8 36.5 12 7 33.77
## Neuveville 76.9 43.5 17 15 5.16
## Porrentruy 76.1 35.3 9 7 90.57
## Infant.Mortality
## Courtelary 22.2
## Delemont 22.2
## Franches-Mnt 20.2
## Moutier 20.3
## Neuveville 20.6
## Porrentruy 26.6
A scatter plot of agriculture as a function of education. We can take the log of education to fit the a linear regression.
ggplot(swiss, aes(log(Education),Agriculture)) + geom_point() +
ggtitle("Log Education vs Agriculture") Fit agriculture ~ log(education).
##
## Call:
## lm(formula = Agriculture ~ log(Education))
##
## Coefficients:
## (Intercept) log(Education)
## 91.28 -19.35
eq = paste0("y = ", round(lm[1]$coefficients[2],3), "*x + ", round(lm[1]$coefficients[1],3))
ggplot(swiss, aes(log(Education),Agriculture)) + geom_point() +
geom_abline(intercept = lm[1]$coefficients[1], slope = lm[1]$coefficients[2]) +
ggtitle(paste("Log Education vs Agriculture:",eq)) The Swiss towns which we used to fit the line are independent.
The relationship between the log education and agriculture is linear.
The residuals from the regression line are nearly normal.
The variability of education/agriculture observations around the regression line is constant.
This model has an intercent of -19.348 and a slope of 91.277.
To determine the model quality, we look at the Multiple R-squared value. R-squared values closer to one indicate better model quality, so R^2 = 0.4569 indicates that this model is not ideal.
##
## Call:
## lm(formula = Agriculture ~ log(Education))
##
## Residuals:
## Min 1Q Median 3Q Max
## -37.181 -14.089 -0.326 13.974 28.291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.277 7.048 12.950 < 2e-16 ***
## log(Education) -19.348 3.145 -6.152 1.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.92 on 45 degrees of freedom
## Multiple R-squared: 0.4569, Adjusted R-squared: 0.4448
## F-statistic: 37.85 on 1 and 45 DF, p-value: 1.854e-07
The residuals do not have a trend, which indicates that a linear model fits the data well.
swiss$resid <- resid(lm)
ggplot(swiss, aes(log(Education),resid)) + geom_point() +
geom_hline(yintercept=0, linetype="dashed", color = "blue") +
ggtitle("Residual Plot")