We will be conducting a linear regression on 2 variables for the dataset “swiss”:
Education
Agriculture
For each model, we will test the following assumptions:
Linear relationship between dependent and independent variables
Normal distribution (with histogram)
plot.default(swiss$Fertility~swiss$Agriculture,type="p",main="Scatterplot for Fertility with Variable Agriculture")
Scatterplot shows weak correlation, with positive trend, and some linearity.
hist(swiss$Agriculture,freq=FALSE,main="Histogram for Agriculture",col="BLUE",prob=TRUE)
swiss$Agriculture_norm= rnorm(length(swiss$Agriculture),mean(swiss$Agriculture),sd(swiss$Agriculture))
lines(density(swiss$Agriculture_norm,adjust=2),col="RED",lwd=2)
The variable Agriculture has a normal distribution.
data(swiss)
model1<- lm(Fertility~Agriculture, data=swiss)
summary(model1)
##
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.5374 -7.8685 -0.6362 9.0464 24.4858
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.30438 4.25126 14.185 <2e-16 ***
## Agriculture 0.19420 0.07671 2.532 0.0149 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared: 0.1247, Adjusted R-squared: 0.1052
## F-statistic: 6.409 on 1 and 45 DF, p-value: 0.01492
Here we have a linear model with intercept at 60.3044 unit value of Agriculture, and a slope of 0.1042. FOr every unit change in Agriculture, Fertility increases by 0.19+60.34.There is not a high level of statistical significance.
The R squared value means that only 11-12% of the variation in Fertility is explained by the linear model.
Agriculture is not a good predictor of Fertility.
plot.default(swiss$Fertility~swiss$Education,type="p",main="Scatterplot for Fertility with Variable Education")
The scatterplot shows a negative, somewhat linear relationship between Fartility and Education, with clusters at around value 9-10, If we remove the outlier at Education value of 50 we should see a more visible trend.
hist(swiss$Education,freq=FALSE,main="Histogram for Education",col="YELLOW",prob=TRUE)
swiss$Education= rnorm(length(swiss$Education),mean(swiss$Education),sd(swiss$Education))
lines(density(swiss$Education,adjust=2),col="RED",lwd=2)
The histogram is strongly skewed left, therefore the mean is larger than the median. THe distribution only somewhat normal.
data(swiss)
model2<- lm(Fertility~Education, data=swiss)
summary(model2)
##
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.036 -6.711 -1.011 9.526 19.689
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.6101 2.1041 37.836 < 2e-16 ***
## Education -0.8624 0.1448 -5.954 3.66e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared: 0.4406, Adjusted R-squared: 0.4282
## F-statistic: 35.45 on 1 and 45 DF, p-value: 3.659e-07
Here we have a linear model with intercept at 79.6 unit value of Education, and a slope of -0.86. FOr every unit change in Education, Fertility increases by -0.8624+79.6. p-value is a high level of statistical significance.
The R squared value means that 43-44% of the variation in Fertility is explained by the linear model.While better than Agriculture, it is still not a strong predictor of Fertility.