We will be conducting a linear regression on 2 variables for the dataset “swiss”:

  1. Education

  2. Agriculture

For each model, we will test the following assumptions:

  1. Linear relationship between dependent and independent variables

  2. Normal distribution (with histogram)

  1. AGRICULTURE
plot.default(swiss$Fertility~swiss$Agriculture,type="p",main="Scatterplot for Fertility with Variable Agriculture")

Scatterplot shows weak correlation, with positive trend, and some linearity.

hist(swiss$Agriculture,freq=FALSE,main="Histogram for Agriculture",col="BLUE",prob=TRUE)
swiss$Agriculture_norm= rnorm(length(swiss$Agriculture),mean(swiss$Agriculture),sd(swiss$Agriculture))
lines(density(swiss$Agriculture_norm,adjust=2),col="RED",lwd=2)

The variable Agriculture has a normal distribution.

data(swiss)
model1<- lm(Fertility~Agriculture, data=swiss)
summary(model1)
## 
## Call:
## lm(formula = Fertility ~ Agriculture, data = swiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.5374  -7.8685  -0.6362   9.0464  24.4858 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 60.30438    4.25126  14.185   <2e-16 ***
## Agriculture  0.19420    0.07671   2.532   0.0149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.82 on 45 degrees of freedom
## Multiple R-squared:  0.1247, Adjusted R-squared:  0.1052 
## F-statistic: 6.409 on 1 and 45 DF,  p-value: 0.01492

Here we have a linear model with intercept at 60.3044 unit value of Agriculture, and a slope of 0.1042. FOr every unit change in Agriculture, Fertility increases by 0.19+60.34.There is not a high level of statistical significance.

The R squared value means that only 11-12% of the variation in Fertility is explained by the linear model.

Agriculture is not a good predictor of Fertility.

  1. EDUCATION
plot.default(swiss$Fertility~swiss$Education,type="p",main="Scatterplot for Fertility with Variable Education")

The scatterplot shows a negative, somewhat linear relationship between Fartility and Education, with clusters at around value 9-10, If we remove the outlier at Education value of 50 we should see a more visible trend.

hist(swiss$Education,freq=FALSE,main="Histogram for Education",col="YELLOW",prob=TRUE)
swiss$Education= rnorm(length(swiss$Education),mean(swiss$Education),sd(swiss$Education))
lines(density(swiss$Education,adjust=2),col="RED",lwd=2)

The histogram is strongly skewed left, therefore the mean is larger than the median. THe distribution only somewhat normal.

data(swiss)
model2<- lm(Fertility~Education, data=swiss)
summary(model2)
## 
## Call:
## lm(formula = Fertility ~ Education, data = swiss)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -17.036  -6.711  -1.011   9.526  19.689 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  79.6101     2.1041  37.836  < 2e-16 ***
## Education    -0.8624     0.1448  -5.954 3.66e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.446 on 45 degrees of freedom
## Multiple R-squared:  0.4406, Adjusted R-squared:  0.4282 
## F-statistic: 35.45 on 1 and 45 DF,  p-value: 3.659e-07

Here we have a linear model with intercept at 79.6 unit value of Education, and a slope of -0.86. FOr every unit change in Education, Fertility increases by -0.8624+79.6. p-value is a high level of statistical significance.

The R squared value means that 43-44% of the variation in Fertility is explained by the linear model.While better than Agriculture, it is still not a strong predictor of Fertility.