The main objective of this document is to introduce regression techniques using R programming language. We will use Boston house dataset in this tutorial. A copy of the dataset can be found in ‘dataset’ the folder. For more information about this dataset, please go to: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
First, we have to import some required libraries.
library(ggplot2)
library(dplyr)
houses.df = read.csv('dataset/housing.csv')
Now, let’s take a look at the first 6 rows of the dataset:
head(houses.df)
## CRIM ZN INDUS CH NOX RM AGE DIS R TAX PRAT B LSTAT MEDV
## 1 0.12744 0 6.91 0 0.448 6.770 2.9 5.7209 3 233 17.9 385.41 4.84 26.6
## 2 0.07896 0 12.83 0 0.437 6.273 6.0 4.2515 5 398 18.7 394.92 6.78 24.1
## 3 0.19539 0 10.81 0 0.413 6.245 6.2 5.2873 4 305 19.2 377.17 7.54 23.4
## 4 0.15936 0 6.91 0 0.448 6.211 6.5 5.7209 3 233 17.9 394.46 7.44 24.7
## 5 0.14150 0 6.91 0 0.448 6.169 6.6 5.7209 3 233 17.9 383.37 5.81 25.3
## 6 0.08826 0 10.81 0 0.413 6.417 6.6 5.2873 4 305 19.2 383.73 6.72 24.2
And check the summary of the whole dataset:
summary(houses.df)
## CRIM ZN INDUS CH
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## R TAX PRAT B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
We can see that all variables are numeric, which is excellent for performing the analysis. However, be in mind that in real life we usually don’t encounter clean dataset like this.
Correlation is a statistical measure which suggests the level of dependencies between two variables. Its values is always between \(-1\) and \(1\):
In R, we can use the cor(x,y) function to get the correlation between two variables. For example, let’s examine if AGE is somehow correlated to MEDV.
cor(houses.df$AGE, houses.df$MEDV)
## [1] -0.3769546
There is a negative correlation between AGE and MEDV. However, it is not a strong correlation, since the value is closer to \(0\) than \(-1\). Maybe we should not use the AGE variable in our predictive model. But we will discuss this later.
A Scatterplot can help to visualize any relationships between the dependent (MEDV) variable and independent (AGE) variables:
ggplot(houses.df, aes(AGE, MEDV)) + geom_point()
It is a ‘generic’ term to describe a set of techniques capable of predicting a response (dependent) variable by other variables, called predictors or independent variables. This predicting capability of a regression model is related to the correlations between the dependent and independent variables.
More specifically, we use a linear regression technique to model a continuous variable Y (response) as a mathematical function of one or more X variables (predictors). Then, we can use this regression model to predict the Y when only the X is known. This mathematical function can be generalized as follows:
\[ \hat{Y} = \beta_1 + \beta_2*X \] where, \(\beta_1\) is the intercept and \(\beta_2\) is the slope. They are known as regression coefficients.
The Boston house dataset is a classical dataset used, in most of the cases, to validate regression models. Usually, we try to predict MEDV values based on some of the others variables.
First, we will build a model to predict MEDV based only on the AGE. In this case, we are using a simple linear regression. For that, we will use the lm() function. The lm() function takes in two main arguments: (i) The formula, and (ii) the data. The data is typically a data.frame and the formula is a object of class formula.
linear.model = lm(MEDV ~ AGE, data = houses.df)
print(linear.model)
##
## Call:
## lm(formula = MEDV ~ AGE, data = houses.df)
##
## Coefficients:
## (Intercept) AGE
## 30.9787 -0.1232
We have just built the linear model and stored in linear.model variable. If we print the contend of that variable we will see the values of the ‘coefficients’: intercept (\(\beta_1\)) and AGE (\(\beta_2\)). We can now derive the relationship between the predictor and response in the form of a mathematical formula for MEDV as a function for AGE:
\[ \hat{MEDV} = 30.979 -0.123 * AGE \]
ggplot(houses.df, aes(AGE, MEDV)) +
geom_point() +
geom_smooth(method = lm)
The regression line shows what we have imagined: There is little negative correlation between AGE and MEDV. MEDV decrease slowly while AGE increase.
Before putting our model into production, it is necessary to perform a diagnosis to evaluate how good is the model to predict new values. For that, the summary() function will help:
summary(linear.model)
##
## Call:
## lm(formula = MEDV ~ AGE, data = houses.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.097 -5.138 -1.958 2.397 31.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.97868 0.99911 31.006 <2e-16 ***
## AGE -0.12316 0.01348 -9.137 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.527 on 504 degrees of freedom
## Multiple R-squared: 0.1421, Adjusted R-squared: 0.1404
## F-statistic: 83.48 on 1 and 504 DF, p-value: < 2.2e-16
A lot of number here. How to interpretate them? I will not detail everything here in this tutorial. But there is something we can’t let escape.
The summary result shows two kinds of p-values: one for the model, and another one for each predictor variable (the Pr(>|t|) column under ‘Coefficients’). These values tell us how significant is a linear model. By the book, a good model or predictor is statistically significant when its p-value is less than a pre-determined threshold (usually \(0.05\)). Note the stars at the end of the row in the Coefficients table. It is a visual indication of the significance: the more the stars beside the variable’s p-Value, the more significant the variable.
The multiple R-squared (\(0.1421\)) indicates that the model accounts for \(14.21\%\) of the variance in the data.
The residual standard error (\(8.527\)) represents the average error in predicting MEDV from AGE using our model.
A model that explains only \(14.21\%\) of the variance in the data used to train the model is not a good choice. Therefore, we need to find out something better. We could try using other predictor variables, but usually, the most common approach is trying several variables at the same time. In that way, we now have a Multiple Linear Regression.
In R, we also use the lm() function to create Multiple Linear Models. In the next example, let’s try to add RM variable, which indicates the numbers of rooms in the house. But first, let’s check its correlation with MEDV:
cor(houses.df$RM, houses.df$MEDV)
## [1] 0.6953599
ggplot(houses.df, aes(RM, MEDV)) +
geom_point() +
geom_smooth(method = lm)
In fact, RM seems a good choice. Let’s now build our multiple linear model:
mlinear.model = lm(MEDV ~ AGE + RM, data = houses.df)
summary(mlinear.model)
##
## Call:
## lm(formula = MEDV ~ AGE + RM, data = houses.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.555 -2.882 -0.274 2.293 40.799
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -25.27740 2.85676 -8.848 < 2e-16 ***
## AGE -0.07278 0.01029 -7.075 5.02e-12 ***
## RM 8.40158 0.41208 20.388 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.316 on 503 degrees of freedom
## Multiple R-squared: 0.5303, Adjusted R-squared: 0.5284
## F-statistic: 283.9 on 2 and 503 DF, p-value: < 2.2e-16
There are several interesting things now. First, the residual error decreased. Also, our R-squared is now \(0.53\), which means that our new model explains \(53\%\) of the variance in the data. We can also infer the both AGE and RM are statistcial signicant.
It seems that adding a new variable was a good move. However, this does not mean that I should use all the variables. Let’s build a new model using all variables and see what we can get:
mlinear02.model = lm(MEDV ~ ., data = houses.df)
summary(mlinear02.model)
##
## Call:
## lm(formula = MEDV ~ ., data = houses.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.595 -2.730 -0.518 1.777 26.199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
## CRIM -1.080e-01 3.286e-02 -3.287 0.001087 **
## ZN 4.642e-02 1.373e-02 3.382 0.000778 ***
## INDUS 2.056e-02 6.150e-02 0.334 0.738288
## CH 2.687e+00 8.616e-01 3.118 0.001925 **
## NOX -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
## RM 3.810e+00 4.179e-01 9.116 < 2e-16 ***
## AGE 6.922e-04 1.321e-02 0.052 0.958229
## DIS -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
## R 3.060e-01 6.635e-02 4.613 5.07e-06 ***
## TAX -1.233e-02 3.760e-03 -3.280 0.001112 **
## PRAT -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
## B 9.312e-03 2.686e-03 3.467 0.000573 ***
## LSTAT -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Notice that our metrics seems better now. However, the p-values of the variables show that not all are statistically significant. In fact, AGE is not significant anymore, once its p-value is higher than \(0.05\). In other words, if we remove all the variables not substantial, we could still have a good model but much simpler (since we have less variable). Let’s check if our intuition is correct:
mlinear03.model = lm(MEDV ~ CRIM + ZN + CH + NOX + RM + DIS + R + TAX + PRAT + B + LSTAT, data = houses.df)
summary(mlinear03.model)
##
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CH + NOX + RM + DIS + R + TAX +
## PRAT + B + LSTAT, data = houses.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5984 -2.7386 -0.5046 1.7273 26.2373
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.341145 5.067492 7.171 2.73e-12 ***
## CRIM -0.108413 0.032779 -3.307 0.001010 **
## ZN 0.045845 0.013523 3.390 0.000754 ***
## CH 2.718716 0.854240 3.183 0.001551 **
## NOX -17.376023 3.535243 -4.915 1.21e-06 ***
## RM 3.801579 0.406316 9.356 < 2e-16 ***
## DIS -1.492711 0.185731 -8.037 6.84e-15 ***
## R 0.299608 0.063402 4.726 3.00e-06 ***
## TAX -0.011778 0.003372 -3.493 0.000521 ***
## PRAT -0.946525 0.129066 -7.334 9.24e-13 ***
## B 0.009291 0.002674 3.475 0.000557 ***
## LSTAT -0.522553 0.047424 -11.019 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7348
## F-statistic: 128.2 on 11 and 494 DF, p-value: < 2.2e-16
Our new model seems a little better than the previous one.
At the end, we a model multiple linear model that explains \(74.06\%\) of the variance. We could try others strategies to outcome the performance, but it is not on the scope of this tutorial.
See you later.