A gentle introduction to Regression in R

1. Introduction

The main objective of this document is to introduce regression techniques using R programming language. We will use Boston house dataset in this tutorial. A copy of the dataset can be found in ‘dataset’ the folder. For more information about this dataset, please go to: https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

2. Setup

First, we have to import some required libraries.

library(ggplot2)
library(dplyr)

3. Loading the dataset

houses.df = read.csv('dataset/housing.csv')

4. Data Analysis

Now, let’s take a look at the first 6 rows of the dataset:

head(houses.df)

##      CRIM ZN INDUS CH   NOX    RM AGE    DIS R TAX PRAT      B LSTAT MEDV
## 1 0.12744  0  6.91  0 0.448 6.770 2.9 5.7209 3 233 17.9 385.41  4.84 26.6
## 2 0.07896  0 12.83  0 0.437 6.273 6.0 4.2515 5 398 18.7 394.92  6.78 24.1
## 3 0.19539  0 10.81  0 0.413 6.245 6.2 5.2873 4 305 19.2 377.17  7.54 23.4
## 4 0.15936  0  6.91  0 0.448 6.211 6.5 5.7209 3 233 17.9 394.46  7.44 24.7
## 5 0.14150  0  6.91  0 0.448 6.169 6.6 5.7209 3 233 17.9 383.37  5.81 25.3
## 6 0.08826  0 10.81  0 0.413 6.417 6.6 5.2873 4 305 19.2 383.73  6.72 24.2

And check the summary of the whole dataset:

summary(houses.df)

##       CRIM                ZN             INDUS             CH         
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       NOX               RM             AGE              DIS        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##        R               TAX             PRAT             B         
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      LSTAT            MEDV      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

We can see that all variables are numeric, which is excellent for performing the analysis. However, be in mind that in real life we usually don’t encounter clean dataset like this.

5. Correlation Analysis

Correlation is a statistical measure which suggests the level of dependencies between two variables. Its values is always between \(-1\) and \(1\):

Values close to \(-0\) suggest a weak correlation (there is no correlation between the variables);
Values close to \(1\) suggest a strong positive correlation;
Values close to \(-1\) suggest a strong negative correlation.

In R, we can use the cor(x,y) function to get the correlation between two variables. For example, let’s examine if AGE is somehow correlated to MEDV.

cor(houses.df$AGE, houses.df$MEDV)

## [1] -0.3769546

There is a negative correlation between AGE and MEDV. However, it is not a strong correlation, since the value is closer to \(0\) than \(-1\). Maybe we should not use the AGE variable in our predictive model. But we will discuss this later.

A Scatterplot can help to visualize any relationships between the dependent (MEDV) variable and independent (AGE) variables:

ggplot(houses.df, aes(AGE, MEDV)) + geom_point()

6.Linear Regression

6.1. What is Linear Regression?

It is a ‘generic’ term to describe a set of techniques capable of predicting a response (dependent) variable by other variables, called predictors or independent variables. This predicting capability of a regression model is related to the correlations between the dependent and independent variables.

More specifically, we use a linear regression technique to model a continuous variable Y (response) as a mathematical function of one or more X variables (predictors). Then, we can use this regression model to predict the Y when only the X is known. This mathematical function can be generalized as follows:

\[ \hat{Y} = \beta_1 + \beta_2*X \] where, \(\beta_1\) is the intercept and \(\beta_2\) is the slope. They are known as regression coefficients.

The Boston house dataset is a classical dataset used, in most of the cases, to validate regression models. Usually, we try to predict MEDV values based on some of the others variables.

6.2. Simple Linear Regression

First, we will build a model to predict MEDV based only on the AGE. In this case, we are using a simple linear regression. For that, we will use the lm() function. The lm() function takes in two main arguments: (i) The formula, and (ii) the data. The data is typically a data.frame and the formula is a object of class formula.

linear.model = lm(MEDV ~ AGE, data = houses.df)
print(linear.model)

## 
## Call:
## lm(formula = MEDV ~ AGE, data = houses.df)
## 
## Coefficients:
## (Intercept)          AGE  
##     30.9787      -0.1232

We have just built the linear model and stored in linear.model variable. If we print the contend of that variable we will see the values of the ‘coefficients’: intercept (\(\beta_1\)) and AGE (\(\beta_2\)). We can now derive the relationship between the predictor and response in the form of a mathematical formula for MEDV as a function for AGE:

\[ \hat{MEDV} = 30.979 -0.123 * AGE \]

6.2.1. Drawing the regression line in the scatter plot

ggplot(houses.df, aes(AGE, MEDV)) + 
  geom_point() + 
  geom_smooth(method = lm)

The regression line shows what we have imagined: There is little negative correlation between AGE and MEDV. MEDV decrease slowly while AGE increase.

6.2.2. Evaluation our model

Before putting our model into production, it is necessary to perform a diagnosis to evaluate how good is the model to predict new values. For that, the summary() function will help:

summary(linear.model)

## 
## Call:
## lm(formula = MEDV ~ AGE, data = houses.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.097  -5.138  -1.958   2.397  31.338 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.97868    0.99911  31.006   <2e-16 ***
## AGE         -0.12316    0.01348  -9.137   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.527 on 504 degrees of freedom
## Multiple R-squared:  0.1421, Adjusted R-squared:  0.1404 
## F-statistic: 83.48 on 1 and 504 DF,  p-value: < 2.2e-16

A lot of number here. How to interpretate them? I will not detail everything here in this tutorial. But there is something we can’t let escape.

6.2.3. The p-value

The summary result shows two kinds of p-values: one for the model, and another one for each predictor variable (the Pr(>|t|) column under ‘Coefficients’). These values tell us how significant is a linear model. By the book, a good model or predictor is statistically significant when its p-value is less than a pre-determined threshold (usually \(0.05\)). Note the stars at the end of the row in the Coefficients table. It is a visual indication of the significance: the more the stars beside the variable’s p-Value, the more significant the variable.

6.2.4 The R-Squared and R-Squared Adjusted

The multiple R-squared (\(0.1421\)) indicates that the model accounts for \(14.21\%\) of the variance in the data.

6.2.5 The Residual standard error

The residual standard error (\(8.527\)) represents the average error in predicting MEDV from AGE using our model.

6.3. Multiple Linear Regression

A model that explains only \(14.21\%\) of the variance in the data used to train the model is not a good choice. Therefore, we need to find out something better. We could try using other predictor variables, but usually, the most common approach is trying several variables at the same time. In that way, we now have a Multiple Linear Regression.

In R, we also use the lm() function to create Multiple Linear Models. In the next example, let’s try to add RM variable, which indicates the numbers of rooms in the house. But first, let’s check its correlation with MEDV:

cor(houses.df$RM, houses.df$MEDV)

## [1] 0.6953599

ggplot(houses.df, aes(RM, MEDV)) +
  geom_point() +
  geom_smooth(method = lm)

In fact, RM seems a good choice. Let’s now build our multiple linear model:

mlinear.model = lm(MEDV ~ AGE + RM, data = houses.df)

summary(mlinear.model)

## 
## Call:
## lm(formula = MEDV ~ AGE + RM, data = houses.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.555  -2.882  -0.274   2.293  40.799 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -25.27740    2.85676  -8.848  < 2e-16 ***
## AGE          -0.07278    0.01029  -7.075 5.02e-12 ***
## RM            8.40158    0.41208  20.388  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.316 on 503 degrees of freedom
## Multiple R-squared:  0.5303, Adjusted R-squared:  0.5284 
## F-statistic: 283.9 on 2 and 503 DF,  p-value: < 2.2e-16

There are several interesting things now. First, the residual error decreased. Also, our R-squared is now \(0.53\), which means that our new model explains \(53\%\) of the variance in the data. We can also infer the both AGE and RM are statistcial signicant.

It seems that adding a new variable was a good move. However, this does not mean that I should use all the variables. Let’s build a new model using all variables and see what we can get:

mlinear02.model = lm(MEDV ~ ., data = houses.df)

summary(mlinear02.model)

## 
## Call:
## lm(formula = MEDV ~ ., data = houses.df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## CRIM        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## ZN           4.642e-02  1.373e-02   3.382 0.000778 ***
## INDUS        2.056e-02  6.150e-02   0.334 0.738288    
## CH           2.687e+00  8.616e-01   3.118 0.001925 ** 
## NOX         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## RM           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## AGE          6.922e-04  1.321e-02   0.052 0.958229    
## DIS         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## R            3.060e-01  6.635e-02   4.613 5.07e-06 ***
## TAX         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## PRAT        -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## B            9.312e-03  2.686e-03   3.467 0.000573 ***
## LSTAT       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Notice that our metrics seems better now. However, the p-values of the variables show that not all are statistically significant. In fact, AGE is not significant anymore, once its p-value is higher than \(0.05\). In other words, if we remove all the variables not substantial, we could still have a good model but much simpler (since we have less variable). Let’s check if our intuition is correct:

mlinear03.model = lm(MEDV ~ CRIM + ZN + CH + NOX + RM + DIS + R + TAX + PRAT + B + LSTAT, data = houses.df)


summary(mlinear03.model)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CH + NOX + RM + DIS + R + TAX + 
##     PRAT + B + LSTAT, data = houses.df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5984  -2.7386  -0.5046   1.7273  26.2373 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  36.341145   5.067492   7.171 2.73e-12 ***
## CRIM         -0.108413   0.032779  -3.307 0.001010 ** 
## ZN            0.045845   0.013523   3.390 0.000754 ***
## CH            2.718716   0.854240   3.183 0.001551 ** 
## NOX         -17.376023   3.535243  -4.915 1.21e-06 ***
## RM            3.801579   0.406316   9.356  < 2e-16 ***
## DIS          -1.492711   0.185731  -8.037 6.84e-15 ***
## R             0.299608   0.063402   4.726 3.00e-06 ***
## TAX          -0.011778   0.003372  -3.493 0.000521 ***
## PRAT         -0.946525   0.129066  -7.334 9.24e-13 ***
## B             0.009291   0.002674   3.475 0.000557 ***
## LSTAT        -0.522553   0.047424 -11.019  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.736 on 494 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7348 
## F-statistic: 128.2 on 11 and 494 DF,  p-value: < 2.2e-16

Our new model seems a little better than the previous one.

At the end, we a model multiple linear model that explains \(74.06\%\) of the variance. We could try others strategies to outcome the performance, but it is not on the scope of this tutorial.

See you later.