Introduction

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty).

Source: Linear regression

Simple Linear Regression

In simple linear regression, each observation consists of two values. One value is for the dependent variable and one value is for the independent variable. Here, we will fit our linear regression model for Salary_Data.csv, where years of experience is independent variable and salary is a dependent variable. Dataset can be downloaded from the following link:

Dataset: Salary_Data.csv

Importing and splitting the dataset into training and test data

dataset <- read.csv('Salary_Data.csv')

library(caTools)

set.seed(100)
split <- sample.split(dataset$Salary, SplitRatio = 2/3)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

library(hwriter)

cat(hwrite(training_set, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
YearsExperience Salary
1.1 39343
1.3 46205
1.5 37731
2 43525
2.2 39891
2.9 56642
3.2 54445
3.2 64445
3.7 57189
4 56957
4.1 57081
5.1 66029
5.3 83088
5.9 81363
6.8 91738
7.9 101302
8.7 109431
9 105582
10.3 122391
10.5 121872
cat(hwrite(test_set, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
YearsExperience Salary
3 60150
3.9 63218
4 55794
4.5 61111
4.9 67938
6 93940
7.1 98273
8.2 113812
9.5 116969
9.6 112635

Fitting Simple Linear Regression to the Training set

regressor <- lm(formula = Salary ~ YearsExperience,
               data = training_set)
summary(regressor)
## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7299.9 -3352.4  -201.1  2843.6  8577.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      26458.7     2219.3   11.92 5.61e-10 ***
## YearsExperience   9190.2      386.3   23.79 4.73e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5080 on 18 degrees of freedom
## Multiple R-squared:  0.9692, Adjusted R-squared:  0.9675 
## F-statistic: 565.9 on 1 and 18 DF,  p-value: 4.733e-15

Predicting the Test set results

y_pred <- predict(regressor, newdata = test_set)
cat(hwrite(y_pred, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
7 11 12 15 16 20 22 24 27 28
54029.4089643145 62300.6140971915 63219.6368897334 67814.7508524428 71490.8420226103 81600.0927405711 91709.3434585318 101818.594176493 113765.890479537 114684.913272079

Visualising the Training set results

library(ggplot2)
gtrain <- ggplot() +
  geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Training set)') +
  xlab('Years of experience') +
  ylab('Salary')
print(gtrain)

Visualising the Test set results

library(ggplot2)
gtest <- ggplot() +
  geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Test set)') +
  xlab('Years of experience') +
  ylab('Salary')
print(gtest)

Multiple Linear Regression

A linear regression model that contains more than one predictor variable is called a multiple linear regression model. Here, we will fit our linear regression model for 50_Startups.csv, where R.D.Spend, Administration, Marketing.Spend, and State are independent variables and Profit is a dependent variable. Dataset can be downloaded from the following link:

Dataset: 50_Startups.csv

Importing and preprocessing the dataset

# Importing the dataset
dataset <- read.csv('50_Startups.csv')

library(hwriter)

cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
R.D.Spend Administration Marketing.Spend State Profit
165349.2 136897.80 471784.1 New York 192261.8
162597.7 151377.59 443898.5 California 191792.1
153441.5 101145.55 407934.5 Florida 191050.4
144372.4 118671.85 383199.6 New York 182902.0
142107.3 91391.77 366168.4 Florida 166187.9
131876.9 99814.71 362861.4 New York 156991.1

Since, our State variable is a categorical variable, we need to encode its string values(‘California’, ‘Florida’, ‘New York’) into factors.

# Encoding categorical data
dataset$State <- factor(dataset$State,
                       levels = c('California', 'Florida', 'New York'),
                       labels = c(1, 2, 3))

cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
R.D.Spend Administration Marketing.Spend State Profit
165349.2 136897.80 471784.1 3 192261.8
162597.7 151377.59 443898.5 1 191792.1
153441.5 101145.55 407934.5 2 191050.4
144372.4 118671.85 383199.6 3 182902.0
142107.3 91391.77 366168.4 2 166187.9
131876.9 99814.71 362861.4 3 156991.1
# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(100)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

Fitting Multiple Linear Regression to the Training set

regressor <- lm(formula = Profit ~ .,
               data = training_set)

summary(regressor)
## 
## Call:
## lm(formula = Profit ~ ., data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32380  -4586   -190   4940  20038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.359e+04  8.111e+03   5.374 5.61e-06 ***
## R.D.Spend        8.146e-01  5.828e-02  13.977 1.18e-15 ***
## Administration   1.858e-02  6.098e-02   0.305    0.762    
## Marketing.Spend  2.873e-02  2.083e-02   1.379    0.177    
## State2          -2.938e+02  3.871e+03  -0.076    0.940    
## State3          -1.878e+03  4.104e+03  -0.458    0.650    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9938 on 34 degrees of freedom
## Multiple R-squared:  0.9496, Adjusted R-squared:  0.9421 
## F-statistic:   128 on 5 and 34 DF,  p-value: < 2.2e-16

NOTE: We have the P values for different independent variables. The P values suggest that not all the independent variables are equally significant in our regression model. So, we can eliminate the non-significant variables using Backward Elimination model.

Predicting the Test set results

y_pred <- predict(regressor, newdata = test_set)
cat(hwrite(y_pred, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
7 12 28 32 34 36 39 42 44 48
159646.887121386 134474.311934385 112970.573563469 96882.6762970742 96580.68587341 86677.7833668439 64738.0433185301 72315.8256659785 57729.6018142506 46105.94444621

Backward Elimination Model

Forward selection has drawbacks, including the fact that each addition of a new variable may render one or more of the already included variables non-significant. An alternate approach which avoids this is backward selection. Under this approach, one starts with fitting a model with all the variables of interest (following the initial screen). Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.

backwardElimination <- function(x, sl) {
  numVars = length(x)
  for (i in c(1:numVars)){
    regressor <- lm(formula = Profit ~ ., data = x)
    maxVar <- max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
    if (maxVar > sl){
      j <- which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
      x <- x[, -j]
    }
    numVars <- numVars - 1
  }
  return(summary(regressor))
}

SL = 0.05
dataset = dataset[, c(1,2,3,4,5)]
print(backwardElimination(training_set, SL))
## 
## Call:
## lm(formula = Profit ~ ., data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32393  -4874    134   5177  18628 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.707e+04  3.085e+03   15.26   <2e-16 ***
## R.D.Spend   8.724e-01  3.390e-02   25.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9751 on 38 degrees of freedom
## Multiple R-squared:  0.9457, Adjusted R-squared:  0.9443 
## F-statistic: 662.2 on 1 and 38 DF,  p-value: < 2.2e-16

After backward elimination, we realized only R.D.Spend is significant for our regression analysis. So, we fit our training data to only that variable.

regressor <- lm(formula = Profit ~ R.D.Spend,
                data = training_set)

Visualising the Training set results

library(ggplot2)
gtrain <- ggplot() +
  geom_point(aes(x = training_set$R.D.Spend, y = training_set$Profit),
             colour = 'red') +
  geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Profit vs RnD Spend (Training set)') +
  xlab('RnD Spend') +
  ylab('Profit')
print(gtrain)

Visualising the Test set results

gtrain <- ggplot() +
  geom_point(aes(x = test_set$R.D.Spend, y = test_set$Profit),
             colour = 'red') +
  geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Profit vs RnD Spend (Training set)') +
  xlab('RnD Spend') +
  ylab('Profit')
print(gtrain)