Linear Regression

Introduction

In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).

Linear regression has many practical uses. Most applications fall into one of the following two broad categories:

If the goal is prediction, or forecasting, or error reduction,[clarification needed] linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.

Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty).

Source: Linear regression

Simple Linear Regression

In simple linear regression, each observation consists of two values. One value is for the dependent variable and one value is for the independent variable. Here, we will fit our linear regression model for Salary_Data.csv, where years of experience is independent variable and salary is a dependent variable. Dataset can be downloaded from the following link:

Dataset: Salary_Data.csv

Importing and splitting the dataset into training and test data

dataset <- read.csv('Salary_Data.csv')

library(caTools)

set.seed(100)
split <- sample.split(dataset$Salary, SplitRatio = 2/3)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

library(hwriter)

cat(hwrite(training_set, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

YearsExperience	Salary
1.1	39343
1.3	46205
1.5	37731
2	43525
2.2	39891
2.9	56642
3.2	54445
3.2	64445
3.7	57189
4	56957
4.1	57081
5.1	66029
5.3	83088
5.9	81363
6.8	91738
7.9	101302
8.7	109431
9	105582
10.3	122391
10.5	121872

cat(hwrite(test_set, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

YearsExperience	Salary
3	60150
3.9	63218
4	55794
4.5	61111
4.9	67938
6	93940
7.1	98273
8.2	113812
9.5	116969
9.6	112635

Fitting Simple Linear Regression to the Training set

regressor <- lm(formula = Salary ~ YearsExperience,
               data = training_set)
summary(regressor)

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7299.9 -3352.4  -201.1  2843.6  8577.5 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      26458.7     2219.3   11.92 5.61e-10 ***
## YearsExperience   9190.2      386.3   23.79 4.73e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5080 on 18 degrees of freedom
## Multiple R-squared:  0.9692, Adjusted R-squared:  0.9675 
## F-statistic: 565.9 on 1 and 18 DF,  p-value: 4.733e-15

Predicting the Test set results

y_pred <- predict(regressor, newdata = test_set)
cat(hwrite(y_pred, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

7	11	12	15	16	20	22	24	27	28
54029.4089643145	62300.6140971915	63219.6368897334	67814.7508524428	71490.8420226103	81600.0927405711	91709.3434585318	101818.594176493	113765.890479537	114684.913272079

Visualising the Training set results

library(ggplot2)
gtrain <- ggplot() +
  geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Training set)') +
  xlab('Years of experience') +
  ylab('Salary')
print(gtrain)

Visualising the Test set results

library(ggplot2)
gtest <- ggplot() +
  geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Salary vs Experience (Test set)') +
  xlab('Years of experience') +
  ylab('Salary')
print(gtest)

Multiple Linear Regression

A linear regression model that contains more than one predictor variable is called a multiple linear regression model. Here, we will fit our linear regression model for 50_Startups.csv, where R.D.Spend, Administration, Marketing.Spend, and State are independent variables and Profit is a dependent variable. Dataset can be downloaded from the following link:

Dataset: 50_Startups.csv

Importing and preprocessing the dataset

# Importing the dataset
dataset <- read.csv('50_Startups.csv')

library(hwriter)

cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

R.D.Spend	Administration	Marketing.Spend	State	Profit
165349.2	136897.80	471784.1	New York	192261.8
162597.7	151377.59	443898.5	California	191792.1
153441.5	101145.55	407934.5	Florida	191050.4
144372.4	118671.85	383199.6	New York	182902.0
142107.3	91391.77	366168.4	Florida	166187.9
131876.9	99814.71	362861.4	New York	156991.1

Since, our State variable is a categorical variable, we need to encode its string values(‘California’, ‘Florida’, ‘New York’) into factors.

# Encoding categorical data
dataset$State <- factor(dataset$State,
                       levels = c('California', 'Florida', 'New York'),
                       labels = c(1, 2, 3))

cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

R.D.Spend	Administration	Marketing.Spend	State	Profit
165349.2	136897.80	471784.1	3	192261.8
162597.7	151377.59	443898.5	1	191792.1
153441.5	101145.55	407934.5	2	191050.4
144372.4	118671.85	383199.6	3	182902.0
142107.3	91391.77	366168.4	2	166187.9
131876.9	99814.71	362861.4	3	156991.1

# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(100)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

Fitting Multiple Linear Regression to the Training set

regressor <- lm(formula = Profit ~ .,
               data = training_set)

summary(regressor)

## 
## Call:
## lm(formula = Profit ~ ., data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32380  -4586   -190   4940  20038 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.359e+04  8.111e+03   5.374 5.61e-06 ***
## R.D.Spend        8.146e-01  5.828e-02  13.977 1.18e-15 ***
## Administration   1.858e-02  6.098e-02   0.305    0.762    
## Marketing.Spend  2.873e-02  2.083e-02   1.379    0.177    
## State2          -2.938e+02  3.871e+03  -0.076    0.940    
## State3          -1.878e+03  4.104e+03  -0.458    0.650    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9938 on 34 degrees of freedom
## Multiple R-squared:  0.9496, Adjusted R-squared:  0.9421 
## F-statistic:   128 on 5 and 34 DF,  p-value: < 2.2e-16

NOTE: We have the P values for different independent variables. The P values suggest that not all the independent variables are equally significant in our regression model. So, we can eliminate the non-significant variables using Backward Elimination model.

Predicting the Test set results

y_pred <- predict(regressor, newdata = test_set)
cat(hwrite(y_pred, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))

7	12	28	32	34	36	39	42	44	48
159646.887121386	134474.311934385	112970.573563469	96882.6762970742	96580.68587341	86677.7833668439	64738.0433185301	72315.8256659785	57729.6018142506	46105.94444621

Backward Elimination Model

Forward selection has drawbacks, including the fact that each addition of a new variable may render one or more of the already included variables non-significant. An alternate approach which avoids this is backward selection. Under this approach, one starts with fitting a model with all the variables of interest (following the initial screen). Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.

backwardElimination <- function(x, sl) {
  numVars = length(x)
  for (i in c(1:numVars)){
    regressor <- lm(formula = Profit ~ ., data = x)
    maxVar <- max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
    if (maxVar > sl){
      j <- which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
      x <- x[, -j]
    }
    numVars <- numVars - 1
  }
  return(summary(regressor))
}

SL = 0.05
dataset = dataset[, c(1,2,3,4,5)]
print(backwardElimination(training_set, SL))

## 
## Call:
## lm(formula = Profit ~ ., data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32393  -4874    134   5177  18628 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.707e+04  3.085e+03   15.26   <2e-16 ***
## R.D.Spend   8.724e-01  3.390e-02   25.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9751 on 38 degrees of freedom
## Multiple R-squared:  0.9457, Adjusted R-squared:  0.9443 
## F-statistic: 662.2 on 1 and 38 DF,  p-value: < 2.2e-16

After backward elimination, we realized only R.D.Spend is significant for our regression analysis. So, we fit our training data to only that variable.

regressor <- lm(formula = Profit ~ R.D.Spend,
                data = training_set)

Visualising the Training set results

library(ggplot2)
gtrain <- ggplot() +
  geom_point(aes(x = training_set$R.D.Spend, y = training_set$Profit),
             colour = 'red') +
  geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Profit vs RnD Spend (Training set)') +
  xlab('RnD Spend') +
  ylab('Profit')
print(gtrain)

Visualising the Test set results

gtrain <- ggplot() +
  geom_point(aes(x = test_set$R.D.Spend, y = test_set$Profit),
             colour = 'red') +
  geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle('Profit vs RnD Spend (Training set)') +
  xlab('RnD Spend') +
  ylab('Profit')
print(gtrain)

Linear Regression

Shubham Agrawal

9 August 2018

Introduction

Simple Linear Regression

Importing and splitting the dataset into training and test data

Fitting Simple Linear Regression to the Training set

Predicting the Test set results

Visualising the Training set results

Visualising the Test set results

Multiple Linear Regression

Importing and preprocessing the dataset

Fitting Multiple Linear Regression to the Training set

Predicting the Test set results

Backward Elimination Model

Visualising the Training set results

Visualising the Test set results