In statistics, linear regression is a linear approach to modelling the relationship between a scalar response (or dependent variable) and one or more explanatory variables (or independent variables).
Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
Linear regression models are often fitted using the least squares approach, but they may also be fitted in other ways, such as by minimizing the “lack of fit” in some other norm (as with least absolute deviations regression), or by minimizing a penalized version of the least squares cost function as in ridge regression (L2-norm penalty) and lasso (L1-norm penalty).
Source: Linear regression
In simple linear regression, each observation consists of two values. One value is for the dependent variable and one value is for the independent variable. Here, we will fit our linear regression model for Salary_Data.csv, where years of experience is independent variable and salary is a dependent variable. Dataset can be downloaded from the following link:
Dataset: Salary_Data.csv
dataset <- read.csv('Salary_Data.csv')
library(caTools)
set.seed(100)
split <- sample.split(dataset$Salary, SplitRatio = 2/3)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
library(hwriter)
cat(hwrite(training_set, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| YearsExperience | Salary |
| 1.1 | 39343 |
| 1.3 | 46205 |
| 1.5 | 37731 |
| 2 | 43525 |
| 2.2 | 39891 |
| 2.9 | 56642 |
| 3.2 | 54445 |
| 3.2 | 64445 |
| 3.7 | 57189 |
| 4 | 56957 |
| 4.1 | 57081 |
| 5.1 | 66029 |
| 5.3 | 83088 |
| 5.9 | 81363 |
| 6.8 | 91738 |
| 7.9 | 101302 |
| 8.7 | 109431 |
| 9 | 105582 |
| 10.3 | 122391 |
| 10.5 | 121872 |
cat(hwrite(test_set, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| YearsExperience | Salary |
| 3 | 60150 |
| 3.9 | 63218 |
| 4 | 55794 |
| 4.5 | 61111 |
| 4.9 | 67938 |
| 6 | 93940 |
| 7.1 | 98273 |
| 8.2 | 113812 |
| 9.5 | 116969 |
| 9.6 | 112635 |
regressor <- lm(formula = Salary ~ YearsExperience,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7299.9 -3352.4 -201.1 2843.6 8577.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26458.7 2219.3 11.92 5.61e-10 ***
## YearsExperience 9190.2 386.3 23.79 4.73e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5080 on 18 degrees of freedom
## Multiple R-squared: 0.9692, Adjusted R-squared: 0.9675
## F-statistic: 565.9 on 1 and 18 DF, p-value: 4.733e-15
y_pred <- predict(regressor, newdata = test_set)
cat(hwrite(y_pred, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| 7 | 11 | 12 | 15 | 16 | 20 | 22 | 24 | 27 | 28 |
| 54029.4089643145 | 62300.6140971915 | 63219.6368897334 | 67814.7508524428 | 71490.8420226103 | 81600.0927405711 | 91709.3434585318 | 101818.594176493 | 113765.890479537 | 114684.913272079 |
library(ggplot2)
gtrain <- ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')
print(gtrain)
library(ggplot2)
gtest <- ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')
print(gtest)
A linear regression model that contains more than one predictor variable is called a multiple linear regression model. Here, we will fit our linear regression model for 50_Startups.csv, where R.D.Spend, Administration, Marketing.Spend, and State are independent variables and Profit is a dependent variable. Dataset can be downloaded from the following link:
Dataset: 50_Startups.csv
# Importing the dataset
dataset <- read.csv('50_Startups.csv')
library(hwriter)
cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| R.D.Spend | Administration | Marketing.Spend | State | Profit |
| 165349.2 | 136897.80 | 471784.1 | New York | 192261.8 |
| 162597.7 | 151377.59 | 443898.5 | California | 191792.1 |
| 153441.5 | 101145.55 | 407934.5 | Florida | 191050.4 |
| 144372.4 | 118671.85 | 383199.6 | New York | 182902.0 |
| 142107.3 | 91391.77 | 366168.4 | Florida | 166187.9 |
| 131876.9 | 99814.71 | 362861.4 | New York | 156991.1 |
Since, our State variable is a categorical variable, we need to encode its string values(‘California’, ‘Florida’, ‘New York’) into factors.
# Encoding categorical data
dataset$State <- factor(dataset$State,
levels = c('California', 'Florida', 'New York'),
labels = c(1, 2, 3))
cat(hwrite(head(dataset), border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| R.D.Spend | Administration | Marketing.Spend | State | Profit |
| 165349.2 | 136897.80 | 471784.1 | 3 | 192261.8 |
| 162597.7 | 151377.59 | 443898.5 | 1 | 191792.1 |
| 153441.5 | 101145.55 | 407934.5 | 2 | 191050.4 |
| 144372.4 | 118671.85 | 383199.6 | 3 | 182902.0 |
| 142107.3 | 91391.77 | 366168.4 | 2 | 166187.9 |
| 131876.9 | 99814.71 | 362861.4 | 3 | 156991.1 |
# Splitting the dataset into the Training set and Test set
library(caTools)
set.seed(100)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
regressor <- lm(formula = Profit ~ .,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32380 -4586 -190 4940 20038
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.359e+04 8.111e+03 5.374 5.61e-06 ***
## R.D.Spend 8.146e-01 5.828e-02 13.977 1.18e-15 ***
## Administration 1.858e-02 6.098e-02 0.305 0.762
## Marketing.Spend 2.873e-02 2.083e-02 1.379 0.177
## State2 -2.938e+02 3.871e+03 -0.076 0.940
## State3 -1.878e+03 4.104e+03 -0.458 0.650
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9938 on 34 degrees of freedom
## Multiple R-squared: 0.9496, Adjusted R-squared: 0.9421
## F-statistic: 128 on 5 and 34 DF, p-value: < 2.2e-16
NOTE: We have the P values for different independent variables. The P values suggest that not all the independent variables are equally significant in our regression model. So, we can eliminate the non-significant variables using Backward Elimination model.
y_pred <- predict(regressor, newdata = test_set)
cat(hwrite(y_pred, border = 1, table.frame='void', width='300px', table.style='padding: 100px', row.names=FALSE, row.style=list('font-weight:bold')))
| 7 | 12 | 28 | 32 | 34 | 36 | 39 | 42 | 44 | 48 |
| 159646.887121386 | 134474.311934385 | 112970.573563469 | 96882.6762970742 | 96580.68587341 | 86677.7833668439 | 64738.0433185301 | 72315.8256659785 | 57729.6018142506 | 46105.94444621 |
Forward selection has drawbacks, including the fact that each addition of a new variable may render one or more of the already included variables non-significant. An alternate approach which avoids this is backward selection. Under this approach, one starts with fitting a model with all the variables of interest (following the initial screen). Then the least significant variable is dropped, so long as it is not significant at our chosen critical level. We continue by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant.
backwardElimination <- function(x, sl) {
numVars = length(x)
for (i in c(1:numVars)){
regressor <- lm(formula = Profit ~ ., data = x)
maxVar <- max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
if (maxVar > sl){
j <- which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
x <- x[, -j]
}
numVars <- numVars - 1
}
return(summary(regressor))
}
SL = 0.05
dataset = dataset[, c(1,2,3,4,5)]
print(backwardElimination(training_set, SL))
##
## Call:
## lm(formula = Profit ~ ., data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32393 -4874 134 5177 18628
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.707e+04 3.085e+03 15.26 <2e-16 ***
## R.D.Spend 8.724e-01 3.390e-02 25.73 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9751 on 38 degrees of freedom
## Multiple R-squared: 0.9457, Adjusted R-squared: 0.9443
## F-statistic: 662.2 on 1 and 38 DF, p-value: < 2.2e-16
After backward elimination, we realized only R.D.Spend is significant for our regression analysis. So, we fit our training data to only that variable.
regressor <- lm(formula = Profit ~ R.D.Spend,
data = training_set)
library(ggplot2)
gtrain <- ggplot() +
geom_point(aes(x = training_set$R.D.Spend, y = training_set$Profit),
colour = 'red') +
geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Profit vs RnD Spend (Training set)') +
xlab('RnD Spend') +
ylab('Profit')
print(gtrain)
gtrain <- ggplot() +
geom_point(aes(x = test_set$R.D.Spend, y = test_set$Profit),
colour = 'red') +
geom_line(aes(x = training_set$R.D.Spend, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Profit vs RnD Spend (Training set)') +
xlab('RnD Spend') +
ylab('Profit')
print(gtrain)