Introduction
In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In simple linear regression, there are only one independent variable, denoted x.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models, which takes this form:
\[
y = b_0 + b_1x
\] where \(x\) is the independent variable and \(y\) is the dependent variable.
Simple Linear Regression in R
To build the simple linear regression model in R, first we will import the dataset from a CSV file.
dataset = read.csv("Salary_Data.csv")
Then we split the dataset into training set and test set.
library(caTools)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
The sample.split
function takes a vector and based on a split ratio, it produces a split
vector which contains two logical value TRUE
and FALSE
. Next we use the subset
function to split the dataset based on the TRUE
, FALSE
value in split
vector.
Now we can build the simple linear regression model based on the training set.
regressor = lm(formula = Salary ~ YearsExperience, data = training_set)
The lm
function is used to fit linear models. The formula argument accepts a formula class object that describe the model to be fitted. Salary ~ YearsExperience
means a linear model while the Salary and YearsExperience are dependent variable and independent variable.
Once we have the model, we can get the summary of the model by typing
summary(regressor)
Call:
lm(formula = Salary ~ YearsExperience, data = training_set)
Residuals:
Min 1Q Median 3Q Max
-7338 -4418 -225 2959 11803
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26105.9 2687.2 9.715 1.39e-08 ***
YearsExperience 9256.5 423.3 21.867 2.05e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5850 on 18 degrees of freedom
Multiple R-squared: 0.9637, Adjusted R-squared: 0.9617
F-statistic: 478.2 on 1 and 18 DF, p-value: 2.053e-14
And we can apply this model to the test set to predict salary in test set.
salary_predict = predict(regressor, newdata = test_set)
Then we can compare the predicted value and the real value.
print(data.frame(Salary = test_set$Salary, Salary_Predict = salary_predict))
We can also plot our data to visualize the linear model we have built. We will use ggplot2
to plot our data.
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary), colour = 'green') +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary), colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)), colour = 'blue') +
ggtitle('Salary vs Experience (Green: Training Set, Red: Test Set)') +
xlab('Years of experience') +
ylab('Salary')

LS0tCnRpdGxlOiAiU2ltcGxlIExpbmVhciBSZWdyZXNzaW9uIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyMgSW50cm9kdWN0aW9uCgpJbiBzdGF0aXN0aWNzLCBsaW5lYXIgcmVncmVzc2lvbiBpcyBhbiBhcHByb2FjaCBmb3IgbW9kZWxpbmcgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIGEgc2NhbGFyIGRlcGVuZGVudCB2YXJpYWJsZSB5IGFuZCBvbmUgb3IgbW9yZSBleHBsYW5hdG9yeSB2YXJpYWJsZXMgKG9yIGluZGVwZW5kZW50IHZhcmlhYmxlcykgZGVub3RlZCBYLiBJbiBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24sIHRoZXJlIGFyZSBvbmx5IG9uZSBpbmRlcGVuZGVudCB2YXJpYWJsZSwgZGVub3RlZCB4LgoKSW4gbGluZWFyIHJlZ3Jlc3Npb24sIHRoZSByZWxhdGlvbnNoaXBzIGFyZSBtb2RlbGVkIHVzaW5nIGxpbmVhciBwcmVkaWN0b3IgZnVuY3Rpb25zIHdob3NlIHVua25vd24gbW9kZWwgcGFyYW1ldGVycyBhcmUgZXN0aW1hdGVkIGZyb20gdGhlIGRhdGEuIFN1Y2ggbW9kZWxzIGFyZSBjYWxsZWQgbGluZWFyIG1vZGVscywgd2hpY2ggdGFrZXMgdGhpcyBmb3JtOgoKJCQKeSA9IGJfMCArIGJfMXgKJCQKd2hlcmUgJHgkIGlzIHRoZSBfX2luZGVwZW5kZW50IHZhcmlhYmxlX18gYW5kICR5JCBpcyB0aGUgX19kZXBlbmRlbnQgdmFyaWFibGVfXy4KCiMjIyBTaW1wbGUgTGluZWFyIFJlZ3Jlc3Npb24gaW4gUgoKVG8gYnVpbGQgdGhlIHNpbXBsZSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbCBpbiBSLCBmaXJzdCB3ZSB3aWxsIGltcG9ydCB0aGUgZGF0YXNldCBmcm9tIGEgQ1NWIGZpbGUuCmBgYHtyfQpkYXRhc2V0ID0gcmVhZC5jc3YoIlNhbGFyeV9EYXRhLmNzdiIpCmBgYAoKVGhlbiB3ZSBzcGxpdCB0aGUgZGF0YXNldCBpbnRvIHRyYWluaW5nIHNldCBhbmQgdGVzdCBzZXQuCmBgYHtyfQpsaWJyYXJ5KGNhVG9vbHMpICMgaW5zdGFsbC5wYWNrYWdlcygnY2FUb29scycpIHRvIGluc3RhbGwgY2FUb29scwpzcGxpdCA9IHNhbXBsZS5zcGxpdChkYXRhc2V0JFNhbGFyeSwgU3BsaXRSYXRpbyA9IDIvMykKdHJhaW5pbmdfc2V0ID0gc3Vic2V0KGRhdGFzZXQsIHNwbGl0ID09IFRSVUUpCnRlc3Rfc2V0ID0gc3Vic2V0KGRhdGFzZXQsIHNwbGl0ID09IEZBTFNFKQpgYGAKVGhlIGBzYW1wbGUuc3BsaXRgIGZ1bmN0aW9uIHRha2VzIGEgdmVjdG9yIGFuZCBiYXNlZCBvbiBhIHNwbGl0IHJhdGlvLCBpdCBwcm9kdWNlcyBhIGBzcGxpdGAgdmVjdG9yIHdoaWNoIGNvbnRhaW5zIHR3byBsb2dpY2FsIHZhbHVlIGBUUlVFYCBhbmQgYEZBTFNFYC4gTmV4dCB3ZSB1c2UgdGhlIGBzdWJzZXRgIGZ1bmN0aW9uIHRvIHNwbGl0IHRoZSBkYXRhc2V0IGJhc2VkIG9uIHRoZSBgVFJVRWAsIGBGQUxTRWAgdmFsdWUgaW4gYHNwbGl0YCB2ZWN0b3IuCgpOb3cgd2UgY2FuIGJ1aWxkIHRoZSBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwgYmFzZWQgb24gdGhlIHRyYWluaW5nIHNldC4KYGBge3J9CnJlZ3Jlc3NvciA9IGxtKGZvcm11bGEgPSBTYWxhcnkgfiBZZWFyc0V4cGVyaWVuY2UsIGRhdGEgPSB0cmFpbmluZ19zZXQpCmBgYApUaGUgYGxtYCBmdW5jdGlvbiBpcyB1c2VkIHRvIGZpdCBsaW5lYXIgbW9kZWxzLiBUaGUgZm9ybXVsYSBhcmd1bWVudCBhY2NlcHRzIGEgZm9ybXVsYSBjbGFzcyBvYmplY3QgdGhhdCBkZXNjcmliZSB0aGUgbW9kZWwgdG8gYmUgZml0dGVkLiBgU2FsYXJ5IH4gWWVhcnNFeHBlcmllbmNlYCBtZWFucyBhIGxpbmVhciBtb2RlbCB3aGlsZSB0aGUgU2FsYXJ5IGFuZCBZZWFyc0V4cGVyaWVuY2UgYXJlIGRlcGVuZGVudCB2YXJpYWJsZSBhbmQgaW5kZXBlbmRlbnQgdmFyaWFibGUuCgpPbmNlIHdlIGhhdmUgdGhlIG1vZGVsLCB3ZSBjYW4gZ2V0IHRoZSBzdW1tYXJ5IG9mIHRoZSBtb2RlbCBieSB0eXBpbmcKYGBge3J9CnN1bW1hcnkocmVncmVzc29yKQpgYGAKCkFuZCB3ZSBjYW4gYXBwbHkgdGhpcyBtb2RlbCB0byB0aGUgdGVzdCBzZXQgdG8gcHJlZGljdCBzYWxhcnkgaW4gdGVzdCBzZXQuCmBgYHtyfQpzYWxhcnlfcHJlZGljdCA9IHByZWRpY3QocmVncmVzc29yLCBuZXdkYXRhID0gdGVzdF9zZXQpCmBgYApUaGVuIHdlIGNhbiBjb21wYXJlIHRoZSBwcmVkaWN0ZWQgdmFsdWUgYW5kIHRoZSByZWFsIHZhbHVlLgpgYGB7cn0KcHJpbnQoZGF0YS5mcmFtZShTYWxhcnkgPSB0ZXN0X3NldCRTYWxhcnksIFNhbGFyeV9QcmVkaWN0ID0gc2FsYXJ5X3ByZWRpY3QpKQpgYGAKCldlIGNhbiBhbHNvIHBsb3Qgb3VyIGRhdGEgdG8gdmlzdWFsaXplIHRoZSBsaW5lYXIgbW9kZWwgd2UgaGF2ZSBidWlsdC4gV2Ugd2lsbCB1c2UgYGdncGxvdDJgIHRvIHBsb3Qgb3VyIGRhdGEuCmBgYHtyfQpsaWJyYXJ5KGdncGxvdDIpICMgaW5zdGFsbC5wYWNrYWdlcygnZ2dwbG90MicpIHRvIGluc3RhbGwgY2FUb29scwpnZ3Bsb3QoKSArCiAgZ2VvbV9wb2ludChhZXMoeCA9IHRyYWluaW5nX3NldCRZZWFyc0V4cGVyaWVuY2UsIHkgPSB0cmFpbmluZ19zZXQkU2FsYXJ5KSwgY29sb3VyID0gJ2dyZWVuJykgKwogIGdlb21fcG9pbnQoYWVzKHggPSB0ZXN0X3NldCRZZWFyc0V4cGVyaWVuY2UsIHkgPSB0ZXN0X3NldCRTYWxhcnkpLCBjb2xvdXIgPSAncmVkJykgKwogIGdlb21fbGluZShhZXMoeCA9IHRyYWluaW5nX3NldCRZZWFyc0V4cGVyaWVuY2UsIHkgPSBwcmVkaWN0KHJlZ3Jlc3NvciwgbmV3ZGF0YSA9IHRyYWluaW5nX3NldCkpLCBjb2xvdXIgPSAnYmx1ZScpICsKICBnZ3RpdGxlKCdTYWxhcnkgdnMgRXhwZXJpZW5jZSAoR3JlZW46IFRyYWluaW5nIFNldCwgUmVkOiBUZXN0IFNldCknKSArCiAgeGxhYignWWVhcnMgb2YgZXhwZXJpZW5jZScpICsKICB5bGFiKCdTYWxhcnknKQpgYGAKCgoKCgo=