Introduction

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X. In simple linear regression, there are only one independent variable, denoted x.

In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models, which takes this form:

\[ y = b_0 + b_1x \] where \(x\) is the independent variable and \(y\) is the dependent variable.

Simple Linear Regression in R

To build the simple linear regression model in R, first we will import the dataset from a CSV file.

dataset = read.csv("Salary_Data.csv")

Then we split the dataset into training set and test set.

library(caTools)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

The sample.split function takes a vector and based on a split ratio, it produces a split vector which contains two logical value TRUE and FALSE. Next we use the subset function to split the dataset based on the TRUE, FALSE value in split vector.

Now we can build the simple linear regression model based on the training set.

regressor = lm(formula = Salary ~ YearsExperience, data = training_set)

The lm function is used to fit linear models. The formula argument accepts a formula class object that describe the model to be fitted. Salary ~ YearsExperience means a linear model while the Salary and YearsExperience are dependent variable and independent variable.

Once we have the model, we can get the summary of the model by typing

summary(regressor)

Call:
lm(formula = Salary ~ YearsExperience, data = training_set)

Residuals:
   Min     1Q Median     3Q    Max 
 -7338  -4418   -225   2959  11803 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      26105.9     2687.2   9.715 1.39e-08 ***
YearsExperience   9256.5      423.3  21.867 2.05e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5850 on 18 degrees of freedom
Multiple R-squared:  0.9637,    Adjusted R-squared:  0.9617 
F-statistic: 478.2 on 1 and 18 DF,  p-value: 2.053e-14

And we can apply this model to the test set to predict salary in test set.

salary_predict = predict(regressor, newdata = test_set)

Then we can compare the predicted value and the real value.

print(data.frame(Salary = test_set$Salary, Salary_Predict = salary_predict))

We can also plot our data to visualize the linear model we have built. We will use ggplot2 to plot our data.

library(ggplot2)
ggplot() +
  geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary), colour = 'green') +
  geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary), colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)), colour = 'blue') +
  ggtitle('Salary vs Experience (Green: Training Set, Red: Test Set)') +
  xlab('Years of experience') +
  ylab('Salary')

LS0tCnRpdGxlOiAiU2ltcGxlIExpbmVhciBSZWdyZXNzaW9uIgpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sKLS0tCgojIyMgSW50cm9kdWN0aW9uCgpJbiBzdGF0aXN0aWNzLCBsaW5lYXIgcmVncmVzc2lvbiBpcyBhbiBhcHByb2FjaCBmb3IgbW9kZWxpbmcgdGhlIHJlbGF0aW9uc2hpcCBiZXR3ZWVuIGEgc2NhbGFyIGRlcGVuZGVudCB2YXJpYWJsZSB5IGFuZCBvbmUgb3IgbW9yZSBleHBsYW5hdG9yeSB2YXJpYWJsZXMgKG9yIGluZGVwZW5kZW50IHZhcmlhYmxlcykgZGVub3RlZCBYLiBJbiBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24sIHRoZXJlIGFyZSBvbmx5IG9uZSBpbmRlcGVuZGVudCB2YXJpYWJsZSwgZGVub3RlZCB4LgoKSW4gbGluZWFyIHJlZ3Jlc3Npb24sIHRoZSByZWxhdGlvbnNoaXBzIGFyZSBtb2RlbGVkIHVzaW5nIGxpbmVhciBwcmVkaWN0b3IgZnVuY3Rpb25zIHdob3NlIHVua25vd24gbW9kZWwgcGFyYW1ldGVycyBhcmUgZXN0aW1hdGVkIGZyb20gdGhlIGRhdGEuIFN1Y2ggbW9kZWxzIGFyZSBjYWxsZWQgbGluZWFyIG1vZGVscywgd2hpY2ggdGFrZXMgdGhpcyBmb3JtOgoKJCQKeSA9IGJfMCArIGJfMXgKJCQKd2hlcmUgJHgkIGlzIHRoZSBfX2luZGVwZW5kZW50IHZhcmlhYmxlX18gYW5kICR5JCBpcyB0aGUgX19kZXBlbmRlbnQgdmFyaWFibGVfXy4KCiMjIyBTaW1wbGUgTGluZWFyIFJlZ3Jlc3Npb24gaW4gUgoKVG8gYnVpbGQgdGhlIHNpbXBsZSBsaW5lYXIgcmVncmVzc2lvbiBtb2RlbCBpbiBSLCBmaXJzdCB3ZSB3aWxsIGltcG9ydCB0aGUgZGF0YXNldCBmcm9tIGEgQ1NWIGZpbGUuCmBgYHtyfQpkYXRhc2V0ID0gcmVhZC5jc3YoIlNhbGFyeV9EYXRhLmNzdiIpCmBgYAoKVGhlbiB3ZSBzcGxpdCB0aGUgZGF0YXNldCBpbnRvIHRyYWluaW5nIHNldCBhbmQgdGVzdCBzZXQuCmBgYHtyfQpsaWJyYXJ5KGNhVG9vbHMpICMgaW5zdGFsbC5wYWNrYWdlcygnY2FUb29scycpIHRvIGluc3RhbGwgY2FUb29scwpzcGxpdCA9IHNhbXBsZS5zcGxpdChkYXRhc2V0JFNhbGFyeSwgU3BsaXRSYXRpbyA9IDIvMykKdHJhaW5pbmdfc2V0ID0gc3Vic2V0KGRhdGFzZXQsIHNwbGl0ID09IFRSVUUpCnRlc3Rfc2V0ID0gc3Vic2V0KGRhdGFzZXQsIHNwbGl0ID09IEZBTFNFKQpgYGAKVGhlIGBzYW1wbGUuc3BsaXRgIGZ1bmN0aW9uIHRha2VzIGEgdmVjdG9yIGFuZCBiYXNlZCBvbiBhIHNwbGl0IHJhdGlvLCBpdCBwcm9kdWNlcyBhIGBzcGxpdGAgdmVjdG9yIHdoaWNoIGNvbnRhaW5zIHR3byBsb2dpY2FsIHZhbHVlIGBUUlVFYCBhbmQgYEZBTFNFYC4gTmV4dCB3ZSB1c2UgdGhlIGBzdWJzZXRgIGZ1bmN0aW9uIHRvIHNwbGl0IHRoZSBkYXRhc2V0IGJhc2VkIG9uIHRoZSBgVFJVRWAsIGBGQUxTRWAgdmFsdWUgaW4gYHNwbGl0YCB2ZWN0b3IuCgpOb3cgd2UgY2FuIGJ1aWxkIHRoZSBzaW1wbGUgbGluZWFyIHJlZ3Jlc3Npb24gbW9kZWwgYmFzZWQgb24gdGhlIHRyYWluaW5nIHNldC4KYGBge3J9CnJlZ3Jlc3NvciA9IGxtKGZvcm11bGEgPSBTYWxhcnkgfiBZZWFyc0V4cGVyaWVuY2UsIGRhdGEgPSB0cmFpbmluZ19zZXQpCmBgYApUaGUgYGxtYCBmdW5jdGlvbiBpcyB1c2VkIHRvIGZpdCBsaW5lYXIgbW9kZWxzLiBUaGUgZm9ybXVsYSBhcmd1bWVudCBhY2NlcHRzIGEgZm9ybXVsYSBjbGFzcyBvYmplY3QgdGhhdCBkZXNjcmliZSB0aGUgbW9kZWwgdG8gYmUgZml0dGVkLiBgU2FsYXJ5IH4gWWVhcnNFeHBlcmllbmNlYCBtZWFucyBhIGxpbmVhciBtb2RlbCB3aGlsZSB0aGUgU2FsYXJ5IGFuZCBZZWFyc0V4cGVyaWVuY2UgYXJlIGRlcGVuZGVudCB2YXJpYWJsZSBhbmQgaW5kZXBlbmRlbnQgdmFyaWFibGUuCgpPbmNlIHdlIGhhdmUgdGhlIG1vZGVsLCB3ZSBjYW4gZ2V0IHRoZSBzdW1tYXJ5IG9mIHRoZSBtb2RlbCBieSB0eXBpbmcKYGBge3J9CnN1bW1hcnkocmVncmVzc29yKQpgYGAKCkFuZCB3ZSBjYW4gYXBwbHkgdGhpcyBtb2RlbCB0byB0aGUgdGVzdCBzZXQgdG8gcHJlZGljdCBzYWxhcnkgaW4gdGVzdCBzZXQuCmBgYHtyfQpzYWxhcnlfcHJlZGljdCA9IHByZWRpY3QocmVncmVzc29yLCBuZXdkYXRhID0gdGVzdF9zZXQpCmBgYApUaGVuIHdlIGNhbiBjb21wYXJlIHRoZSBwcmVkaWN0ZWQgdmFsdWUgYW5kIHRoZSByZWFsIHZhbHVlLgpgYGB7cn0KcHJpbnQoZGF0YS5mcmFtZShTYWxhcnkgPSB0ZXN0X3NldCRTYWxhcnksIFNhbGFyeV9QcmVkaWN0ID0gc2FsYXJ5X3ByZWRpY3QpKQpgYGAKCldlIGNhbiBhbHNvIHBsb3Qgb3VyIGRhdGEgdG8gdmlzdWFsaXplIHRoZSBsaW5lYXIgbW9kZWwgd2UgaGF2ZSBidWlsdC4gV2Ugd2lsbCB1c2UgYGdncGxvdDJgIHRvIHBsb3Qgb3VyIGRhdGEuCmBgYHtyfQpsaWJyYXJ5KGdncGxvdDIpICMgaW5zdGFsbC5wYWNrYWdlcygnZ2dwbG90MicpIHRvIGluc3RhbGwgY2FUb29scwpnZ3Bsb3QoKSArCiAgZ2VvbV9wb2ludChhZXMoeCA9IHRyYWluaW5nX3NldCRZZWFyc0V4cGVyaWVuY2UsIHkgPSB0cmFpbmluZ19zZXQkU2FsYXJ5KSwgY29sb3VyID0gJ2dyZWVuJykgKwogIGdlb21fcG9pbnQoYWVzKHggPSB0ZXN0X3NldCRZZWFyc0V4cGVyaWVuY2UsIHkgPSB0ZXN0X3NldCRTYWxhcnkpLCBjb2xvdXIgPSAncmVkJykgKwogIGdlb21fbGluZShhZXMoeCA9IHRyYWluaW5nX3NldCRZZWFyc0V4cGVyaWVuY2UsIHkgPSBwcmVkaWN0KHJlZ3Jlc3NvciwgbmV3ZGF0YSA9IHRyYWluaW5nX3NldCkpLCBjb2xvdXIgPSAnYmx1ZScpICsKICBnZ3RpdGxlKCdTYWxhcnkgdnMgRXhwZXJpZW5jZSAoR3JlZW46IFRyYWluaW5nIFNldCwgUmVkOiBUZXN0IFNldCknKSArCiAgeGxhYignWWVhcnMgb2YgZXhwZXJpZW5jZScpICsKICB5bGFiKCdTYWxhcnknKQpgYGAKCgoKCgo=