Untitled

Simple Linear Regression

In this example, I would demonestrate how to create a simple linear regression. Furthermore, I would make a residual dignoses to weigh our model quality and how far does it fit to the problem.

The dataset I have is basically consists of one independent variable(year of experience) and one dependant variable (Salary). I would build a linear regression model to predict the salary at certain point. The relation between two variables is linear.

Importing the dataset

library(RCurl)

## Loading required package: bitops

myfile = getURL('https://raw.githubusercontent.com/salma71/Machine-learning-tutorial/master/Part%202%20-%20Regression/Section%204%20-%20Simple%20Linear%20Regression/Salary_Data.csv', ssl.verifyhost=FALSE, ssl.verifypeer=FALSE)

dataset = read.csv(textConnection(myfile), header = TRUE)
head(dataset)

Splitting the dataset into the Training set and Test set

# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

Fitting Simple Linear Regression to the Training set

regressor = lm(formula = Salary ~ YearsExperience,
               data = training_set)
summary(regressor)

## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7325.1 -3814.4   427.7  3559.7  8884.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        25592       2646   9.672 1.49e-08 ***
## YearsExperience     9365        421  22.245 1.52e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5391 on 18 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.963 
## F-statistic: 494.8 on 1 and 18 DF,  p-value: 1.524e-14

The model says that for a salary to increase one unit it required to increase years of experience variable by 9365 units. Also, we can see from the model, the Adjusted R-squared is 0.96. Means that the model can predict 96% of the data - which is good.

Residual dignosis

par(mfrow=c(2,2)) # Split the plotting panel into a 2 x 2 grid
plot(regressor) # Plot the model information

The first plot (residuals vs. fitted values) is a simple scatterplot between residuals and predicted values. It look a bit random but so far equally distributed around the center.

The second plot (normal Q-Q) is a normal probability plot. It gives so far a straight line which means that the errors are distributed normally.

The third plot (Scale-Location), like the the first, looks random. No patterns. However, it gave me a strange inversed v-shape.

The last plot (Cook’s distance) tells us which points have the greatest influence on the regression (leverage points). We see that points 7, 9 and 12 have great influence on the model.