Regression Overview

In this script, I’m presenting an overview of regressions in R machine learning.

First Step

First step, after calling in libraries, is to upload the data and getting to know it a bit.

#---- LIBRARIES ---- 
library(tidyverse)

#---- READ IN DATA ----
dataset <- read.csv("/Users/saina/Desktop/machine learning/Data (1).csv")
# to get to know the data

head(dataset)
summary(dataset)
##    Country               Age            Salary       Purchased        
##  Length:10          Min.   :27.00   Min.   :48000   Length:10         
##  Class :character   1st Qu.:35.00   1st Qu.:54000   Class :character  
##  Mode  :character   Median :38.00   Median :61000   Mode  :character  
##                     Mean   :38.78   Mean   :63778                     
##                     3rd Qu.:44.00   3rd Qu.:72000                     
##                     Max.   :50.00   Max.   :83000                     
##                     NA's   :1       NA's   :1
# SANITY CHECK FLAGS
# write a function to calculate NAs for everything later

dataset %>%  # 
  summarise_all(funs(sum(is.na(.))))
# plotting to know the data
dataset %>%
  keep(is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") +
  geom_histogram(bins = 20) + # all same y axis range
  # + scale_x_continuous(labels = comma) # disabling scientific notation
  theme_bw()

 

Second Step

Machine learning stuff need to have clean data sets as inputs. That means we need to take care of NAs and categorical values since the models would not do good with those. Two of the four variables have missing values. Let’s manage that.

# quick look for outliers
##range(dataset$Age, na.rm = T)
##range(dataset$Salary, na.rm = T)

# replacing NAs with averages. Sometimes we use median for this so that we can control for outliers, but there seems to be no need for that here.
dataset$Age = ifelse(is.na(dataset$Age),
                     ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
                     dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
                        ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
                        dataset$Salary)

  Next, we convert the categorical variables to numeric. 

# encoding categorical data
dataset$Country = factor(dataset$Country,
                         levels = c("France", "Spain", "Germany"),
                         labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
                           levels = c("No", "Yes"),
                           labels = c(0, 1))


Next, we need to create a subset of the data as the train data so that we can teach our model how the data behaves. We also need a test data to evaluate how the data is doing, and predict the behavior of the similar data. 

# splitting the data set into the train and test
library(caTools)
set.seed(123) # Set a random seed so that same sample can be reproduced in future runs
split = sample.split(dataset$Purchased, SplitRatio = 0.8) # creating a true/false logical vector
training_set = subset(dataset, split == TRUE) # 8
test_set = subset(dataset, split == FALSE) # 2

# having Euclidean distance in mind, we need to have the independent variables in the same scale. We can do standardization (x-mean/sd) or normalization(x-minx/maxx-minx). 
# feature scaling
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])

 

Now, let’s start modeling.   

Simple linear regression 

Let’s predict the salary based on the years of experience. Imagine we have a new hire who told us they were about to hit level 6 in the old company. Therefore, they have requested a salary of 160k. We want to make sure that’s the right salary for someone between the levels of 6 and 7. We’ll start with linear regression models. In building linear regression models, we need to be careful about the assumptions but in here we won’t get into them.  

dataset <- read.csv("/Users/saina/Desktop/machine learning/Salary_Data.csv") # new data uploaded

set.seed(123)
split <- sample.split(dataset$Salary, SplitRatio = 2/3) # 70 percent is usually better but it's fine here since we have 30 observations
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
# no need to do the scaling since R takes care of that in simple linear regression

 

Our model will learn the correlation between the two variables in the training set and then we can predict the salary of any other data (and we’ll test it using the test set).

# Fitting Simple Linear Regression to the Training set
regressor <- lm(formula = Salary ~ YearsExperience,
               data = training_set)

summary(regressor) # experience highly statistically sig in prediction of Salary
## 
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7325.1 -3814.4   427.7  3559.7  8884.6 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        25592       2646   9.672 1.49e-08 ***
## YearsExperience     9365        421  22.245 1.52e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5391 on 18 degrees of freedom
## Multiple R-squared:  0.9649, Adjusted R-squared:  0.963 
## F-statistic: 494.8 on 1 and 18 DF,  p-value: 1.524e-14
# Predicting the Test set results
y_pred <- predict(regressor, newdata = test_set)

# now let's see how the model did
# visualizing the training set regression results
ggplot() +
  geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            colour = 'blue') +
  ggtitle("Salary vs Experience") +
  xlab("Years of experience") +
  ylab("Salary")

# visualizing the training set regression results
ggplot() +
  geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
             colour = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)), # keeping what we trained on
            colour = 'blue') +
  ggtitle("Salary vs Experience") +
  xlab("Years of experience") +
  ylab("Salary")

 

Multiple linear regression

Next, we will create a model to help a customer assess where and in which companies they should invest to maximize profit: 

dataset <- read.csv("/Users/saina/Desktop/machine learning/50_Startups.csv") # new data uploaded

# encoding categorical data
dataset$State = factor(dataset$State,
                       levels = c('New York', 'California', 'Florida'),
                       labels = c(1, 2, 3))

set.seed(123)
split <- sample.split(dataset$Profit, SplitRatio = 0.8) # 
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

# fitting Multiple Linear Regression to the training set
regressor <- lm(formula = Profit ~ .,
               data = training_set)
summary(regressor) # the only significant variable was R&D
## 
## Call:
## lm(formula = Profit ~ ., data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33128  -4865      5   6098  18065 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
## R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
## Administration  -2.942e-02  5.828e-02  -0.505    0.617    
## Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
## State2           1.213e+02  3.751e+03   0.032    0.974    
## State3           2.376e+02  4.127e+03   0.058    0.954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9908 on 34 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9425 
## F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16
# predicting the test set results
y_pred <- predict(regressor, newdata = test_set)

 

but let’s do backward elimination to get the best model:  

# 
## no need to worry about dummy trap, R takes care of that for us
regressor <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
               data = training_set)
summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + 
##     State, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33128  -4865      5   6098  18065 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.965e+04  7.637e+03   6.501 1.94e-07 ***
## R.D.Spend        7.986e-01  5.604e-02  14.251 6.70e-16 ***
## Administration  -2.942e-02  5.828e-02  -0.505    0.617    
## Marketing.Spend  3.268e-02  2.127e-02   1.537    0.134    
## State2           1.213e+02  3.751e+03   0.032    0.974    
## State3           2.376e+02  4.127e+03   0.058    0.954    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9908 on 34 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9425 
## F-statistic:   129 on 5 and 34 DF,  p-value: < 2.2e-16
regressor <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
               data = training_set) # dropping the one with the highest p value
summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend, 
##     data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33117  -4858    -36   6020  17957 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.970e+04  7.120e+03   6.980 3.48e-08 ***
## R.D.Spend        7.983e-01  5.356e-02  14.905  < 2e-16 ***
## Administration  -2.895e-02  5.603e-02  -0.517    0.609    
## Marketing.Spend  3.283e-02  1.987e-02   1.652    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9629 on 36 degrees of freedom
## Multiple R-squared:  0.9499, Adjusted R-squared:  0.9457 
## F-statistic: 227.6 on 3 and 36 DF,  p-value: < 2.2e-16
regressor <- lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
               data = training_set) # dropping the one with the highest p value
summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33294  -4763   -354   6351  17693 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.638e+04  3.019e+03  15.364   <2e-16 ***
## R.D.Spend       7.879e-01  4.916e-02  16.026   <2e-16 ***
## Marketing.Spend 3.538e-02  1.905e-02   1.857   0.0713 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9533 on 37 degrees of freedom
## Multiple R-squared:  0.9495, Adjusted R-squared:  0.9468 
## F-statistic: 348.1 on 2 and 37 DF,  p-value: < 2.2e-16
# we should also remove this one.

So the final model is:  

regressor <- lm(formula = Profit ~ R.D.Spend,
               data = training_set) # dropping the one with the highest p value
summary(regressor)
## 
## Call:
## lm(formula = Profit ~ R.D.Spend, data = training_set)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34334  -4894   -340   6752  17147 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.902e+04  2.748e+03   17.84   <2e-16 ***
## R.D.Spend   8.563e-01  3.357e-02   25.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9836 on 38 degrees of freedom
## Multiple R-squared:  0.9448, Adjusted R-squared:  0.9434 
## F-statistic: 650.8 on 1 and 38 DF,  p-value: < 2.2e-16

 

And if we’re interested in an automated process for this, here’s a function for backward elimination in multiple regression:  

# regressor is the name of the model
# you'd have to define the data set and the SL
backwardElimination <- function(x, sl) {
    numVars = length(x)
    for (i in c(1:numVars)){
      regressor = lm(formula = Profit ~ ., data = x)
      maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
      if (maxVar > sl){
        j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
        x = x[, -j]
      }
      numVars = numVars - 1
    }
    return(summary(regressor))
  }
  
  SL = 0.05
  dataset = dataset[, c(1,2,3,4,5)]
  backwardElimination(training_set, SL)
## 
## Call:
## lm(formula = Profit ~ ., data = x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34334  -4894   -340   6752  17147 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.902e+04  2.748e+03   17.84   <2e-16 ***
## R.D.Spend   8.563e-01  3.357e-02   25.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9836 on 38 degrees of freedom
## Multiple R-squared:  0.9448, Adjusted R-squared:  0.9434 
## F-statistic: 650.8 on 1 and 38 DF,  p-value: < 2.2e-16

Polynomial regression

 

Let’s estimate a person’s salary using the data about the position levels. We’ll teach our model the correlations.  

dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable

# the data is too tiny so we won't be doing the train/test sets
# linear:
lin_reg <- lm(formula = Salary ~ .,
             data = dataset)
# summary(lin_reg)
# fitting Polynomial Regression to the data set
dataset$Level2 <- dataset$Level^2
# making it high degree model
dataset$Level3 <- dataset$Level^3
dataset$Level4 <- dataset$Level^4
poly_reg <- lm(formula = Salary ~ .,
              data = dataset)
summary(poly_reg)
## 
## Call:
## lm(formula = Salary ~ ., data = dataset)
## 
## Residuals:
##      1      2      3      4      5      6      7      8      9     10 
##  -8357  18240   1358 -14633 -11725   6725  15997  10006 -28695  11084 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  184166.7    67768.0   2.718  0.04189 * 
## Level       -211002.3    76382.2  -2.762  0.03972 * 
## Level2        94765.4    26454.2   3.582  0.01584 * 
## Level3       -15463.3     3535.0  -4.374  0.00719 **
## Level4          890.2      159.8   5.570  0.00257 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20510 on 5 degrees of freedom
## Multiple R-squared:  0.9974, Adjusted R-squared:  0.9953 
## F-statistic: 478.1 on 4 and 5 DF,  p-value: 1.213e-06

 

Let’s plot the linear model:  

# Visualizing the Linear Regression results

dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # data uploaded
dataset <- dataset[,2:3] # no need for categorical variable

# the data is too tiny so we won't be doing the train/test sets
# linear:
lin_reg <- lm(formula = Salary ~ .,
             data = dataset)

ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = dataset$Level, y = predict(lin_reg, newdata = dataset)),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Linear Regression)') +
  xlab('Level') +
  ylab('Salary')

 

But would doing polynomial model improve the prediction?  

# Visualizing the Polynomial Regression results
dataset$Level2 <- dataset$Level^2

poly_reg <- lm(formula = Salary ~ .,
              data = dataset)

ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Polynomial Regression)') +
  xlab('Level') +
  ylab('Salary')

 

What if I add another non-linear variable? 

# Visualising the Regression Model results (for higher resolution and smoother curve)

dataset$Level3 <- dataset$Level^3
poly_reg <- lm(formula = Salary ~ .,
              data = dataset)


ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Polynomial Regression)') +
  xlab('Level') +
  ylab('Salary')

 

Ok but what about going to a high degree model?  

dataset$Level4 <- dataset$Level^4
poly_reg <- lm(formula = Salary ~ .,
              data = dataset)


ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Polynomial Regression)') +
  xlab('Level') +
  ylab('Salary')

 

and what if you want to keep going like this? 

dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable

# to be able to get a more incremental prediction
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = x_grid, y = predict(poly_reg,
                                        newdata = data.frame(Level = x_grid,
                                                             Level2 = x_grid^2,
                                                             Level3 = x_grid^3,
                                                             Level4 = x_grid^4))),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Polynomial Regression)') +
  xlab('Level') +
  ylab('Salary')

 

Let’s compare the predictions of the linear model with our final model: 

# Predicting a new result with Linear Regression
predict(lin_reg, data.frame(Level = 6.5))
##        1 
## 330378.8
# Predicting a new result with Polynomial Regression
predict(poly_reg, data.frame(Level = 6.5,
                             Level2 = 6.5^2,
                             Level3 = 6.5^3,
                             Level4 = 6.5^4))
##        1 
## 158862.5

 

Very different! 

SVR  

dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable

library(e1071)
regressor <- svm(formula = Salary ~ .,
                data = dataset,
                type = 'eps-regression', # for classificatoin/regression type. non linear > EPS. Linear > 
                kernel = 'radial')

summary(regressor)
## 
## Call:
## svm(formula = Salary ~ ., data = dataset, type = "eps-regression", 
##     kernel = "radial")
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  1 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  6
# Predicting a new result
y_pred<-predict(regressor, data.frame(Level = 6.5))

# Visualising the SVR results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)),
            colour = 'blue') +
  ggtitle('Truth or Bluff (SVR)') +
  xlab('Level') +
  ylab('Salary')

# Visualising the SVR results (for higher resolution and smoother curve)
# install.packages('ggplot2')
library(ggplot2)
x_grid <- seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
            colour = 'blue') +
  ggtitle('Truth or Bluff (SVR)') +
  xlab('Level') +
  ylab('Salary')

 

Quite a good prediction trend!  

Decision Tree

dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable

library(rpart)
regressor = rpart(formula = Salary ~ .,
                  data = dataset,
                  control = rpart.control(minsplit = 1)) # if we dont tell it how many splits, it might just give back the average of all data (this is model performance improvement)

# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor, data.frame(Level = 6.5))

# Visualising the Decision Tree Regression results (higher resolution)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Decision Tree Regression)') +
  xlab('Level') +
  ylab('Salary')

# Plotting the tree
plot(regressor)
text(regressor)

 

You can clearly see the intervals that define each category.  

Random Forrest

dataset <- read.csv('Position_Salaries.csv')
dataset <- dataset[2:3]

library(randomForest)
set.seed(1234)
regressor = randomForest(x = dataset[-2],
                         y = dataset$Salary,
                         ntree = 500) # number of trees. Play around with them. as it increases the model has higher number of steps > better prediction in some cases

# Predicting a new result with Random Forest Regression
y_pred = predict(regressor, data.frame(Level = 6.5))

# Visualising the Random Forest Regression results (higher resolution)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01) # if this is larger, the lines will not be vertical
ggplot() +
  geom_point(aes(x = dataset$Level, y = dataset$Salary),
             colour = 'red') +
  geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
            colour = 'blue') +
  ggtitle('Truth or Bluff (Random Forest Regression)') +
  xlab('Level') +
  ylab('Salary')

 

What’s the predicted salary then?  

y_pred
##        1 
## 160907.7

 

As you see, the prediction of this model is a lot closer to the value asked by the new hire!