In this script, I’m presenting an overview of regressions in R
machine learning.
First step, after calling in libraries, is to upload the data and
getting to know it a bit.
#---- LIBRARIES ----
library(tidyverse)
#---- READ IN DATA ----
dataset <- read.csv("/Users/saina/Desktop/machine learning/Data (1).csv")
# to get to know the data
head(dataset)
summary(dataset)
## Country Age Salary Purchased
## Length:10 Min. :27.00 Min. :48000 Length:10
## Class :character 1st Qu.:35.00 1st Qu.:54000 Class :character
## Mode :character Median :38.00 Median :61000 Mode :character
## Mean :38.78 Mean :63778
## 3rd Qu.:44.00 3rd Qu.:72000
## Max. :50.00 Max. :83000
## NA's :1 NA's :1
# SANITY CHECK FLAGS
# write a function to calculate NAs for everything later
dataset %>% #
summarise_all(funs(sum(is.na(.))))
# plotting to know the data
dataset %>%
keep(is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 20) + # all same y axis range
# + scale_x_continuous(labels = comma) # disabling scientific notation
theme_bw()
Machine learning stuff need to have clean data sets as inputs. That
means we need to take care of NAs and categorical values since the
models would not do good with those. Two of the four variables have
missing values. Let’s manage that.
# quick look for outliers
##range(dataset$Age, na.rm = T)
##range(dataset$Salary, na.rm = T)
# replacing NAs with averages. Sometimes we use median for this so that we can control for outliers, but there seems to be no need for that here.
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
Next, we convert the categorical variables to numeric.
# encoding categorical data
dataset$Country = factor(dataset$Country,
levels = c("France", "Spain", "Germany"),
labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
levels = c("No", "Yes"),
labels = c(0, 1))
Next, we need to create a subset of the data as the train data so that we can teach our model how the data behaves. We also need a test data to evaluate how the data is doing, and predict the behavior of the similar data.
# splitting the data set into the train and test
library(caTools)
set.seed(123) # Set a random seed so that same sample can be reproduced in future runs
split = sample.split(dataset$Purchased, SplitRatio = 0.8) # creating a true/false logical vector
training_set = subset(dataset, split == TRUE) # 8
test_set = subset(dataset, split == FALSE) # 2
# having Euclidean distance in mind, we need to have the independent variables in the same scale. We can do standardization (x-mean/sd) or normalization(x-minx/maxx-minx).
# feature scaling
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])
Now, let’s start modeling.
Let’s predict the salary based on the years of experience. Imagine we have a new hire who told us they were about to hit level 6 in the old company. Therefore, they have requested a salary of 160k. We want to make sure that’s the right salary for someone between the levels of 6 and 7. We’ll start with linear regression models. In building linear regression models, we need to be careful about the assumptions but in here we won’t get into them.
dataset <- read.csv("/Users/saina/Desktop/machine learning/Salary_Data.csv") # new data uploaded
set.seed(123)
split <- sample.split(dataset$Salary, SplitRatio = 2/3) # 70 percent is usually better but it's fine here since we have 30 observations
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
# no need to do the scaling since R takes care of that in simple linear regression
Our model will learn the correlation between the two variables in the
training set and then we can predict the salary of any other data (and
we’ll test it using the test set).
# Fitting Simple Linear Regression to the Training set
regressor <- lm(formula = Salary ~ YearsExperience,
data = training_set)
summary(regressor) # experience highly statistically sig in prediction of Salary
##
## Call:
## lm(formula = Salary ~ YearsExperience, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7325.1 -3814.4 427.7 3559.7 8884.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25592 2646 9.672 1.49e-08 ***
## YearsExperience 9365 421 22.245 1.52e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5391 on 18 degrees of freedom
## Multiple R-squared: 0.9649, Adjusted R-squared: 0.963
## F-statistic: 494.8 on 1 and 18 DF, p-value: 1.524e-14
# Predicting the Test set results
y_pred <- predict(regressor, newdata = test_set)
# now let's see how the model did
# visualizing the training set regression results
ggplot() +
geom_point(aes(x = training_set$YearsExperience, y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle("Salary vs Experience") +
xlab("Years of experience") +
ylab("Salary")
# visualizing the training set regression results
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)), # keeping what we trained on
colour = 'blue') +
ggtitle("Salary vs Experience") +
xlab("Years of experience") +
ylab("Salary")
Next, we will create a model to help a customer assess where and in which companies they should invest to maximize profit:
dataset <- read.csv("/Users/saina/Desktop/machine learning/50_Startups.csv") # new data uploaded
# encoding categorical data
dataset$State = factor(dataset$State,
levels = c('New York', 'California', 'Florida'),
labels = c(1, 2, 3))
set.seed(123)
split <- sample.split(dataset$Profit, SplitRatio = 0.8) #
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)
# fitting Multiple Linear Regression to the training set
regressor <- lm(formula = Profit ~ .,
data = training_set)
summary(regressor) # the only significant variable was R&D
##
## Call:
## lm(formula = Profit ~ ., data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33128 -4865 5 6098 18065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.965e+04 7.637e+03 6.501 1.94e-07 ***
## R.D.Spend 7.986e-01 5.604e-02 14.251 6.70e-16 ***
## Administration -2.942e-02 5.828e-02 -0.505 0.617
## Marketing.Spend 3.268e-02 2.127e-02 1.537 0.134
## State2 1.213e+02 3.751e+03 0.032 0.974
## State3 2.376e+02 4.127e+03 0.058 0.954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9908 on 34 degrees of freedom
## Multiple R-squared: 0.9499, Adjusted R-squared: 0.9425
## F-statistic: 129 on 5 and 34 DF, p-value: < 2.2e-16
# predicting the test set results
y_pred <- predict(regressor, newdata = test_set)
but let’s do backward elimination to get the best model:
#
## no need to worry about dummy trap, R takes care of that for us
regressor <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
data = training_set)
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend +
## State, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33128 -4865 5 6098 18065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.965e+04 7.637e+03 6.501 1.94e-07 ***
## R.D.Spend 7.986e-01 5.604e-02 14.251 6.70e-16 ***
## Administration -2.942e-02 5.828e-02 -0.505 0.617
## Marketing.Spend 3.268e-02 2.127e-02 1.537 0.134
## State2 1.213e+02 3.751e+03 0.032 0.974
## State3 2.376e+02 4.127e+03 0.058 0.954
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9908 on 34 degrees of freedom
## Multiple R-squared: 0.9499, Adjusted R-squared: 0.9425
## F-statistic: 129 on 5 and 34 DF, p-value: < 2.2e-16
regressor <- lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
data = training_set) # dropping the one with the highest p value
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
## data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33117 -4858 -36 6020 17957
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.970e+04 7.120e+03 6.980 3.48e-08 ***
## R.D.Spend 7.983e-01 5.356e-02 14.905 < 2e-16 ***
## Administration -2.895e-02 5.603e-02 -0.517 0.609
## Marketing.Spend 3.283e-02 1.987e-02 1.652 0.107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9629 on 36 degrees of freedom
## Multiple R-squared: 0.9499, Adjusted R-squared: 0.9457
## F-statistic: 227.6 on 3 and 36 DF, p-value: < 2.2e-16
regressor <- lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
data = training_set) # dropping the one with the highest p value
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend + Marketing.Spend, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33294 -4763 -354 6351 17693
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.638e+04 3.019e+03 15.364 <2e-16 ***
## R.D.Spend 7.879e-01 4.916e-02 16.026 <2e-16 ***
## Marketing.Spend 3.538e-02 1.905e-02 1.857 0.0713 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9533 on 37 degrees of freedom
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9468
## F-statistic: 348.1 on 2 and 37 DF, p-value: < 2.2e-16
# we should also remove this one.
So the final model is:
regressor <- lm(formula = Profit ~ R.D.Spend,
data = training_set) # dropping the one with the highest p value
summary(regressor)
##
## Call:
## lm(formula = Profit ~ R.D.Spend, data = training_set)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34334 -4894 -340 6752 17147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.902e+04 2.748e+03 17.84 <2e-16 ***
## R.D.Spend 8.563e-01 3.357e-02 25.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9836 on 38 degrees of freedom
## Multiple R-squared: 0.9448, Adjusted R-squared: 0.9434
## F-statistic: 650.8 on 1 and 38 DF, p-value: < 2.2e-16
And if we’re interested in an automated process for this, here’s a function for backward elimination in multiple regression:
# regressor is the name of the model
# you'd have to define the data set and the SL
backwardElimination <- function(x, sl) {
numVars = length(x)
for (i in c(1:numVars)){
regressor = lm(formula = Profit ~ ., data = x)
maxVar = max(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"])
if (maxVar > sl){
j = which(coef(summary(regressor))[c(2:numVars), "Pr(>|t|)"] == maxVar)
x = x[, -j]
}
numVars = numVars - 1
}
return(summary(regressor))
}
SL = 0.05
dataset = dataset[, c(1,2,3,4,5)]
backwardElimination(training_set, SL)
##
## Call:
## lm(formula = Profit ~ ., data = x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34334 -4894 -340 6752 17147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.902e+04 2.748e+03 17.84 <2e-16 ***
## R.D.Spend 8.563e-01 3.357e-02 25.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9836 on 38 degrees of freedom
## Multiple R-squared: 0.9448, Adjusted R-squared: 0.9434
## F-statistic: 650.8 on 1 and 38 DF, p-value: < 2.2e-16
Let’s estimate a person’s salary using the data about the position levels. We’ll teach our model the correlations.
dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable
# the data is too tiny so we won't be doing the train/test sets
# linear:
lin_reg <- lm(formula = Salary ~ .,
data = dataset)
# summary(lin_reg)
# fitting Polynomial Regression to the data set
dataset$Level2 <- dataset$Level^2
# making it high degree model
dataset$Level3 <- dataset$Level^3
dataset$Level4 <- dataset$Level^4
poly_reg <- lm(formula = Salary ~ .,
data = dataset)
summary(poly_reg)
##
## Call:
## lm(formula = Salary ~ ., data = dataset)
##
## Residuals:
## 1 2 3 4 5 6 7 8 9 10
## -8357 18240 1358 -14633 -11725 6725 15997 10006 -28695 11084
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 184166.7 67768.0 2.718 0.04189 *
## Level -211002.3 76382.2 -2.762 0.03972 *
## Level2 94765.4 26454.2 3.582 0.01584 *
## Level3 -15463.3 3535.0 -4.374 0.00719 **
## Level4 890.2 159.8 5.570 0.00257 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20510 on 5 degrees of freedom
## Multiple R-squared: 0.9974, Adjusted R-squared: 0.9953
## F-statistic: 478.1 on 4 and 5 DF, p-value: 1.213e-06
Let’s plot the linear model:
# Visualizing the Linear Regression results
dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # data uploaded
dataset <- dataset[,2:3] # no need for categorical variable
# the data is too tiny so we won't be doing the train/test sets
# linear:
lin_reg <- lm(formula = Salary ~ .,
data = dataset)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(lin_reg, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (Linear Regression)') +
xlab('Level') +
ylab('Salary')
But would doing polynomial model improve the prediction?
# Visualizing the Polynomial Regression results
dataset$Level2 <- dataset$Level^2
poly_reg <- lm(formula = Salary ~ .,
data = dataset)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
What if I add another non-linear variable?
# Visualising the Regression Model results (for higher resolution and smoother curve)
dataset$Level3 <- dataset$Level^3
poly_reg <- lm(formula = Salary ~ .,
data = dataset)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
Ok but what about going to a high degree model?
dataset$Level4 <- dataset$Level^4
poly_reg <- lm(formula = Salary ~ .,
data = dataset)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(poly_reg, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
and what if you want to keep going like this?
dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable
# to be able to get a more incremental prediction
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(poly_reg,
newdata = data.frame(Level = x_grid,
Level2 = x_grid^2,
Level3 = x_grid^3,
Level4 = x_grid^4))),
colour = 'blue') +
ggtitle('Truth or Bluff (Polynomial Regression)') +
xlab('Level') +
ylab('Salary')
Let’s compare the predictions of the linear model with our final model:
# Predicting a new result with Linear Regression
predict(lin_reg, data.frame(Level = 6.5))
## 1
## 330378.8
# Predicting a new result with Polynomial Regression
predict(poly_reg, data.frame(Level = 6.5,
Level2 = 6.5^2,
Level3 = 6.5^3,
Level4 = 6.5^4))
## 1
## 158862.5
Very different!
dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable
library(e1071)
regressor <- svm(formula = Salary ~ .,
data = dataset,
type = 'eps-regression', # for classificatoin/regression type. non linear > EPS. Linear >
kernel = 'radial')
summary(regressor)
##
## Call:
## svm(formula = Salary ~ ., data = dataset, type = "eps-regression",
## kernel = "radial")
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1
## gamma: 1
## epsilon: 0.1
##
##
## Number of Support Vectors: 6
# Predicting a new result
y_pred<-predict(regressor, data.frame(Level = 6.5))
# Visualising the SVR results
# install.packages('ggplot2')
library(ggplot2)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = dataset$Level, y = predict(regressor, newdata = dataset)),
colour = 'blue') +
ggtitle('Truth or Bluff (SVR)') +
xlab('Level') +
ylab('Salary')
# Visualising the SVR results (for higher resolution and smoother curve)
# install.packages('ggplot2')
library(ggplot2)
x_grid <- seq(min(dataset$Level), max(dataset$Level), 0.1)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (SVR)') +
xlab('Level') +
ylab('Salary')
Quite a good prediction trend!
dataset <- read.csv("/Users/saina/Desktop/machine learning/Position_Salaries.csv") # new data uploaded
dataset <- dataset[,2:3] # no need for categorical variable
library(rpart)
regressor = rpart(formula = Salary ~ .,
data = dataset,
control = rpart.control(minsplit = 1)) # if we dont tell it how many splits, it might just give back the average of all data (this is model performance improvement)
# Predicting a new result with Decision Tree Regression
y_pred = predict(regressor, data.frame(Level = 6.5))
# Visualising the Decision Tree Regression results (higher resolution)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (Decision Tree Regression)') +
xlab('Level') +
ylab('Salary')
# Plotting the tree
plot(regressor)
text(regressor)
You can clearly see the intervals that define each category.
dataset <- read.csv('Position_Salaries.csv')
dataset <- dataset[2:3]
library(randomForest)
set.seed(1234)
regressor = randomForest(x = dataset[-2],
y = dataset$Salary,
ntree = 500) # number of trees. Play around with them. as it increases the model has higher number of steps > better prediction in some cases
# Predicting a new result with Random Forest Regression
y_pred = predict(regressor, data.frame(Level = 6.5))
# Visualising the Random Forest Regression results (higher resolution)
# install.packages('ggplot2')
library(ggplot2)
x_grid = seq(min(dataset$Level), max(dataset$Level), 0.01) # if this is larger, the lines will not be vertical
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary),
colour = 'red') +
geom_line(aes(x = x_grid, y = predict(regressor, newdata = data.frame(Level = x_grid))),
colour = 'blue') +
ggtitle('Truth or Bluff (Random Forest Regression)') +
xlab('Level') +
ylab('Salary')
What’s the predicted salary then?
y_pred
## 1
## 160907.7
As you see, the prediction of this model is a lot closer to the value asked by the new hire!