Data Preprocessing
Importing the dataset
dataset = read.csv('Data.csv')
Training set and test set
## 1
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.7)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
## 2
# set.seed(123)
# split = createDataPartition(data$Purchased, p = 0.7, list = F)
# training = dataset[split, ]
# testing = dataset[-split, ]
- Changing the split target to dependent variable for the model.
- Train set is using to build the model, and test set is evaluating the model and it should be use only one time when finishing the model.
Encoding categorical data
# Refix numeric misidentified variable to categorical (factor) variable
summary(dataset)
dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany'),
labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes'),
labels = c(0, 1))
# Create dummy variable from categorical (factor) variable (one to many)
# dummy = dummyVars( ~ am, data = mtcars)
# dataset = predict(dummy, newdata = mtcars) %>% data.frame(mtcars[, -c(9)])
# Create continuous variable to categorical (factor) variable (sectioning)
# cut2(dataset$continuous.numeric.variable, g = number of splitting section group)
- Turning character variables to factor variables to be able to do them with ML.
Taking care of missing data
# Total number of missing data
glance(dataset) %>% mutate(na.number = na.fraction * nrow * ncol)
# Individual number in column of missing data
colSums(is.na(dataset))
# Replacing missing data
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age,
FUN = function(x) mean(x, na.rm = T)),
dataset$Age)
# Imputing missing data (recommond)
# miss = preProcess(dataset[, -4], method = "knnImpute")
# dataset = predict(miss, dataset[, c(-4)]) %>% data.frame(data$Purchased)
# Removing variables (columns) that are mostly NA
# all.na = sapply(dataset, function(x) mean(is.na(x))) > 0.95
# dataset = dataset[, all.na == F]
# Removing observations (rows) that have NA
# dataset = na.omit(dataset)
- Checking where the na column is.
- Dealing with numerical variables. Using CTRL+F to replace all the key word to do it repeatly. Using CTRL+I to space it correctly.
Variable selection
# Checking variables with nearly zero variance
# nzv = nearZeroVar(dataset, saveMetrics = T); nzv
# Removing variables with nearly zero variance
# nzv = nearZeroVar(dataset)
# dataset = dataset[, -nzv]
# Remove the identification only variables and choose column in index number
# dataset = dataset[, c(?) or -c(?)]
# Choose column in name by regular expression select
# dataset = dataset[, grep("regex", names(dataset))]
- Identifying the variables that have marginal variablility and likely not be good predictors.
- If near zero variable (nzv) column shows as TRUE, the variable is acceptable to drop due to the value being the same as all zero.
- Regular exression (regex): cheatsheet. (https://tinyurl.com/y4obdv4p)
- For example, “^a” (apple, acer, a_type), “b$” (bomb, type_b), “^a|b” (a_type, b_type). (https://tinyurl.com/yxhjargn)
Feature scaling (standardization) (optional)
## 1
# dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
## 2
# scale = preProcess(dataset[, -c("y")], method = c("center", "scale" or "BoxCox"))
# dataset = predict(scale, dataset[, -c("y")]) %>% data.frame(dataset$y)
- Standaridization = (x - mean(x)) / sd(x). It rescales the data to have a mean of 0 and a standard deviation of 1.
- Normalization = (x - min(x)) / diff(range(x)). It rescales the data into a range of [0;1].
- Only the dependent variable does not need the scale; the rest of all independent variables need to set under the same scale, but only with one rule that scaling including continuous variables and excluding categorical variables.
- Scaling and preprocessing should use the train setting into test as well.
Exploring & analyzing data
## 1
# library('GGally')
# p1 = ggpairs(data = training.set, # data source
# column = c("var", ...), # select column
# mapping = aes(col = var), # color
# lower = list(continuous = wrap('smooth', method = 'lm'))) # lm line
## 2
# featurePlot(x = train[, c("id.var.1", "id.var.2", "id.var.3")],
# y = train$dependent.var,
# plot = "pairs")
## 3
# qplot(data = train, x = in.var.1, y = d.var, col|fill = in.var.2, size = in.var.3, geom = c("boxplot", "jitter", "density"))
## 4
# corMatrix = cor(train)
# corrplot(corMatrix, order = "FPC", method = "color", type = "lower",
# tl.cex = 0.8, tl.col = rgb(0, 0, 0))
## 5
# table(row, column)
# prop.table(row, column, (1:proportion in row, 2:proportion in col))
- Only make the EDA in the training set. No use the test set for exploration.
- Things should be looking for: imbalance, outlier, skewed distribution.
- Plot and table are the ways for EDA.
- If the correlations are quite more, a principal components analysis (PCA) could be performed as processing step to make an even more compact analysis.
Dimensionality reduction - PCA (optional)
# library('caret')
# pca = preProcess(x = training[, -ncol(training)],
# method = 'pca',
# pcaComp = 2 | thresh = 0.8 or 0.9)
# training = predict(pca, training)
# training = training[, c(pc1, pc2, y)]
# testing = predict(pca, testing)
# testing = testing[, c(pc1, pc2, y)]
- The basic PCA idea is that we might not need every variable (predictor) (indenpendent variable).
- PCA is a weighted combination of predictors.
- We pick this combination to capture the most information (variability).
- Goods reduce number of predictors, reduce noise (due to averaging).
- Bad: lose interpretability (due to hard to explain).
- It is easy got affected by outliers. so by doing EDA is a good way to aviod this weakness.
- More dimensionality reduction examples as below at the end (dimensionality reduction section).
Regression
- Regression for predicting a continuous data, which can be evaluated by root mean squared error (RMSE), r squared.
- Supervised learning, having a label variable or dependent variable or correct answer.
Simple linear regression
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Salary_Data.csv')
set.seed(123)
split = sample.split(dataset$Salary, SplitRatio = 2/3)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Model fitting
mod = lm(Salary ~ YearsExperience, data = training.set)
summary(mod)
# Causation
confint(mod, level = 0.95) # estimate of interval
data.frame(summary(mod)$coef[, 1], confint(mod, level = 0.95)) # estimate and its interval
# Prediction
predict(mod, newdata = test.set,
interval = 'confidence',
level = 0.95) # linear-combination of interval
predict(mod, newdata = test.set) # vector
y.pred = augment(mod, newdata = test.set) # matrix
# Error rate evaluating (RMSE)
library('forecast')
accuracy(y.pred$.fitted, y.pred$Salary) # RMSE on test 6553
accuracy(augment(mod, newdata = training.set)$.fitted, augment(mod, newdata = training.set)$Salary) # RMSE on training 5114
# Model visualising
plot.training = ggplot() +
geom_point(aes(x = training.set$YearsExperience,
y = training.set$Salary),
col = 'red') +
geom_line(aes(x = training.set$YearsExperience,
y = predict(mod, newdata = training.set)),
col = 'blue') +
labs(title = 'Salary vs Experience',
subtitle = 'Training set',
x = 'Years of experience',
y = 'Salary')
plot.test = ggplot() +
geom_point(aes(x = test.set$YearsExperience,
y = test.set$Salary),
col = 'red') +
geom_line(aes(x = training.set$YearsExperience,
y = predict(mod, newdata = training.set)),
col = 'blue') +
labs(title = 'Salary vs Experience',
subtitle = 'Test set',
x = 'Years of experience',
y = 'Salary')
grid.arrange(plot.training, plot.test, nrow = 1)
## 2
# fit = train(y ~ x, data = dataset, method = "lm")
# mod = fit$finalModel
- It already has a standardization in default with the model.
Multiple linear regression
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('50_Startups.csv')
dataset$State = factor(dataset$State,
levels = c('New York',
'California',
'Florida'),
labels = c(1, 2, 3))
set.seed(123)
split = sample.split(dataset$Profit, SplitRatio = 0.8)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Model fitting
# Variable selection
# All-in
mod.all = lm(Profit ~ ., data = training.set)
summary(mod.all)
sc.all = glance(mod.all)
# Backward elimination
mod.back = step(mod.all, direction = 'backward')
summary(mod.back)
sc.back = glance(mod.back)
# Forward selection
mod.null = lm(Profit ~ 1, data = training.set)
mod.for = step(mod.null, scope = list(lower = mod.null,
upper = mod.all),
direction = 'forward')
summary(mod.for)
sc.for = glance(mod.for)
# Stepwise regression
mod.step = step(mod.all, direction = 'both')
summary(mod.step)
sc.step = glance(mod.step)
# Finalised model
sc = data.frame('type' = c('All-in',
'Backward elimination',
'Forward selection',
'Stepwise regression'),
'adj.r.squared' = c(sc.all$adj.r.squared,
sc.back$adj.r.squared,
sc.for$adj.r.squared,
sc.step$adj.r.squared),
'AIC' = c(sc.all$AIC,
sc.back$AIC,
sc.for$AIC,
sc.step$AIC),
'BIC' = c(sc.all$BIC,
sc.back$BIC,
sc.for$BIC,
sc.step$BIC))
mod = lm(Profit ~ R.D.Spend + Marketing.Spend, data = training.set)
summary(mod)
# Checking assumptions (optional)
# {r, fig.height = 5, fig.width = 5} # better shape with square
par(mfrow = c(2, 2)); plot(mod)
# Checking others (optional)
library('car')
vif(mod) # variance inflation factor
varImp(mod, decreasing = T) # variable importance for regression and classification
# Prediction
predict(mod, newdata = test.set)
y.pred = augment(mod, newdata = test.set)
## 2
# fit = train(data = training, preProcess = c("center", "scale"), y ~ ., method = "glm" | "lm") # model fitting
# mod = fit$finalModel # model final
# pred = predict(mod, newdata = testing); pred # prediction
# confusionMatrix(pred, testing$y) # evaluation
- It already has a standardization in default with the model.
- It can automatically identify categorical variables and set them up to dummy variables. It will drop the base one to prevent from the problem of exact collinearity (multicollinearity).
- Finding the common agreement with all variable selection methods.
- Checking assumptions established (error term / residual).
- Residuals vs Fitted: homoscedasticity (error variance = constant)
- Normal Q-Q: normal distribution (error mean = 0)
- Redisuals vs Leverage: no outlier (error distribtion is normal)
- ACF: no serial correlation (autocorrelation) (error covariance = 0)
- Finalised model summary gives the causation and its predict gives the prediction.
- Regression modeling processes are theroying, data collection, data cleaning, training set & test set, variable selection (feature selection), advancing by the goodness of fit, checking finalised model assumptions, evaluating with test set (no underfit nor overfit), causation & prediction with the finalised model.
Polynomial non-linear regression
# Data preprocessing
rm(list = ls())
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[, c(2, 3)]
# Model fitting
mod.nlm = lm(Salary ~ Level + I(Level^2) + I(Level^3) + I(Level^4),
data = dataset)
summary(mod.nlm)
mod.lm = lm(Salary ~ Level,
data = dataset)
summary(mod.lm)
# Model visualising
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary)) +
geom_line(aes(x = dataset$Level,
y = predict(mod.lm, newdata = dataset)),
col = 'red') +
labs(title = 'Simple linear regression',
subtitle = 'SLR',
x = 'Level',
y = 'Salary') +
geom_point(aes(x = 6.5,
y = predict(mod.lm,
newdata = data.frame(Level = 6.5))),
shape = 18, size = 3) +
annotate(geom = 'text',
x = 7.2,
y = predict(mod.lm,
newdata = data.frame(Level = 6.5)),
label = as.integer
(predict(mod.lm,
newdata = data.frame(Level = 6.5))))
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary)) +
geom_line(aes(x = dataset$Level,
y = predict(mod.nlm, newdata = dataset)),
col = 'red') +
labs(title = 'Polynomial non-linear regression',
subtitle = 'PNLR',
x = 'Level',
y = 'Salary') +
geom_point(aes(x = 6.5,
y = predict(mod.nlm,
newdata = data.frame(Level = 6.5))),
shape = 18, size = 3) +
annotate(geom = 'text',
x = 7.2,
y = predict(mod.nlm,
newdata = data.frame(Level = 6.5)),
label = as.integer
(predict(mod.nlm,
newdata = data.frame(Level = 6.5))))
- It already has a standardization in default with the model.
- It works on non-linear problems.
Support vector regression
# Data preprocessing
rm(list = ls())
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[, c(2, 3)]
# Model fitting
library('e1071')
mod = svm(Salary ~ .,
data = dataset,
type = 'eps-regression',
kernel = 'linear')
summary(mod)
# Model visualising
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary)) +
geom_line(aes(x = dataset$Level,
y = predict(mod, newdata = dataset)),
col = 'red') +
labs(title = 'Support vector regression',
subtitle = 'SVR-L',
x = 'Level',
y = 'Salary') +
geom_point(aes(x = 6.5,
y = predict(mod,
newdata = data.frame(Level = 6.5))),
shape = 18, size = 3) +
annotate(geom = 'text',
x = 7.2,
y = predict(mod,
newdata = data.frame(Level = 6.5)),
label = as.integer
(predict(mod,
newdata = data.frame(Level = 6.5))))
- It is a linear regression model.
- It already has a standardization in default with the model.
- Support vector regression for regression; support vector mechine for classification.
- Kernel type being linear is linear model; kernel type being others is non-linear model.
- It is not biased by outliers.
Kernel support vector regression
# Data preprocessing
rm(list = ls())
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[, c(2, 3)]
# Model fitting
library('e1071')
mod = svm(Salary ~ .,
data = dataset,
type = 'eps-regression',
kernel = 'radial')
summary(mod)
# Model visualising
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary)) +
geom_line(aes(x = dataset$Level,
y = predict(mod, newdata = dataset)),
col = 'red') +
labs(title = 'Kernel support vector regression - non-linear',
subtitle = 'Kernel SVR-NL',
x = 'Level',
y = 'Salary') +
geom_point(aes(x = 6.5,
y = predict(mod,
newdata = data.frame(Level = 6.5))),
shape = 18, size = 3) +
annotate(geom = 'text',
x = 7.2,
y = predict(mod,
newdata = data.frame(Level = 6.5)),
label = as.integer
(predict(mod,
newdata = data.frame(Level = 6.5))))
- It is a non-linear regression model.
- It already has a standardization in default with the model.
- The process is mapping to a kernel function space, having the crossing intersection of interest, and projecting back to the orginal space to have the non-linear regression.
- It is not biased by outliers.
Decision tree regression
# Data preprocessing
rm(list = ls())
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[, c(2, 3)]
# Model fitting
library('rpart')
set.seed(123)
mod = rpart(Salary ~ .,
data = dataset,
method = 'anova',
control = rpart.control(minsplit = 1,
xval = 10))
# Tree pruning
plotcp(mod)
# mod = prune(mod, cp = XXX) # from cptable
printcp(mod)
# Generating the tree and the rule
library('rpart.plot')
options(scipen = 999)
rpart.rules(mod, cover = T, roundint = F)
prp(mod, roundint = F, type = 3, extra = 101, under = T, digits = -2,
box.palette = 'auto')
# Model visualising
x.grid = seq(min(dataset$Level), max(dataset$Level), 0.001)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary)) +
geom_line(aes(x = x.grid,
y = predict(mod,
newdata = data.frame(Level = x.grid))),
col = 'red') +
labs(title = 'Decision tree regression',
subtitle = 'CART-R',
x = 'Level',
y = 'Salary') +
geom_point(aes(x = 6.5,
y = predict(mod,
newdata = data.frame(Level = 6.5))),
shape = 18, size = 3) +
annotate(geom = 'text',
x = 7.2,
y = predict(mod,
newdata = data.frame(Level = 6.5)),
label = as.integer
(predict(mod,
newdata = data.frame(Level = 6.5))))
- It is a non-linear regression model, and also works on linear problems.
- It does not need to apply feature scaling since it does not need to use the concept of eulidean distance.
- It is a non-continuous model. So, transforming non-continuous to continuous by using x.grid.
- The tree and the rule is more useful in decision tree classification.
- The summary can decide two trees of interest. Minimum error tree has lowest xerror on validation data. Best pruned tree is the smallest tree within one xstd, which can add a bonus for simplicity.
- Using cp (complexity parameter) in prune function could develop the best pruned tree or could decide the cp value which point is under the dashed line from plotcp(mod).
- It is good at interpretability.
Random forest regression
# Data preprocessing
rm(list = ls())
dataset = read.csv('Position_Salaries.csv')
dataset = dataset[, c(2, 3)]
# Model fitting
library('randomForest')
set.seed(123)
mod = randomForest(Salary ~ .,
data = dataset,
ntree = 500)
# Variable importance
importance(mod)
varImpPlot(mod,
main = 'RF-R_Variable contribution')
# Model visualising
x.grid = seq(min(dataset$Level), max(dataset$Level), 0.001)
ggplot() +
geom_point(aes(x = dataset$Level, y = dataset$Salary)) +
geom_line(aes(x = x.grid,
y = predict(mod,
newdata = data.frame(Level = x.grid))),
col = 'red') +
labs(title = 'Random forest regression',
subtitle = 'RF-R',
x = 'Level',
y = 'Salary') +
geom_point(aes(x = 6.5,
y = predict(mod,
newdata = data.frame(Level = 6.5))),
shape = 18, size = 3) +
annotate(geom = 'text',
x = 7.2,
y = predict(mod,
newdata = data.frame(Level = 6.5)),
label = as.integer
(predict(mod,
newdata = data.frame(Level = 6.5))))
- It is an ensemble learning, which is taking multiple algorithms or the same algorithm multiple times and putting them together to make it more powerful than the original.
- It is more stable and accurate than a decision tree regression for the identity of the mean or median. But, it will lose the beauty of the decision tree regression, which would be interpretable and transparent, that means the tree and the rule.
- It is also a non-linear model, a non-needed feature scaling model, and a non-continuous model.
- Evaluating variable importance can tell the contributing role of variables.
Classification
- Classification for predicting a categorical data, which can be evaluated by accuracy, kappa.
- Supervised learning, having a label variable or dependent variable or correct answer.
Logistic regression
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Standard level changing (reference changing)
dataset = relevel(dataset$Purchased, 1)
# Model fitting
mod = glm(Purchased ~ .,
data = training.set,
family = binomial)
summary(mod)
# Prediction
y.pred.prob = predict(mod, type = 'response',
newdata = test.set[, -3])
y.pred = factor(ifelse(y.pred.prob > 0.5, 1, 0), levels = c(0, 1))
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set[, 3])
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('Age', 'EstimatedSalary')
y.grid.prob = predict(mod, type = 'response',
newdata = grid.set)
y.grid = factor(ifelse(y.grid.prob > 0.5, 1, 0), levels = c(0, 1))
plot(set[, -3],
main = 'Logistic regression (Training set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set
plot(set[, -3],
main = 'Logistic regression (Test set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
## 2
# fit = train(data = training, preProcess = c("center", "scale"), y ~ ., method = "glm" | "lm") # model fitting
# pred = predict(fit, newdata = testing) # prediction # default thresh 0.5 # include pred.prob > ifelse > pred
# confusionMatrix(pred, testing$y) # evaluation
- It needs to apply feature scaling.
- It is a linear classification model.
K-nearest neighbors
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# K-value choosing
library('class')
library('caret')
library('scales')
accuracy.set = data.frame(k = seq(1, 20, 1), accuracy = rep(0, 20))
for (i in 1:20) {
y.pred = knn(train = training.set[, -3],
test = test.set[, -3],
cl = training.set[, 3],
k = i)
accuracy.set[i, 2] = confusionMatrix(y.pred,
test.set[, 3])$overall[1]
}
ggplot(data = accuracy.set, aes(x = k, y = accuracy)) +
geom_point() +
geom_line(linetype = 'dashed') +
scale_x_continuous(breaks = pretty_breaks(nrow(accuracy.set))) +
scale_y_continuous(breaks = pretty_breaks()) +
labs(title = 'Accuracy vs K-value',
subtitle = 'Best KNN',
x = 'K-value',
y = 'Accuracy')
# Real time searching
y.pred = knn(train = training.set[, -3],
test = test.set[, -3],
cl = training.set[, 3],
k = 4)
# Confusion matrix evaluating
confusionMatrix(y.pred, test.set[, 3])
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('Age', 'EstimatedSalary')
y.grid = knn(train = training.set[, -3],
test = grid.set,
cl = training.set[, 3],
k = 4)
plot(set[, -3],
main = 'KNN (Training set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set
plot(set[, -3],
main = 'KNN (Test set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
- It needs to apply feature scaling due to measuring by euclidean distance.
- It does not need to use a model but just use real time searching, which is also called as lazy learner.
- It is a time consuming model due to the process of real time searching.
- It is a non-linear classification model.
- When having the same accuracy in multiple k values, choosing the lower k value is better due to the better ability of capturing the local structure.
Support vector machine
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Model fitting
library('e1071')
set.seed(123)
mod = svm(Purchased ~ .,
data = training.set,
type = 'C-classification',
kernel = 'linear')
summary(mod)
# Prediction
y.pred = predict(mod, newdata = test.set[, -3])
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set[, 3])
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('Age', 'EstimatedSalary')
y.grid = predict(mod, newdata = grid.set)
plot(set[, -3],
main = 'SVM-L (Training set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set
plot(set[, -3],
main = 'SVM-L (Test set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
- It needs to apply feature scaling.
- It is a linear classification model based on the type of kernel chose.
- It suits for linearly seperable data.
- It is not biased by outliers.
- It is not sensitive to overfitting.
Kernel support vector machine
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Model fitting
library('e1071')
set.seed(123)
mod = svm(Purchased ~ .,
data = training.set,
type = 'C-classification',
kernel = 'radial')
summary(mod)
# Prediction
y.pred = predict(mod, newdata = test.set[, -3])
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set[, 3])
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('Age', 'EstimatedSalary')
y.grid = predict(mod, newdata = grid.set)
plot(set[, -3],
main = 'Kernel SVM-NL (Training set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set
plot(set[, -3],
main = 'Kernel SVM-NL (Test set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
- It needs to apply feature scaling.
- It is a non-linear classification model based on the type of kernel chose.
- It suits for non-linearly seperable data.
- Mapping to a higher dimensional space, finding a linearly seperable solution, projecting back on the original dimension space, and having a non-linearly seperable solution is a way for this model. But, the downside of it is being highly compute-intensive. So, it is not preferable for this model.
- Without mapping to a higher dimensional space as a better way is the kernel function. So, kernel support vector machine is better used for non-linearly seperable data.
- The common types of kernel function are gaussian rbf (radial basis function) kernel (type = ‘radial’), sigmoid kernel (type = ‘sigmoid’), and polynomial kernel (type = ‘polynomial’).
Naive bayes
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Model fitting
library('e1071')
mod = naiveBayes(x = training.set[, -3],
y = training.set[, 3])
# Prediction
y.pred = predict(mod, newdata = test.set[, -3])
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set[, 3])
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('Age', 'EstimatedSalary')
y.grid = predict(mod, newdata = grid.set)
plot(set[, -3],
main = 'Naive bayes (Training set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set
plot(set[, -3],
main = 'Naive bayes (Test set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
## 2
# fit = train(data = training, y ~ ., method = "nb")
# pred = predict(fit, newdata = testing)
- It needs to apply feature scaling.
- It is a non-linear classification model.
- It is not biased by outliers.
- Good: fast speed, reasonably accurate.
- Bad: additional assumptions needed.
Decision tree classification
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# # Only for model visualising to scaling here
# training.set[, -ncol(dataset)] = scale(training.set[, -ncol(dataset)])
# test.set[, -ncol(dataset)] = scale(test.set[, -ncol(dataset)])
# Model fitting
library('rpart')
set.seed(123)
mod = rpart(Purchased ~ .,
data = training.set,
method = 'class',
control = rpart.control(xval = 10))
# Tree pruning
plotcp(mod)
# mod = prune(mod, cp = XXX) # from cptable
printcp(mod)
# Generating the tree and the rule
library('rpart.plot')
options(scipen = 999)
prp(mod, roundint = F, type = 3, extra = 101, under = T, digits = -2,
box.palette = 'auto')
# Prediction
y.pred = predict(mod, newdata = test.set[, -3], type = 'class')
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set[, 3])
# # Model visualising
# set = training.set
# X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
# X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
# grid.set = expand.grid(X1, X2)
# colnames(grid.set) = c('Age', 'EstimatedSalary')
# y.grid = predict(mod, newdata = grid.set, type = 'class')
# plot(set[, -3],
# main = 'CART-C (Training set)',
# xlab = 'Age', ylab = 'Estimated salary',
# xlim = range(X1),
# ylim = range(X2))
# points(grid.set, pch = '.',
# col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
# points(set, pch = 21,
# bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
# set = test.set
# plot(set[, -3],
# main = 'CART-C (Test set)',
# xlab = 'Age', ylab = 'Estimated salary',
# xlim = range(X1),
# ylim = range(X2))
# points(grid.set, pch = '.',
# col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
# points(set, pch = 21,
# bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
## 2
# fit = train(data = training, y ~ ., method = "rpart")
# mod = fit$FinalModel
# fancyRpartPlot(mod) # library(rattle)
# pred = predict(fit, newdata = testing) # prediction # classification - type
# confusionMatrix(pred, testing$y) # evaluation
- It does not need to apply feature scaling since it does not need to use the concept of eulidean distance. But, due to the needing of model visualising, it has to scale up for a doable number scale. But, it would cause insimplicity for the tree and the rule.
- The tree and the rule need to rerun without feature scaling.
- Good: easy to interpret, better performance in nonlinear settings.
- Bad: without pruning/cross-validation can lead to over fitting, unstable with uncertainty.
- Bagging process can extend to random forest classification.
Random forest classification
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# # Only for model visualising to scaling here
training.set[, -ncol(dataset)] = scale(training.set[, -ncol(dataset)])
test.set[, -ncol(dataset)] = scale(test.set[, -ncol(dataset)])
# Model fitting
library('randomForest')
set.seed(123)
mod = randomForest(Purchased ~ .,
data = training.set,
ntree = 500)
# Variable importance
importance(mod)
varImpPlot(mod,
main = 'RF-C_Variable contribution')
# Prediction
y.pred = predict(mod, newdata = test.set[, -3], type = 'class')
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set[, 3])
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('Age', 'EstimatedSalary')
y.grid = predict(mod, newdata = grid.set, type = 'class')
plot(set[, -3],
main = 'RF-C (Training set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set
plot(set[, -3],
main = 'RF-C (Test set)',
xlab = 'Age', ylab = 'Estimated salary',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
## 2
# fit = train(data = training,
# y ~ .,
# method = "rf",
# prox = T)
# mod = fit$finalModel
# getTree(mod, k = 2) # get the 2th tree
# pred = predict(fit, newdata = testing)
# confusionMatrix(pred, testing$y)
# testing$pred.right = pred == testing$y # a way to show up the right classification
# qplot(data = testing, x, y, col = pred.right, main = "newdata predictions")
- It is an ensemble learning, good at stable and accurate, bad at intransparent, a non-linear model, a non-needed feature scaling model, a non-continuous model.
- Good: accuracy.
- Bad: slow speed, less interpretability, easy overfitting.
XGBoost
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Churn_Modelling.csv')
dataset = dataset[, c(4:14)]
dataset$Geography = as.numeric(factor(dataset$Geography,
levels = c('France',
'Spain',
'Germany'),
labels = c(1, 2, 3)))
dataset$Gender = as.numeric(factor(dataset$Gender,
levels = c('Female',
'Male'),
labels = c(1, 2)))
set.seed(123)
split = sample.split(dataset$Exited, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Model fitting
library('xgboost')
mod = xgboost(data = as.matrix(training.set[, -ncol(training.set)]),
label = training.set$Exited,
nrounds = 10)
# Prediction
y.pred = predict(mod, newdata = as.matrix(test.set[, -11]))
y.pred = ifelse(y.pred >= 0.5, 1, 0)
# Confusion matrix evaluating
library('caret')
confusionMatrix(table(y.pred, test.set[, 11]))
# K-fold cross validation applying
library('caret')
folds = createFolds(training.set$Exited, k = 10)
cv = lapply(folds, function(x) {
training.fold = training.set[-x, ]
test.fold = training.set[x, ]
mod = xgboost(data = as.matrix(training.fold[, -11]),
label = training.fold$Exited,
nrounds = 10)
y.pred = predict(mod, newdata = as.matrix(test.fold[, -11]))
y.pred = ifelse(y.pred >= 0.5, 1, 0)
cm = confusionMatrix(table(y.pred, test.fold[, 11]))
accuracy = as.numeric(cm$overall[1])
return(accuracy)
})
accuracy = mean(as.numeric(cv))
## 2
# fit = train(data = training, y ~ ., method = "gbm", verbose = F) ## gbmboost for random forest classification boosting
# pred = predict(fit, newdata = testing)
# mod = fit$finalModel
# summary(mod)
- High performance, Accurate quality, Fast execution speed, Maintain interpretation (no need for scaling).
- This part contains the model selection, that can stablize the accuracy (K-fold cross validation). 3, Other boostings, such as adaboost, gbmboost, mboost, gamboost, and gradient boosting, are in different algorithm reinforcement. The simple meaning is about gathering the weak predictors and weight them and add them up to become a strong predictor and use it.
- Boosting can be used with any subset of classifiers.
Dimensionality reduction
- Feature selection is also variable selection and is already mentioned in multiple linear regression.
- Feature extraction: Linear dataset: PCA, LDA; Non-linear dataset: Kernel PCA.
Pricipal component analysis
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Wine.csv')
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Customer_Segment, SplitRatio = 0.8)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# PCA applying
library('caret')
library('e1071')
pca = preProcess(x = training.set[, -ncol(training.set)],
method = 'pca',
pcaComp = 2)
training.set = predict(pca, training.set)
training.set = training.set[, c(2, 3, 1)]
test.set = predict(pca, test.set)
test.set = test.set[, c(2, 3, 1)]
# Model fitting
library('e1071')
set.seed(123)
mod = svm(Customer_Segment ~ .,
data = training.set,
type = 'C-classification',
kernel = 'linear')
summary(mod)
# Prediction
y.pred = predict(mod, newdata = test.set[, -3])
# Confusion matrix evaluating
library('caret')
confusionMatrix(table(y.pred, test.set[, 3]))
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('PC1', 'PC2')
y.grid = predict(mod, newdata = grid.set)
plot(set[, -3],
main = 'SVM-L (Training set)',
xlab = 'PC1', ylab = 'PC2',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 2, 'deepskyblue',
ifelse(y.grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21,
bg = ifelse(set[, 3] == 2, 'blue3',
ifelse(set[, 3] == 1, 'green4', 'red3')))
set = test.set
plot(set[, -3],
main = 'SVM-L (Test set)',
xlab = 'PC1', ylab = 'PC2',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 2, 'deepskyblue',
ifelse(y.grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21,
bg = ifelse(set[, 3] == 2, 'blue3',
ifelse(set[, 3] == 1, 'green4', 'red3')))
## 2
# pca = preProcess(x = training[, -ncol(training)], method = 'pca', pcaComp = 2 | thresh = 0.8 or 0.9)
# training = predict(pca, training)
# testing = predict(pca, testing)
- It is a unsupervised learning, the reduction model uses the indenpendent variables (x), also dataset is linear-seperable.
- It captures the variable with the most variation.
- It is highly affected by outliers.
- It is highly useful for dimensional reduction for the independent variables.
Linear discriminant analysis
## 1
# Data preprocessing
rm(list = ls())
dataset = read.csv('Wine.csv')
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Customer_Segment, SplitRatio = 0.8)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# LDA applying
library('MASS')
lda = lda(formula = Customer_Segment ~ .,
data = training.set)
training.set = predict(lda, training.set) %>% as.data.frame()
training.set = training.set[, c(5, 6, 1)]
test.set = predict(lda, test.set) %>% as.data.frame()
test.set = test.set[, c(5, 6, 1)]
# Model fitting
library('e1071')
set.seed(123)
mod = svm(class ~ .,
data = training.set,
type = 'C-classification',
kernel = 'linear')
summary(mod)
# Prediction
y.pred = predict(mod, newdata = test.set[, -3])
# Confusion matrix evaluating
library('caret')
confusionMatrix(table(y.pred, test.set[, 3]))
# Model visualising
set = training.set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('x.LD1', 'x.LD2')
y.grid = predict(mod, newdata = grid.set)
plot(set[, -3],
main = 'SVM-L (Training set)',
xlab = 'LD1', ylab = 'LD2',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 2, 'deepskyblue',
ifelse(y.grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21,
bg = ifelse(set[, 3] == 2, 'blue3',
ifelse(set[, 3] == 1, 'green4', 'red3')))
set = test.set
plot(set[, -3],
main = 'SVM-L (Test set)',
xlab = 'LD1', ylab = 'LD2',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 2, 'deepskyblue',
ifelse(y.grid == 1, 'springgreen3', 'tomato')))
points(set, pch = 21,
bg = ifelse(set[, 3] == 2, 'blue3',
ifelse(set[, 3] == 1, 'green4', 'red3')))
## 2
# lda = train(data = training, y ~ ., method = "lda")
# training = predict(lda, training)
# testing = predict(lda, testing)
- It is a supervised learning, the reduction model uses only the dependent variable (y), also dataset is linear-seperable.
- It maximize the separation of known categories.
- It is highly affected by outliers.
- It is highly useful for dimensional reduction for the dependent variable.
- It projects the dependent variable classification from high dimension (m) to lower dimension (m-1).
Kernel pricipal component analysis
# Data preprocessing
rm(list = ls())
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[, c(3:5)]
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))
dataset[, -ncol(dataset)] = scale(dataset[, -ncol(dataset)])
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training.set = subset(dataset, split == T)
test.set = subset(dataset, split == F)
# Kernel PCA applying
library('kernlab')
kpca = kpca(~.,
data = training.set[, -3],
kernel = 'rbfdot',
features = 2)
training.set.pca = predict(kpca, training.set) %>% as.data.frame()
training.set.pca$Purchased = training.set$Purchased
test.set.pca = predict(kpca, test.set) %>% as.data.frame()
test.set.pca$Purchased = test.set$Purchased
# Model fitting
mod = glm(Purchased ~ .,
data = training.set.pca,
family = binomial)
summary(mod)
# Prediction
y.pred.prob = predict(mod, type = 'response',
newdata = test.set.pca[, -3])
y.pred = factor(ifelse(y.pred.prob > 0.5, 1, 0), levels = c(0, 1))
# Confusion matrix evaluating
library('caret')
confusionMatrix(y.pred, test.set.pca[, 3])
# Model visualising
set = training.set.pca
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid.set = expand.grid(X1, X2)
colnames(grid.set) = c('V1', 'V2')
y.grid.prob = predict(mod, type = 'response',
newdata = grid.set)
y.grid = factor(ifelse(y.grid.prob > 0.5, 1, 0), levels = c(0, 1))
plot(set[, -3],
main = 'Logistic regression (Training set)',
xlab = 'PC1', ylab = 'PC2',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
set = test.set.pca
plot(set[, -3],
main = 'Logistic regression (Test set)',
xlab = 'PC1', ylab = 'PC2',
xlim = range(X1),
ylim = range(X2))
points(grid.set, pch = '.',
col = ifelse(y.grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21,
bg = ifelse(set[, 3] == 1, 'green4', 'red3'))
- It is a unsupervised learning, the reduction model uses the indenpendent variables (x), also dataset is non-linear-seperable.
- It captures the variable with the most variation.
- It is highly affected by outliers.
- It is highly useful for dimensional reduction for the independent variables.
- Its pricinple is just projecting the dataset to higher dimension due to becoming linear-seperable then using the PCA.