Classification methods applied to an imbalanced big dataset
1) Project decription:
This analysis is part of the final project of the subject Machine Learning 1: classification methods taught at the University of Warsaw. The final project consisted in two projects regression and classification made together with Lashari Gochiashvili.
I would like to share with you the part I was taking care of, hopefully it will be helpful to someone.
The purpose of this analysis is to apply classification methods to a big dataset, in order to classify the cars with a symboling security level: secure, neutral and risky.
Classification method is a machine supervised learning technique which has the purpose of identtifying to which of a set categories a new observation belongs.
In this analysis several methods will be applied, such as learning vector quantization, multinomial regression, penalized multinomial regression, k-nearest neighbors, support bvector machine, linear discriminant analysis and quadratic discriminant analysis.
Aditionaly, several techniches will be applied in order to find the best model performance, among which we can find down-sampling, cross-validation, tuning our model by different parameters and pre processing.
All of these techniques are based on the caret package, one of the most known R package for Machine Learning
2) Data description:
The dataset used in this analysis cars, can be found in openml website: https://www.openml.org/d/1398
cars is an artificial dataset generated by BNG method (Bayesian Network Generated), and it is based on the popular dataset with the same name that can be found in the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Automobile
- First we will load all the necesary
packagesand thedata:
library(dplyr)
library(caret)
library(ggplot2)
library(corrplot)
library(tibble)
library(nnet)
library(mlbench)
library(randomForest)
library(nnet)
library(stargazer)
library(DMwR)
library(party)
library(e1071)
library(kernlab)
library(scales)
library(class)
library(psych)
library(knitr)
library(expss)
library(reshape2)
library(pROC)
library(MASS)[1] “The dataset cars initialy has 1000000 rows and 26 columns”
2.1) Features description:
The dataset basically contains several characteristics of cars, an insurance risk rating and normalized losses in use as compared to other cars.
Below we cond find more details about the different features:
normalized.losses: numerical - It is the relative average loss payment per insured vehicle year, this variables was normalizedmake: ordinal with 22 levels - Car brand - “volkswagen”, “volvo”, “nissan”, “porsche”, “honda”, “subaru”, “mazda”, “jaguar”, “dodge”, “mercury”, “toyota”, “chevrolet”, “mercedes-benz”, “peugot”, “mitsubishi”, “plymouth”, “bmw”, “saab”, “isuzu”, “renault”, “alfa-romero” and “audi”fuel.type: ordinal with 2 levels - Type of fuel - “diesel” and “gas”aspiration: ordinal with 2 levels - Type of aspiration, it refers to breathin - “turbo” and “std”num.of.doors: ordinal with 2 levels - Number of door - “four” and “two”body.style: ordinal with 5 levels - Body style of the car (shape) - “hatchback” “sedan” “wagon” “hardtop” and “convertible”drive.wheels: ordinal with 3 levels - The kind of drive wheel - “fwd” “rwd” “4wd”engine.location: ordinal with 2 levels - The location of the engine - “front” “rear”wheel.base: numerical - It is the distance between the centers of the front and rear wheelslength: numerical - The length of the carwidth: numerical - The width of the carheight: numerical - The height of the carcurb.weight: numerical - The total mass of a vehicle with standard equipment and all necessary operating consumablesengine.type: ordinal with 7 levels - The kind of engine - “ohc” “ohcv” “l” “rotor” “ohcf” “dohc” “dohcv”num.of.cylinders: ordinal with 7 levels - The number of cylinders - “four” “eight” “six” “three” “five” “twelve” “two”engine.size: numerical - The size of the enginefuel.system: ordinal with 9 levels - The kind of fuel system - “2bbl” “mfi” “mpfi” “1bbl” “idi” “spdi” “4bbl” “spfi”bore: numerical - The diameter of each cylinderstroke: numerical - The distance travelled by the piston in each cyclecompression.ratio: numerical - Measure based on the relative volumes of the combustion chamber and the cylinderhorsepower: numerical - The horse power of the enginepeak.rpm: numerical - Power band of the enginecity.mpg: numerical - It is the related distance traveled by a vehicle and the amount of fuel consumed in cityhighway.mpg: numerical - It is the related distance traveled by a vehicle and the amount of fuel consumed in highwayprice: numerical - Price of the carsymboling: ordinal with 7 levels - Indicates how safe is the car - -3: Highly secure car, -2: Moderate safe car, -1: Low safe car, 0: Neutral, not safe and not risky, +1 Low risky car, +2 Moderate risky car and +3 Highly risky car
For the purpose of this research, first of all, the target variable symboling will be transformed to three levels:
Secure, corresponding with symboling levels -3, -2 and -1Neutral, neither safe nor risky car, corresponding with symboling levels 0Risky, corresponding with symboling levels +3, -+2 and +1
After will be transformed to factor.
cars$symboling <-
plyr::revalue(as.character(cars$symboling),
c("-3" = "secure",
"-2" = "secure",
"-1" = "secure",
"0" = "neutral",
"1" = "risky",
"2" = "risky",
"3" = "risky")) %>%
as.factor()Below we cand find a barplot of the target variable symboling:
As we can see in the graph, the levels are imbalanced. There are much more cars with symboling level risky than secure, hence we should take further this into consideration, maybe applying some type of resampling.
2.2) Numeric variables:
There are 15 numeric variables.
We can find below a density plot of these variables:
Below we can find the main statistical moments for the numeric variables:
| vars | n | mean | sd | min | max | range | se | |
|---|---|---|---|---|---|---|---|---|
| normalized.losses | 1 | 1000000 | 115.92 | 35.06 | 37.63 | 273.72 | 236.10 | 0.04 |
| wheel.base | 2 | 1000000 | 98.94 | 6.10 | 82.48 | 127.19 | 44.71 | 0.01 |
| length | 3 | 1000000 | 174.89 | 12.00 | 134.02 | 218.77 | 84.75 | 0.01 |
| width | 4 | 1000000 | 65.96 | 2.14 | 60.66 | 75.89 | 15.23 | 0.00 |
| height | 5 | 1000000 | 53.77 | 2.46 | 46.98 | 62.50 | 15.52 | 0.00 |
| curb.weight | 6 | 1000000 | 2562.55 | 514.15 | 1520.87 | 4716.21 | 3195.34 | 0.51 |
| engine.size | 7 | 1000000 | 124.86 | 40.56 | 9.89 | 418.54 | 408.65 | 0.04 |
| bore | 8 | 1000000 | 3.33 | 0.27 | 2.55 | 4.07 | 1.52 | 0.00 |
| stroke | 9 | 1000000 | 3.25 | 0.32 | 1.69 | 4.46 | 2.77 | 0.00 |
| compression.ratio | 10 | 1000000 | 9.99 | 3.87 | -11.81 | 42.16 | 53.97 | 0.00 |
| horsepower | 11 | 1000000 | 104.43 | 39.09 | 38.14 | 303.92 | 265.78 | 0.04 |
| peak.rpm | 12 | 1000000 | 5132.16 | 550.69 | 3406.97 | 6931.56 | 3524.59 | 0.55 |
| city.mpg | 13 | 1000000 | 24.27 | 6.32 | 10.51 | 54.82 | 44.31 | 0.01 |
| highway.mpg | 14 | 1000000 | 30.11 | 6.72 | 11.93 | 61.70 | 49.77 | 0.01 |
| price | 15 | 1000000 | 13474.80 | 8024.06 | -11856.91 | 63918.82 | 75775.73 | 8.02 |
2.3) Categorical variables:
There are 11 categorical variables, one of them is the target variable.
We can find below a barplot of this variables:
Fist of all, we will create crossvalidation tables with the taget variable symboling, and the rest of the ordinal variables. We will be able to see in percentage how the different variables are distributed, along the target variable:
- For
body.styleanddrive.wheels:
| #Total | convertible | hardtop | hatchback | sedan | wagon | 4wd | fwd | rwd | |
|---|---|---|---|---|---|---|---|---|---|
| cars$symboling | |||||||||
| neutral | 32.4 | 34.2 | 36.1 | 28.6 | 31.8 | 37.8 | 33.6 | 31.8 | 32.1 |
| risky | 54.9 | 51.1 | 50.2 | 58.8 | 56.4 | 48.6 | 52.5 | 56.4 | 54.7 |
| secure | 12.7 | 14.7 | 13.7 | 12.5 | 11.8 | 13.6 | 13.9 | 11.7 | 13.1 |
| #Total cases | 1000000 | 95879 | 103877 | 289881 | 378870 | 131493 | 231220 | 419063 | 349717 |
- Most total common cars:
- For
body.style-sedan - For
drive.wheels-fwd
- For
- Most secure and risky cars by features:
- For
body.style-convertibleis the most secure car - For
body.style-hatchbackis the most risky car - For
drive.wheels-4wdis the most secure car - For
drive.wheels-fwdis the most risky car
- For
- For
fuel.type,aspiration,num.of.doorsandengine.location:
| #Total | diesel | gas | std | turbo | four | two | front | rear | |
|---|---|---|---|---|---|---|---|---|---|
| cars$symboling | |||||||||
| neutral | 32.4 | 39.4 | 30.9 | 34.2 | 28.9 | 38.6 | 25.4 | 31.6 | 37.5 |
| risky | 54.9 | 50.8 | 55.8 | 52.8 | 58.9 | 47.8 | 62.8 | 55 | 54.2 |
| secure | 12.7 | 9.8 | 13.3 | 13 | 12.2 | 13.6 | 11.8 | 13.3 | 8.2 |
| #Total cases | 1000000 | 175444 | 824556 | 655664 | 344336 | 527269 | 472731 | 879494 | 120506 |
- Most total common cars:
- For
fuel.type-gas - For
aspiration-std - For
num.of.doors-four - For
engine.location-front
- For
- Most secure and risky cars by features:
- For
fuel.type-gasis more likely to be risky or secure - For
fuel.type-dieselis the most neutral car - For
aspiration-stdis the most secure car - For
aspiration-turbois the most risky car - For
num.of.doors-fouris the most secure car - For
num.of.doors-twois the most risky car - For
engine.location-frontis more likely to be risky or secure - For
engine.location-rearis the most neutral
- For
- For
engine.type:
| #Total | dohc | dohcv | l | ohc | ohcf | ohcv | rotor | |
|---|---|---|---|---|---|---|---|---|
| cars$symboling | ||||||||
| neutral | 32.4 | 25.9 | 26.6 | 41.2 | 35.3 | 31.9 | 30.3 | 27.4 |
| risky | 54.9 | 59.7 | 57.6 | 45.7 | 56.7 | 50 | 56.2 | 56.1 |
| secure | 12.7 | 14.4 | 15.9 | 13.2 | 8 | 18.1 | 13.6 | 16.5 |
| #Total cases | 1000000 | 111796 | 91894 | 111844 | 349015 | 124575 | 119278 | 91598 |
- Most total common cars:
- For
engine.type-ohc
- For
- Most secure and risky cars by features:
- For
engine.type-ohxfis the most secure car - For
engine.type-dohcis the most risky car
- For
- For
num.of.cylinders:
| #Total | eight | five | four | six | three | twelve | two | |
|---|---|---|---|---|---|---|---|---|
| cars$symboling | ||||||||
| neutral | 32.4 | 24.2 | 27 | 41.9 | 33 | 22 | 26.7 | 34.6 |
| risky | 54.9 | 61.6 | 58.2 | 47.9 | 51.2 | 65.6 | 60.1 | 53.8 |
| secure | 12.7 | 14.2 | 14.8 | 10.1 | 15.8 | 12.4 | 13.2 | 11.6 |
| #Total cases | 1000000 | 111681 | 110885 | 296802 | 154559 | 108764 | 101802 | 115507 |
- Most total common cars:
- For
num.of.cylinders-four
- For
- Most secure and risky cars by features:
- For
num.of.cylinders-sixis the most secure car - For
num.of.cylinders-threeis the most risky car
- For
- For
fuel.system:
| #Total | 1bbl | 2bbl | 4bbl | idi | mfi | mpfi | spdi | spfi | |
|---|---|---|---|---|---|---|---|---|---|
| cars$symboling | |||||||||
| neutral | 32.4 | 37 | 28.6 | 38.3 | 32.7 | 41 | 28 | 35.5 | 39.9 |
| risky | 54.9 | 49 | 57.7 | 49.5 | 60.8 | 46.4 | 56 | 52.6 | 48.4 |
| secure | 12.7 | 13.9 | 13.7 | 12.2 | 6.6 | 12.6 | 15.9 | 11.9 | 11.8 |
| #Total cases | 1000000 | 81366 | 196893 | 62690 | 175407 | 58088 | 290196 | 74199 | 61161 |
- Most total common cars:
- For
fuel.system-mpfi
- For
- Most secure and risky cars by features:
- For
fuel.system-mpfiis the most secure car - For
fuel.system-idiis the most risky car
- For
- For
make:
| #Total | dodge | honda | jaguar | mazda | mercury | nissan | porsche | subaru | toyota | volkswagen | volvo | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| c$symboling | ||||||||||||
| neutral | 32 | 30.4 | 35.3 | 45.1 | 31.7 | 29.4 | 34.8 | 27.6 | 47.7 | 20.9 | 24.8 | 29 |
| risky | 54.8 | 55.5 | 54 | 44.3 | 56.5 | 58.7 | 54 | 60.1 | 35.8 | 64.5 | 65.6 | 51.1 |
| secure | 13.2 | 14.1 | 10.8 | 10.6 | 11.8 | 11.8 | 11.2 | 12.3 | 16.5 | 14.6 | 9.6 | 19.9 |
| #Total cases | 538539 | 44742 | 47643 | 41011 | 55339 | 35865 | 54910 | 38312 | 50981 | 73356 | 46977 | 49403 |
| #Total | alfa-romero | audi | bmw | chevrolet | isuzu | mercedes-benz | mitsubishi | peugot | plymouth | renault | saab | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| d$symboling | ||||||||||||
| neutral | 32.8 | 28.9 | 29.6 | 42.6 | 31 | 30.4 | 30 | 25.4 | 55.2 | 26.9 | 32.1 | 23.8 |
| risky | 55 | 59.1 | 58.1 | 47.5 | 56.8 | 56.4 | 54.1 | 63.7 | 36.2 | 55.9 | 53.7 | 66.5 |
| secure | 12.2 | 11.9 | 12.2 | 10 | 12.1 | 13.1 | 15.9 | 10.8 | 8.6 | 17.1 | 14.2 | 9.7 |
| #Total cases | 461461 | 36683 | 42728 | 45119 | 35423 | 37596 | 45049 | 50748 | 50414 | 39119 | 33884 | 44698 |
- Most total common cars:
- For
make-toyota
- For
- Most secure and risky cars by features:
- For
make-volvois the most secure car - For
make-saabis the most risky car
- For
3) Cleaning the data:
Before applying any Machine Learning algorithm it is important to have the data clean. It can improve the results but also reduce the computational time. Even in some cases we will not be able to apply the algorithms without this step.
3.1) Var. transformation:
3.1.1) Encoding and conversion to factors:
All the ordinal variables of the dataset cars are characteristic, except for the target variable symboling. All these variables should be transformed to factors, we will do that with the function as.factor().
Additionaly, integer encoding will be applied, this type of one-hot encoding transforms the characteristic features to numbers, without loosing any information, or having any impact in the final results. In large datasets like this one this step is important, in order to use less memory when saving the files. Aaprt from that, it is important for some models which are only able to manage numeric variables, in case of saving these as numeric.
Firstly, the characteristic variables will be converted to factors for the models that are not sensible to data distribution, but after will be transformed to numeric for the models sensible to data distribution.
After applying as.factor(), we should check that there is not any character:
[1] FALSE
It is FALSE, hence we can continue further.
3.1.2) Scalling the data:
There are many numeric variables with different scale, and this can be a problem for the algorithms that are based on euclidean distance.
Caret package gives the posibility to apply preProcess to scale the data when training the model. However, in our case we will scale the data before, because some algorithms are applied from other different packages.
The type of selected scale is range, it scales the numeric data in the interval [0, 1], but not the factors.
3.2) Missing values:
[1] FALSE
There is not any missing value, so we can continue further.
3.3) Unique variables:
Finally, we should check the numeric variables, in order to be sure tha they are unique.
To consider one variable as unique, a threshold of 500k was selected, hence half of the possible maximum number of unique values, 0.5 * 1.000.000 = 500.000
In order to be able to check this, the function find_if_unique_length was created:
find_if_unique_length <- function(x) {
cars_numeric_vars_index <- c()
for (i in cars_numeric_vars) {
cars_numeric_vars_index <-
c(cars_numeric_vars_index, grep(i, colnames(x)))
}
cars_numeric_vars_unique <- c()
for (i in cars_numeric_vars_index) {
a <- length(unique(x[, i]))
if (a < (0.5) * dim(cars)[1]) {
cars_numeric_vars_unique <-
c(cars_numeric_vars_unique,
paste(colnames(cars[i]), "NOT unique"))
}
else {
cars_numeric_vars_unique <-
c(cars_numeric_vars_unique, paste(colnames(cars[i]), "UNIQUE"))
}
}
return(cars_numeric_vars_unique)
}Now we will pass the function to our dataset:
| Results |
|---|
| normalized.losses UNIQUE |
| wheel.base UNIQUE |
| length UNIQUE |
| width UNIQUE |
| height UNIQUE |
| curb.weight UNIQUE |
| engine.size UNIQUE |
| bore UNIQUE |
| stroke UNIQUE |
| compression.ratio UNIQUE |
| horsepower UNIQUE |
| peak.rpm UNIQUE |
| city.mpg UNIQUE |
| highway.mpg UNIQUE |
| price UNIQUE |
As per the selected threshold of 500k, we can conclude that all the variables are unique, so we are ready to divide our data in two parts.
4) Data partitioning:
4.1) For models that can manage ordinal variables:
We are ready to divide the data in two samples, train and test. The train sample is used to train the model, and the test sample is used to make the prediction and verify the performance of the model.
The data will be divided in 70% training, and 30% test, with the help of the function createDataPartition of the caret package:
4.1.1) Training sample:
4.1.2) Test sample:
4.2) For the modesl sensible to data distribution:
There are some models sensible to data distribution based on euclidean distance.
In our dataset there are many factors and this can be time consuming for some models, hence all these factors will be trasnformed to numeric:
cars2 <- cars[-26]
indx <- sapply(cars2, is.factor)
cars2[indx] <- lapply(cars2[indx], function(x) as.numeric(as.character(x)))
cars2 <- cbind(cars2, cars[26])
cars_preProces1 <-
preProcess(cars2, method = c("range"))
cars2 <- predict(cars_preProces1,
cars2)4.2.1) Training sample:
4.2.2) Test sample:
5) Features selection:
It will be applied only to cars_train sample:
5.1) Correlation between features:
We will check the correlation between all the numerical variables, so we will use the list cars_numeric_vars.
Firstly, the correlation will be checked by the graph corplot:
It seems that the variables are not highly correlated, there are not values close to dark blue or dark red.
To be sure, we will check the maximum and minimum correlations, but in the maximum the value one should be removed, because each variable is evaluated against itself:
| Maximum.correlation |
|---|
| 0.17139 |
| Minimum.correlation |
|---|
| -0.16221 |
The maximum correlation was 0.17139, and the minimum -0.16221, hence there is not any variable to be omitted.
5.2) Relationship with the target variable:
Now we will check if the characteristic variables have relationship with the target variable symboling, so we will use the list cars_mult_bin_vars.
In order to be able to check this, we will use ANOVA, with the created function result_aov_pvalue:
result_aov_pvalue <- function(data, var) {
result <- c()
for (i in var) {
if (summary(aov(data[, 10] ~ data[, i]))[[1]][["Pr(>F)"]][1] < 0.05) {
result <-
c(result,
paste("Reject H0 -", i, "impact in symboling"))
}
else {
result <-
c(result,
paste("NO reject H0 -", i, "has not impact in symboling"))
}
}
return(result)
}We will apply the function to our data:
| Decision |
|---|
| Reject H0 - make impact in symboling |
| Reject H0 - fuel.type impact in symboling |
| Reject H0 - aspiration impact in symboling |
| Reject H0 - num.of.doors impact in symboling |
| Reject H0 - body.style impact in symboling |
| Reject H0 - drive.wheels impact in symboling |
| Reject H0 - engine.location impact in symboling |
| Reject H0 - engine.type impact in symboling |
| Reject H0 - num.of.cylinders impact in symboling |
| Reject H0 - fuel.system impact in symboling |
The null hypothesis means that the variable has not impact in the taget variable (symboling).
For all the cases we reject this null hypothesis. Hence, considering 5% as the level of significance, we can conclude that all the characteristic variables have impact in the target variable.
5.3) Variables with near zero variance:
The variables with zero or near zero variance can have negative impact on the final result of the applied algorithm, for this reason is important to be checked.
The function nearZeroVar from the caret package will be used:
cars_nzv_stats <- nearZeroVar(cars_train,
saveMetrics = TRUE)
saveRDS(cars_nzv_stats, "cars_nzv_stats.rds")cars_nzv_stats <- readRDS("cars_nzv_stats.rds")
cars_nzv_stats_res <- cars_nzv_stats %>%
rownames_to_column("variable") %>%
arrange(-zeroVar, -nzv, -freqRatio)
cars_nzv_stats_res[c(1, 4:5)] %>%
kable(align = "l",digits = 2)cars_nzv_unsel <- c(cars_nzv_stats_res[1][cars_nzv_stats_res[5] == TRUE])
sort(cars_nzv_unsel) %>%
knitr::knit_print()character(0)
As we can see there is not any variable with TRUE for nzv and zeroVar, so the model does not suggest to omit any feature.
It can be concluded that no variables should be omitted considering near zero variance method.
5.4) Linear regressions in dataset:
We will check if in the dataset carsthere is linear regression with the function findLinearCombos:
$linearCombos list()
$remove NULL
The function returns NULL, so there is not linear regression in our data.
5.5) Rank features - Learning Vector Quantization:
Learning Vector Quantization will be applied in order to rank the features by importance in relationship with symboling.
This algorithm is being applied with down-sampling and without cross-validation, and it can give an idea of which feauture could be omitted.
set.seed(16)
ctrl_cvnone <- trainControl(method = "none",
sampling = "down")
rank_features <-
train(
symboling ~ .,
data = cars_train,
method = "lvq",
trControl = ctrl_cvnone
)
saveRDS(rank_features, "rank_features.rds")Learning Vector Quantization
700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: None Addtional sampling using down-sampling
Most importantfeautures by level:neutral:engine.typerisky:num.of.doorssecure:engine.type
Less importantfeautures by level:neutral:compression.ratiorisky:compression.ratiosecure:compression.ratio
As we can see, the feature engine.type is one of the most important for all the levels.
On the other hand, in case we need to ommit some feautures, probably we will choose compression.ratio, as it is the least important for all the levels.
5.6) Conclusions:
The previous analysis did not suggest to ommit any feauture, but the variable make has too many levels. Cosequently, it can be a problem in terms of computational time when training the different models.
For this reason, in the less demanding models will be tested whether omitting the feature make gives much worst results.
6) Application of Classification algorithms:
6.1) Functions:
Two functions were created in order to evaluate the results of the models:
6.1.1) Accuracy - accuracy_multinom:
The first one accuracy_multinom returns the accuracy measures to compare the performance of the models:
accuracy_multinom <- function(predicted, real) {
ctable_m <- table(predicted,
real)
accuracy <- (100 * sum(diag(ctable_m)) / sum(ctable_m))
base_ <- diag(ctable_m) / colSums(ctable_m)
balanced_accuracy <- mean(100 * ifelse(is.na(base_), 0, base_))
base_2 <- diag(ctable_m) / rowSums(ctable_m)
correctly_predicted <-
mean(100 * ifelse(is.na(base_2), 0, base_2))
return(
data.frame(
accuracy = accuracy,
balanced_accuracy = balanced_accuracy,
balanced_correctly_predicted = correctly_predicted
)
)
}6.1.1) Graph - plot_model_fitted:
The function plot_model_fitted returns a ggplot graph of the prediction results, in order to see how the predictors are divided by levels:
6.2) MLR - Multinomial Logistic Regression:
The first model to be applied will be multinomial logistic regression. It is a classification method that generalize logistic regression to multiclass problems, in this case three classes or levels.
6.2.1) Train the data:
The first step in the application of one classisication method is to train the model. To do that, the algorithm should be applied to the training sample, in this case cars_train.
The level of maximum iterations will be selected as 1000, because the predeterminate in 100, and the algorithm can converge between both number of iterations.
set.seed(16)
mlr_multinomial <- multinom(symboling ~ .,
data = cars_train,
maxit = 1000)
saveRDS(mlr_multinomial, "mlr_multinomial.rds")| Residual.Deviance | AIC |
|---|---|
| 1222691 | 1222955 |
| Residual.Deviance | AIC |
|---|---|
| 1222691 | 1222955 |
6.2.2) Prediction on test sample:
After the model is trained, we are reade to make the prediction on the test sample:
6.2.3) Results:
First of all, we will see how the predictions are divided by levels, with the help of the function plot_model_fitted, also we will see a table with the distribution:
| mlr_multinomial_fitted | Freq |
|---|---|
| neutral | 69965 |
| risky | 228513 |
| secure | 1521 |
As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.
Most of the cars were predicted as risky, 228.513 cars.
Below we can find a table in order to compare the result on the test sample:
| neutral | risky | secure | |
|---|---|---|---|
| neutral | 39401 | 22588 | 7976 |
| risky | 57423 | 141773 | 29317 |
| secure | 242 | 436 | 843 |
In the table we can see that the risky level falls from 228.513 to 141.773. This is because only 141.773 risky cars were properly predicted, and the rest of the cars have other levels, 57.423 neutral and 29.317 secure.
Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 60.67254 | 42.94378 | 57.92697 |
Below there is the ROC measure:
| Multiclass.ROC |
|---|
| 0.68041 |
The accuracy is not really high, however the balance accuracy is penalized by the fact of having an imbalanced data.
- We will see if the accuracy is reduced a lot when considering the feature
make:
set.seed(16)
mlr_multinomial_sel <- multinom(symboling ~ .,
data = cars_train %>%
dplyr::select(-make),
maxit = 1000)
saveRDS(mlr_multinomial_sel, "mlr_multinomial_sel.rds")mlr_multinomial_sel <- readRDS("mlr_multinomial_sel.rds")
data.frame(Residual.Deviance = round(mlr_multinomial_sel[["deviance"]], 2), AIC =
round(mlr_multinomial_sel[["AIC"]], 2)) %>%
kable(align = "l",digits = 2) | Residual.Deviance | AIC |
|---|---|
| 1778282 | 1778462 |
As we can see AIC and residual deviance increases in comparison with the model with make.
set.seed(16)
mlr_multinomial_sel_fitted <- predict(mlr_multinomial_sel,
cars_test)
saveRDS(mlr_multinomial_sel_fitted, "mlr_multinomial_sel_fitted.rds")set.seed(16)
mlr_multinomial_sel_fitted_prob <- predict(mlr_multinomial_sel,
cars_test,
type= "prob")
saveRDS(mlr_multinomial_sel_fitted_prob, "mlr_multinomial_sel_fitted_prob.rds")| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 59.31553 | 41.1457 | 57.39944 |
| Multiclass.ROC |
|---|
| 0.6634 |
As we can see, the accuracy was reduced but not significantly, the same occurs with ROC.
We can conclude that removing the feature make has not a lot of impact in the final results of the models.
6.3) PMR - Penalized Multinomial Regression:
Caret package brings several models, one of them is Penalized Multinomial Regression. It is the closest one to the normal multinomial model.
The benefits of applying this algorithm with caret package are that we will be able to apply resampling and cross-validation. Adittionally, as the previous normal multinomial model, we will apply 1.000 as the maximum number of iterations.
For this model we will apply cross-validation with 5 fold:
6.3.1) Train the data:
set.seed(16)
ctrl_cv <- trainControl(method = "cv",
number = 5)
options(scipen=999)
pmr_multinomial <- train(
symboling ~ .,
data = cars_train,
method = "multinom",
trControl = ctrl_cv,
maxit = 1000
)
saveRDS(pmr_multinomial, "pmr_multinomial.rds")Penalized Multinomial Regression
700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 560001, 560002, 560000, 560000, 560001 Resampling results across tuning parameters:
decay Accuracy Kappa
0.0000 0.6060191 0.2201044 0.0001 0.6060163 0.2200993 0.1000 0.6060234 0.2201166
Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.1.
6.3.2) Prediction on test sample:
6.3.3) Results:
Below we can find the optimal model values:
| Residual.Deviance | AIC |
|---|---|
| 1222698 | 1222962 |
The AIC and residual deviance values obtained are lower than the normal multinomial model.
It seems that the optimal decay value that increases more the accuacy is 0.0001
| pmr_multinomial_fitted | Freq |
|---|---|
| neutral | 69965 |
| risky | 228512 |
| secure | 1522 |
As we can see in the above graph and table, there are not a lot of levels predicted with the level risky, this is one of the consequences of working with unbalanced data.
Below we can find a table in order to compare the result on the test sample:
| neutral | risky | secure | |
|---|---|---|---|
| neutral | 39399 | 22592 | 7974 |
| risky | 57425 | 141769 | 29318 |
| secure | 242 | 436 | 844 |
In the table we can see that the risky level falls from 228.512 to 141.769. This is because only 141.769 risky cars were properly predicted, and the rest of the cars have other level, 57.425 neutral and 29.318 secure.
Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 60.67087 | 42.94316 | 57.93529 |
| Multiclass.ROC |
|---|
| 0.68041 |
As we can see all the accuracies increased, but not much more than the first model, also ROC increases.
6.4) Test - Choosing the kind of sampling:
As we saw the multinom model of caret package is not really demanding in terms of computational time, so we will use this model to test which sampling method should be applied in KNN and SVM.
We will see if the accuracy changes if we apply no-sampling and down-sampling.
6.4.1) No sampling:
set.seed(16)
ctrl_cv <- trainControl(method = "cv",
number = 2)
options(scipen=999)
pmr_multinomial_nosam <- train(
symboling ~ .,
data = cars_train,
method = "multinom",
trControl = ctrl_cv,
maxit = 1000
)
saveRDS(pmr_multinomial_nosam, "pmr_multinomial_nosam.rds")Penalized Multinomial Regression
700001 samples 49 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: Cross-Validated (2 fold) Summary of sample sizes: 350001, 350000 Resampling results across tuning parameters:
decay Accuracy Kappa
0.0000 0.5930292 0.1864252 0.0001 0.5930320 0.1864315 0.1000 0.5930263 0.1864106
Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.0001.
set.seed(16)
pmr_multinomial_nosam_fitted <- predict(pmr_multinomial_nosam,
cars_test)
saveRDS(pmr_multinomial_nosam_fitted, "pmr_multinomial_nosam_fitted.rds")set.seed(16)
pmr_multinomial_nosam_fitted_prob <- predict(pmr_multinomial_nosam,
cars_test,
type= "prob")
saveRDS(pmr_multinomial_nosam_fitted_prob, "pmr_multinomial_nosam_fitted_prob.rds")Below we can find the predicted results by class:
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 59.32253 | 41.14495 | 57.35972 |
| Multiclass.ROC |
|---|
| 0.66332 |
6.4.2) Down-sampling:
set.seed(16)
ctrl_cv <- trainControl(method = "cv",
number = 2,
sampling = "down",
classProbs = TRUE,
summaryFunction = fiveStats)
options(scipen=999)
pmr_multinomial_down <- train(
symboling ~ .,
data = cars_train,
method = "multinom",
trControl = ctrl_cv,
maxit = 1000
)
saveRDS(pmr_multinomial_down, "pmr_multinomial_down.rds")Penalized Multinomial Regression
700001 samples 49 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: Cross-Validated (2 fold) Summary of sample sizes: 350001, 350000 Addtional sampling using down-sampling
Resampling results across tuning parameters:
decay Accuracy Kappa
0.0000 0.4769293 0.2014660 0.0001 0.4768107 0.2012208 0.1000 0.4765350 0.2011376
Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.
set.seed(16)
pmr_multinomial_down_fitted <- predict(pmr_multinomial_down,
cars_test)
saveRDS(pmr_multinomial_down_fitted, "pmr_multinomial_down_fitted.rds")set.seed(16)
pmr_multinomial_down_fitted_prob <- predict(pmr_multinomial_down,
cars_test,
type= "prob")
saveRDS(pmr_multinomial_down_fitted_prob, "pmr_multinomial_down_fitted_prob.rds")Below we can find the predicted results by class:
As we can find in the above graph, the predicted classes are not imbalanced, hence down-sampling will deal with this problem.
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 47.56482 | 48.51109 | 46.26112 |
| Multiclass.ROC |
|---|
| 0.66779 |
6.4.4) Conclusions of the test:
The above results are typical when one has imbalanced data. In this case one cannot trust the accuracy, because the predictor tends to predict the class that has more observations.
We can see that the accuracy is greater for no-sampling, but the ROC curve measure is greater for down-sampling, hence it would be better the last model with down-sampling.
Applying down-sampling is really efficent, we obtain better predictor, and also we reduce the computational time, and this is important in the case of big samples like this.
Probably applying ROSE or SMOTE sampling would give even better results, but computational time probably would increase drastically.
On the otehr hand, it is not productive to use the feature make which contains 22 levels, because the computational time increases a lot, but probably it would not be a lot of difference in terms of performance, as we saw before in multinomial.
We can conclude that down-sampling willl be applied in the next algorithms, also the feature make will be omitted.
6.5) KNN - K-Nearest Neighbors:
k-nearest neighbors algorithm is one of the most famous classification methods. In this case the input consists of the k closest training samples in the feature space.
It uses euclidean distance in order to measure the distance between neighbors, for this reason, for this algorithm is important to have scaled data, in our case the data was scaled before, with the range method, in the range [0, 1].
The algorithm will be applied with cross-validation, 3 folds, and down-sampling.
Adittionally, we will apply tuneGrid, in order to personalize more parameters, in this case several values of k are going to be tested, corresponding to the sequence seq(5, 89, 14))). Hence, 18 values will be tested 5, 19.. till 80. The algorithm will return the finalmodel with the greatest accuracy.
6.5.1) Train the data:
set.seed(16)
k_value <- data.frame(k = c(seq(5, 145, 28)))
ctrl_cv3 <- trainControl(method= "cv",
number= 3,
sampling= "down")
knn_model <- train(symboling ~ .,
cars_train1 %>%
dplyr::select(-make),
method = "knn",
trControl = ctrl_cv3,
tuneGrid = k_value)
saveRDS(knn_model, "knn_model.rds") k-Nearest Neighbors
700001 samples 24 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: Cross-Validated (3 fold) Summary of sample sizes: 466668, 466667, 466667 Addtional sampling using down-sampling
Resampling results across tuning parameters:
k Accuracy Kappa
5 0.5119107 0.2495125 33 0.5291850 0.2834744 61 0.5237907 0.2793799 89 0.5207107 0.2765361 117 0.5149907 0.2712024 145 0.5114621 0.2666035
Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 33.
6.5.2) Prediction on test sample:
6.5.3) Results:
Below we can find the optimal k value for the given sequence:
[1] 33
We can plot the results of the cross-validation:
| knn_model_fitted | Freq |
|---|---|
| neutral | 69699 |
| risky | 192140 |
| secure | 38160 |
Most of the cars were predicted as risky, 192.140 cars.
Below we can find a table in order to compare the result on the test sample:
| neutral | risky | secure | |
|---|---|---|---|
| neutral | 23643 | 38309 | 7747 |
| risky | 62051 | 105717 | 24372 |
| secure | 11372 | 20771 | 6017 |
Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 45.12582 | 34.76174 | 34.9034 |
| Multiclass.ROC |
|---|
| 0.53859 |
The accuracy and ROC measure is much worse that the previous cases.
Probably this algorithm is not the most adequate one to be applied when the dependent variable has multiclasses.
Support vector algoprithmis a more complex algorithm. There are several types, and thelinearcreates a line or hyperplane which separates the data into classes. There are other kinds ofSVMalgorithm, such aspolynomialandradial, these two are more complex in terms ofcomputational time, hence it is not a good idea to apply them in a large dataset.
During this analysis SVM was applied, however the model was not trained after two days of processing, hence the processes was stopped and will not be presented here.
6.6) LDA - Linear Discriminant Analysis:
Now Linear discriminant analysis will be applied. This algorithm is popular when dealing with a multiclass dependent variable, like our variable symboling.
It finds a linear combination of features that characterizes or separates two or more classes of objects.
In this case it will be applied with repeat cross-validation, 10 folds, repeated 3 times and down sampling In this case as it is not based on euclidean distance, it can be applied to train_cars instead of train_cars1. Aditionally, the feature make will not be omitted, because this model can manage very well the ordinal variables with several levels.
6.6.1) Train the data
set.seed(16)
ctrl_cv10 <- trainControl(method= "repeatedcv",
number = 10,
repeats = 3,
sampling= "down")
lda_model <- train(
symboling ~ .,
data = cars_train,
method = "lda",
trControl = ctrl_cv10)
saveRDS(lda_model , "lda_model.rds") Linear Discriminant Analysis
700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 630001, 630002, 630001, 630000, 630000, 630001, … Addtional sampling using down-sampling
Resampling results:
Accuracy Kappa
0.4973712 0.225034
6.6.2) Prediction on test sample
6.6.3) Results
| lda_model_fitted | Freq |
|---|---|
| neutral | 98889 |
| risky | 113509 |
| secure | 87601 |
As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.
Most of the cars were predicted as risky, 113.509 cars.
Below we can find a table in order to compare the result on the test sample:
| neutral | risky | secure | |
|---|---|---|---|
| neutral | 49984 | 39584 | 9321 |
| risky | 23677 | 80268 | 9564 |
| secure | 23405 | 44945 | 19251 |
Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 49.8345 | 50.22731 | 47.74548 |
| Multiclass.ROC |
|---|
| 0.68485 |
The ROC obtained is similar to the Multinomial Logistic Regression.
6.7) QDA - Quadratic Discriminant Analysis:
Similarly to the case of SVM there are other similar algorithms that apply other kind of combination rather than linear, for example quadratic, this last will be applied in this case.
In this case it will be applied with repeat cross-validation, 10 folds, repeated 3 times and down sampling, in this case as it is not based on euclidean distance, it can be applied to train_cars instead of train_cars1. Aditionally, the feature make will not be omitted, because this model can manage well the ordinal variables with several levels.
6.7.1) Train the data
set.seed(16)
ctrl_cv10 <- trainControl(method= "repeatedcv",
number = 10,
repeats = 3,
sampling= "down")
qda_model <- train(
symboling ~ .,
data = cars_train,
method = "qda",
trControl = ctrl_cv10)
saveRDS(qda_model , "qda_model.rds") Quadratic Discriminant Analysis
700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’
No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 630001, 630002, 630001, 630000, 630000, 630001, … Addtional sampling using down-sampling
Resampling results:
Accuracy Kappa
0.5382359 0.269335
6.6.2) Prediction on test sample
6.6.3) Results
| qda_model_fitted | Freq |
|---|---|
| neutral | 94351 |
| risky | 127163 |
| secure | 78485 |
As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.
Most of the cars were predicted as risky, 127.163 cars.
Below we can find a table in order to compare the result on the test sample:
| neutral | risky | secure | |
|---|---|---|---|
| neutral | 51702 | 34740 | 7909 |
| risky | 25727 | 90720 | 10716 |
| secure | 19637 | 39337 | 19511 |
Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:
| accuracy | balanced_accuracy | balanced_correctly_predicted |
|---|---|---|
| 53.97785 | 53.15866 | 50.33285 |
| Multiclass.ROC |
|---|
| 0.71601 |
The ROC meausure is the largest one, so this is one of the best models, as we cannot trust the accuracy
7) Summary and conclusions:
7.1) Summary:
| Model name | Accuracy | Bal. accuracy | Bal. correct. accuracy | ROC | Var. select. | CV | Fold | Resampling |
|---|---|---|---|---|---|---|---|---|
| MLR | 60.67 | 42.94 | 57.93 | 0.68 | No | No | 0 | No |
| MLR -make | 59.32 | 41.15 | 57.40 | 0.66 | Yes | No | 0 | No |
| PMR | 60.67 | 42.94 | 57.94 | 0.68 | No | Yes | 5 | No |
| PMR no sam | 59.32 | 41.14 | 57.36 | 0.66 | No | Yes | 2 | No |
| PMR sam | 47.56 | 48.51 | 46.26 | 0.67 | No | Yes | 2 | Down |
| KNN | 45.13 | 34.76 | 34.90 | 0.54 | Yes | Yes | 3 | Down |
| LDA | 49.83 | 50.23 | 47.75 | 0.68 | No | Yes | 10 | Down |
| QDA | 53.98 | 53.16 | 50.33 | 0.72 | No | Yes | 10 | Down |
7.2) Conclusions:
- For the dataset
carsthe bestclassificationmodel isQuadratic Discriminantwithrepeated cross validationanddown-sampling - In
imbalanceddata one cannot trust theaccuracyto compare models, the model tends to predict the most commonclass, hence other kind of measures are recomended, in our caseROCwas used K nearest neighborsandsupport vector machineare models that do not work very well withmulticlassdependent variable, it would be better to use them in case ofbinomialdepdent varaible- Before applying demanding models, it is convenient to analize
ceteris paribuswhat is expected to happen, in our case it was analized what will happen whendown-samplingis applied Computational timematters, and it should be taken into consideration when applyingmachine learningmodels, hence thedata preparationis really important in these cases. In certain instances it is recommended toparallelthe process, when the algorithm allows to do it (doParallelanddoMCpackages can be used)- In case of having a
big dataandimbalance clasesit is really productive to usedown-sampling, in terms ofcomputational timeandperformanceof the model - When performing a
Machine Learninganalysis it is important to have enough memory in the system which would allow one to save the results
8) References:
- Class materials provided by Piotr Wójcik PhD at the course “Machine Learning 1”, University of Warsaw, 2020
- Photo source: https://unsplash.com/photos/FkJ3aNGeFMY
- https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
- https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
- http://topepo.github.io/caret/index.html
- https://machinelearningmastery.com/compare-the-performance-of-machine-learning-algorithms-in-r/