Source: https://unsplash.com/photos/FkJ3aNGeFMY

1) `Project decription`:

This analysis is part of the final project of the subject Machine Learning 1: classification methods taught at the University of Warsaw. The final project consisted in two projects regression and classification made together with Lashari Gochiashvili.

I would like to share with you the part I was taking care of, hopefully it will be helpful to someone.

The purpose of this analysis is to apply classification methods to a big dataset, in order to classify the cars with a symboling security level: secure, neutral and risky.

Classification method is a machine supervised learning technique which has the purpose of identtifying to which of a set categories a new observation belongs.

In this analysis several methods will be applied, such as learning vector quantization, multinomial regression, penalized multinomial regression, k-nearest neighbors, support bvector machine, linear discriminant analysis and quadratic discriminant analysis.

Aditionaly, several techniches will be applied in order to find the best model performance, among which we can find down-sampling, cross-validation, tuning our model by different parameters and pre processing.

All of these techniques are based on the caret package, one of the most known R package for Machine Learning

2) `Data description`:

The dataset used in this analysis cars, can be found in openml website: https://www.openml.org/d/1398

cars is an artificial dataset generated by BNG method (Bayesian Network Generated), and it is based on the popular dataset with the same name that can be found in the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Automobile

First we will load all the necesary packages and the data:

library(dplyr)
library(caret)
library(ggplot2)
library(corrplot)
library(tibble)
library(nnet)
library(mlbench)
library(randomForest)
library(nnet)
library(stargazer)
library(DMwR)
library(party)
library(e1071)
library(kernlab)
library(scales)
library(class)
library(psych)
library(knitr)
library(expss)
library(reshape2)
library(pROC)
library(MASS)

[1] “The dataset cars initialy has 1000000 rows and 26 columns”

2.1) `Features description`:

The dataset basically contains several characteristics of cars, an insurance risk rating and normalized losses in use as compared to other cars.

Below we cond find more details about the different features:

normalized.losses: numerical - It is the relative average loss payment per insured vehicle year, this variables was normalized
make: ordinal with 22 levels - Car brand - “volkswagen”, “volvo”, “nissan”, “porsche”, “honda”, “subaru”, “mazda”, “jaguar”, “dodge”, “mercury”, “toyota”, “chevrolet”, “mercedes-benz”, “peugot”, “mitsubishi”, “plymouth”, “bmw”, “saab”, “isuzu”, “renault”, “alfa-romero” and “audi”
fuel.type: ordinal with 2 levels - Type of fuel - “diesel” and “gas”
aspiration: ordinal with 2 levels - Type of aspiration, it refers to breathin - “turbo” and “std”
num.of.doors: ordinal with 2 levels - Number of door - “four” and “two”
body.style: ordinal with 5 levels - Body style of the car (shape) - “hatchback” “sedan” “wagon” “hardtop” and “convertible”
drive.wheels: ordinal with 3 levels - The kind of drive wheel - “fwd” “rwd” “4wd”
engine.location: ordinal with 2 levels - The location of the engine - “front” “rear”
wheel.base: numerical - It is the distance between the centers of the front and rear wheels
length: numerical - The length of the car
width: numerical - The width of the car
height: numerical - The height of the car
curb.weight: numerical - The total mass of a vehicle with standard equipment and all necessary operating consumables
engine.type: ordinal with 7 levels - The kind of engine - “ohc” “ohcv” “l” “rotor” “ohcf” “dohc” “dohcv”
num.of.cylinders: ordinal with 7 levels - The number of cylinders - “four” “eight” “six” “three” “five” “twelve” “two”
engine.size: numerical - The size of the engine
fuel.system: ordinal with 9 levels - The kind of fuel system - “2bbl” “mfi” “mpfi” “1bbl” “idi” “spdi” “4bbl” “spfi”
bore: numerical - The diameter of each cylinder
stroke: numerical - The distance travelled by the piston in each cycle
compression.ratio: numerical - Measure based on the relative volumes of the combustion chamber and the cylinder
horsepower: numerical - The horse power of the engine
peak.rpm: numerical - Power band of the engine
city.mpg: numerical - It is the related distance traveled by a vehicle and the amount of fuel consumed in city
highway.mpg: numerical - It is the related distance traveled by a vehicle and the amount of fuel consumed in highway
price: numerical - Price of the car
symboling: ordinal with 7 levels - Indicates how safe is the car - -3: Highly secure car, -2: Moderate safe car, -1: Low safe car, 0: Neutral, not safe and not risky, +1 Low risky car, +2 Moderate risky car and +3 Highly risky car

For the purpose of this research, first of all, the target variable symboling will be transformed to three levels:

Secure, corresponding with symboling levels -3, -2 and -1
Neutral, neither safe nor risky car, corresponding with symboling levels 0
Risky, corresponding with symboling levels +3, -+2 and +1

After will be transformed to factor.

cars$symboling <-
  plyr::revalue(as.character(cars$symboling),
                c("-3" = "secure",
                  "-2" = "secure",
                  "-1" = "secure",
                  "0" = "neutral",
                  "1" = "risky",
                  "2" = "risky",
                  "3" = "risky")) %>%
  as.factor()

Below we cand find a barplot of the target variable symboling:

As we can see in the graph, the levels are imbalanced. There are much more cars with symboling level risky than secure, hence we should take further this into consideration, maybe applying some type of resampling.

2.2) `Numeric` variables:

There are 15 numeric variables.

We can find below a density plot of these variables:

Below we can find the main statistical moments for the numeric variables:

	vars	n	mean	sd	min	max	range	se
normalized.losses	1	1000000	115.92	35.06	37.63	273.72	236.10	0.04
wheel.base	2	1000000	98.94	6.10	82.48	127.19	44.71	0.01
length	3	1000000	174.89	12.00	134.02	218.77	84.75	0.01
width	4	1000000	65.96	2.14	60.66	75.89	15.23	0.00
height	5	1000000	53.77	2.46	46.98	62.50	15.52	0.00
curb.weight	6	1000000	2562.55	514.15	1520.87	4716.21	3195.34	0.51
engine.size	7	1000000	124.86	40.56	9.89	418.54	408.65	0.04
bore	8	1000000	3.33	0.27	2.55	4.07	1.52	0.00
stroke	9	1000000	3.25	0.32	1.69	4.46	2.77	0.00
compression.ratio	10	1000000	9.99	3.87	-11.81	42.16	53.97	0.00
horsepower	11	1000000	104.43	39.09	38.14	303.92	265.78	0.04
peak.rpm	12	1000000	5132.16	550.69	3406.97	6931.56	3524.59	0.55
city.mpg	13	1000000	24.27	6.32	10.51	54.82	44.31	0.01
highway.mpg	14	1000000	30.11	6.72	11.93	61.70	49.77	0.01
price	15	1000000	13474.80	8024.06	-11856.91	63918.82	75775.73	8.02

2.3) `Categorical` variables:

There are 11 categorical variables, one of them is the target variable.

We can find below a barplot of this variables:

Fist of all, we will create crossvalidation tables with the taget variable symboling, and the rest of the ordinal variables. We will be able to see in percentage how the different variables are distributed, along the target variable:

For body.style and drive.wheels:

	#Total	convertible	hardtop	hatchback	sedan	wagon	4wd	fwd	rwd
cars$symboling
neutral	32.4	34.2	36.1	28.6	31.8	37.8	33.6	31.8	32.1
risky	54.9	51.1	50.2	58.8	56.4	48.6	52.5	56.4	54.7
secure	12.7	14.7	13.7	12.5	11.8	13.6	13.9	11.7	13.1
#Total cases	1000000	95879	103877	289881	378870	131493	231220	419063	349717

Most total common cars:
- For body.style - sedan
- For drive.wheels - fwd
Most secure and risky cars by features:
- For body.style - convertible is the most secure car
- For body.style - hatchback is the most risky car
- For drive.wheels - 4wd is the most secure car
- For drive.wheels - fwd is the most risky car

For fuel.type, aspiration, num.of.doors and engine.location:

	#Total	diesel	gas	std	turbo	four	two	front	rear
cars$symboling
neutral	32.4	39.4	30.9	34.2	28.9	38.6	25.4	31.6	37.5
risky	54.9	50.8	55.8	52.8	58.9	47.8	62.8	55	54.2
secure	12.7	9.8	13.3	13	12.2	13.6	11.8	13.3	8.2
#Total cases	1000000	175444	824556	655664	344336	527269	472731	879494	120506

Most total common cars:
- For fuel.type - gas
- For aspiration - std
- For num.of.doors - four
- For engine.location - front
Most secure and risky cars by features:
- For fuel.type - gas is more likely to be risky or secure
- For fuel.type - diesel is the most neutral car
- For aspiration - std is the most secure car
- For aspiration - turbo is the most risky car
- For num.of.doors - four is the most secure car
- For num.of.doors - two is the most risky car
- For engine.location - front is more likely to be risky or secure
- For engine.location - rear is the most neutral

For engine.type:

	#Total	dohc	dohcv	l	ohc	ohcf	ohcv	rotor
cars$symboling
neutral	32.4	25.9	26.6	41.2	35.3	31.9	30.3	27.4
risky	54.9	59.7	57.6	45.7	56.7	50	56.2	56.1
secure	12.7	14.4	15.9	13.2	8	18.1	13.6	16.5
#Total cases	1000000	111796	91894	111844	349015	124575	119278	91598

Most total common cars:
- For engine.type - ohc
Most secure and risky cars by features:
- For engine.type - ohxf is the most secure car
- For engine.type - dohc is the most risky car

For num.of.cylinders:

	#Total	eight	five	four	six	three	twelve	two
cars$symboling
neutral	32.4	24.2	27	41.9	33	22	26.7	34.6
risky	54.9	61.6	58.2	47.9	51.2	65.6	60.1	53.8
secure	12.7	14.2	14.8	10.1	15.8	12.4	13.2	11.6
#Total cases	1000000	111681	110885	296802	154559	108764	101802	115507

Most total common cars:
- For num.of.cylinders - four
Most secure and risky cars by features:
- For num.of.cylinders - six is the most secure car
- For num.of.cylinders - three is the most risky car

For fuel.system:

	#Total	1bbl	2bbl	4bbl	idi	mfi	mpfi	spdi	spfi
cars$symboling
neutral	32.4	37	28.6	38.3	32.7	41	28	35.5	39.9
risky	54.9	49	57.7	49.5	60.8	46.4	56	52.6	48.4
secure	12.7	13.9	13.7	12.2	6.6	12.6	15.9	11.9	11.8
#Total cases	1000000	81366	196893	62690	175407	58088	290196	74199	61161

Most total common cars:
- For fuel.system - mpfi
Most secure and risky cars by features:
- For fuel.system - mpfi is the most secure car
- For fuel.system - idi is the most risky car

For make:

	#Total	dodge	honda	jaguar	mazda	mercury	nissan	porsche	subaru	toyota	volkswagen	volvo
c$symboling
neutral	32	30.4	35.3	45.1	31.7	29.4	34.8	27.6	47.7	20.9	24.8	29
risky	54.8	55.5	54	44.3	56.5	58.7	54	60.1	35.8	64.5	65.6	51.1
secure	13.2	14.1	10.8	10.6	11.8	11.8	11.2	12.3	16.5	14.6	9.6	19.9
#Total cases	538539	44742	47643	41011	55339	35865	54910	38312	50981	73356	46977	49403

	#Total	alfa-romero	audi	bmw	chevrolet	isuzu	mercedes-benz	mitsubishi	peugot	plymouth	renault	saab
d$symboling
neutral	32.8	28.9	29.6	42.6	31	30.4	30	25.4	55.2	26.9	32.1	23.8
risky	55	59.1	58.1	47.5	56.8	56.4	54.1	63.7	36.2	55.9	53.7	66.5
secure	12.2	11.9	12.2	10	12.1	13.1	15.9	10.8	8.6	17.1	14.2	9.7
#Total cases	461461	36683	42728	45119	35423	37596	45049	50748	50414	39119	33884	44698

Most total common cars:
- For make - toyota
Most secure and risky cars by features:
- For make - volvo is the most secure car
- For make - saab is the most risky car

3) `Cleaning the data`:

Before applying any Machine Learning algorithm it is important to have the data clean. It can improve the results but also reduce the computational time. Even in some cases we will not be able to apply the algorithms without this step.

3.1) `Var. transformation`:

3.1.1) `Encoding` and conversion to `factors`:

All the ordinal variables of the dataset cars are characteristic, except for the target variable symboling. All these variables should be transformed to factors, we will do that with the function as.factor().

Additionaly, integer encoding will be applied, this type of one-hot encoding transforms the characteristic features to numbers, without loosing any information, or having any impact in the final results. In large datasets like this one this step is important, in order to use less memory when saving the files. Aaprt from that, it is important for some models which are only able to manage numeric variables, in case of saving these as numeric.

Firstly, the characteristic variables will be converted to factors for the models that are not sensible to data distribution, but after will be transformed to numeric for the models sensible to data distribution.

After applying as.factor(), we should check that there is not any character:

any(sapply(cars, is.character)) %>% 
  knitr::knit_print()

[1] FALSE

It is FALSE, hence we can continue further.

3.1.2) `Scalling` the data:

There are many numeric variables with different scale, and this can be a problem for the algorithms that are based on euclidean distance.

Caret package gives the posibility to apply preProcess to scale the data when training the model. However, in our case we will scale the data before, because some algorithms are applied from other different packages.

The type of selected scale is range, it scales the numeric data in the interval [0, 1], but not the factors.

cars_preProces <-
  preProcess(cars, method = c("range"))
cars <- predict(cars_preProces,
               cars)

3.2) `Missing values`:

any(is.na(cars)) %>% 
  knitr::knit_print()

[1] FALSE

There is not any missing value, so we can continue further.

3.3) `Unique variables`:

Finally, we should check the numeric variables, in order to be sure tha they are unique.

To consider one variable as unique, a threshold of 500k was selected, hence half of the possible maximum number of unique values, 0.5 * 1.000.000 = 500.000

In order to be able to check this, the function find_if_unique_length was created:

find_if_unique_length <- function(x) {
  cars_numeric_vars_index <- c()
  for (i in cars_numeric_vars) {
    cars_numeric_vars_index <-
      c(cars_numeric_vars_index, grep(i, colnames(x)))
  }
  cars_numeric_vars_unique <- c()
  for (i in cars_numeric_vars_index) {
    a <- length(unique(x[, i]))
    if (a < (0.5) * dim(cars)[1]) {
      cars_numeric_vars_unique <-
        c(cars_numeric_vars_unique,
          paste(colnames(cars[i]), "NOT unique"))
    }
    else {
      cars_numeric_vars_unique <-
        c(cars_numeric_vars_unique, paste(colnames(cars[i]), "UNIQUE"))
    }
  }
  return(cars_numeric_vars_unique)
}

Now we will pass the function to our dataset:

data.frame(Results= find_if_unique_length(cars)) %>% 
 kable(align = "l",digits = 2)

Results
normalized.losses UNIQUE
wheel.base UNIQUE
length UNIQUE
width UNIQUE
height UNIQUE
curb.weight UNIQUE
engine.size UNIQUE
bore UNIQUE
stroke UNIQUE
compression.ratio UNIQUE
horsepower UNIQUE
peak.rpm UNIQUE
city.mpg UNIQUE
highway.mpg UNIQUE
price UNIQUE

As per the selected threshold of 500k, we can conclude that all the variables are unique, so we are ready to divide our data in two parts.

4) `Data partitioning`:

4.1) For models that can manage ordinal variables:

We are ready to divide the data in two samples, train and test. The train sample is used to train the model, and the test sample is used to make the prediction and verify the performance of the model.

The data will be divided in 70% training, and 30% test, with the help of the function createDataPartition of the caret package:

set.seed(16)
cars_which_train <-
  createDataPartition(cars$symboling,
                      p = 0.7, 
                      list = FALSE)

4.1.1) `Training` sample:

cars_train <- cars[cars_which_train,]

4.1.2) `Test` sample:

cars_test <- cars[-cars_which_train,]

4.2) For the modesl sensible to data distribution:

There are some models sensible to data distribution based on euclidean distance.

In our dataset there are many factors and this can be time consuming for some models, hence all these factors will be trasnformed to numeric:

cars2 <- cars[-26]
indx <- sapply(cars2, is.factor)
cars2[indx] <- lapply(cars2[indx], function(x) as.numeric(as.character(x)))
cars2 <- cbind(cars2, cars[26])
cars_preProces1 <-
  preProcess(cars2, method = c("range"))
cars2 <- predict(cars_preProces1,
               cars2)

set.seed(16)
cars_which_train1 <-
  createDataPartition(cars2$symboling,
                      p = 0.7, 
                      list = FALSE)

4.2.1) `Training` sample:

cars_train1 <- cars2[cars_which_train1,]

4.2.2) `Test` sample:

cars_test1 <- cars2[-cars_which_train1,]

5) `Features selection`:

It will be applied only to cars_train sample:

5.1) `Correlation` between features:

We will check the correlation between all the numerical variables, so we will use the list cars_numeric_vars.

Firstly, the correlation will be checked by the graph corplot:

It seems that the variables are not highly correlated, there are not values close to dark blue or dark red.

To be sure, we will check the maximum and minimum correlations, but in the maximum the value one should be removed, because each variable is evaluated against itself:

Maximum.correlation
0.17139

Minimum.correlation
-0.16221

The maximum correlation was 0.17139, and the minimum -0.16221, hence there is not any variable to be omitted.

5.2) `Relationship` with the target variable:

Now we will check if the characteristic variables have relationship with the target variable symboling, so we will use the list cars_mult_bin_vars.

In order to be able to check this, we will use ANOVA, with the created function result_aov_pvalue:

result_aov_pvalue <- function(data, var) {
  result <- c()
  for (i in var) {
    if (summary(aov(data[, 10] ~ data[, i]))[[1]][["Pr(>F)"]][1] < 0.05) {
      result <-
        c(result,
          paste("Reject H0 -", i, "impact in symboling"))
    }
    else {
      result <-
        c(result,
          paste("NO reject H0 -", i, "has not impact in symboling"))
    }
  }
  return(result)
}

We will apply the function to our data:

Decision
Reject H0 - make impact in symboling
Reject H0 - fuel.type impact in symboling
Reject H0 - aspiration impact in symboling
Reject H0 - num.of.doors impact in symboling
Reject H0 - body.style impact in symboling
Reject H0 - drive.wheels impact in symboling
Reject H0 - engine.location impact in symboling
Reject H0 - engine.type impact in symboling
Reject H0 - num.of.cylinders impact in symboling
Reject H0 - fuel.system impact in symboling

The null hypothesis means that the variable has not impact in the taget variable (symboling).

For all the cases we reject this null hypothesis. Hence, considering 5% as the level of significance, we can conclude that all the characteristic variables have impact in the target variable.

5.3) Variables with `near zero variance`:

The variables with zero or near zero variance can have negative impact on the final result of the applied algorithm, for this reason is important to be checked.

The function nearZeroVar from the caret package will be used:

cars_nzv_stats <- nearZeroVar(cars_train,
                              saveMetrics = TRUE)
saveRDS(cars_nzv_stats, "cars_nzv_stats.rds")

cars_nzv_stats <- readRDS("cars_nzv_stats.rds")
cars_nzv_stats_res <- cars_nzv_stats %>%
  rownames_to_column("variable") %>%
  arrange(-zeroVar, -nzv, -freqRatio)
cars_nzv_stats_res[c(1, 4:5)] %>% 
kable(align = "l",digits = 2)

cars_nzv_unsel <- c(cars_nzv_stats_res[1][cars_nzv_stats_res[5] == TRUE])
sort(cars_nzv_unsel) %>% 
  knitr::knit_print()

character(0)

As we can see there is not any variable with TRUE for nzv and zeroVar, so the model does not suggest to omit any feature.

It can be concluded that no variables should be omitted considering near zero variance method.

5.4) `Linear regressions` in dataset:

We will check if in the dataset carsthere is linear regression with the function findLinearCombos:

cars_linearCombos <- findLinearCombos(cars_train[, cars_numeric_vars])  %>% 
  knitr::knit_print()

$linearCombos list()

$remove NULL

The function returns NULL, so there is not linear regression in our data.

5.5) `Rank features` - Learning Vector Quantization:

Learning Vector Quantization will be applied in order to rank the features by importance in relationship with symboling.

This algorithm is being applied with down-sampling and without cross-validation, and it can give an idea of which feauture could be omitted.

set.seed(16)
ctrl_cvnone <- trainControl(method = "none",
                            sampling = "down")
rank_features <-
  train(
    symboling ~ .,
    data = cars_train,
    method = "lvq",
    trControl = ctrl_cvnone
  )
saveRDS(rank_features, "rank_features.rds")

Learning Vector Quantization

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: None Addtional sampling using down-sampling

Most important feautures by level:
- neutral: engine.type
- risky: num.of.doors
- secure: engine.type
Less important feautures by level:
- neutral: compression.ratio
- risky: compression.ratio
- secure: compression.ratio

As we can see, the feature engine.type is one of the most important for all the levels.

On the other hand, in case we need to ommit some feautures, probably we will choose compression.ratio, as it is the least important for all the levels.

5.6) `Conclusions`:

The previous analysis did not suggest to ommit any feauture, but the variable make has too many levels. Cosequently, it can be a problem in terms of computational time when training the different models.

For this reason, in the less demanding models will be tested whether omitting the feature make gives much worst results.

6) Application of `Classification algorithms`:

6.1) `Functions`:

Two functions were created in order to evaluate the results of the models:

6.1.1) `Accuracy` - accuracy_multinom:

The first one accuracy_multinom returns the accuracy measures to compare the performance of the models:

accuracy_multinom <- function(predicted, real) {
  ctable_m <- table(predicted,
                    real)
  accuracy <- (100 * sum(diag(ctable_m)) / sum(ctable_m))
  base_ <- diag(ctable_m) / colSums(ctable_m)
  balanced_accuracy <- mean(100 * ifelse(is.na(base_), 0, base_))
  base_2 <- diag(ctable_m) / rowSums(ctable_m)
  correctly_predicted <-
    mean(100 * ifelse(is.na(base_2), 0, base_2))
  return(
    data.frame(
      accuracy = accuracy,
      balanced_accuracy = balanced_accuracy,
      balanced_correctly_predicted = correctly_predicted
    )
  )
}

6.1.1) `Graph` - plot_model_fitted:

The function plot_model_fitted returns a ggplot graph of the prediction results, in order to see how the predictors are divided by levels:

plot_model_fitted <- function(fitt) {
  require(scales)
  a <- data.frame(Symboling = fitt)
  ggplot(a, aes(Symboling)) +
    geom_bar(fill = "#9F1E42") +
    theme_minimal() +
    scale_y_continuous(labels = comma) 
    
}

6.2) `MLR` - Multinomial Logistic Regression:

The first model to be applied will be multinomial logistic regression. It is a classification method that generalize logistic regression to multiclass problems, in this case three classes or levels.

6.2.1) Train the data:

The first step in the application of one classisication method is to train the model. To do that, the algorithm should be applied to the training sample, in this case cars_train.

The level of maximum iterations will be selected as 1000, because the predeterminate in 100, and the algorithm can converge between both number of iterations.

set.seed(16)
mlr_multinomial <- multinom(symboling ~ .,
                    data = cars_train,
                      maxit = 1000)
saveRDS(mlr_multinomial, "mlr_multinomial.rds")

Residual.Deviance	AIC
1222691	1222955

Residual.Deviance	AIC
1222691	1222955

6.2.2) Prediction on test sample:

After the model is trained, we are reade to make the prediction on the test sample:

set.seed(16)
mlr_multinomial_fitted <- predict(mlr_multinomial,
                              cars_test)
saveRDS(mlr_multinomial_fitted, "mlr_multinomial_fitted.rds")

set.seed(16)
mlr_multinomial_fitted_prob <- predict(mlr_multinomial,
                              cars_test,
                              type= "prob")
saveRDS(mlr_multinomial_fitted_prob, "mlr_multinomial_fitted_prob.rds")

6.2.3) Results:

First of all, we will see how the predictions are divided by levels, with the help of the function plot_model_fitted, also we will see a table with the distribution:

mlr_multinomial_fitted	Freq
neutral	69965
risky	228513
secure	1521

As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.

Most of the cars were predicted as risky, 228.513 cars.

Below we can find a table in order to compare the result on the test sample:

	neutral	risky	secure
neutral	39401	22588	7976
risky	57423	141773	29317
secure	242	436	843

In the table we can see that the risky level falls from 228.513 to 141.773. This is because only 141.773 risky cars were properly predicted, and the rest of the cars have other levels, 57.423 neutral and 29.317 secure.

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy	balanced_accuracy	balanced_correctly_predicted
60.67254	42.94378	57.92697

Below there is the ROC measure:

Multiclass.ROC
0.68041

The accuracy is not really high, however the balance accuracy is penalized by the fact of having an imbalanced data.

We will see if the accuracy is reduced a lot when considering the feature make:

set.seed(16)
mlr_multinomial_sel <- multinom(symboling ~ .,
                    data = cars_train %>% 
                      dplyr::select(-make),
                      maxit = 1000)
saveRDS(mlr_multinomial_sel, "mlr_multinomial_sel.rds")

mlr_multinomial_sel <- readRDS("mlr_multinomial_sel.rds")
data.frame(Residual.Deviance = round(mlr_multinomial_sel[["deviance"]], 2), AIC = 
             round(mlr_multinomial_sel[["AIC"]], 2)) %>%
  kable(align = "l",digits = 2)

Residual.Deviance	AIC
1778282	1778462

As we can see AIC and residual deviance increases in comparison with the model with make.

set.seed(16)
mlr_multinomial_sel_fitted <- predict(mlr_multinomial_sel,
                              cars_test)
saveRDS(mlr_multinomial_sel_fitted, "mlr_multinomial_sel_fitted.rds")

set.seed(16)
mlr_multinomial_sel_fitted_prob <- predict(mlr_multinomial_sel,
                              cars_test,
                              type= "prob")
saveRDS(mlr_multinomial_sel_fitted_prob, "mlr_multinomial_sel_fitted_prob.rds")

accuracy	balanced_accuracy	balanced_correctly_predicted
59.31553	41.1457	57.39944

Multiclass.ROC
0.6634

As we can see, the accuracy was reduced but not significantly, the same occurs with ROC.

We can conclude that removing the feature make has not a lot of impact in the final results of the models.

6.3) `PMR` - Penalized Multinomial Regression:

Caret package brings several models, one of them is Penalized Multinomial Regression. It is the closest one to the normal multinomial model.

The benefits of applying this algorithm with caret package are that we will be able to apply resampling and cross-validation. Adittionally, as the previous normal multinomial model, we will apply 1.000 as the maximum number of iterations.

For this model we will apply cross-validation with 5 fold:

6.3.1) Train the data:

set.seed(16)

ctrl_cv <- trainControl(method = "cv",
                          number = 5)
options(scipen=999)
pmr_multinomial <- train(
  symboling ~ .,
  data = cars_train,
  method = "multinom",
  trControl = ctrl_cv,
  maxit = 1000
)
saveRDS(pmr_multinomial, "pmr_multinomial.rds")

Penalized Multinomial Regression

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 560001, 560002, 560000, 560000, 560001 Resampling results across tuning parameters:

decay Accuracy Kappa
0.0000 0.6060191 0.2201044 0.0001 0.6060163 0.2200993 0.1000 0.6060234 0.2201166

Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.1.

6.3.2) Prediction on test sample:

set.seed(16)
pmr_multinomial_fitted <- predict(pmr_multinomial,
                              cars_test)
saveRDS(pmr_multinomial_fitted, "pmr_multinomial_fitted.rds")

set.seed(16)
pmr_multinomial_fitted_prob <- predict(pmr_multinomial,
                              cars_test,
                              type= "prob")
saveRDS(pmr_multinomial_fitted_prob, "pmr_multinomial_fitted_prob.rds")

6.3.3) Results:

Below we can find the optimal model values:

Residual.Deviance	AIC
1222698	1222962

The AIC and residual deviance values obtained are lower than the normal multinomial model.

It seems that the optimal decay value that increases more the accuacy is 0.0001

pmr_multinomial_fitted	Freq
neutral	69965
risky	228512
secure	1522

As we can see in the above graph and table, there are not a lot of levels predicted with the level risky, this is one of the consequences of working with unbalanced data.

Below we can find a table in order to compare the result on the test sample:

	neutral	risky	secure
neutral	39399	22592	7974
risky	57425	141769	29318
secure	242	436	844

In the table we can see that the risky level falls from 228.512 to 141.769. This is because only 141.769 risky cars were properly predicted, and the rest of the cars have other level, 57.425 neutral and 29.318 secure.

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy	balanced_accuracy	balanced_correctly_predicted
60.67087	42.94316	57.93529

Multiclass.ROC
0.68041

As we can see all the accuracies increased, but not much more than the first model, also ROC increases.

6.4) `Test` - Choosing the kind of sampling:

As we saw the multinom model of caret package is not really demanding in terms of computational time, so we will use this model to test which sampling method should be applied in KNN and SVM.

We will see if the accuracy changes if we apply no-sampling and down-sampling.

6.4.1) `No sampling`:

set.seed(16)

ctrl_cv <- trainControl(method = "cv",
                        number = 2)
options(scipen=999)
pmr_multinomial_nosam <- train(
  symboling ~ .,
  data = cars_train,
  method = "multinom",
  trControl = ctrl_cv,
  maxit = 1000
)
saveRDS(pmr_multinomial_nosam, "pmr_multinomial_nosam.rds")

Penalized Multinomial Regression

700001 samples 49 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (2 fold) Summary of sample sizes: 350001, 350000 Resampling results across tuning parameters:

decay Accuracy Kappa
0.0000 0.5930292 0.1864252 0.0001 0.5930320 0.1864315 0.1000 0.5930263 0.1864106

Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.0001.

set.seed(16)
pmr_multinomial_nosam_fitted <- predict(pmr_multinomial_nosam,
                              cars_test)
saveRDS(pmr_multinomial_nosam_fitted, "pmr_multinomial_nosam_fitted.rds")

set.seed(16)
pmr_multinomial_nosam_fitted_prob <- predict(pmr_multinomial_nosam,
                              cars_test,
                              type= "prob")
saveRDS(pmr_multinomial_nosam_fitted_prob, "pmr_multinomial_nosam_fitted_prob.rds")

Below we can find the predicted results by class:

accuracy	balanced_accuracy	balanced_correctly_predicted
59.32253	41.14495	57.35972

Multiclass.ROC
0.66332

6.4.2) `Down-sampling`:

set.seed(16)
ctrl_cv <- trainControl(method = "cv",
                          number = 2,
                          sampling = "down",
                        classProbs = TRUE,
                        summaryFunction = fiveStats)
options(scipen=999)
pmr_multinomial_down <- train(
  symboling ~ .,
  data = cars_train,
  method = "multinom",
  trControl = ctrl_cv,
  maxit = 1000
)
saveRDS(pmr_multinomial_down, "pmr_multinomial_down.rds")

Penalized Multinomial Regression

700001 samples 49 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (2 fold) Summary of sample sizes: 350001, 350000 Addtional sampling using down-sampling

Resampling results across tuning parameters:

decay Accuracy Kappa
0.0000 0.4769293 0.2014660 0.0001 0.4768107 0.2012208 0.1000 0.4765350 0.2011376

Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.

set.seed(16)
pmr_multinomial_down_fitted <- predict(pmr_multinomial_down,
                              cars_test)
saveRDS(pmr_multinomial_down_fitted, "pmr_multinomial_down_fitted.rds")

set.seed(16)
pmr_multinomial_down_fitted_prob <- predict(pmr_multinomial_down,
                              cars_test,
                              type= "prob")
saveRDS(pmr_multinomial_down_fitted_prob, "pmr_multinomial_down_fitted_prob.rds")

Below we can find the predicted results by class:

As we can find in the above graph, the predicted classes are not imbalanced, hence down-sampling will deal with this problem.

accuracy	balanced_accuracy	balanced_correctly_predicted
47.56482	48.51109	46.26112

Multiclass.ROC
0.66779

6.4.4) Conclusions of the `test`:

The above results are typical when one has imbalanced data. In this case one cannot trust the accuracy, because the predictor tends to predict the class that has more observations.

We can see that the accuracy is greater for no-sampling, but the ROC curve measure is greater for down-sampling, hence it would be better the last model with down-sampling.

Applying down-sampling is really efficent, we obtain better predictor, and also we reduce the computational time, and this is important in the case of big samples like this.

Probably applying ROSE or SMOTE sampling would give even better results, but computational time probably would increase drastically.

On the otehr hand, it is not productive to use the feature make which contains 22 levels, because the computational time increases a lot, but probably it would not be a lot of difference in terms of performance, as we saw before in multinomial.

We can conclude that down-sampling willl be applied in the next algorithms, also the feature make will be omitted.

6.5) `KNN` - K-Nearest Neighbors:

k-nearest neighbors algorithm is one of the most famous classification methods. In this case the input consists of the k closest training samples in the feature space.

It uses euclidean distance in order to measure the distance between neighbors, for this reason, for this algorithm is important to have scaled data, in our case the data was scaled before, with the range method, in the range [0, 1].

The algorithm will be applied with cross-validation, 3 folds, and down-sampling.

Adittionally, we will apply tuneGrid, in order to personalize more parameters, in this case several values of k are going to be tested, corresponding to the sequence seq(5, 89, 14))). Hence, 18 values will be tested 5, 19.. till 80. The algorithm will return the finalmodel with the greatest accuracy.

6.5.1) Train the data:

 set.seed(16) 
 k_value <- data.frame(k = c(seq(5, 145, 28)))
 ctrl_cv3 <- trainControl(method= "cv",
                             number= 3,
                          sampling= "down") 
 knn_model <- train(symboling ~ ., 
                      cars_train1 %>% 
                      dplyr::select(-make), 
                      method = "knn", 
                      trControl =  ctrl_cv3,
                      tuneGrid = k_value) 
 saveRDS(knn_model, "knn_model.rds")

k-Nearest Neighbors

700001 samples 24 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (3 fold) Summary of sample sizes: 466668, 466667, 466667 Addtional sampling using down-sampling

Resampling results across tuning parameters:

k Accuracy Kappa
5 0.5119107 0.2495125 33 0.5291850 0.2834744 61 0.5237907 0.2793799 89 0.5207107 0.2765361 117 0.5149907 0.2712024 145 0.5114621 0.2666035

Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 33.

6.5.2) Prediction on test sample:

 set.seed(16) 
 knn_model_fitted <- predict(knn_model, 
                               cars_test1) 
 saveRDS(knn_model_fitted, "knn_model_fitted.rds")

 set.seed(16) 
 knn_model_fitted_prob <- predict(knn_model, 
                               cars_test1, 
                               type= "prob") s
 saveRDS(knn_model_fitted_prob, "knn_model_fitted_prob.rds")

6.5.3) Results:

Below we can find the optimal k value for the given sequence:

 knn_model$finalModel$k %>% 
  knitr::knit_print()

[1] 33

We can plot the results of the cross-validation:

knn_model_fitted	Freq
neutral	69699
risky	192140
secure	38160

Most of the cars were predicted as risky, 192.140 cars.

Below we can find a table in order to compare the result on the test sample:

	neutral	risky	secure
neutral	23643	38309	7747
risky	62051	105717	24372
secure	11372	20771	6017

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy	balanced_accuracy	balanced_correctly_predicted
45.12582	34.76174	34.9034

Multiclass.ROC
0.53859

The accuracy and ROC measure is much worse that the previous cases.

Probably this algorithm is not the most adequate one to be applied when the dependent variable has multiclasses.

Support vector algoprithm is a more complex algorithm. There are several types, and the linear creates a line or hyperplane which separates the data into classes. There are other kinds of SVM algorithm, such as polynomial and radial, these two are more complex in terms of computational time, hence it is not a good idea to apply them in a large dataset.

During this analysis SVM was applied, however the model was not trained after two days of processing, hence the processes was stopped and will not be presented here.

6.6) `LDA` - Linear Discriminant Analysis:

Now Linear discriminant analysis will be applied. This algorithm is popular when dealing with a multiclass dependent variable, like our variable symboling.

It finds a linear combination of features that characterizes or separates two or more classes of objects.

In this case it will be applied with repeat cross-validation, 10 folds, repeated 3 times and down sampling In this case as it is not based on euclidean distance, it can be applied to train_cars instead of train_cars1. Aditionally, the feature make will not be omitted, because this model can manage very well the ordinal variables with several levels.

6.6.1) Train the data

 set.seed(16) 
 ctrl_cv10 <- trainControl(method= "repeatedcv", 
                           number = 10, 
                           repeats = 3,
                           sampling= "down") 
lda_model <- train( 
   symboling ~ ., 
   data = cars_train, 
   method = "lda", 
   trControl = ctrl_cv10) 
 saveRDS(lda_model , "lda_model.rds")

Linear Discriminant Analysis

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 630001, 630002, 630001, 630000, 630000, 630001, … Addtional sampling using down-sampling

Resampling results:

Accuracy Kappa
0.4973712 0.225034

6.6.2) Prediction on test sample

 set.seed(16) 
 lda_model_fitted <- predict(lda_model, 
                               cars_test) 
 saveRDS(lda_model_fitted, "lda_model_fitted.rds")

 set.seed(16) 
 lda_model_fitted_prob <- predict(lda_model, 
                               cars_test, 
                               type= "prob") 
 saveRDS(lda_model_fitted_prob, "lda_model_fitted_prob.rds")

6.6.3) Results

lda_model_fitted	Freq
neutral	98889
risky	113509
secure	87601

As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.

Most of the cars were predicted as risky, 113.509 cars.

Below we can find a table in order to compare the result on the test sample:

	neutral	risky	secure
neutral	49984	39584	9321
risky	23677	80268	9564
secure	23405	44945	19251

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy	balanced_accuracy	balanced_correctly_predicted
49.8345	50.22731	47.74548

Multiclass.ROC
0.68485

The ROC obtained is similar to the Multinomial Logistic Regression.

6.7) `QDA` - Quadratic Discriminant Analysis:

Similarly to the case of SVM there are other similar algorithms that apply other kind of combination rather than linear, for example quadratic, this last will be applied in this case.

In this case it will be applied with repeat cross-validation, 10 folds, repeated 3 times and down sampling, in this case as it is not based on euclidean distance, it can be applied to train_cars instead of train_cars1. Aditionally, the feature make will not be omitted, because this model can manage well the ordinal variables with several levels.

6.7.1) Train the data

 set.seed(16) 
 ctrl_cv10 <- trainControl(method= "repeatedcv", 
                           number = 10, 
                           repeats = 3,
                           sampling= "down") 
qda_model <- train( 
   symboling ~ ., 
   data = cars_train, 
   method = "qda", 
   trControl = ctrl_cv10) 
 saveRDS(qda_model , "qda_model.rds")

Quadratic Discriminant Analysis

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 630001, 630002, 630001, 630000, 630000, 630001, … Addtional sampling using down-sampling

Resampling results:

Accuracy Kappa
0.5382359 0.269335

6.6.2) Prediction on test sample

 set.seed(16) 
 qda_model_fitted <- predict(qda_model, 
                               cars_test) 
 saveRDS(qda_model_fitted, "qda_model_fitted.rds")

 set.seed(16) 
 qda_model_fitted_prob <- predict(qda_model, 
                               cars_test, 
                               type= "prob") 
 saveRDS(qda_model_fitted_prob, "qda_model_fitted_prob.rds")

6.6.3) Results

qda_model_fitted	Freq
neutral	94351
risky	127163
secure	78485

As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.

Most of the cars were predicted as risky, 127.163 cars.

Below we can find a table in order to compare the result on the test sample:

	neutral	risky	secure
neutral	51702	34740	7909
risky	25727	90720	10716
secure	19637	39337	19511

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy	balanced_accuracy	balanced_correctly_predicted
53.97785	53.15866	50.33285

Multiclass.ROC
0.71601

The ROC meausure is the largest one, so this is one of the best models, as we cannot trust the accuracy

7) `Summary` and `conclusions`:

7.1) `Summary`:

Model name	Accuracy	Bal. accuracy	Bal. correct. accuracy	ROC	Var. select.	CV	Fold	Resampling
MLR	60.67	42.94	57.93	0.68	No	No	0	No
MLR -make	59.32	41.15	57.40	0.66	Yes	No	0	No
PMR	60.67	42.94	57.94	0.68	No	Yes	5	No
PMR no sam	59.32	41.14	57.36	0.66	No	Yes	2	No
PMR sam	47.56	48.51	46.26	0.67	No	Yes	2	Down
KNN	45.13	34.76	34.90	0.54	Yes	Yes	3	Down
LDA	49.83	50.23	47.75	0.68	No	Yes	10	Down
QDA	53.98	53.16	50.33	0.72	No	Yes	10	Down

7.2) `Conclusions`:

For the dataset cars the best classification model is Quadratic Discriminant with repeated cross validation and down-sampling
In imbalanced data one cannot trust the accuracy to compare models, the model tends to predict the most common class, hence other kind of measures are recomended, in our case ROC was used
K nearest neighbors and support vector machine are models that do not work very well with multiclass dependent variable, it would be better to use them in case of binomial depdent varaible
Before applying demanding models, it is convenient to analize ceteris paribus what is expected to happen, in our case it was analized what will happen when down-sampling is applied
Computational time matters, and it should be taken into consideration when applying machine learning models, hence the data preparation is really important in these cases. In certain instances it is recommended to parallel the process, when the algorithm allows to do it (doParallel and doMC packages can be used)
In case of having a big data and imbalance clases it is really productive to use down-sampling, in terms of computational time and performance of the model
When performing a Machine Learning analysis it is important to have enough memory in the system which would allow one to save the results

8) `References`:

Class materials provided by Piotr Wójcik PhD at the course “Machine Learning 1”, University of Warsaw, 2020
Photo source: https://unsplash.com/photos/FkJ3aNGeFMY
https://machinelearningmastery.com/feature-selection-with-the-caret-r-package/
https://machinelearningmastery.com/why-one-hot-encode-data-in-machine-learning/
http://topepo.github.io/caret/index.html
https://machinelearningmastery.com/compare-the-performance-of-machine-learning-algorithms-in-r/

Classification methods applied to an imbalanced big dataset

Classification methods applied to an imbalanced big dataset

1) Project decription:

2) Data description:

2.1) Features description:

2.2) Numeric variables:

2.3) Categorical variables:

3) Cleaning the data:

3.1) Var. transformation:

3.1.1) Encoding and conversion to factors:

3.1.2) Scalling the data:

3.2) Missing values:

3.3) Unique variables:

4) Data partitioning:

4.1) For models that can manage ordinal variables:

4.1.1) Training sample:

4.1.2) Test sample:

4.2) For the modesl sensible to data distribution:

4.2.1) Training sample:

4.2.2) Test sample:

5) Features selection:

5.1) Correlation between features:

5.2) Relationship with the target variable:

5.3) Variables with near zero variance:

5.4) Linear regressions in dataset:

5.5) Rank features - Learning Vector Quantization:

5.6) Conclusions:

6) Application of Classification algorithms:

6.1) Functions:

6.1.1) Accuracy - accuracy_multinom:

6.1.1) Graph - plot_model_fitted:

6.2) MLR - Multinomial Logistic Regression:

6.2.1) Train the data:

6.2.2) Prediction on test sample:

6.2.3) Results:

6.3) PMR - Penalized Multinomial Regression:

6.3.1) Train the data:

6.3.2) Prediction on test sample:

6.3.3) Results:

6.4) Test - Choosing the kind of sampling:

6.4.1) No sampling:

6.4.2) Down-sampling:

6.4.4) Conclusions of the test:

6.5) KNN - K-Nearest Neighbors:

6.5.1) Train the data:

6.5.2) Prediction on test sample:

6.5.3) Results:

6.6) LDA - Linear Discriminant Analysis:

6.6.1) Train the data

6.6.2) Prediction on test sample

6.6.3) Results

6.7) QDA - Quadratic Discriminant Analysis:

6.7.1) Train the data

6.6.2) Prediction on test sample

6.6.3) Results

7) Summary and conclusions:

7.1) Summary:

7.2) Conclusions:

8) References:

1) `Project decription`:

2) `Data description`:

2.1) `Features description`:

2.2) `Numeric` variables:

2.3) `Categorical` variables:

3) `Cleaning the data`:

3.1) `Var. transformation`:

3.1.1) `Encoding` and conversion to `factors`:

3.1.2) `Scalling` the data:

3.2) `Missing values`:

3.3) `Unique variables`:

4) `Data partitioning`:

4.1.1) `Training` sample:

4.1.2) `Test` sample:

4.2.1) `Training` sample:

4.2.2) `Test` sample:

5) `Features selection`:

5.1) `Correlation` between features:

5.2) `Relationship` with the target variable:

5.3) Variables with `near zero variance`:

5.4) `Linear regressions` in dataset:

5.5) `Rank features` - Learning Vector Quantization:

5.6) `Conclusions`:

6) Application of `Classification algorithms`:

6.1) `Functions`:

6.1.1) `Accuracy` - accuracy_multinom:

6.1.1) `Graph` - plot_model_fitted:

6.2) `MLR` - Multinomial Logistic Regression:

6.3) `PMR` - Penalized Multinomial Regression:

6.4) `Test` - Choosing the kind of sampling:

6.4.1) `No sampling`:

6.4.2) `Down-sampling`:

6.4.4) Conclusions of the `test`:

6.5) `KNN` - K-Nearest Neighbors:

6.6) `LDA` - Linear Discriminant Analysis:

6.7) `QDA` - Quadratic Discriminant Analysis:

7) `Summary` and `conclusions`:

7.1) `Summary`:

7.2) `Conclusions`:

8) `References`: