1) Project decription:

This analysis is part of the final project of the subject Machine Learning 1: classification methods taught at the University of Warsaw. The final project consisted in two projects regression and classification made together with Lashari Gochiashvili.

I would like to share with you the part I was taking care of, hopefully it will be helpful to someone.

The purpose of this analysis is to apply classification methods to a big dataset, in order to classify the cars with a symboling security level: secure, neutral and risky.

Classification method is a machine supervised learning technique which has the purpose of identtifying to which of a set categories a new observation belongs.

In this analysis several methods will be applied, such as learning vector quantization, multinomial regression, penalized multinomial regression, k-nearest neighbors, support bvector machine, linear discriminant analysis and quadratic discriminant analysis.

Aditionaly, several techniches will be applied in order to find the best model performance, among which we can find down-sampling, cross-validation, tuning our model by different parameters and pre processing.

All of these techniques are based on the caret package, one of the most known R package for Machine Learning

2) Data description:

The dataset used in this analysis cars, can be found in openml website: https://www.openml.org/d/1398

cars is an artificial dataset generated by BNG method (Bayesian Network Generated), and it is based on the popular dataset with the same name that can be found in the UCI Machine Learning Repository: https://archive.ics.uci.edu/ml/datasets/Automobile

  • First we will load all the necesary packages and the data:

[1] “The dataset cars initialy has 1000000 rows and 26 columns”

2.1) Features description:

The dataset basically contains several characteristics of cars, an insurance risk rating and normalized losses in use as compared to other cars.

Below we cond find more details about the different features:

  1. normalized.losses: numerical - It is the relative average loss payment per insured vehicle year, this variables was normalized
  2. make: ordinal with 22 levels - Car brand - “volkswagen”, “volvo”, “nissan”, “porsche”, “honda”, “subaru”, “mazda”, “jaguar”, “dodge”, “mercury”, “toyota”, “chevrolet”, “mercedes-benz”, “peugot”, “mitsubishi”, “plymouth”, “bmw”, “saab”, “isuzu”, “renault”, “alfa-romero” and “audi”
  3. fuel.type: ordinal with 2 levels - Type of fuel - “diesel” and “gas”
  4. aspiration: ordinal with 2 levels - Type of aspiration, it refers to breathin - “turbo” and “std”
  5. num.of.doors: ordinal with 2 levels - Number of door - “four” and “two”
  6. body.style: ordinal with 5 levels - Body style of the car (shape) - “hatchback” “sedan” “wagon” “hardtop” and “convertible”
  7. drive.wheels: ordinal with 3 levels - The kind of drive wheel - “fwd” “rwd” “4wd”
  8. engine.location: ordinal with 2 levels - The location of the engine - “front” “rear”
  9. wheel.base: numerical - It is the distance between the centers of the front and rear wheels
  10. length: numerical - The length of the car
  11. width: numerical - The width of the car
  12. height: numerical - The height of the car
  13. curb.weight: numerical - The total mass of a vehicle with standard equipment and all necessary operating consumables
  14. engine.type: ordinal with 7 levels - The kind of engine - “ohc” “ohcv” “l” “rotor” “ohcf” “dohc” “dohcv”
  15. num.of.cylinders: ordinal with 7 levels - The number of cylinders - “four” “eight” “six” “three” “five” “twelve” “two”
  16. engine.size: numerical - The size of the engine
  17. fuel.system: ordinal with 9 levels - The kind of fuel system - “2bbl” “mfi” “mpfi” “1bbl” “idi” “spdi” “4bbl” “spfi”
  18. bore: numerical - The diameter of each cylinder
  19. stroke: numerical - The distance travelled by the piston in each cycle
  20. compression.ratio: numerical - Measure based on the relative volumes of the combustion chamber and the cylinder
  21. horsepower: numerical - The horse power of the engine
  22. peak.rpm: numerical - Power band of the engine
  23. city.mpg: numerical - It is the related distance traveled by a vehicle and the amount of fuel consumed in city
  24. highway.mpg: numerical - It is the related distance traveled by a vehicle and the amount of fuel consumed in highway
  25. price: numerical - Price of the car
  26. symboling: ordinal with 7 levels - Indicates how safe is the car - -3: Highly secure car, -2: Moderate safe car, -1: Low safe car, 0: Neutral, not safe and not risky, +1 Low risky car, +2 Moderate risky car and +3 Highly risky car

For the purpose of this research, first of all, the target variable symboling will be transformed to three levels:

  • Secure, corresponding with symboling levels -3, -2 and -1
  • Neutral, neither safe nor risky car, corresponding with symboling levels 0
  • Risky, corresponding with symboling levels +3, -+2 and +1

After will be transformed to factor.

Below we cand find a barplot of the target variable symboling:

As we can see in the graph, the levels are imbalanced. There are much more cars with symboling level risky than secure, hence we should take further this into consideration, maybe applying some type of resampling.

2.2) Numeric variables:

There are 15 numeric variables.

We can find below a density plot of these variables:

Below we can find the main statistical moments for the numeric variables:

vars n mean sd min max range se
normalized.losses 1 1000000 115.92 35.06 37.63 273.72 236.10 0.04
wheel.base 2 1000000 98.94 6.10 82.48 127.19 44.71 0.01
length 3 1000000 174.89 12.00 134.02 218.77 84.75 0.01
width 4 1000000 65.96 2.14 60.66 75.89 15.23 0.00
height 5 1000000 53.77 2.46 46.98 62.50 15.52 0.00
curb.weight 6 1000000 2562.55 514.15 1520.87 4716.21 3195.34 0.51
engine.size 7 1000000 124.86 40.56 9.89 418.54 408.65 0.04
bore 8 1000000 3.33 0.27 2.55 4.07 1.52 0.00
stroke 9 1000000 3.25 0.32 1.69 4.46 2.77 0.00
compression.ratio 10 1000000 9.99 3.87 -11.81 42.16 53.97 0.00
horsepower 11 1000000 104.43 39.09 38.14 303.92 265.78 0.04
peak.rpm 12 1000000 5132.16 550.69 3406.97 6931.56 3524.59 0.55
city.mpg 13 1000000 24.27 6.32 10.51 54.82 44.31 0.01
highway.mpg 14 1000000 30.11 6.72 11.93 61.70 49.77 0.01
price 15 1000000 13474.80 8024.06 -11856.91 63918.82 75775.73 8.02

2.3) Categorical variables:

There are 11 categorical variables, one of them is the target variable.

We can find below a barplot of this variables:

Fist of all, we will create crossvalidation tables with the taget variable symboling, and the rest of the ordinal variables. We will be able to see in percentage how the different variables are distributed, along the target variable:

  1. For body.style and drive.wheels:
 #Total   convertible   hardtop   hatchback   sedan   wagon   4wd   fwd   rwd 
 cars$symboling 
   neutral  32.4 34.2 36.1 28.6 31.8 37.8 33.6 31.8 32.1
   risky  54.9 51.1 50.2 58.8 56.4 48.6 52.5 56.4 54.7
   secure  12.7 14.7 13.7 12.5 11.8 13.6 13.9 11.7 13.1
   #Total cases  1000000 95879 103877 289881 378870 131493 231220 419063 349717
  • Most total common cars:
    • For body.style - sedan
    • For drive.wheels - fwd
  • Most secure and risky cars by features:
    • For body.style - convertible is the most secure car
    • For body.style - hatchback is the most risky car
    • For drive.wheels - 4wd is the most secure car
    • For drive.wheels - fwd is the most risky car
  1. For fuel.type, aspiration, num.of.doors and engine.location:
 #Total   diesel   gas   std   turbo   four   two   front   rear 
 cars$symboling 
   neutral  32.4 39.4 30.9 34.2 28.9 38.6 25.4 31.6 37.5
   risky  54.9 50.8 55.8 52.8 58.9 47.8 62.8 55 54.2
   secure  12.7 9.8 13.3 13 12.2 13.6 11.8 13.3 8.2
   #Total cases  1000000 175444 824556 655664 344336 527269 472731 879494 120506
  • Most total common cars:
    • For fuel.type - gas
    • For aspiration - std
    • For num.of.doors - four
    • For engine.location - front
  • Most secure and risky cars by features:
    • For fuel.type - gas is more likely to be risky or secure
    • For fuel.type - diesel is the most neutral car
    • For aspiration - std is the most secure car
    • For aspiration - turbo is the most risky car
    • For num.of.doors - four is the most secure car
    • For num.of.doors - two is the most risky car
    • For engine.location - front is more likely to be risky or secure
    • For engine.location - rear is the most neutral
  1. For engine.type:
 #Total   dohc   dohcv   l   ohc   ohcf   ohcv   rotor 
 cars$symboling 
   neutral  32.4 25.9 26.6 41.2 35.3 31.9 30.3 27.4
   risky  54.9 59.7 57.6 45.7 56.7 50 56.2 56.1
   secure  12.7 14.4 15.9 13.2 8 18.1 13.6 16.5
   #Total cases  1000000 111796 91894 111844 349015 124575 119278 91598
  • Most total common cars:
    • For engine.type - ohc
  • Most secure and risky cars by features:
    • For engine.type - ohxf is the most secure car
    • For engine.type - dohc is the most risky car
  1. For num.of.cylinders:
 #Total   eight   five   four   six   three   twelve   two 
 cars$symboling 
   neutral  32.4 24.2 27 41.9 33 22 26.7 34.6
   risky  54.9 61.6 58.2 47.9 51.2 65.6 60.1 53.8
   secure  12.7 14.2 14.8 10.1 15.8 12.4 13.2 11.6
   #Total cases  1000000 111681 110885 296802 154559 108764 101802 115507
  • Most total common cars:
    • For num.of.cylinders - four
  • Most secure and risky cars by features:
    • For num.of.cylinders - six is the most secure car
    • For num.of.cylinders - three is the most risky car
  1. For fuel.system:
 #Total   1bbl   2bbl   4bbl   idi   mfi   mpfi   spdi   spfi 
 cars$symboling 
   neutral  32.4 37 28.6 38.3 32.7 41 28 35.5 39.9
   risky  54.9 49 57.7 49.5 60.8 46.4 56 52.6 48.4
   secure  12.7 13.9 13.7 12.2 6.6 12.6 15.9 11.9 11.8
   #Total cases  1000000 81366 196893 62690 175407 58088 290196 74199 61161
  • Most total common cars:
    • For fuel.system - mpfi
  • Most secure and risky cars by features:
    • For fuel.system - mpfi is the most secure car
    • For fuel.system - idi is the most risky car
  1. For make:
 #Total   dodge   honda   jaguar   mazda   mercury   nissan   porsche   subaru   toyota   volkswagen   volvo 
 c$symboling 
   neutral  32 30.4 35.3 45.1 31.7 29.4 34.8 27.6 47.7 20.9 24.8 29
   risky  54.8 55.5 54 44.3 56.5 58.7 54 60.1 35.8 64.5 65.6 51.1
   secure  13.2 14.1 10.8 10.6 11.8 11.8 11.2 12.3 16.5 14.6 9.6 19.9
   #Total cases  538539 44742 47643 41011 55339 35865 54910 38312 50981 73356 46977 49403
 #Total   alfa-romero   audi   bmw   chevrolet   isuzu   mercedes-benz   mitsubishi   peugot   plymouth   renault   saab 
 d$symboling 
   neutral  32.8 28.9 29.6 42.6 31 30.4 30 25.4 55.2 26.9 32.1 23.8
   risky  55 59.1 58.1 47.5 56.8 56.4 54.1 63.7 36.2 55.9 53.7 66.5
   secure  12.2 11.9 12.2 10 12.1 13.1 15.9 10.8 8.6 17.1 14.2 9.7
   #Total cases  461461 36683 42728 45119 35423 37596 45049 50748 50414 39119 33884 44698
  • Most total common cars:
    • For make - toyota
  • Most secure and risky cars by features:
    • For make - volvo is the most secure car
    • For make - saab is the most risky car

3) Cleaning the data:

Before applying any Machine Learning algorithm it is important to have the data clean. It can improve the results but also reduce the computational time. Even in some cases we will not be able to apply the algorithms without this step.

3.1) Var. transformation:

3.1.1) Encoding and conversion to factors:

All the ordinal variables of the dataset cars are characteristic, except for the target variable symboling. All these variables should be transformed to factors, we will do that with the function as.factor().

Additionaly, integer encoding will be applied, this type of one-hot encoding transforms the characteristic features to numbers, without loosing any information, or having any impact in the final results. In large datasets like this one this step is important, in order to use less memory when saving the files. Aaprt from that, it is important for some models which are only able to manage numeric variables, in case of saving these as numeric.

Firstly, the characteristic variables will be converted to factors for the models that are not sensible to data distribution, but after will be transformed to numeric for the models sensible to data distribution.

After applying as.factor(), we should check that there is not any character:

[1] FALSE

It is FALSE, hence we can continue further.

3.1.2) Scalling the data:

There are many numeric variables with different scale, and this can be a problem for the algorithms that are based on euclidean distance.

Caret package gives the posibility to apply preProcess to scale the data when training the model. However, in our case we will scale the data before, because some algorithms are applied from other different packages.

The type of selected scale is range, it scales the numeric data in the interval [0, 1], but not the factors.

3.2) Missing values:

[1] FALSE

There is not any missing value, so we can continue further.

3.3) Unique variables:

Finally, we should check the numeric variables, in order to be sure tha they are unique.

To consider one variable as unique, a threshold of 500k was selected, hence half of the possible maximum number of unique values, 0.5 * 1.000.000 = 500.000

In order to be able to check this, the function find_if_unique_length was created:

Now we will pass the function to our dataset:

Results
normalized.losses UNIQUE
wheel.base UNIQUE
length UNIQUE
width UNIQUE
height UNIQUE
curb.weight UNIQUE
engine.size UNIQUE
bore UNIQUE
stroke UNIQUE
compression.ratio UNIQUE
horsepower UNIQUE
peak.rpm UNIQUE
city.mpg UNIQUE
highway.mpg UNIQUE
price UNIQUE

As per the selected threshold of 500k, we can conclude that all the variables are unique, so we are ready to divide our data in two parts.

4) Data partitioning:

4.1) For models that can manage ordinal variables:

We are ready to divide the data in two samples, train and test. The train sample is used to train the model, and the test sample is used to make the prediction and verify the performance of the model.

The data will be divided in 70% training, and 30% test, with the help of the function createDataPartition of the caret package:

4.1.1) Training sample:

4.2) For the modesl sensible to data distribution:

There are some models sensible to data distribution based on euclidean distance.

In our dataset there are many factors and this can be time consuming for some models, hence all these factors will be trasnformed to numeric:

5) Features selection:

It will be applied only to cars_train sample:

5.1) Correlation between features:

We will check the correlation between all the numerical variables, so we will use the list cars_numeric_vars.

Firstly, the correlation will be checked by the graph corplot:

It seems that the variables are not highly correlated, there are not values close to dark blue or dark red.

To be sure, we will check the maximum and minimum correlations, but in the maximum the value one should be removed, because each variable is evaluated against itself:

Maximum.correlation
0.17139
Minimum.correlation
-0.16221

The maximum correlation was 0.17139, and the minimum -0.16221, hence there is not any variable to be omitted.

5.2) Relationship with the target variable:

Now we will check if the characteristic variables have relationship with the target variable symboling, so we will use the list cars_mult_bin_vars.

In order to be able to check this, we will use ANOVA, with the created function result_aov_pvalue:

We will apply the function to our data:

Decision
Reject H0 - make impact in symboling
Reject H0 - fuel.type impact in symboling
Reject H0 - aspiration impact in symboling
Reject H0 - num.of.doors impact in symboling
Reject H0 - body.style impact in symboling
Reject H0 - drive.wheels impact in symboling
Reject H0 - engine.location impact in symboling
Reject H0 - engine.type impact in symboling
Reject H0 - num.of.cylinders impact in symboling
Reject H0 - fuel.system impact in symboling

The null hypothesis means that the variable has not impact in the taget variable (symboling).

For all the cases we reject this null hypothesis. Hence, considering 5% as the level of significance, we can conclude that all the characteristic variables have impact in the target variable.

5.3) Variables with near zero variance:

The variables with zero or near zero variance can have negative impact on the final result of the applied algorithm, for this reason is important to be checked.

The function nearZeroVar from the caret package will be used:

character(0)

As we can see there is not any variable with TRUE for nzv and zeroVar, so the model does not suggest to omit any feature.

It can be concluded that no variables should be omitted considering near zero variance method.

5.4) Linear regressions in dataset:

We will check if in the dataset carsthere is linear regression with the function findLinearCombos:

$linearCombos list()

$remove NULL

The function returns NULL, so there is not linear regression in our data.

5.5) Rank features - Learning Vector Quantization:

Learning Vector Quantization will be applied in order to rank the features by importance in relationship with symboling.

This algorithm is being applied with down-sampling and without cross-validation, and it can give an idea of which feauture could be omitted.

Learning Vector Quantization

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: None Addtional sampling using down-sampling

  • Most important feautures by level:
    • neutral: engine.type
    • risky: num.of.doors
    • secure: engine.type
  • Less important feautures by level:
    • neutral: compression.ratio
    • risky: compression.ratio
    • secure: compression.ratio

As we can see, the feature engine.type is one of the most important for all the levels.

On the other hand, in case we need to ommit some feautures, probably we will choose compression.ratio, as it is the least important for all the levels.

5.6) Conclusions:

The previous analysis did not suggest to ommit any feauture, but the variable make has too many levels. Cosequently, it can be a problem in terms of computational time when training the different models.

For this reason, in the less demanding models will be tested whether omitting the feature make gives much worst results.

6) Application of Classification algorithms:

6.2) MLR - Multinomial Logistic Regression:

The first model to be applied will be multinomial logistic regression. It is a classification method that generalize logistic regression to multiclass problems, in this case three classes or levels.

6.2.1) Train the data:

The first step in the application of one classisication method is to train the model. To do that, the algorithm should be applied to the training sample, in this case cars_train.

The level of maximum iterations will be selected as 1000, because the predeterminate in 100, and the algorithm can converge between both number of iterations.

Residual.Deviance AIC
1222691 1222955
Residual.Deviance AIC
1222691 1222955

6.2.3) Results:

First of all, we will see how the predictions are divided by levels, with the help of the function plot_model_fitted, also we will see a table with the distribution:

mlr_multinomial_fitted Freq
neutral 69965
risky 228513
secure 1521

As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.

Most of the cars were predicted as risky, 228.513 cars.

Below we can find a table in order to compare the result on the test sample:

neutral risky secure
neutral 39401 22588 7976
risky 57423 141773 29317
secure 242 436 843

In the table we can see that the risky level falls from 228.513 to 141.773. This is because only 141.773 risky cars were properly predicted, and the rest of the cars have other levels, 57.423 neutral and 29.317 secure.

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy balanced_accuracy balanced_correctly_predicted
60.67254 42.94378 57.92697

Below there is the ROC measure:

Multiclass.ROC
0.68041

The accuracy is not really high, however the balance accuracy is penalized by the fact of having an imbalanced data.

  • We will see if the accuracy is reduced a lot when considering the feature make:
Residual.Deviance AIC
1778282 1778462

As we can see AIC and residual deviance increases in comparison with the model with make.

accuracy balanced_accuracy balanced_correctly_predicted
59.31553 41.1457 57.39944
Multiclass.ROC
0.6634

As we can see, the accuracy was reduced but not significantly, the same occurs with ROC.

We can conclude that removing the feature make has not a lot of impact in the final results of the models.

6.3) PMR - Penalized Multinomial Regression:

Caret package brings several models, one of them is Penalized Multinomial Regression. It is the closest one to the normal multinomial model.

The benefits of applying this algorithm with caret package are that we will be able to apply resampling and cross-validation. Adittionally, as the previous normal multinomial model, we will apply 1.000 as the maximum number of iterations.

For this model we will apply cross-validation with 5 fold:

6.3.1) Train the data:

Penalized Multinomial Regression

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (5 fold) Summary of sample sizes: 560001, 560002, 560000, 560000, 560001 Resampling results across tuning parameters:

decay Accuracy Kappa
0.0000 0.6060191 0.2201044 0.0001 0.6060163 0.2200993 0.1000 0.6060234 0.2201166

Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.1.

6.3.3) Results:

Below we can find the optimal model values:

Residual.Deviance AIC
1222698 1222962

The AIC and residual deviance values obtained are lower than the normal multinomial model.

It seems that the optimal decay value that increases more the accuacy is 0.0001

pmr_multinomial_fitted Freq
neutral 69965
risky 228512
secure 1522

As we can see in the above graph and table, there are not a lot of levels predicted with the level risky, this is one of the consequences of working with unbalanced data.

Below we can find a table in order to compare the result on the test sample:

neutral risky secure
neutral 39399 22592 7974
risky 57425 141769 29318
secure 242 436 844

In the table we can see that the risky level falls from 228.512 to 141.769. This is because only 141.769 risky cars were properly predicted, and the rest of the cars have other level, 57.425 neutral and 29.318 secure.

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy balanced_accuracy balanced_correctly_predicted
60.67087 42.94316 57.93529
Multiclass.ROC
0.68041

As we can see all the accuracies increased, but not much more than the first model, also ROC increases.

6.4) Test - Choosing the kind of sampling:

As we saw the multinom model of caret package is not really demanding in terms of computational time, so we will use this model to test which sampling method should be applied in KNN and SVM.

We will see if the accuracy changes if we apply no-sampling and down-sampling.

6.4.1) No sampling:

Penalized Multinomial Regression

700001 samples 49 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (2 fold) Summary of sample sizes: 350001, 350000 Resampling results across tuning parameters:

decay Accuracy Kappa
0.0000 0.5930292 0.1864252 0.0001 0.5930320 0.1864315 0.1000 0.5930263 0.1864106

Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.0001.

Below we can find the predicted results by class:

accuracy balanced_accuracy balanced_correctly_predicted
59.32253 41.14495 57.35972
Multiclass.ROC
0.66332

6.4.2) Down-sampling:

Penalized Multinomial Regression

700001 samples 49 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (2 fold) Summary of sample sizes: 350001, 350000 Addtional sampling using down-sampling

Resampling results across tuning parameters:

decay Accuracy Kappa
0.0000 0.4769293 0.2014660 0.0001 0.4768107 0.2012208 0.1000 0.4765350 0.2011376

Accuracy was used to select the optimal model using the largest value. The final value used for the model was decay = 0.

Below we can find the predicted results by class:

As we can find in the above graph, the predicted classes are not imbalanced, hence down-sampling will deal with this problem.

accuracy balanced_accuracy balanced_correctly_predicted
47.56482 48.51109 46.26112
Multiclass.ROC
0.66779

6.4.4) Conclusions of the test:

The above results are typical when one has imbalanced data. In this case one cannot trust the accuracy, because the predictor tends to predict the class that has more observations.

We can see that the accuracy is greater for no-sampling, but the ROC curve measure is greater for down-sampling, hence it would be better the last model with down-sampling.

Applying down-sampling is really efficent, we obtain better predictor, and also we reduce the computational time, and this is important in the case of big samples like this.

Probably applying ROSE or SMOTE sampling would give even better results, but computational time probably would increase drastically.

On the otehr hand, it is not productive to use the feature make which contains 22 levels, because the computational time increases a lot, but probably it would not be a lot of difference in terms of performance, as we saw before in multinomial.

We can conclude that down-sampling willl be applied in the next algorithms, also the feature make will be omitted.

6.5) KNN - K-Nearest Neighbors:

k-nearest neighbors algorithm is one of the most famous classification methods. In this case the input consists of the k closest training samples in the feature space.

It uses euclidean distance in order to measure the distance between neighbors, for this reason, for this algorithm is important to have scaled data, in our case the data was scaled before, with the range method, in the range [0, 1].

The algorithm will be applied with cross-validation, 3 folds, and down-sampling.

Adittionally, we will apply tuneGrid, in order to personalize more parameters, in this case several values of k are going to be tested, corresponding to the sequence seq(5, 89, 14))). Hence, 18 values will be tested 5, 19.. till 80. The algorithm will return the finalmodel with the greatest accuracy.

6.5.1) Train the data:

k-Nearest Neighbors

700001 samples 24 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (3 fold) Summary of sample sizes: 466668, 466667, 466667 Addtional sampling using down-sampling

Resampling results across tuning parameters:

k Accuracy Kappa
5 0.5119107 0.2495125 33 0.5291850 0.2834744 61 0.5237907 0.2793799 89 0.5207107 0.2765361 117 0.5149907 0.2712024 145 0.5114621 0.2666035

Accuracy was used to select the optimal model using the largest value. The final value used for the model was k = 33.

6.5.3) Results:

Below we can find the optimal k value for the given sequence:

[1] 33

We can plot the results of the cross-validation:

knn_model_fitted Freq
neutral 69699
risky 192140
secure 38160

Most of the cars were predicted as risky, 192.140 cars.

Below we can find a table in order to compare the result on the test sample:

neutral risky secure
neutral 23643 38309 7747
risky 62051 105717 24372
secure 11372 20771 6017

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy balanced_accuracy balanced_correctly_predicted
45.12582 34.76174 34.9034
Multiclass.ROC
0.53859

The accuracy and ROC measure is much worse that the previous cases.

Probably this algorithm is not the most adequate one to be applied when the dependent variable has multiclasses.

  • Support vector algoprithm is a more complex algorithm. There are several types, and the linear creates a line or hyperplane which separates the data into classes. There are other kinds of SVM algorithm, such as polynomial and radial, these two are more complex in terms of computational time, hence it is not a good idea to apply them in a large dataset.

During this analysis SVM was applied, however the model was not trained after two days of processing, hence the processes was stopped and will not be presented here.

6.6) LDA - Linear Discriminant Analysis:

Now Linear discriminant analysis will be applied. This algorithm is popular when dealing with a multiclass dependent variable, like our variable symboling.

It finds a linear combination of features that characterizes or separates two or more classes of objects.

In this case it will be applied with repeat cross-validation, 10 folds, repeated 3 times and down sampling In this case as it is not based on euclidean distance, it can be applied to train_cars instead of train_cars1. Aditionally, the feature make will not be omitted, because this model can manage very well the ordinal variables with several levels.

6.6.1) Train the data

Linear Discriminant Analysis

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 630001, 630002, 630001, 630000, 630000, 630001, … Addtional sampling using down-sampling

Resampling results:

Accuracy Kappa
0.4973712 0.225034

6.6.3) Results

lda_model_fitted Freq
neutral 98889
risky 113509
secure 87601

As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.

Most of the cars were predicted as risky, 113.509 cars.

Below we can find a table in order to compare the result on the test sample:

neutral risky secure
neutral 49984 39584 9321
risky 23677 80268 9564
secure 23405 44945 19251

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy balanced_accuracy balanced_correctly_predicted
49.8345 50.22731 47.74548
Multiclass.ROC
0.68485

The ROC obtained is similar to the Multinomial Logistic Regression.

6.7) QDA - Quadratic Discriminant Analysis:

Similarly to the case of SVM there are other similar algorithms that apply other kind of combination rather than linear, for example quadratic, this last will be applied in this case.

In this case it will be applied with repeat cross-validation, 10 folds, repeated 3 times and down sampling, in this case as it is not based on euclidean distance, it can be applied to train_cars instead of train_cars1. Aditionally, the feature make will not be omitted, because this model can manage well the ordinal variables with several levels.

6.7.1) Train the data

Quadratic Discriminant Analysis

700001 samples 25 predictor 3 classes: ‘neutral’, ‘risky’, ‘secure’

No pre-processing Resampling: Cross-Validated (10 fold, repeated 3 times) Summary of sample sizes: 630001, 630002, 630001, 630000, 630000, 630001, … Addtional sampling using down-sampling

Resampling results:

Accuracy Kappa
0.5382359 0.269335

6.6.3) Results

qda_model_fitted Freq
neutral 94351
risky 127163
secure 78485

As we can see in the above graph and table, there are not a lot of levels predicted with the level secure, this is one of the consequences of working with imbalanced data.

Most of the cars were predicted as risky, 127.163 cars.

Below we can find a table in order to compare the result on the test sample:

neutral risky secure
neutral 51702 34740 7909
risky 25727 90720 10716
secure 19637 39337 19511

Finally, we will apply the function accuracy_multinom in order to see the different measures of accuracy of our model:

accuracy balanced_accuracy balanced_correctly_predicted
53.97785 53.15866 50.33285
Multiclass.ROC
0.71601

The ROC meausure is the largest one, so this is one of the best models, as we cannot trust the accuracy

7) Summary and conclusions:

7.1) Summary:

Model name Accuracy Bal. accuracy Bal. correct. accuracy ROC Var. select. CV Fold Resampling
MLR 60.67 42.94 57.93 0.68 No No 0 No
MLR -make 59.32 41.15 57.40 0.66 Yes No 0 No
PMR 60.67 42.94 57.94 0.68 No Yes 5 No
PMR no sam 59.32 41.14 57.36 0.66 No Yes 2 No
PMR sam 47.56 48.51 46.26 0.67 No Yes 2 Down
KNN 45.13 34.76 34.90 0.54 Yes Yes 3 Down
LDA 49.83 50.23 47.75 0.68 No Yes 10 Down
QDA 53.98 53.16 50.33 0.72 No Yes 10 Down

7.2) Conclusions:

  • For the dataset cars the best classification model is Quadratic Discriminant with repeated cross validation and down-sampling
  • In imbalanced data one cannot trust the accuracy to compare models, the model tends to predict the most common class, hence other kind of measures are recomended, in our case ROC was used
  • K nearest neighbors and support vector machine are models that do not work very well with multiclass dependent variable, it would be better to use them in case of binomial depdent varaible
  • Before applying demanding models, it is convenient to analize ceteris paribus what is expected to happen, in our case it was analized what will happen when down-sampling is applied
  • Computational time matters, and it should be taken into consideration when applying machine learning models, hence the data preparation is really important in these cases. In certain instances it is recommended to parallel the process, when the algorithm allows to do it (doParallel and doMC packages can be used)
  • In case of having a big data and imbalance clases it is really productive to use down-sampling, in terms of computational time and performance of the model
  • When performing a Machine Learning analysis it is important to have enough memory in the system which would allow one to save the results