Objectives
Perform an analysis of the dataset used in Homework #2 using the SVM algorithm.
Compare the results with the results from previous homework.
Answer questions, such as:
i. Which algorithm is recommended to get more accurate results?
ii. Is it better for classification or regression scenarios?
iii. Do you agree with the recommendations?
iv. Why?
HW Modification
In HW2, I used two different decision trees and an ensemble technique. After reviewing what I have done and I think that running PCA analysis before modeling would help to save same computation,and also help me review PCA.
Therefore, in HW3, I am going to abandon the second decision tree model created in HW2 to make model comparison between the first decision, random forest and svm. In addition, the data that I am going to use is based on the feature selection produced by the result of PCA, so the dimension of data will be lower, and the futhur analysis will be based on feature selection.
There is not going to be hard-coded visualization because I’ve done that in HW2 and I found a function can take care of it in one step. In addition, I have mentioned the potential improvement at the end of HW2 so that the hyperparameters will be tuned before modeling as we have talked in cross-validation, as a result, all models created here should be considered “optimal”.
Data Exploration
The same data is loaded, the function ggpairs take care
of the visualization in both numerical and categorical data. The
downside is, if the features are a lot, it may not help to capture
details since since grids are too small.
the plot shows the relationship pairs of pairwise and individuals with target value being color coded. Take an example of the plot in the top most left, the decision boundary is very hard to be drew because the state of churn or not overlap, there is very little difference.
The same situation happened to some other density plots, and the
points are mixed together from the scatter plots. As a result, for those
who do not seem to provide significant information for prediction can be
dropped from the data.
Feature Selection
Because PCA only work for numeric data, therefore, we need to do some one-hot encoding to categorical data, so that they can take part in PCA procedure. Also, the target variable is excluded temporarily.
## [1] "dimension before one-hot encoding: (10000, 11)"
## [1] "dimension after one-hot encoding: (10000, 13)"
From the result of PCA, we see that each of component does not contribute much total variance. So there is going to be a trade off. I would like to achieve as much variance as possible, meanwhile the eigenvalue is greater or equal to 1. As we see from PC7 and PC8, eigenvalues are near 1 but the cumulative total variance is different. So I am going to pick the total variance account for 81%.
I select the first 8 components and plot feature contribution. The interest thing is that one of the most important features such as age results in DT1 from HW2 becomes not important. Then all features above the red dash line will be selected for modeling, and target variable is included.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.4169 1.3654 1.2263 1.05529 1.04111 1.00734 1.00019
## Proportion of Variance 0.1544 0.1434 0.1157 0.08566 0.08338 0.07806 0.07695
## Cumulative Proportion 0.1544 0.2978 0.4135 0.49918 0.58256 0.66062 0.73757
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.99838 0.98563 0.94887 0.73689 3.588e-15 2.926e-15
## Proportion of Variance 0.07667 0.07473 0.06926 0.04177 0.000e+00 0.000e+00
## Cumulative Proportion 0.81424 0.88897 0.95823 1.00000 1.000e+00 1.000e+00
## gender.Female gender.Male country.France country.Spain credit_score
## 1 1 0 1 0 619
## 2 1 0 0 1 608
## 3 1 0 1 0 502
## 4 1 0 1 0 699
## 5 1 0 0 1 850
## 6 0 1 0 1 645
## country.Germany products_number churn
## 1 0 1 1
## 2 0 1 0
## 3 0 3 1
## 4 0 2 0
## 5 0 1 0
## 6 0 2 1
Preprocessing
Data is split into ratio 3:1 where training data takes up 3 parts of original data, and test set holds 1 part. The from the ratio of target variable in the first shown result, we need to deal with imbalance class which will cause unstable tree structure. I have tried upsampling in previous HW, so down-sampling is used instead. The second reason is that SVM may not work better in larger data set. the result of down-sampling is shown in the middle box. The last box is the dimension of training set.
##
## 0 1
## 7963 2037
## train.y
## 0 1
## 1527 1527
## [1] 3054 7
Model Training
DT
## CART
##
## 3054 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2749, 2749, 2748, 2748, 2748, 2749, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0009823183 0.6810772 0.3621451
## 0.0011460380 0.6817383 0.3634848
## 0.0016371971 0.6905704 0.3811684
## 0.0019646365 0.6879528 0.3759149
## 0.0026195154 0.6928569 0.3856985
## 0.0028378083 0.6928569 0.3856985
## 0.0032743942 0.6915433 0.3830621
## 0.0056756167 0.6876196 0.3752413
## 0.0150622135 0.6742016 0.3484254
## 0.1784544859 0.6005447 0.2016187
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.002837808.
RF
## Random Forest
##
## 3054 samples
## 7 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2749, 2748, 2748, 2748, 2748, 2749, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.6833411 0.3667275
## 4 0.6928473 0.3857343
## 7 0.6283394 0.2566930
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 4.
SVM
The commonly used SVM kernels are Linear,
Radial and Polynomial, so all three versions
of SVM are tuned. Based on the accuracy of best tuned model, SVM with
radial kernel is selected.
## Support Vector Machines with Radial Basis Function Kernel
##
## 3054 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2750, 2748, 2749, 2748, 2749, 2749, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.6908983 0.3817969
## 0.50 0.6935180 0.3870367
## 1.00 0.6931923 0.3863786
##
## Tuning parameter 'sigma' was held constant at a value of 0.1845787
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.1845787 and C = 0.5.
## Support Vector Machines with Linear Kernel
##
## 3054 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2749, 2748, 2750, 2748, 2749, 2748, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 1.000000 0.5870901 0.1741707
## 1.052632 0.5870901 0.1741707
## 1.105263 0.5870901 0.1741707
## 1.157895 0.5870901 0.1741707
## 1.210526 0.5870901 0.1741707
## 1.263158 0.5870901 0.1741707
## 1.315789 0.5870901 0.1741707
## 1.368421 0.5870901 0.1741707
## 1.421053 0.5870901 0.1741707
## 1.473684 0.5870901 0.1741707
## 1.526316 0.5870901 0.1741707
## 1.578947 0.5870901 0.1741707
## 1.631579 0.5870901 0.1741707
## 1.684211 0.5870901 0.1741707
## 1.736842 0.5870901 0.1741707
## 1.789474 0.5870901 0.1741707
## 1.842105 0.5870901 0.1741707
## 1.894737 0.5870901 0.1741707
## 1.947368 0.5870901 0.1741707
## 2.000000 0.5870901 0.1741707
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 1.
## Support Vector Machines with Polynomial Kernel
##
## 3054 samples
## 7 predictor
## 2 classes: '0', '1'
##
## Pre-processing: centered (7), scaled (7)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 2748, 2749, 2749, 2749, 2748, 2748, ...
## Resampling results across tuning parameters:
##
## degree scale C Accuracy Kappa
## 1 0.001 0.25 0.5356166 0.07320046
## 1 0.001 0.50 0.5857795 0.17180649
## 1 0.001 1.00 0.5870845 0.17417385
## 1 0.010 0.25 0.5870845 0.17417385
## 1 0.010 0.50 0.5870845 0.17417385
## 1 0.010 1.00 0.5870845 0.17417385
## 1 0.100 0.25 0.5870845 0.17417385
## 1 0.100 0.50 0.5870845 0.17417385
## 1 0.100 1.00 0.5870845 0.17417385
## 2 0.001 0.25 0.5851248 0.17049967
## 2 0.001 0.50 0.5870845 0.17417385
## 2 0.001 1.00 0.5870845 0.17417385
## 2 0.010 0.25 0.5870845 0.17417385
## 2 0.010 0.50 0.5870845 0.17417385
## 2 0.010 1.00 0.5969067 0.19383588
## 2 0.100 0.25 0.6761459 0.35242772
## 2 0.100 0.50 0.6764738 0.35300905
## 2 0.100 1.00 0.6764738 0.35300905
## 3 0.001 0.25 0.5854441 0.17091718
## 3 0.001 0.50 0.5870845 0.17417385
## 3 0.001 1.00 0.5870845 0.17417385
## 3 0.010 0.25 0.5923219 0.18465292
## 3 0.010 0.50 0.5975603 0.19514307
## 3 0.010 1.00 0.5991953 0.19841062
## 3 0.100 0.25 0.6931769 0.38641923
## 3 0.100 0.50 0.6935037 0.38707283
## 3 0.100 1.00 0.6938305 0.38772642
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were degree = 3, scale = 0.1 and C = 1.
Model Performance
## model accuracy sensitivity specificity precision recall
## 1 DT 0.6901 0.6931 0.6784 0.8938 0.6931
## 2 RF 0.7033 0.7082 0.6843 0.8975 0.7082
## 3 svmRadial 0.7033 0.7077 0.6863 0.898 0.7077
Discussion
- Which algorithm is recommended to get more accurate results?
From the table shown above, based on the accuracy score, the best model is decision tree.
- Is it better for classification or regression scenarios?
There are two main types of decision trees, the most common one is classification tree, and the other is regression tree. They have different uses cases depending on varies scenarios.
- Do you agree with the recommendations?
No, I do not agree with the recommendations.
- Why?
I think that ensemble model and support vector machine should work better in general. However, random forest averages the outcome of multiple trees, the accuracy result is pretty close to the optimal one. Therefore, I would say that random forest is a more general model than desicion tree. Because the subset of data will most likely determine the shape of current decision tree, and if the given subset got changed, the accuracy of decision tree may result differently due to distinct tree structure. In addition, SVM is a linear classifier only used to do classification. Therefore, if data cannot be separated by hyper-plane(with proper kernel), SVM won’t do a good job. In conclusion, it could be just by luck where the subset of data define a good tree structure.