As we can see, the main problem for this project will be the class imbalance, nearly 70% of our data was scored with a level 5 sentiment analysis. To solve this, I tryed different approaches including: SMOTE,ROSE,Down-sampling,Over-sampling… In the end the best result was by performing a PCA on both datasets and using the following models.
## Model used to predict iphone sentiment
## Random Forest
##
## 12973 samples
## 25 predictor
## 6 classes: '0', '1', '2', '3', '4', '5'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10377, 10378, 10380, 10377, 10380
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7627377 0.5438048
## 13 0.7621212 0.5430289
## 25 0.7615051 0.5423955
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion matrix of the iphone Model
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5
## 0 1392 1 0 0 6 7
## 1 0 41 0 0 1 1
## 2 3 2 136 1 1 7
## 3 1 0 0 837 0 4
## 4 1 0 0 0 648 13
## 5 565 346 318 350 783 7508
##
## Overall Statistics
##
## Accuracy : 0.8142
## 95% CI : (0.8074, 0.8208)
## No Information Rate : 0.5812
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6489
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.7095 0.105128 0.29956 0.70455 0.45031 0.9958
## Specificity 0.9987 0.999841 0.99888 0.99958 0.99879 0.5652
## Pos Pred Value 0.9900 0.953488 0.90667 0.99406 0.97885 0.7607
## Neg Pred Value 0.9507 0.973009 0.97520 0.97107 0.93575 0.9897
## Prevalence 0.1512 0.030062 0.03500 0.09157 0.11092 0.5812
## Detection Rate 0.1073 0.003160 0.01048 0.06452 0.04995 0.5787
## Detection Prevalence 0.1084 0.003315 0.01156 0.06490 0.05103 0.7608
## Balanced Accuracy 0.8541 0.552485 0.64922 0.85206 0.72455 0.7805
## Model used to predict galaxy sentiment
## Random Forest
##
## 12911 samples
## 25 predictor
## 6 classes: '0', '1', '2', '3', '4', '5'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10330, 10329, 10329, 10327, 10329
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.7564095 0.5148671
## 13 0.7565644 0.5160284
## 25 0.7562549 0.5161070
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.
## Confusion matrix of the galaxy Model
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5
## 0 1259 4 4 6 9 52
## 1 0 40 0 0 1 0
## 2 3 1 104 2 3 6
## 3 3 3 1 776 9 48
## 4 8 1 0 3 583 18
## 5 423 333 341 388 812 7667
##
## Overall Statistics
##
## Accuracy : 0.8078
## 95% CI : (0.8009, 0.8145)
## No Information Rate : 0.6034
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6225
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.74233 0.104712 0.231111 0.66043 0.41143 0.9841
## Specificity 0.99331 0.999920 0.998796 0.99455 0.99739 0.5514
## Pos Pred Value 0.94378 0.975610 0.873950 0.92381 0.95106 0.7695
## Neg Pred Value 0.96225 0.973427 0.972952 0.96695 0.93218 0.9579
## Prevalence 0.13136 0.029587 0.034854 0.09101 0.10975 0.6034
## Detection Rate 0.09751 0.003098 0.008055 0.06010 0.04516 0.5938
## Detection Prevalence 0.10332 0.003176 0.009217 0.06506 0.04748 0.7717
## Balanced Accuracy 0.86782 0.552316 0.614954 0.82749 0.70441 0.7677