Project Introduction

In this project I will try to find/create a relation between the feelings of people towards Iphone and Samsung galaxy phones based on what they write in different web pages. I used different word combinations/keys to score variables from different phones in every webpage.

## This are the trainset distributions for iphonesentiment

## This are the trainset distributions for galaxysentiment

Initial Analysis

As we can see, the main problem for this project will be the class imbalance, nearly 70% of our data was scored with a level 5 sentiment analysis. To solve this, I tryed different approaches including: SMOTE,ROSE,Down-sampling,Over-sampling… In the end the best result was by performing a PCA on both datasets and using the following models.

## Model used to predict iphone sentiment
## Random Forest 
## 
## 12973 samples
##    25 predictor
##     6 classes: '0', '1', '2', '3', '4', '5' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10377, 10378, 10380, 10377, 10380 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7627377  0.5438048
##   13    0.7621212  0.5430289
##   25    0.7615051  0.5423955
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion matrix of the iphone Model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5
##          0 1392    1    0    0    6    7
##          1    0   41    0    0    1    1
##          2    3    2  136    1    1    7
##          3    1    0    0  837    0    4
##          4    1    0    0    0  648   13
##          5  565  346  318  350  783 7508
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8142          
##                  95% CI : (0.8074, 0.8208)
##     No Information Rate : 0.5812          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6489          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.7095 0.105128  0.29956  0.70455  0.45031   0.9958
## Specificity            0.9987 0.999841  0.99888  0.99958  0.99879   0.5652
## Pos Pred Value         0.9900 0.953488  0.90667  0.99406  0.97885   0.7607
## Neg Pred Value         0.9507 0.973009  0.97520  0.97107  0.93575   0.9897
## Prevalence             0.1512 0.030062  0.03500  0.09157  0.11092   0.5812
## Detection Rate         0.1073 0.003160  0.01048  0.06452  0.04995   0.5787
## Detection Prevalence   0.1084 0.003315  0.01156  0.06490  0.05103   0.7608
## Balanced Accuracy      0.8541 0.552485  0.64922  0.85206  0.72455   0.7805
## Model used to predict galaxy sentiment
## Random Forest 
## 
## 12911 samples
##    25 predictor
##     6 classes: '0', '1', '2', '3', '4', '5' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10330, 10329, 10329, 10327, 10329 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.7564095  0.5148671
##   13    0.7565644  0.5160284
##   25    0.7562549  0.5161070
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 13.
## Confusion matrix of the galaxy Model
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5
##          0 1259    4    4    6    9   52
##          1    0   40    0    0    1    0
##          2    3    1  104    2    3    6
##          3    3    3    1  776    9   48
##          4    8    1    0    3  583   18
##          5  423  333  341  388  812 7667
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8078          
##                  95% CI : (0.8009, 0.8145)
##     No Information Rate : 0.6034          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6225          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.74233 0.104712 0.231111  0.66043  0.41143   0.9841
## Specificity           0.99331 0.999920 0.998796  0.99455  0.99739   0.5514
## Pos Pred Value        0.94378 0.975610 0.873950  0.92381  0.95106   0.7695
## Neg Pred Value        0.96225 0.973427 0.972952  0.96695  0.93218   0.9579
## Prevalence            0.13136 0.029587 0.034854  0.09101  0.10975   0.6034
## Detection Rate        0.09751 0.003098 0.008055  0.06010  0.04516   0.5938
## Detection Prevalence  0.10332 0.003176 0.009217  0.06506  0.04748   0.7717
## Balanced Accuracy     0.86782 0.552316 0.614954  0.82749  0.70441   0.7677

Final Results

##  iphonesentiment galaxysentiment
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:0.000  
##  Median :3.000   Median :3.000  
##  Mean   :2.489   Mean   :2.635  
##  3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000

From this analysis we can conclude that the relationship between the phones sentiments and the number of keys found in the webpages is not very clear. The sentiment is subjective to every person and human behaviour is hard to predict. The results show a very small sentiment increase for galaxy against iphone