This study is the analysis of more than 6000 thousand webs extrated from Common Crowl repository in December 2016. My purpose was to create a dataset with information of words denoting sentiment about mobile devices.So I needed to extract some information from the webs.

This information has been extracted through Amazon Web Services , using Hadoop and MapReduce technologies. The information extracted was raw sentiment counts from individual web pages. So I have created a dependent variable from these counts, using a python script that adds weight and summarize them into y-variables for iPhone and Galaxy. This is the Sentiment dataset.

The goal of this study was to create a model to predict Sentiment analysis between Iphone and Samsung Galaxy.

For this job I use R , so I will load some libraries that I need for the study.

library(ggplot2)
library(dplyr)
library(caret)
library(arules)
library(corrplot)
library(dplyr)

PRE PROCESS

First I load the dataset with all the information extracted from the webs. These information was divided in two datasets, one with all the information and the sentiment dataset. And we create two new datasets, one for Iphone, and other for Galaxy.

sentiment<- read.csv("~/sentiment.csv")
grupo<-read.csv("~/grup.csv")
iPhoneLargeMatrix<-cbind(grupo, iphoneSentiment =sentiment$iphoneSentiment)
GalaxyLargeMatrix<-cbind(grupo, galaxySentiment=sentiment$galaxySentiment)

Now let’s view the variables or attributes of the matrices, and its dimension.

dim(iPhoneLargeMatrix)
## [1] 92281    60
names(iPhoneLargeMatrix)
##  [1] "id"              "iphone"          "samsunggalaxy"  
##  [4] "sonyxperia"      "nokialumina"     "htcphone"       
##  [7] "ios"             "googleandroid"   "iphonecampos"   
## [10] "samsungcampos"   "sonycampos"      "nokiacampos"    
## [13] "htccampos"       "iphonecamneg"    "samsungcamneg"  
## [16] "sonycamneg"      "nokiacamneg"     "htccamneg"      
## [19] "iphonecamunc"    "samsungcamunc"   "sonycamunc"     
## [22] "nokiacamunc"     "htccamunc"       "iphonedispos"   
## [25] "samsungdispos"   "sonydispos"      "nokiadispos"    
## [28] "htcdispos"       "iphonedisneg"    "samsungdisneg"  
## [31] "sonydisneg"      "nokiadisneg"     "htcdisneg"      
## [34] "iphonedisunc"    "samsungdisunc"   "sonydisunc"     
## [37] "nokiadisunc"     "htcdisunc"       "iphoneperpos"   
## [40] "samsungperpos"   "sonyperpos"      "nokiaperpos"    
## [43] "htcperpos"       "iphoneperneg"    "samsungperneg"  
## [46] "sonyperneg"      "nokiaperneg"     "htcperneg"      
## [49] "iphoneperunc"    "samsungperunc"   "sonyperunc"     
## [52] "nokiaperunc"     "htcperunc"       "iosperpos"      
## [55] "googleperpos"    "iosperneg"       "googleperneg"   
## [58] "iosperunc"       "googleperunc"    "iphoneSentiment"

We can see that we have positive,uncertain (neutral) and negative counts about the phone’s operating system, camera, display, performance (hardware), performance (operating system).

One of the features is Sentiment of Galaxy and Iphone, so let’s look how the counts are distributed, and its statistical details

summary(iPhoneLargeMatrix$iphoneSentiment)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -689.000    0.000    0.000    4.297    0.000 5600.000
summary(GalaxyLargeMatrix$galaxySentiment)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -668.000    0.000    0.000    1.692    0.000 5600.000

Here we see that there are negatives values. This is because the system of weights used to classify variables. All attributes that denote a negative opinion are given -10 , and the positives 10 .

To visualize the distribution of this sentiment I will use two kinds of plots. The simple plot and histogram.

plot(iPhoneLargeMatrix$iphoneSentiment, ylab = "Sentiment", main = "Iphone Sentiment")

hist(iPhoneLargeMatrix$iphoneSentiment, xlim = c(-10,15), ylim = c(0,90000), breaks = 1000, 
     xlab = "Iphone sentiment", main="Histogram of Iphone Sentiment")

In the first plot, we can see that there are two clear outliers, that if we look at the histogram, they affect distribution.

Let’s see the same plots with Galaxy. We see that also outliers affect distribution. There are the same two outliers, and these outliers can affect a future discretization. So in order to get a better Kappa, and not just accuracy, I will exclude some of them. First I will exclude the instances with no information.

iPhoneLargeMatrix$Mean_row<-rowMeans(iPhoneLargeMatrix, na.rm = TRUE)
nuevoIphone<-filter(iPhoneLargeMatrix, Mean_row !=0)
nuevoIphone$Mean_row<-NULL

GalaxyLargeMatrix$Mean_row <- rowMeans(GalaxyLargeMatrix, na.rm = TRUE)
nuevoGalaxy<- filter(GalaxyLargeMatrix, Mean_row != 0)
nuevoGalaxy$Mean_row<-NULL

To visualize more in detail the distribution of the sentiment, I’m going to discretize the data.

dIphone <- discretize(nuevoIphone$iphoneSentiment, "fixed", categories= c(-Inf,-300,-200,-100, -50, -10, -1, 1, 10, 50, 100,200,300,Inf))
dGalaxy <- discretize(nuevoGalaxy$galaxySentiment, "fixed", categories = c(-Inf,-300,-200,-100, -50, -10, -1, 1, 10, 50,100,200,300, Inf))

And now I check the distribution of these buckets.

summary(dIphone)
## [-Inf,-300)        -300 [-200,-100) [-100, -50) [ -50, -10) [ -10,  -1) 
##           8           0          11         136         730        2041 
## [  -1,   1) [   1,  10) [  10,  50) [  50, 100) [ 100, 200) [ 200, 300) 
##       67786        5774       14195         996         403          90 
## [ 300, Inf] 
##         109
summary(dGalaxy)
## [-Inf,-300)        -300 [-200,-100) [-100, -50) [ -50, -10) [ -10,  -1) 
##           3           0           3          31         258         397 
## [  -1,   1) [   1,  10) [  10,  50) [  50, 100) [ 100, 200) [ 200, 300) 
##       84617        2477        3800         472         143          31 
## [ 300, Inf] 
##          47

I also check the standard deviation to analyze this distribution and decide which outliers delete.

sdiphone<-sd(nuevoIphone$iphoneSentiment)
sdgalaxy<-sd(nuevoGalaxy$galaxySentiment)
sdiphone
## [1] 32.29157
sdgalaxy
## [1] 25.50328

With all this information, we can do a density plot. It will help to decide what is the tendency of sentiment I want to plot real sentiment without considering 0, because it is neutral. So I create a new variable to plot. In this case, I consider that outliers will be those over one SD, so I’ve chosen 40 because it shows the tendency better in both datasets. We can see that in next plot.

plotterIPHONE <- nuevoIphone %>% filter( -40 < iphoneSentiment & 40 > iphoneSentiment & iphoneSentiment != 0)
plotterGALAXY <- nuevoGalaxy %>% filter( -40 < galaxySentiment & 40 > galaxySentiment & galaxySentiment !=0)

Now I make a plot of desity about sentiment to see distribution.

ggplot(plotterIPHONE, aes(plotterIPHONE$iphoneSentiment))+ geom_density()

ggplot(plotterGALAXY, aes(plotterGALAXY$galaxySentiment))+ geom_density()

At this point, we have pre processed the instances of both datasets. But now we have to make the process for the attributes. We have to select which attributes will be our predictors, and which ones are highly correlated and not necessary to create and evaluate a predictive model.

To check the correlations and subset the attributes first I protect the attribute to predict.

iphonefiltered<- select(plotterIPHONE, -iphoneSentiment)
galaxyfiltered<- select(plotterGALAXY, -galaxySentiment)

Now I analyze the correlation of attributes.

correIphone <- cor(iphonefiltered)
correGalaxy <- cor(galaxyfiltered)

And also investigate the statistics of the new objects.

summary(correIphone[upper.tri(correIphone)])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.151500  0.001912  0.069250  0.142500  0.182600  0.998800
summary(correGalaxy[upper.tri(correGalaxy)])
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.152900 -0.003486  0.049840  0.129000  0.184900  0.999700

Select a point of cut off of the highly correlated features. In this case this point will be 80%.

corrIphone2 <- findCorrelation(correIphone, cutoff = .80)
corrGalaxy2 <- findCorrelation(correGalaxy, cutoff = .80) 

And now make a matrix without the high correlated attributes.

iPhoneRes <- iphonefiltered[,-corrIphone2]
GalaxyRes <- galaxyfiltered[, -corrGalaxy2]

Let`s see which are the attributes that are going to be used to predict.

names(iPhoneRes)
##  [1] "id"            "iphone"        "samsunggalaxy" "sonyxperia"   
##  [5] "nokialumina"   "htcphone"      "ios"           "googleandroid"
##  [9] "iphonecampos"  "samsungcampos" "sonycampos"    "iphonecamneg" 
## [13] "iphonecamunc"  "samsungcamunc" "sonycamunc"    "nokiacamunc"  
## [17] "htccamunc"     "iphonedispos"  "sonydispos"    "iphonedisneg" 
## [21] "samsungdisneg" "htcdisneg"     "iphonedisunc"  "samsungdisunc"
## [25] "iphoneperpos"  "iphoneperneg"  "samsungperneg" "htcperneg"    
## [29] "iphoneperunc"  "samsungperunc" "sonyperunc"    "htcperunc"    
## [33] "iosperpos"     "googleperpos"
names(GalaxyRes)
##  [1] "id"            "iphone"        "samsunggalaxy" "sonyxperia"   
##  [5] "nokialumina"   "htcphone"      "ios"           "googleandroid"
##  [9] "iphonecampos"  "samsungcampos" "sonycampos"    "htccampos"    
## [13] "sonycamneg"    "nokiacamneg"   "iphonecamunc"  "samsungcamunc"
## [17] "sonycamunc"    "htccamunc"     "iphonedispos"  "nokiadispos"  
## [21] "htcdispos"     "samsungdisneg" "sonydisneg"    "htcdisneg"    
## [25] "iphonedisunc"  "samsungdisunc" "samsungperpos" "sonyperpos"   
## [29] "htcperpos"     "iphoneperneg"  "htcperneg"     "iphoneperunc" 
## [33] "samsungperunc" "htcperunc"     "iosperpos"     "googleperneg"

At this point I add the attribute of sentiment about each phone that we excluded before.

iPhoneRes<- mutate(iPhoneRes, iPhoneSentiment = plotterIPHONE$iphoneSentiment)
GalaxyRes <- mutate(GalaxyRes, GalaxySentiment = plotterGALAXY$galaxySentiment)

And plot correlation matrix.

corrplot(cor(iPhoneRes), order = "hclust")

corrplot(cor(GalaxyRes), order = "hclust")

Now we have to discretize to see how is now the distribution of sentiment, categorize it and give labels to it.

LabIphone<- discretize(iPhoneRes$iPhoneSentiment, "fixed", categories = c(-Inf,-15,-5,5,15,Inf))
LabGalaxy<- discretize(GalaxyRes$GalaxySentiment, "fixed", categories = c(-Inf,-15,-5,5,15,Inf))
summary(LabIphone)
## [-Inf, -15) [ -15,  -5) [  -5,   5) [   5,  15) [  15, Inf] 
##         643        2037        5602        9278        4462
summary(LabGalaxy)
## [-Inf, -15) [ -15,  -5) [  -5,   5) [   5,  15) [  15, Inf] 
##         213         393        2429        1534        2045

With 5 buckets we get a good distribution. So now we use it.

iPhoneRes$iPhoneSentiment<- LabIphone
GalaxyRes$GalaxySentiment<- LabGalaxy

And label them.

levels(iPhoneRes$iPhoneSentiment)<- c("very negative", "negative", "neutral", "good", "very good")
levels(GalaxyRes$GalaxySentiment)<- c("very negative", "negative", "neutral", "good", "very good")
summary(iPhoneRes$iPhoneSentiment)
## very negative      negative       neutral          good     very good 
##           643          2037          5602          9278          4462
summary(GalaxyRes$GalaxySentiment)
## very negative      negative       neutral          good     very good 
##           213           393          2429          1534          2045

At this point the pre process is finished. So now we begin the creation of models.

CREATION OF PREDICTIVE MODELS

To create the models what I will do to divide each dataset into Train and Test set, for training the models and test it with instances that it has not been used for the model. This way I will evaluate the accuracy and Kappa of each model.

IPHONE MODELS

I create the partition, and check the partition of each set.

set.seed(333)
trainIndex1<- createDataPartition(iPhoneRes$iPhoneSentiment, p= .7, list = FALSE)
Train1 <- iPhoneRes[ trainIndex1,]
Test1  <- iPhoneRes[-trainIndex1,]
nrow(Train1)
## [1] 15418
nrow(Test1)
## [1] 6604

We have a classification problem, so the models that I am going to use will be C5.0, KNN and RandomForest .

__________________________________MODEL C5.0___________________________________________________________________

ctrl<-trainControl(method = "repeatedcv", repeats = 3)

C5Iphone<- train(iPhoneSentiment~., data= Train1, method= "C5.0", trControl = ctrl,preProc = c("center", "scale"))

and now we use it with test to make the prediction and evaluate it with a confusion matrix.

IPHONEPRED<-predict(C5Iphone, Test1)

confusionMatrix(Test1$iPhoneSentiment, IPHONEPRED)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      very negative negative neutral good very good
##   very negative           166       13       4    3         6
##   negative                 14      542      37    9         9
##   neutral                   1       35    1534   79        31
##   good                      1       10      86 2613        73
##   very good                 1        1      18   24      1294
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9311          
##                  95% CI : (0.9247, 0.9371)
##     No Information Rate : 0.4131          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9028          
##  Mcnemar's Test P-Value : 9.526e-06       
## 
## Statistics by Class:
## 
##                      Class: very negative Class: negative Class: neutral
## Sensitivity                       0.90710         0.90183         0.9136
## Specificity                       0.99595         0.98851         0.9704
## Pos Pred Value                    0.86458         0.88707         0.9131
## Neg Pred Value                    0.99735         0.99016         0.9706
## Prevalence                        0.02771         0.09101         0.2542
## Detection Rate                    0.02514         0.08207         0.2323
## Detection Prevalence              0.02907         0.09252         0.2544
## Balanced Accuracy                 0.95153         0.94517         0.9420
##                      Class: good Class: very good
## Sensitivity               0.9578           0.9158
## Specificity               0.9561           0.9915
## Pos Pred Value            0.9389           0.9671
## Neg Pred Value            0.9699           0.9774
## Prevalence                0.4131           0.2140
## Detection Rate            0.3957           0.1959
## Detection Prevalence      0.4214           0.2026
## Balanced Accuracy         0.9570           0.9537

We can see good results with this model.

__________________________________________KNN___________________________________________________________________

ctrl<-trainControl(method = "repeatedcv", repeats = 3)

KNNIphone<- train(iPhoneSentiment~., data= Train1, method= "knn", trControl = ctrl,preProc = c("center", "scale"))

And now the prediction and Confusion matrix with KNN

PredKNNIphone<-predict(KNNIphone, Test1)
confusionMatrix(Test1$iPhoneSentiment, PredKNNIphone)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      very negative negative neutral good very good
##   very negative           123       36      22    5         6
##   negative                 16      485      78   24         8
##   neutral                   3       26    1504  116        31
##   good                      3        5     189 2490        96
##   very good                 1        2      61  134      1140
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8695          
##                  95% CI : (0.8611, 0.8775)
##     No Information Rate : 0.4193          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8146          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: very negative Class: negative Class: neutral
## Sensitivity                       0.84247         0.87545         0.8112
## Specificity                       0.98932         0.97917         0.9629
## Pos Pred Value                    0.64062         0.79378         0.8952
## Neg Pred Value                    0.99641         0.98849         0.9289
## Prevalence                        0.02211         0.08389         0.2807
## Detection Rate                    0.01863         0.07344         0.2277
## Detection Prevalence              0.02907         0.09252         0.2544
## Balanced Accuracy                 0.91589         0.92731         0.8871
##                      Class: good Class: very good
## Sensitivity               0.8992           0.8899
## Specificity               0.9236           0.9628
## Pos Pred Value            0.8947           0.8520
## Neg Pred Value            0.9270           0.9732
## Prevalence                0.4193           0.1940
## Detection Rate            0.3770           0.1726
## Detection Prevalence      0.4214           0.2026
## Balanced Accuracy         0.9114           0.9264

Here we see it is a little worst than C5.0.

______________________________________________RANDOMFOREST______________________________________________________

Now I will try with a Random Forest

library(randomForest)
RFmodelIphone<-randomForest(iPhoneSentiment~., data= Train1)
RFpredIphone <- predict(RFmodelIphone, Test1)

and analyze it with the confusion matrix.

confusionMatrix(Test1$iPhoneSentiment, RFpredIphone)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      very negative negative neutral good very good
##   very negative           115       29      33    3        12
##   negative                  6      464     113   16        12
##   neutral                   0        3    1613   34        30
##   good                      0        1     199 2505        78
##   very good                 0        2      66  102      1168
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8881          
##                  95% CI : (0.8802, 0.8956)
##     No Information Rate : 0.4028          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8412          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: very negative Class: negative Class: neutral
## Sensitivity                       0.95041         0.92986         0.7969
## Specificity                       0.98812         0.97592         0.9854
## Pos Pred Value                    0.59896         0.75941         0.9601
## Neg Pred Value                    0.99906         0.99416         0.9165
## Prevalence                        0.01832         0.07556         0.3065
## Detection Rate                    0.01741         0.07026         0.2442
## Detection Prevalence              0.02907         0.09252         0.2544
## Balanced Accuracy                 0.96927         0.95289         0.8912
##                      Class: good Class: very good
## Sensitivity               0.9417           0.8985
## Specificity               0.9295           0.9679
## Pos Pred Value            0.9001           0.8729
## Neg Pred Value            0.9594           0.9749
## Prevalence                0.4028           0.1969
## Detection Rate            0.3793           0.1769
## Detection Prevalence      0.4214           0.2026
## Balanced Accuracy         0.9356           0.9332

RESULTS OF IPHONE

Best prediction is with C 5.0 that we achieve Accuracy : 0.9311 and Kappa : 0.9028 . We have also to consider that in the pre process we deleted instances with all 0, also instances that had Sentiment Value over 40 or under 40. doing so we deleted outliers. And the third process that I have done was to delete instances with 0 Sentiment. We have to predict Sentiment, so we have to check the density of this sentiment and how it is distributed to get a good predictive model. Also I deleted the attributes highly correlated

SAMSUNG GALAXY ANALYSIS

First again create the datasets of training and test.

set.seed(123)
trainIndex2<- createDataPartition(GalaxyRes$GalaxySentiment, p= .7, list = FALSE)
Train2 <- GalaxyRes[ trainIndex2,]
Test2  <- GalaxyRes[-trainIndex2,]
nrow(Train2)
## [1] 4633
nrow(Test2)
## [1] 1981

___________________________C5.0_______________________________________________________________________________

ctrl<-trainControl(method = "repeatedcv", repeats = 3)

C5Galaxy<- train(GalaxySentiment~., data= Train2, method= "C5.0", trControl = ctrl,preProc = c("center", "scale"))

Make the prediction and confusion matrix

C5Galaxypred<- predict(C5Galaxy, Test2)
confusionMatrix(Test2$GalaxySentiment, C5Galaxypred)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      very negative negative neutral good very good
##   very negative            36       16       5    1         5
##   negative                  8       62      35    8         4
##   neutral                   2        9     672   31        14
##   good                      3        5      64  339        49
##   very good                 5        4       9   47       548
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8364          
##                  95% CI : (0.8194, 0.8525)
##     No Information Rate : 0.3963          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7676          
##  Mcnemar's Test P-Value : 0.0002158       
## 
## Statistics by Class:
## 
##                      Class: very negative Class: negative Class: neutral
## Sensitivity                       0.66667         0.64583         0.8561
## Specificity                       0.98599         0.97082         0.9532
## Pos Pred Value                    0.57143         0.52991         0.9231
## Neg Pred Value                    0.99062         0.98176         0.9098
## Prevalence                        0.02726         0.04846         0.3963
## Detection Rate                    0.01817         0.03130         0.3392
## Detection Prevalence              0.03180         0.05906         0.3675
## Balanced Accuracy                 0.82633         0.80833         0.9046
##                      Class: good Class: very good
## Sensitivity               0.7958           0.8839
## Specificity               0.9222           0.9522
## Pos Pred Value            0.7370           0.8940
## Neg Pred Value            0.9428           0.9474
## Prevalence                0.2150           0.3130
## Detection Rate            0.1711           0.2766
## Detection Prevalence      0.2322           0.3094
## Balanced Accuracy         0.8590           0.9181

__________________________________KNN_________________________________________________________________________

ctrl<-trainControl(method = "repeatedcv")

KNNGalaxy<- train(GalaxySentiment~., data= Train2, method= "knn", trControl = ctrl,preProc = c("center", "scale"))

PredKNNGalaxy<-predict(KNNGalaxy, Test2)
confusionMatrix(Test2$GalaxySentiment, PredKNNGalaxy)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      very negative negative neutral good very good
##   very negative            28       17       5    5         8
##   negative                  9       59      32   14         3
##   neutral                   2       16     626   51        33
##   good                      3        5      77  329        46
##   very good                 7        4      22  100       480
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7683          
##                  95% CI : (0.7491, 0.7867)
##     No Information Rate : 0.3847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6724          
##  Mcnemar's Test P-Value : 9.174e-06       
## 
## Statistics by Class:
## 
##                      Class: very negative Class: negative Class: neutral
## Sensitivity                       0.57143         0.58416         0.8215
## Specificity                       0.98188         0.96915         0.9163
## Pos Pred Value                    0.44444         0.50427         0.8599
## Neg Pred Value                    0.98905         0.97747         0.8915
## Prevalence                        0.02473         0.05098         0.3847
## Detection Rate                    0.01413         0.02978         0.3160
## Detection Prevalence              0.03180         0.05906         0.3675
## Balanced Accuracy                 0.77666         0.77665         0.8689
##                      Class: good Class: very good
## Sensitivity               0.6593           0.8421
## Specificity               0.9116           0.9057
## Pos Pred Value            0.7152           0.7830
## Neg Pred Value            0.8882           0.9342
## Prevalence                0.2519           0.2877
## Detection Rate            0.1661           0.2423
## Detection Prevalence      0.2322           0.3094
## Balanced Accuracy         0.7855           0.8739

__________________________________________RANDOMFOREST_________________________________________________________

RFmodelGalaxy<-randomForest(GalaxySentiment~., data= Train2)
RFpredGalaxy <- predict(RFmodelGalaxy, Test2)
confusionMatrix(Test2$GalaxySentiment, RFpredGalaxy)
## Confusion Matrix and Statistics
## 
##                Reference
## Prediction      very negative negative neutral good very good
##   very negative            36       11       9    2         5
##   negative                 11       51      41   12         2
##   neutral                   0        0     673   39        16
##   good                      1        2      80  333        44
##   very good                 2        2      24   56       529
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8188          
##                  95% CI : (0.8011, 0.8355)
##     No Information Rate : 0.4175          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7407          
##  Mcnemar's Test P-Value : 3.138e-12       
## 
## Statistics by Class:
## 
##                      Class: very negative Class: negative Class: neutral
## Sensitivity                       0.72000         0.77273         0.8138
## Specificity                       0.98602         0.96554         0.9523
## Pos Pred Value                    0.57143         0.43590         0.9245
## Neg Pred Value                    0.99270         0.99195         0.8771
## Prevalence                        0.02524         0.03332         0.4175
## Detection Rate                    0.01817         0.02574         0.3397
## Detection Prevalence              0.03180         0.05906         0.3675
## Balanced Accuracy                 0.85301         0.86913         0.8831
##                      Class: good Class: very good
## Sensitivity               0.7534           0.8876
## Specificity               0.9175           0.9394
## Pos Pred Value            0.7239           0.8630
## Neg Pred Value            0.9283           0.9510
## Prevalence                0.2231           0.3009
## Detection Rate            0.1681           0.2670
## Detection Prevalence      0.2322           0.3094
## Balanced Accuracy         0.8354           0.9135

RESULTS FOR GALAXY

Like in the case of Iphone, the best algorithm is C5.0. The Accuracy : 0.8364 and Kappa : 0.7676 . In this example, the number of instances to create and evaluate the model are less than in Iphone.

CONCLUSIONS

Iphone is a device that has more lovers or haters. We can see it with the sentiment distribution. Whilst there are more people with no opinion about Galaxy.

I have tried also to see which attributes have more weight in Sentiment Analysis. For this subject I have used WEKA with the algorithm CFS.

I first apply the funtion in iPhoneLargeMatrix dataset, to predict iPhone sentiment. and the result is:

Selected attributes: 10,14,25,27,40,55 : 6 iphonecampos htccampos iphonedispos sonydispos iphoneperpos iosperpos

And now I apply it to GalaxyLargeMatrix dataset, to predict Galaxy sentiment:

Selected attributes: 11,12,26,41,56 : 5 samsungcampos sonycampos samsungdispos samsungperpos googleperpos