This study is the analysis of more than 6000 thousand webs extrated from Common Crowl repository in December 2016. My purpose was to create a dataset with information of words denoting sentiment about mobile devices.So I needed to extract some information from the webs.
This information has been extracted through Amazon Web Services , using Hadoop and MapReduce technologies. The information extracted was raw sentiment counts from individual web pages. So I have created a dependent variable from these counts, using a python script that adds weight and summarize them into y-variables for iPhone and Galaxy. This is the Sentiment dataset.
The goal of this study was to create a model to predict Sentiment analysis between Iphone and Samsung Galaxy.
For this job I use R , so I will load some libraries that I need for the study.
library(ggplot2)
library(dplyr)
library(caret)
library(arules)
library(corrplot)
library(dplyr)
First I load the dataset with all the information extracted from the webs. These information was divided in two datasets, one with all the information and the sentiment dataset. And we create two new datasets, one for Iphone, and other for Galaxy.
sentiment<- read.csv("~/sentiment.csv")
grupo<-read.csv("~/grup.csv")
iPhoneLargeMatrix<-cbind(grupo, iphoneSentiment =sentiment$iphoneSentiment)
GalaxyLargeMatrix<-cbind(grupo, galaxySentiment=sentiment$galaxySentiment)
Now let’s view the variables or attributes of the matrices, and its dimension.
dim(iPhoneLargeMatrix)
## [1] 92281 60
names(iPhoneLargeMatrix)
## [1] "id" "iphone" "samsunggalaxy"
## [4] "sonyxperia" "nokialumina" "htcphone"
## [7] "ios" "googleandroid" "iphonecampos"
## [10] "samsungcampos" "sonycampos" "nokiacampos"
## [13] "htccampos" "iphonecamneg" "samsungcamneg"
## [16] "sonycamneg" "nokiacamneg" "htccamneg"
## [19] "iphonecamunc" "samsungcamunc" "sonycamunc"
## [22] "nokiacamunc" "htccamunc" "iphonedispos"
## [25] "samsungdispos" "sonydispos" "nokiadispos"
## [28] "htcdispos" "iphonedisneg" "samsungdisneg"
## [31] "sonydisneg" "nokiadisneg" "htcdisneg"
## [34] "iphonedisunc" "samsungdisunc" "sonydisunc"
## [37] "nokiadisunc" "htcdisunc" "iphoneperpos"
## [40] "samsungperpos" "sonyperpos" "nokiaperpos"
## [43] "htcperpos" "iphoneperneg" "samsungperneg"
## [46] "sonyperneg" "nokiaperneg" "htcperneg"
## [49] "iphoneperunc" "samsungperunc" "sonyperunc"
## [52] "nokiaperunc" "htcperunc" "iosperpos"
## [55] "googleperpos" "iosperneg" "googleperneg"
## [58] "iosperunc" "googleperunc" "iphoneSentiment"
We can see that we have positive,uncertain (neutral) and negative counts about the phone’s operating system, camera, display, performance (hardware), performance (operating system).
One of the features is Sentiment of Galaxy and Iphone, so let’s look how the counts are distributed, and its statistical details
summary(iPhoneLargeMatrix$iphoneSentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -689.000 0.000 0.000 4.297 0.000 5600.000
summary(GalaxyLargeMatrix$galaxySentiment)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -668.000 0.000 0.000 1.692 0.000 5600.000
Here we see that there are negatives values. This is because the system of weights used to classify variables. All attributes that denote a negative opinion are given -10 , and the positives 10 .
To visualize the distribution of this sentiment I will use two kinds of plots. The simple plot and histogram.
plot(iPhoneLargeMatrix$iphoneSentiment, ylab = "Sentiment", main = "Iphone Sentiment")
hist(iPhoneLargeMatrix$iphoneSentiment, xlim = c(-10,15), ylim = c(0,90000), breaks = 1000,
xlab = "Iphone sentiment", main="Histogram of Iphone Sentiment")
In the first plot, we can see that there are two clear outliers, that if we look at the histogram, they affect distribution.
Let’s see the same plots with Galaxy. We see that also outliers affect distribution. There are the same two outliers, and these outliers can affect a future discretization. So in order to get a better Kappa, and not just accuracy, I will exclude some of them. First I will exclude the instances with no information.
iPhoneLargeMatrix$Mean_row<-rowMeans(iPhoneLargeMatrix, na.rm = TRUE)
nuevoIphone<-filter(iPhoneLargeMatrix, Mean_row !=0)
nuevoIphone$Mean_row<-NULL
GalaxyLargeMatrix$Mean_row <- rowMeans(GalaxyLargeMatrix, na.rm = TRUE)
nuevoGalaxy<- filter(GalaxyLargeMatrix, Mean_row != 0)
nuevoGalaxy$Mean_row<-NULL
To visualize more in detail the distribution of the sentiment, I’m going to discretize the data.
dIphone <- discretize(nuevoIphone$iphoneSentiment, "fixed", categories= c(-Inf,-300,-200,-100, -50, -10, -1, 1, 10, 50, 100,200,300,Inf))
dGalaxy <- discretize(nuevoGalaxy$galaxySentiment, "fixed", categories = c(-Inf,-300,-200,-100, -50, -10, -1, 1, 10, 50,100,200,300, Inf))
And now I check the distribution of these buckets.
summary(dIphone)
## [-Inf,-300) -300 [-200,-100) [-100, -50) [ -50, -10) [ -10, -1)
## 8 0 11 136 730 2041
## [ -1, 1) [ 1, 10) [ 10, 50) [ 50, 100) [ 100, 200) [ 200, 300)
## 67786 5774 14195 996 403 90
## [ 300, Inf]
## 109
summary(dGalaxy)
## [-Inf,-300) -300 [-200,-100) [-100, -50) [ -50, -10) [ -10, -1)
## 3 0 3 31 258 397
## [ -1, 1) [ 1, 10) [ 10, 50) [ 50, 100) [ 100, 200) [ 200, 300)
## 84617 2477 3800 472 143 31
## [ 300, Inf]
## 47
I also check the standard deviation to analyze this distribution and decide which outliers delete.
sdiphone<-sd(nuevoIphone$iphoneSentiment)
sdgalaxy<-sd(nuevoGalaxy$galaxySentiment)
sdiphone
## [1] 32.29157
sdgalaxy
## [1] 25.50328
With all this information, we can do a density plot. It will help to decide what is the tendency of sentiment I want to plot real sentiment without considering 0, because it is neutral. So I create a new variable to plot. In this case, I consider that outliers will be those over one SD, so I’ve chosen 40 because it shows the tendency better in both datasets. We can see that in next plot.
plotterIPHONE <- nuevoIphone %>% filter( -40 < iphoneSentiment & 40 > iphoneSentiment & iphoneSentiment != 0)
plotterGALAXY <- nuevoGalaxy %>% filter( -40 < galaxySentiment & 40 > galaxySentiment & galaxySentiment !=0)
Now I make a plot of desity about sentiment to see distribution.
ggplot(plotterIPHONE, aes(plotterIPHONE$iphoneSentiment))+ geom_density()
ggplot(plotterGALAXY, aes(plotterGALAXY$galaxySentiment))+ geom_density()
At this point, we have pre processed the instances of both datasets. But now we have to make the process for the attributes. We have to select which attributes will be our predictors, and which ones are highly correlated and not necessary to create and evaluate a predictive model.
To check the correlations and subset the attributes first I protect the attribute to predict.
iphonefiltered<- select(plotterIPHONE, -iphoneSentiment)
galaxyfiltered<- select(plotterGALAXY, -galaxySentiment)
Now I analyze the correlation of attributes.
correIphone <- cor(iphonefiltered)
correGalaxy <- cor(galaxyfiltered)
And also investigate the statistics of the new objects.
summary(correIphone[upper.tri(correIphone)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.151500 0.001912 0.069250 0.142500 0.182600 0.998800
summary(correGalaxy[upper.tri(correGalaxy)])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.152900 -0.003486 0.049840 0.129000 0.184900 0.999700
Select a point of cut off of the highly correlated features. In this case this point will be 80%.
corrIphone2 <- findCorrelation(correIphone, cutoff = .80)
corrGalaxy2 <- findCorrelation(correGalaxy, cutoff = .80)
And now make a matrix without the high correlated attributes.
iPhoneRes <- iphonefiltered[,-corrIphone2]
GalaxyRes <- galaxyfiltered[, -corrGalaxy2]
Let`s see which are the attributes that are going to be used to predict.
names(iPhoneRes)
## [1] "id" "iphone" "samsunggalaxy" "sonyxperia"
## [5] "nokialumina" "htcphone" "ios" "googleandroid"
## [9] "iphonecampos" "samsungcampos" "sonycampos" "iphonecamneg"
## [13] "iphonecamunc" "samsungcamunc" "sonycamunc" "nokiacamunc"
## [17] "htccamunc" "iphonedispos" "sonydispos" "iphonedisneg"
## [21] "samsungdisneg" "htcdisneg" "iphonedisunc" "samsungdisunc"
## [25] "iphoneperpos" "iphoneperneg" "samsungperneg" "htcperneg"
## [29] "iphoneperunc" "samsungperunc" "sonyperunc" "htcperunc"
## [33] "iosperpos" "googleperpos"
names(GalaxyRes)
## [1] "id" "iphone" "samsunggalaxy" "sonyxperia"
## [5] "nokialumina" "htcphone" "ios" "googleandroid"
## [9] "iphonecampos" "samsungcampos" "sonycampos" "htccampos"
## [13] "sonycamneg" "nokiacamneg" "iphonecamunc" "samsungcamunc"
## [17] "sonycamunc" "htccamunc" "iphonedispos" "nokiadispos"
## [21] "htcdispos" "samsungdisneg" "sonydisneg" "htcdisneg"
## [25] "iphonedisunc" "samsungdisunc" "samsungperpos" "sonyperpos"
## [29] "htcperpos" "iphoneperneg" "htcperneg" "iphoneperunc"
## [33] "samsungperunc" "htcperunc" "iosperpos" "googleperneg"
At this point I add the attribute of sentiment about each phone that we excluded before.
iPhoneRes<- mutate(iPhoneRes, iPhoneSentiment = plotterIPHONE$iphoneSentiment)
GalaxyRes <- mutate(GalaxyRes, GalaxySentiment = plotterGALAXY$galaxySentiment)
And plot correlation matrix.
corrplot(cor(iPhoneRes), order = "hclust")
corrplot(cor(GalaxyRes), order = "hclust")
Now we have to discretize to see how is now the distribution of sentiment, categorize it and give labels to it.
LabIphone<- discretize(iPhoneRes$iPhoneSentiment, "fixed", categories = c(-Inf,-15,-5,5,15,Inf))
LabGalaxy<- discretize(GalaxyRes$GalaxySentiment, "fixed", categories = c(-Inf,-15,-5,5,15,Inf))
summary(LabIphone)
## [-Inf, -15) [ -15, -5) [ -5, 5) [ 5, 15) [ 15, Inf]
## 643 2037 5602 9278 4462
summary(LabGalaxy)
## [-Inf, -15) [ -15, -5) [ -5, 5) [ 5, 15) [ 15, Inf]
## 213 393 2429 1534 2045
With 5 buckets we get a good distribution. So now we use it.
iPhoneRes$iPhoneSentiment<- LabIphone
GalaxyRes$GalaxySentiment<- LabGalaxy
And label them.
levels(iPhoneRes$iPhoneSentiment)<- c("very negative", "negative", "neutral", "good", "very good")
levels(GalaxyRes$GalaxySentiment)<- c("very negative", "negative", "neutral", "good", "very good")
summary(iPhoneRes$iPhoneSentiment)
## very negative negative neutral good very good
## 643 2037 5602 9278 4462
summary(GalaxyRes$GalaxySentiment)
## very negative negative neutral good very good
## 213 393 2429 1534 2045
At this point the pre process is finished. So now we begin the creation of models.
To create the models what I will do to divide each dataset into Train and Test set, for training the models and test it with instances that it has not been used for the model. This way I will evaluate the accuracy and Kappa of each model.
IPHONE MODELS
I create the partition, and check the partition of each set.
set.seed(333)
trainIndex1<- createDataPartition(iPhoneRes$iPhoneSentiment, p= .7, list = FALSE)
Train1 <- iPhoneRes[ trainIndex1,]
Test1 <- iPhoneRes[-trainIndex1,]
nrow(Train1)
## [1] 15418
nrow(Test1)
## [1] 6604
We have a classification problem, so the models that I am going to use will be C5.0, KNN and RandomForest .
__________________________________MODEL C5.0___________________________________________________________________
ctrl<-trainControl(method = "repeatedcv", repeats = 3)
C5Iphone<- train(iPhoneSentiment~., data= Train1, method= "C5.0", trControl = ctrl,preProc = c("center", "scale"))
and now we use it with test to make the prediction and evaluate it with a confusion matrix.
IPHONEPRED<-predict(C5Iphone, Test1)
confusionMatrix(Test1$iPhoneSentiment, IPHONEPRED)
## Confusion Matrix and Statistics
##
## Reference
## Prediction very negative negative neutral good very good
## very negative 166 13 4 3 6
## negative 14 542 37 9 9
## neutral 1 35 1534 79 31
## good 1 10 86 2613 73
## very good 1 1 18 24 1294
##
## Overall Statistics
##
## Accuracy : 0.9311
## 95% CI : (0.9247, 0.9371)
## No Information Rate : 0.4131
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9028
## Mcnemar's Test P-Value : 9.526e-06
##
## Statistics by Class:
##
## Class: very negative Class: negative Class: neutral
## Sensitivity 0.90710 0.90183 0.9136
## Specificity 0.99595 0.98851 0.9704
## Pos Pred Value 0.86458 0.88707 0.9131
## Neg Pred Value 0.99735 0.99016 0.9706
## Prevalence 0.02771 0.09101 0.2542
## Detection Rate 0.02514 0.08207 0.2323
## Detection Prevalence 0.02907 0.09252 0.2544
## Balanced Accuracy 0.95153 0.94517 0.9420
## Class: good Class: very good
## Sensitivity 0.9578 0.9158
## Specificity 0.9561 0.9915
## Pos Pred Value 0.9389 0.9671
## Neg Pred Value 0.9699 0.9774
## Prevalence 0.4131 0.2140
## Detection Rate 0.3957 0.1959
## Detection Prevalence 0.4214 0.2026
## Balanced Accuracy 0.9570 0.9537
We can see good results with this model.
__________________________________________KNN___________________________________________________________________
ctrl<-trainControl(method = "repeatedcv", repeats = 3)
KNNIphone<- train(iPhoneSentiment~., data= Train1, method= "knn", trControl = ctrl,preProc = c("center", "scale"))
And now the prediction and Confusion matrix with KNN
PredKNNIphone<-predict(KNNIphone, Test1)
confusionMatrix(Test1$iPhoneSentiment, PredKNNIphone)
## Confusion Matrix and Statistics
##
## Reference
## Prediction very negative negative neutral good very good
## very negative 123 36 22 5 6
## negative 16 485 78 24 8
## neutral 3 26 1504 116 31
## good 3 5 189 2490 96
## very good 1 2 61 134 1140
##
## Overall Statistics
##
## Accuracy : 0.8695
## 95% CI : (0.8611, 0.8775)
## No Information Rate : 0.4193
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8146
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: very negative Class: negative Class: neutral
## Sensitivity 0.84247 0.87545 0.8112
## Specificity 0.98932 0.97917 0.9629
## Pos Pred Value 0.64062 0.79378 0.8952
## Neg Pred Value 0.99641 0.98849 0.9289
## Prevalence 0.02211 0.08389 0.2807
## Detection Rate 0.01863 0.07344 0.2277
## Detection Prevalence 0.02907 0.09252 0.2544
## Balanced Accuracy 0.91589 0.92731 0.8871
## Class: good Class: very good
## Sensitivity 0.8992 0.8899
## Specificity 0.9236 0.9628
## Pos Pred Value 0.8947 0.8520
## Neg Pred Value 0.9270 0.9732
## Prevalence 0.4193 0.1940
## Detection Rate 0.3770 0.1726
## Detection Prevalence 0.4214 0.2026
## Balanced Accuracy 0.9114 0.9264
Here we see it is a little worst than C5.0.
______________________________________________RANDOMFOREST______________________________________________________
Now I will try with a Random Forest
library(randomForest)
RFmodelIphone<-randomForest(iPhoneSentiment~., data= Train1)
RFpredIphone <- predict(RFmodelIphone, Test1)
and analyze it with the confusion matrix.
confusionMatrix(Test1$iPhoneSentiment, RFpredIphone)
## Confusion Matrix and Statistics
##
## Reference
## Prediction very negative negative neutral good very good
## very negative 115 29 33 3 12
## negative 6 464 113 16 12
## neutral 0 3 1613 34 30
## good 0 1 199 2505 78
## very good 0 2 66 102 1168
##
## Overall Statistics
##
## Accuracy : 0.8881
## 95% CI : (0.8802, 0.8956)
## No Information Rate : 0.4028
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8412
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: very negative Class: negative Class: neutral
## Sensitivity 0.95041 0.92986 0.7969
## Specificity 0.98812 0.97592 0.9854
## Pos Pred Value 0.59896 0.75941 0.9601
## Neg Pred Value 0.99906 0.99416 0.9165
## Prevalence 0.01832 0.07556 0.3065
## Detection Rate 0.01741 0.07026 0.2442
## Detection Prevalence 0.02907 0.09252 0.2544
## Balanced Accuracy 0.96927 0.95289 0.8912
## Class: good Class: very good
## Sensitivity 0.9417 0.8985
## Specificity 0.9295 0.9679
## Pos Pred Value 0.9001 0.8729
## Neg Pred Value 0.9594 0.9749
## Prevalence 0.4028 0.1969
## Detection Rate 0.3793 0.1769
## Detection Prevalence 0.4214 0.2026
## Balanced Accuracy 0.9356 0.9332
RESULTS OF IPHONE
Best prediction is with C 5.0 that we achieve Accuracy : 0.9311 and Kappa : 0.9028 . We have also to consider that in the pre process we deleted instances with all 0, also instances that had Sentiment Value over 40 or under 40. doing so we deleted outliers. And the third process that I have done was to delete instances with 0 Sentiment. We have to predict Sentiment, so we have to check the density of this sentiment and how it is distributed to get a good predictive model. Also I deleted the attributes highly correlated
SAMSUNG GALAXY ANALYSIS
First again create the datasets of training and test.
set.seed(123)
trainIndex2<- createDataPartition(GalaxyRes$GalaxySentiment, p= .7, list = FALSE)
Train2 <- GalaxyRes[ trainIndex2,]
Test2 <- GalaxyRes[-trainIndex2,]
nrow(Train2)
## [1] 4633
nrow(Test2)
## [1] 1981
___________________________C5.0_______________________________________________________________________________
ctrl<-trainControl(method = "repeatedcv", repeats = 3)
C5Galaxy<- train(GalaxySentiment~., data= Train2, method= "C5.0", trControl = ctrl,preProc = c("center", "scale"))
Make the prediction and confusion matrix
C5Galaxypred<- predict(C5Galaxy, Test2)
confusionMatrix(Test2$GalaxySentiment, C5Galaxypred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction very negative negative neutral good very good
## very negative 36 16 5 1 5
## negative 8 62 35 8 4
## neutral 2 9 672 31 14
## good 3 5 64 339 49
## very good 5 4 9 47 548
##
## Overall Statistics
##
## Accuracy : 0.8364
## 95% CI : (0.8194, 0.8525)
## No Information Rate : 0.3963
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7676
## Mcnemar's Test P-Value : 0.0002158
##
## Statistics by Class:
##
## Class: very negative Class: negative Class: neutral
## Sensitivity 0.66667 0.64583 0.8561
## Specificity 0.98599 0.97082 0.9532
## Pos Pred Value 0.57143 0.52991 0.9231
## Neg Pred Value 0.99062 0.98176 0.9098
## Prevalence 0.02726 0.04846 0.3963
## Detection Rate 0.01817 0.03130 0.3392
## Detection Prevalence 0.03180 0.05906 0.3675
## Balanced Accuracy 0.82633 0.80833 0.9046
## Class: good Class: very good
## Sensitivity 0.7958 0.8839
## Specificity 0.9222 0.9522
## Pos Pred Value 0.7370 0.8940
## Neg Pred Value 0.9428 0.9474
## Prevalence 0.2150 0.3130
## Detection Rate 0.1711 0.2766
## Detection Prevalence 0.2322 0.3094
## Balanced Accuracy 0.8590 0.9181
__________________________________KNN_________________________________________________________________________
ctrl<-trainControl(method = "repeatedcv")
KNNGalaxy<- train(GalaxySentiment~., data= Train2, method= "knn", trControl = ctrl,preProc = c("center", "scale"))
PredKNNGalaxy<-predict(KNNGalaxy, Test2)
confusionMatrix(Test2$GalaxySentiment, PredKNNGalaxy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction very negative negative neutral good very good
## very negative 28 17 5 5 8
## negative 9 59 32 14 3
## neutral 2 16 626 51 33
## good 3 5 77 329 46
## very good 7 4 22 100 480
##
## Overall Statistics
##
## Accuracy : 0.7683
## 95% CI : (0.7491, 0.7867)
## No Information Rate : 0.3847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6724
## Mcnemar's Test P-Value : 9.174e-06
##
## Statistics by Class:
##
## Class: very negative Class: negative Class: neutral
## Sensitivity 0.57143 0.58416 0.8215
## Specificity 0.98188 0.96915 0.9163
## Pos Pred Value 0.44444 0.50427 0.8599
## Neg Pred Value 0.98905 0.97747 0.8915
## Prevalence 0.02473 0.05098 0.3847
## Detection Rate 0.01413 0.02978 0.3160
## Detection Prevalence 0.03180 0.05906 0.3675
## Balanced Accuracy 0.77666 0.77665 0.8689
## Class: good Class: very good
## Sensitivity 0.6593 0.8421
## Specificity 0.9116 0.9057
## Pos Pred Value 0.7152 0.7830
## Neg Pred Value 0.8882 0.9342
## Prevalence 0.2519 0.2877
## Detection Rate 0.1661 0.2423
## Detection Prevalence 0.2322 0.3094
## Balanced Accuracy 0.7855 0.8739
__________________________________________RANDOMFOREST_________________________________________________________
RFmodelGalaxy<-randomForest(GalaxySentiment~., data= Train2)
RFpredGalaxy <- predict(RFmodelGalaxy, Test2)
confusionMatrix(Test2$GalaxySentiment, RFpredGalaxy)
## Confusion Matrix and Statistics
##
## Reference
## Prediction very negative negative neutral good very good
## very negative 36 11 9 2 5
## negative 11 51 41 12 2
## neutral 0 0 673 39 16
## good 1 2 80 333 44
## very good 2 2 24 56 529
##
## Overall Statistics
##
## Accuracy : 0.8188
## 95% CI : (0.8011, 0.8355)
## No Information Rate : 0.4175
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7407
## Mcnemar's Test P-Value : 3.138e-12
##
## Statistics by Class:
##
## Class: very negative Class: negative Class: neutral
## Sensitivity 0.72000 0.77273 0.8138
## Specificity 0.98602 0.96554 0.9523
## Pos Pred Value 0.57143 0.43590 0.9245
## Neg Pred Value 0.99270 0.99195 0.8771
## Prevalence 0.02524 0.03332 0.4175
## Detection Rate 0.01817 0.02574 0.3397
## Detection Prevalence 0.03180 0.05906 0.3675
## Balanced Accuracy 0.85301 0.86913 0.8831
## Class: good Class: very good
## Sensitivity 0.7534 0.8876
## Specificity 0.9175 0.9394
## Pos Pred Value 0.7239 0.8630
## Neg Pred Value 0.9283 0.9510
## Prevalence 0.2231 0.3009
## Detection Rate 0.1681 0.2670
## Detection Prevalence 0.2322 0.3094
## Balanced Accuracy 0.8354 0.9135
RESULTS FOR GALAXY
Like in the case of Iphone, the best algorithm is C5.0. The Accuracy : 0.8364 and Kappa : 0.7676 . In this example, the number of instances to create and evaluate the model are less than in Iphone.
Iphone is a device that has more lovers or haters. We can see it with the sentiment distribution. Whilst there are more people with no opinion about Galaxy.
I have tried also to see which attributes have more weight in Sentiment Analysis. For this subject I have used WEKA with the algorithm CFS.
I first apply the funtion in iPhoneLargeMatrix dataset, to predict iPhone sentiment. and the result is:
Selected attributes: 10,14,25,27,40,55 : 6 iphonecampos htccampos iphonedispos sonydispos iphoneperpos iosperpos
And now I apply it to GalaxyLargeMatrix dataset, to predict Galaxy sentiment:
Selected attributes: 11,12,26,41,56 : 5 samsungcampos sonycampos samsungdispos samsungperpos googleperpos