Helio (a fictional app developing company) is working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere (one of the apps, for example, enables specialists in communicable diseases to diagnose conditions by examining images and other patient data uploaded by local aid workers). The government agency requires that the app suite be bundled with one model of smart phone. Helio is in the process of evaluating potential handset models to determine which one to bundle their software with. After completing an initial investigation, Helio has created a short list of five devices that are all capable of executing the app suite’s functions. To help Helio narrow their list down to one device, they have asked us to examine the prevalence of positive and negative attitudes toward these devices on the web. The goal of this project is to provide our client with a report that contains an analysis of sentiment toward the target devices, as well as a description of the methods and processes we used to arrive at our conclusions.
We will try to gauge sentiment across the internet to two devices; iPhone and Samsung Galaxy. This involves using Amazon Web Service (EC2, EMR and S3) to develop a data matrix (over 20,000 instances) taken from the Common Crawl (https://commoncrawl.org/big-picture/what-we-do/). In the below, we will only cover the analysis of this dataset, not how it was procured in AWS.
Below we import the necessary libraries for this project, and add additional processor cores to deal with the large datasets. We have two datasets, iphone_df and galaxy_df which have columns iphonesentiment and galaxysentiment respectively, these give a rating of sentiment towards the phones manually reviewed by a team who have read each instance:
0: very negative
1: negative
2: somewhat negative
3: somewhat positive
4: positive
5: very positive
#Additional processor cores
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(corrplot)
## corrplot 0.92 loaded
library(caret)
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
detectCores() # 8 cores available
## [1] 8
cl<- makeCluster(4)
#Register Cluster
registerDoParallel(cl)
#Confirm how many clusters are available and assigned to R
getDoParWorkers() # result is 3
## [1] 4
iphone_df<- read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\M5\\iphone_smallmatrix_labeled_8d.csv")
galaxy_df<- read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\M5\\galaxy_smallmatrix_labeled_9d.csv")
#No missing data
sum(is.na(iphone_df))
## [1] 0
sum(is.na(galaxy_df))
## [1] 0
Taking a look at our data sets, we can see a large number of variables. Each variable is the same in both datasets, with the only difference being the last column, the ‘sentiment’ towards the phone.
We have the number of times the below phone types we counted in the document:
iphone samsunggalaxy sonyxperia nokialumia htcphone
For each of these phones, we have counted the number of positive words or expressions near each of the below attributes (or their synonyms):
Operating system (only ios or google andriod,positive, negative or unclear) Camera (positive, negative or unclear) Display (positive, negative or unclear) Performance (positive, negative or unclear)
options(max.print=1000000)
head(iphone_df)
## iphone samsunggalaxy sonyxperia nokialumina htcphone ios googleandroid
## 1 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0
## 4 1 0 0 0 0 0 0
## 5 1 0 0 0 0 0 0
## 6 41 0 0 0 0 6 0
## iphonecampos samsungcampos sonycampos nokiacampos htccampos iphonecamneg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 1 0 0 0 0 3
## samsungcamneg sonycamneg nokiacamneg htccamneg iphonecamunc samsungcamunc
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 7 0
## sonycamunc nokiacamunc htccamunc iphonedispos samsungdispos sonydispos
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 0
## nokiadispos htcdispos iphonedisneg samsungdisneg sonydisneg nokiadisneg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 3 0 0 0
## htcdisneg iphonedisunc samsungdisunc sonydisunc nokiadisunc htcdisunc
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 4 0 0 0 0
## iphoneperpos samsungperpos sonyperpos nokiaperpos htcperpos iphoneperneg
## 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 1 0 0 0 0 0
## 5 1 0 0 0 0 0
## 6 0 0 0 0 0 0
## samsungperneg sonyperneg nokiaperneg htcperneg iphoneperunc samsungperunc
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 1 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## sonyperunc nokiaperunc htcperunc iosperpos googleperpos iosperneg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## googleperneg iosperunc googleperunc iphonesentiment
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 4
head(galaxy_df)
## iphone samsunggalaxy sonyxperia nokialumina htcphone ios googleandroid
## 1 1 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0
## 3 1 1 0 0 0 0 0
## 4 0 0 0 0 1 0 0
## 5 1 0 0 0 0 0 0
## 6 2 0 0 0 0 0 0
## iphonecampos samsungcampos sonycampos nokiacampos htccampos iphonecamneg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 1 1 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 1 0 0 0 0 0
## samsungcamneg sonycamneg nokiacamneg htccamneg iphonecamunc samsungcamunc
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## sonycamunc nokiacamunc htccamunc iphonedispos samsungdispos sonydispos
## 1 0 0 0 0 0 0
## 2 0 0 0 1 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## nokiadispos htcdispos iphonedisneg samsungdisneg sonydisneg nokiadisneg
## 1 0 0 0 0 0 0
## 2 0 0 1 0 0 0
## 3 0 0 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## htcdisneg iphonedisunc samsungdisunc sonydisunc nokiadisunc htcdisunc
## 1 0 0 0 0 0 0
## 2 0 1 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 1
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## iphoneperpos samsungperpos sonyperpos nokiaperpos htcperpos iphoneperneg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 1 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## samsungperneg sonyperneg nokiaperneg htcperneg iphoneperunc samsungperunc
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 1 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## sonyperunc nokiaperunc htcperunc iosperpos googleperpos iosperneg
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 1 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## googleperneg iosperunc googleperunc galaxysentiment
## 1 0 0 0 5
## 2 0 0 0 3
## 3 0 0 0 3
## 4 0 0 0 0
## 5 0 0 0 1
## 6 0 0 0 0
Below we can see very similar looking distributions for sentiment towards the two types of phones, the Samsung Galaxy has a slightly higher ‘very positive’ sentiment to iPhone’s with that difference being passed on to the ‘very negative’ ratings.
Below we have the two correlation plots for iPhone’s and Samsung Galaxy’s. We want to focus on the top line here, sentiment, to see if there are any features largely correlated to this negatively or positively.
Both graphs are the same, which is expected, but the sentiment to iPhone’s and SG’s look to have similar correlations as well. This is not too surprising given the similarities in the histograms above. Nothing on either graph shows a strong positive correlation to sentiment, interestingly the count of ‘samsung’ followed by ‘galaxy’ has a prevalent negative correlation with ‘galaxysentiment’ and ‘iphonesentiment’.
Lets look at how are variables are set up. All variables are a count of a words appearance on a website, other than the sentiment column which is categorical (definitions above) - so we will change those columns to factors (categorical).
str(iphone_df)
## 'data.frame': 12973 obs. of 59 variables:
## $ iphone : int 1 1 1 1 1 41 1 1 1 1 ...
## $ samsunggalaxy : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyxperia : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokialumina : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcphone : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ios : int 0 0 0 0 0 6 0 0 0 0 ...
## $ googleandroid : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonecampos : int 0 0 0 0 0 1 1 0 0 0 ...
## $ samsungcampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonycampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiacampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htccampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonecamneg : int 0 0 0 0 0 3 1 0 0 0 ...
## $ samsungcamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonycamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiacamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htccamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonecamunc : int 0 0 0 0 0 7 1 0 0 0 ...
## $ samsungcamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonycamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiacamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htccamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonedispos : int 0 0 0 0 0 1 13 0 0 0 ...
## $ samsungdispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonydispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiadispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcdispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonedisneg : int 0 0 0 0 0 3 10 0 0 0 ...
## $ samsungdisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonydisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiadisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcdisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonedisunc : int 0 0 0 0 0 4 9 0 0 0 ...
## $ samsungdisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonydisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiadisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcdisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphoneperpos : int 0 1 0 1 1 0 5 3 0 0 ...
## $ samsungperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiaperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphoneperneg : int 0 0 0 0 0 0 4 1 0 0 ...
## $ samsungperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiaperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphoneperunc : int 0 0 0 1 0 0 5 0 0 0 ...
## $ samsungperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiaperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iosperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iosperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iosperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonesentiment: int 0 0 0 0 0 4 4 0 0 0 ...
str(galaxy_df)
## 'data.frame': 12911 obs. of 59 variables:
## $ iphone : int 1 1 1 0 1 2 1 1 4 1 ...
## $ samsunggalaxy : int 0 0 1 0 0 0 0 0 0 0 ...
## $ sonyxperia : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokialumina : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcphone : int 0 0 0 1 0 0 0 0 0 0 ...
## $ ios : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleandroid : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonecampos : int 0 0 1 0 0 1 0 0 0 0 ...
## $ samsungcampos : int 0 0 1 0 0 0 0 0 0 0 ...
## $ sonycampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiacampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htccampos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonecamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ samsungcamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonycamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiacamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htccamneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonecamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ samsungcamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonycamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiacamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htccamunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonedispos : int 0 1 0 0 0 0 2 0 0 0 ...
## $ samsungdispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonydispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiadispos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcdispos : int 0 0 0 1 0 0 0 0 0 0 ...
## $ iphonedisneg : int 0 1 0 0 0 0 0 0 0 0 ...
## $ samsungdisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonydisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiadisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcdisneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iphonedisunc : int 0 1 0 0 0 0 0 0 0 0 ...
## $ samsungdisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonydisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiadisunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcdisunc : int 0 0 0 1 0 0 0 0 0 0 ...
## $ iphoneperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ samsungperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiaperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcperpos : int 0 0 0 1 0 0 0 0 0 0 ...
## $ iphoneperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ samsungperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiaperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcperneg : int 0 0 0 1 0 0 0 0 0 0 ...
## $ iphoneperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ samsungperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ sonyperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nokiaperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ htcperunc : int 0 0 0 1 0 0 0 0 0 0 ...
## $ iosperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleperpos : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iosperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleperneg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ iosperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ googleperunc : int 0 0 0 0 0 0 0 0 0 0 ...
## $ galaxysentiment: int 5 3 3 0 1 0 3 5 5 5 ...
iphone_df$iphonesentiment<-as.factor(iphone_df$iphonesentiment)
galaxy_df$galaxysentiment<-as.factor(galaxy_df$galaxysentiment)
In the below, we will be training 4 different classification models; Random Forest, C5, Support Vector Machine and K Nearest Neighbors. We will train each of these models on both our iPhone and Galaxy data. For these models, we will use 5 fold cross validation repeated once, and use all features. This will give us an indication of which model works best for our data, which we can then fine tune to try and improve accuracy later. I will show the code for the training of the iPhone Random Forest, and then hide all other code, just showing the results which will be compared later on.
set.seed(123)
IndexTrain<- createDataPartition(y=iphone_df$iphonesentiment,
p=0.70,
list=FALSE)
training_set_iphone<-iphone_df[IndexTrain,]
testing_set_iphone<-iphone_df[-IndexTrain,]
# Define the control parameters for our model
responses_ctrl_iphone<- trainControl(method='repeatedcv', number=5,repeats=1)
Random Forest uses an ensemble (forest) of decision trees to try and predict the outcome of a classification problem. The output that the majority of decision trees predict is the chosen output for the Random Forest
#Fit the model
RandomForestFit_iphone<-train(iphonesentiment~.,data = training_set_iphone,method='rf', trControl=responses_ctrl_iphone)
#Results:
RandomForestResults_iphone<- RandomForestFit_iphone$results
RandomForestResults_iphone
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7007591 0.3722918 0.003156604 0.01003563
## 2 30 0.7727620 0.5622495 0.006183294 0.01484825
## 3 58 0.7637347 0.5491715 0.005424906 0.01320563
#predictions
PredsRF_iphone<- predict(RandomForestFit_iphone, newdata=testing_set_iphone)
#Confusion matrix
RFConfMat_iphone<- confusionMatrix(data=PredsRF_iphone,testing_set_iphone$iphonesentiment)
#accuracy
RF_iphone_accuracy<- postResample(PredsRF_iphone,testing_set_iphone$iphonesentiment)
C5 classification models work by using a decision tree based model, or a rule based model. It works by splitting the data at the point where there is most information to be gained.
## model winnow trials Accuracy Kappa AccuracySD KappaSD
## 7 rules FALSE 1 0.7703402 0.5543348 0.004215102 0.008725390
## 10 rules TRUE 1 0.7697890 0.5537985 0.004330741 0.009172138
## 1 tree FALSE 1 0.7684691 0.5512736 0.002969988 0.005521612
## 4 tree TRUE 1 0.7684681 0.5518640 0.004332674 0.009162804
## 8 rules FALSE 10 0.7602120 0.5366753 0.003850594 0.009972375
## 11 rules TRUE 10 0.7582304 0.5366311 0.004773835 0.008185420
## 2 tree FALSE 10 0.7624127 0.5434079 0.003080477 0.007475143
## 5 tree TRUE 10 0.7606525 0.5409539 0.002789537 0.004851069
## 9 rules FALSE 20 0.7602120 0.5366753 0.003850594 0.009972375
## 12 rules TRUE 20 0.7582304 0.5366311 0.004773835 0.008185420
## 3 tree FALSE 20 0.7624127 0.5434079 0.003080477 0.007475143
## 6 tree TRUE 20 0.7606525 0.5409539 0.002789537 0.004851069
SVM works by creating an n-dimensional plane that splits n+1 features into groups. For example, in a 2-D data set, SVM would create a line that partitions the data into groups most effectively.
## cost Accuracy Kappa AccuracySD KappaSD
## 1 0.25 0.6928355 0.3693173 0.012570454 0.02987336
## 2 0.50 0.7069225 0.4055675 0.009558239 0.02073187
## 3 1.00 0.7089031 0.4124570 0.008390388 0.01815390
K Nearest neighbour uses the k points closest to a particular data point to classify the data point, assuming that its nearest points will have simialr characteristics to itself.
## kmax distance kernel Accuracy Kappa AccuracySD KappaSD
## 1 5 2 optimal 0.3080481 0.1530814 0.010067531 0.010854351
## 2 7 2 optimal 0.3203788 0.1564489 0.009636541 0.010723476
## 3 9 2 optimal 0.3283052 0.1609767 0.006996242 0.008372339
set.seed(123)
IndexTrain_galaxy<- createDataPartition(y=galaxy_df$galaxysentiment,
p=0.70,
list=FALSE)
training_set_galaxy<-galaxy_df[IndexTrain_galaxy,]
testing_set_galaxy<-galaxy_df[-IndexTrain_galaxy,]
training_set_galaxy$galaxysentiment<- as.factor(training_set_galaxy$galaxysentiment)
testing_set_galaxy$galaxysentiment<- as.factor(testing_set_galaxy$galaxysentiment)
#responses_ctrl stays the same
responses_ctrl_galaxy<- trainControl(method='repeatedcv', number=5,repeats=1)
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7054209 0.3568951 0.004052458 0.01169008
## 2 30 0.7611691 0.5248937 0.006761353 0.01478376
## 3 58 0.7523198 0.5127047 0.006921926 0.01539424
## model winnow trials Accuracy Kappa AccuracySD KappaSD
## 7 rules FALSE 1 0.7631656 0.5257385 0.009737038 0.02183258
## 10 rules TRUE 1 0.7648244 0.5291823 0.007977222 0.01847665
## 1 tree FALSE 1 0.7633872 0.5263596 0.009892377 0.02218260
## 4 tree TRUE 1 0.7628333 0.5253238 0.007895657 0.01852420
## 8 rules FALSE 10 0.7535426 0.5059946 0.010567289 0.02336173
## 11 rules TRUE 10 0.7502237 0.4971887 0.011906873 0.02942906
## 2 tree FALSE 10 0.7556443 0.5130673 0.010086479 0.02426922
## 5 tree TRUE 10 0.7508851 0.5003618 0.010474820 0.02631869
## 9 rules FALSE 20 0.7535426 0.5059946 0.010567289 0.02336173
## 12 rules TRUE 20 0.7502237 0.4971887 0.011906873 0.02942906
## 3 tree FALSE 20 0.7556443 0.5130673 0.010086479 0.02426922
## 6 tree TRUE 20 0.7508851 0.5003618 0.010474820 0.02631869
## cost Accuracy Kappa AccuracySD KappaSD
## 1 0.25 0.7012168 0.3626839 0.010386018 0.02472235
## 2 0.50 0.7044235 0.3739690 0.008082921 0.01692360
## 3 1.00 0.7054203 0.3821431 0.006988075 0.01542300
## kmax distance kernel Accuracy Kappa AccuracySD KappaSD
## 1 5 2 optimal 0.6659261 0.4140002 0.03182499 0.03844816
## 2 7 2 optimal 0.7042064 0.4525665 0.01743256 0.02190136
## 3 9 2 optimal 0.7287630 0.4839317 0.03115862 0.04170870
We can see in the below graphs that both C5 algorithm and the Random Forest algorithm have similar accuracies and Kappa scores.
In the above graphs, accuracy is the ratio of the number of correct predictions to the total number of inputs. For example, if the model guesses a 5 (Very positive) review and the sentiment rating given to this website is 5, that is a correct prediction. We can analyse this further with a confusion matrix.
We will compare a poor predictive model (KNN for iphone) with a better one (Random Forest for iphone)
We can see the K Nearest Neighbour model has a huge number of errors where it predicts the value to be ‘0’, but the correct value is ‘5’. This is obviously hugely incorrect, giving values of ‘very negative’ instead of ‘very positive’. This could be due lots of reasons, KNN is particularly susceptible to irrelevant features, which due to our dataset size could be a factor.
KNNConfMat_iphone
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5
## 0 532 99 85 94 222 1731
## 1 6 4 2 3 5 49
## 2 1 1 18 1 4 27
## 3 5 1 5 233 7 27
## 4 6 4 2 5 134 33
## 5 38 8 24 20 59 395
##
## Overall Statistics
##
## Accuracy : 0.3383
## 95% CI : (0.3234, 0.3534)
## No Information Rate : 0.5815
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1714
##
## Mcnemar's Test P-Value : <2e-16
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.9048 0.034188 0.132353 0.65449 0.31090 0.1746
## Specificity 0.3243 0.982772 0.990943 0.98727 0.98554 0.9085
## Pos Pred Value 0.1925 0.057971 0.346154 0.83813 0.72826 0.7261
## Neg Pred Value 0.9503 0.970427 0.969255 0.96595 0.91986 0.4420
## Prevalence 0.1512 0.030077 0.034961 0.09152 0.11080 0.5815
## Detection Rate 0.1368 0.001028 0.004627 0.05990 0.03445 0.1015
## Detection Prevalence 0.7103 0.017738 0.013368 0.07147 0.04730 0.1398
## Balanced Accuracy 0.6146 0.508480 0.561648 0.82088 0.64822 0.5416
We can now look at our Random Forest confusion matrix which has a lot less errors. It is noticeable that the RF model is predicting a lot more ‘5’ values which is the majority of the dataset (as we saw in the histograms). This can be shown by the 72.48% positive prediction value in Class: 5. This measurement (or precision) is the measurement of True Positives (predicting a 5 when it is a 5) divided by the total False Negatives (all other predictions when the actual data is 5) and True Positives.
RFConfMat_iphone
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5
## 0 376 0 1 0 4 8
## 1 1 0 0 0 0 1
## 2 1 1 17 0 0 2
## 3 2 0 1 236 4 3
## 4 5 0 1 4 143 10
## 5 203 116 116 116 280 2238
##
## Overall Statistics
##
## Accuracy : 0.7738
## 95% CI : (0.7603, 0.7868)
## No Information Rate : 0.5815
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5611
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.63946 0.0000000 0.125000 0.66292 0.33179 0.9894
## Specificity 0.99606 0.9994699 0.998934 0.99717 0.99422 0.4896
## Pos Pred Value 0.96658 0.0000000 0.809524 0.95935 0.87730 0.7292
## Neg Pred Value 0.93945 0.9699074 0.969243 0.96707 0.92273 0.9708
## Prevalence 0.15116 0.0300771 0.034961 0.09152 0.11080 0.5815
## Detection Rate 0.09666 0.0000000 0.004370 0.06067 0.03676 0.5753
## Detection Prevalence 0.10000 0.0005141 0.005398 0.06324 0.04190 0.7889
## Balanced Accuracy 0.81776 0.4997350 0.561967 0.83005 0.66300 0.7395
To improve our models, we can select or remove certain features if they are not providing any use. There are different ways of doing this.
The distribution of values within a variable (or feature) can give an indication on how much information that variable holds. This means features with little or near-zero variance can be removed as they do not give much information but increase computational time.
Below we have created two datasets (iphoneNZV and galaxyNZV) which have removed the less informative variables. These are the same for both except they have their phone’s sentiment column included.
The galaxyNZV dataset only has two features relating to samsung; “samsunggalaxy” & “galaxysentiment”. This raises the question, “is most sentiment to samsung galaxys driven by positive or negative reviews for iPhone’s?” which we hope to resolve.
nzvMetricsiPhone <- nearZeroVar(iphone_df, saveMetrics = TRUE)
nzviPhone<- nearZeroVar(iphone_df, saveMetrics = FALSE)
iphoneNZV<- iphone_df[,-nzviPhone]
iphoneNZV$iphonesentiment <-as.factor(iphoneNZV$iphonesentiment)
training_set_iphone_NZV<-iphoneNZV[IndexTrain,]
testing_set_iphone_NZV<-iphoneNZV[-IndexTrain,]
print("Remaing features in NZV iPhone dataset:")
## [1] "Remaing features in NZV iPhone dataset:"
colnames(training_set_iphone_NZV)
## [1] "iphone" "samsunggalaxy" "htcphone" "iphonecampos"
## [5] "iphonecamunc" "iphonedispos" "iphonedisneg" "iphonedisunc"
## [9] "iphoneperpos" "iphoneperneg" "iphoneperunc" "iphonesentiment"
nzvMetricsgalaxy <- nearZeroVar(galaxy_df, saveMetrics = TRUE)
nzvgalaxy<- nearZeroVar(galaxy_df, saveMetrics = FALSE)
galaxyNZV<- galaxy_df[,-nzvgalaxy]
galaxyNZV$galaxysentiment <-as.factor(galaxyNZV$galaxysentiment)
training_set_galaxy_NZV<-galaxyNZV[IndexTrain_galaxy,]
testing_set_galaxy_NZV<-galaxyNZV[-IndexTrain_galaxy,]
print("Remaing features in NZV galaxy dataset:")
## [1] "Remaing features in NZV galaxy dataset:"
colnames(training_set_galaxy_NZV)
## [1] "iphone" "samsunggalaxy" "htcphone" "iphonecampos"
## [5] "iphonecamunc" "iphonedispos" "iphonedisneg" "iphonedisunc"
## [9] "iphoneperpos" "iphoneperneg" "iphoneperunc" "galaxysentiment"
We will now run our two most successful models, C5.0 and Random Forest with the above feature selection to see if we can improve accuracy.
#Random FOrest
RandomForestFit_iphone_NZV<-train(iphonesentiment~.,data = training_set_iphone_NZV,method='rf', trControl=responses_ctrl_iphone)
RandomForestResults_iphone_NZV<- RandomForestFit_iphone_NZV$results
print("Random Forest Results:")
## [1] "Random Forest Results:"
RandomForestResults_iphone_NZV
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7586686 0.5249740 0.007129869 0.01612044
## 2 6 0.7545957 0.5238594 0.003768706 0.01013590
## 3 11 0.7485400 0.5160934 0.005833476 0.01413955
#Predictions
PredsRF_iphone_NZV<- predict(RandomForestFit_iphone_NZV, newdata=testing_set_iphone_NZV)
RFConfMat_iphone_NZV<- confusionMatrix(data=PredsRF_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)
RF_iphone_NZV_accuracy<-postResample(PredsRF_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)
#C5 decision tree
C5Fit_iphone_NZV<-train(iphonesentiment~.,data = training_set_iphone_NZV,method='C5.0', trControl=responses_ctrl_iphone)
C5Results_iphone_NZV<-C5Fit_iphone_NZV$results
print("C5 results:")
## [1] "C5 results:"
C5Results_iphone_NZV
## model winnow trials Accuracy Kappa AccuracySD KappaSD
## 7 rules FALSE 1 0.7530516 0.5137567 0.009230818 0.02476595
## 10 rules TRUE 1 0.7522811 0.5124292 0.009621745 0.02551985
## 1 tree FALSE 1 0.7544856 0.5193471 0.004815998 0.01068204
## 4 tree TRUE 1 0.7543755 0.5195277 0.005398680 0.01145251
## 8 rules FALSE 10 0.7465593 0.5038092 0.003820829 0.01103318
## 11 rules TRUE 10 0.7461190 0.5035769 0.005623192 0.01137125
## 2 tree FALSE 10 0.7451299 0.5036052 0.006617427 0.01349713
## 5 tree TRUE 10 0.7422680 0.4963711 0.009844885 0.02156069
## 9 rules FALSE 20 0.7465593 0.5038092 0.003820829 0.01103318
## 12 rules TRUE 20 0.7461190 0.5035769 0.005623192 0.01137125
## 3 tree FALSE 20 0.7451299 0.5036052 0.006617427 0.01349713
## 6 tree TRUE 20 0.7422680 0.4963711 0.009844885 0.02156069
PredsC5_iphone_NZV<- predict(C5Fit_iphone_NZV, newdata=testing_set_iphone_NZV)
C5ConfMat_iphone_NZV<- confusionMatrix(data=PredsC5_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)
C5_iphone_NZV_accuracy<-postResample(PredsC5_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)
## [1] "Random Forest results:"
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7547522 0.5000197 0.009318872 0.02301893
## 2 6 0.7505497 0.4994057 0.009057025 0.02333364
## 3 11 0.7430277 0.4896735 0.008238344 0.02085541
## [1] "C5 results:"
## model winnow trials Accuracy Kappa AccuracySD KappaSD
## 7 rules FALSE 1 0.7535414 0.4989467 0.005175917 0.012051023
## 10 rules TRUE 1 0.7535414 0.4989467 0.005175917 0.012051023
## 1 tree FALSE 1 0.7521041 0.4965346 0.006377437 0.014390995
## 4 tree TRUE 1 0.7521041 0.4965346 0.006377437 0.014390995
## 8 rules FALSE 10 0.7349552 0.4547650 0.003335738 0.007459253
## 11 rules TRUE 10 0.7349552 0.4547650 0.003335738 0.007459253
## 2 tree FALSE 10 0.7337393 0.4558775 0.009443116 0.020439343
## 5 tree TRUE 10 0.7337393 0.4558775 0.009443116 0.020439343
## 9 rules FALSE 20 0.7349552 0.4547650 0.003335738 0.007459253
## 12 rules TRUE 20 0.7349552 0.4547650 0.003335738 0.007459253
## 3 tree FALSE 20 0.7337393 0.4558775 0.009443116 0.020439343
## 6 tree TRUE 20 0.7337393 0.4558775 0.009443116 0.020439343
We will be using Random Forest, our best performing model, to run RFE, a feature selection algorithm on our data set of 58 variables. This will eliminate the least useful features gradually, leaving us with the best selection. We will use a sample data set for both iPhone’s and Galaxy’s to save computation time as this is a heavy process computationally.
#iphone
iphoneSample <- iphone_df[sample(1:nrow(iphone_df),1000,replace=FALSE),]
ctrl_iphone <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
rfeResults_iphone <- rfe(iphoneSample[,1:58],
iphoneSample$iphonesentiment,
sizes=(1:58),
rfeControl=ctrl_iphone)
plot(rfeResults_iphone, type=c("g", "o"))
iphoneRFE<- iphone_df[,predictors(rfeResults_iphone)]
iphoneRFE$iphonesentiment <- iphone_df$iphonesentiment
iphoneRFE$iphonesentiment <-as.factor(iphoneRFE$iphonesentiment)
training_set_iphone_RFE<-iphoneRFE[IndexTrain,]
testing_set_iphone_RFE<-iphoneRFE[-IndexTrain,]
print("Remaining features in RFE iphone dataset:")
## [1] "Remaining features in RFE iphone dataset:"
colnames(training_set_iphone_RFE)
## [1] "iphone" "googleandroid" "samsunggalaxy" "iphonedisunc"
## [5] "htcphone" "sonyxperia" "iphonedispos" "iphoneperpos"
## [9] "iphonedisneg" "ios" "iphonecamunc" "htcdispos"
## [13] "htccampos" "iphonecampos" "iphonecamneg" "iphoneperunc"
## [17] "iphoneperneg" "htcperpos" "htcperneg" "htcdisunc"
## [21] "htcperunc" "htccamneg" "htcdisneg" "samsungdispos"
## [25] "samsungperpos" "samsungperunc" "iphonesentiment"
## [1] "Remaining features in RFE galaxy dataset:"
## [1] "iphone" "samsunggalaxy" "googleandroid" "iphonedisunc"
## [5] "htcphone" "ios" "iphonedispos" "iphonecamunc"
## [9] "iphoneperpos" "htccampos" "iphonedisneg" "htcdispos"
## [13] "iphoneperneg" "sonyxperia" "iphonecamneg" "iphonecampos"
## [17] "htccamunc" "htcperpos" "iphoneperunc" "htcdisneg"
## [21] "htcdisunc" "htccamneg" "samsungcamunc" "samsungdispos"
## [25] "samsungcampos" "htcperneg" "iosperneg" "samsungperpos"
## [29] "galaxysentiment"
#Random Forest
RandomForestFit_iphone_RFE<-train(iphonesentiment~.,data = training_set_iphone_RFE,method='rf', trControl=responses_ctrl_iphone)
summary(RandomForestFit_iphone_RFE)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 9083 factor numeric
## err.rate 3500 -none- numeric
## confusion 42 -none- numeric
## votes 54498 matrix numeric
## oob.times 9083 -none- numeric
## classes 6 -none- character
## importance 26 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 9083 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 26 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 6 -none- character
## param 0 -none- list
RandomForestResults_iphone_RFE<- RandomForestFit_iphone_RFE$results
print("Random Forest results:")
## [1] "Random Forest results:"
RandomForestResults_iphone_RFE
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7127635 0.4054042 0.007719913 0.01748074
## 2 14 0.7701238 0.5589461 0.010412889 0.02038441
## 3 26 0.7627470 0.5490002 0.011105589 0.02078794
#Predictions
PredsRF_iphone_RFE<- predict(RandomForestFit_iphone_RFE, newdata=testing_set_iphone_RFE)
RFConfMat_iphone_RFE<- confusionMatrix(data=PredsRF_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)
RF_iphone_RFE_accuracy<-postResample(PredsRF_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)
#C5
C5Fit_iphone_RFE<-train(iphonesentiment~.,data = training_set_iphone_RFE,method='C5.0', trControl=responses_ctrl_iphone)
summary(C5Fit_iphone_RFE)
##
## Call:
## (function (x, y, trials = 1, rules = FALSE, weights = NULL, control
## = 0.25, minCases = 2, fuzzyThreshold = FALSE, sample = 0, earlyStopping
## = TRUE, label = "outcome", seed = 3913L))
##
##
## C5.0 [Release 2.07 GPL Edition] Thu Jul 20 14:22:31 2023
## -------------------------------
##
## Class specified by attribute `outcome'
##
## Read 9083 cases (27 attributes) from undefined.data
##
## Rules:
##
## Rule 1: (310, lift 6.6)
## iphone <= 0
## htcphone > 0
## -> class 0 [0.997]
##
## Rule 2: (180/3, lift 6.5)
## iphone <= 2
## sonyxperia > 0
## iphoneperpos <= 0
## iphonedisneg <= 1
## -> class 0 [0.978]
##
## Rule 3: (645/14, lift 6.5)
## iphone <= 0
## googleandroid <= 0
## -> class 0 [0.977]
##
## Rule 4: (80/1, lift 6.4)
## iphone > 0
## iphonedispos <= 3
## samsungperpos > 2
## -> class 0 [0.976]
##
## Rule 5: (82/2, lift 6.4)
## iphone <= 2
## sonyxperia <= 0
## htccampos > 1
## -> class 0 [0.964]
##
## Rule 6: (140/5, lift 6.3)
## iphone <= 2
## sonyxperia <= 0
## htccampos > 0
## iphoneperunc <= 0
## -> class 0 [0.958]
##
## Rule 7: (12, lift 6.1)
## sonyxperia > 0
## iphonedisneg <= 1
## iphonecamunc > 0
## htcperunc <= 0
## -> class 0 [0.929]
##
## Rule 8: (7, lift 5.9)
## iphonedisunc <= 0
## iphonedispos > 0
## iphoneperpos > 4
## iphonecamneg > 0
## -> class 0 [0.889]
##
## Rule 9: (14/1, lift 5.8)
## samsunggalaxy > 0
## iphonedisunc <= 0
## iphonecamunc > 0
## -> class 0 [0.875]
##
## Rule 10: (16/4, lift 4.8)
## iphone <= 1
## iphonedisunc <= 0
## iphonedispos > 0
## iphoneperpos > 0
## iphonecamneg > 1
## -> class 0 [0.722]
##
## Rule 11: (45, lift 28.0)
## iphone <= 0
## googleandroid > 0
## htcphone <= 0
## -> class 2 [0.979]
##
## Rule 12: (117/1, lift 10.7)
## iphone <= 1
## samsunggalaxy <= 0
## iphonedisunc > 0
## iphoneperpos <= 0
## iphonedisneg <= 0
## iphonecamneg <= 0
## iphoneperunc <= 0
## iphoneperneg <= 0
## -> class 3 [0.983]
##
## Rule 13: (162/2, lift 10.7)
## iphone <= 1
## samsunggalaxy <= 0
## iphonedispos > 0
## iphonedispos <= 2
## iphoneperpos <= 0
## iphonedisneg <= 0
## iphonecampos <= 0
## iphoneperunc <= 0
## -> class 3 [0.982]
##
## Rule 14: (81/1, lift 10.7)
## iphone <= 1
## iphonedisunc > 0
## iphonedispos > 0
## iphonedispos <= 1
## iphoneperpos <= 0
## iphonecamneg <= 0
## iphoneperunc <= 0
## -> class 3 [0.976]
##
## Rule 15: (115/2, lift 10.6)
## iphone <= 2
## samsunggalaxy <= 0
## iphonedisunc > 0
## iphonedisunc <= 1
## iphoneperpos <= 0
## iphonedisneg <= 0
## iphonecamneg <= 0
## iphoneperunc <= 0
## iphoneperneg <= 0
## -> class 3 [0.974]
##
## Rule 16: (102/2, lift 10.6)
## iphone > 0
## googleandroid > 0
## iphoneperpos <= 0
## -> class 3 [0.971]
##
## Rule 17: (145/11, lift 10.0)
## iphone > 0
## samsunggalaxy > 0
## sonyxperia <= 0
## iphoneperpos <= 0
## iphonecamunc <= 0
## -> class 3 [0.918]
##
## Rule 18: (6, lift 9.6)
## iphoneperpos <= 3
## iphonecamunc > 2
## htccampos > 0
## samsungperpos <= 2
## -> class 3 [0.875]
##
## Rule 19: (11/2, lift 8.4)
## iphone <= 2
## googleandroid <= 0
## sonyxperia <= 0
## htccampos > 0
## htccampos <= 1
## iphoneperunc > 0
## -> class 3 [0.769]
##
## Rule 20: (9/2, lift 7.9)
## googleandroid <= 0
## iphonedisneg > 4
## htcdisunc > 1
## -> class 3 [0.727]
##
## Rule 21: (4/2, lift 5.5)
## iphone > 2
## sonyxperia > 0
## -> class 3 [0.500]
##
## Rule 22: (184, lift 9.0)
## iphonedisunc > 0
## iphonedispos > 0
## iphoneperpos <= 0
## htccampos <= 0
## iphonecampos > 0
## iphonecampos <= 2
## iphonecamneg > 0
## -> class 4 [0.995]
##
## Rule 23: (173, lift 9.0)
## iphone > 7
## iphonedisunc > 0
## iphoneperpos <= 0
## iphonedisneg <= 4
## -> class 4 [0.994]
##
## Rule 24: (174, lift 9.0)
## ios > 3
## -> class 4 [0.994]
##
## Rule 25: (142/8, lift 8.4)
## iphonedisneg > 4
## htccampos <= 0
## iphonecampos > 0
## -> class 4 [0.938]
##
## Rule 26: (13/1, lift 7.8)
## iphone > 2
## iphonecamunc <= 2
## htccampos > 0
## samsungperpos <= 2
## -> class 4 [0.867]
##
## Rule 27: (3, lift 7.2)
## iphonedisunc <= 0
## iphonedispos > 0
## iphoneperpos <= 0
## iphonedisneg <= 0
## iphonecamneg > 0
## -> class 4 [0.800]
##
## Rule 28: (3, lift 7.2)
## iphone > 2
## iphoneperpos > 3
## iphonecamunc > 2
## htccampos > 0
## -> class 4 [0.800]
##
## Rule 29: (8/1, lift 7.2)
## iphone > 1
## googleandroid > 0
## sonyxperia <= 0
## iphonedisneg > 0
## samsungperpos <= 2
## -> class 4 [0.800]
##
## Rule 30: (4/1, lift 6.0)
## sonyxperia > 0
## htcperunc > 0
## samsungperpos <= 2
## -> class 4 [0.667]
##
## Rule 31: (4/1, lift 6.0)
## iphonedispos > 3
## samsungperpos > 2
## -> class 4 [0.667]
##
## Rule 32: (1161/298, lift 1.3)
## iphone <= 2
## iphoneperpos > 0
## iphonedisneg <= 1
## iphonecamunc <= 0
## -> class 5 [0.743]
##
## Rule 33: (1281/344, lift 1.3)
## iphone <= 2
## iphoneperpos > 0
## iphonedisneg <= 1
## htcperunc <= 0
## -> class 5 [0.731]
##
## Rule 34: (197/57, lift 1.2)
## iphonedispos <= 0
## iphonedisneg > 0
## iphonecamneg <= 0
## -> class 5 [0.709]
##
## Rule 35: (7599/2370, lift 1.2)
## iphone > 0
## googleandroid <= 0
## samsunggalaxy <= 0
## sonyxperia <= 0
## ios <= 3
## htccampos <= 0
## -> class 5 [0.688]
##
## Default class: 5
##
##
## Evaluation on training data (9083 cases):
##
## Rules
## ----------------
## No Errors
##
## 35 1986(21.9%) <<
##
##
## (a) (b) (c) (d) (e) (f) <-classified as
## ---- ---- ---- ---- ---- ----
## 908 6 3 457 (a): class 0
## 1 272 (b): class 1
## 1 45 4 1 267 (c): class 2
## 3 543 286 (d): class 3
## 6 3 346 653 (e): class 4
## 9 8 6 5255 (f): class 5
##
##
## Attribute usage:
##
## 99.67% iphone
## 92.62% googleandroid
## 89.23% sonyxperia
## 87.58% htccampos
## 85.58% ios
## 85.45% samsunggalaxy
## 24.46% iphonedisneg
## 24.41% iphoneperpos
## 14.83% iphonecamunc
## 14.22% htcperunc
## 7.65% iphonedispos
## 6.37% iphonecamneg
## 5.34% iphonecampos
## 4.90% iphoneperunc
## 4.38% iphonedisunc
## 3.91% htcphone
## 1.52% iphoneperneg
## 1.23% samsungperpos
## 0.10% htcdisunc
##
##
## Time: 0.1 secs
C5Results_iphone_RFE<-C5Fit_iphone_RFE$results
print("C5 results:")
## [1] "C5 results:"
C5Results_iphone_RFE
## model winnow trials Accuracy Kappa AccuracySD KappaSD
## 7 rules FALSE 1 0.7736401 0.5605874 0.006314254 0.01362950
## 10 rules TRUE 1 0.7724287 0.5580897 0.006726264 0.01461212
## 1 tree FALSE 1 0.7723182 0.5581867 0.006855322 0.01496038
## 4 tree TRUE 1 0.7714374 0.5566386 0.007558973 0.01672058
## 8 rules FALSE 10 0.7577841 0.5350669 0.007699640 0.01233984
## 11 rules TRUE 10 0.7603164 0.5408349 0.007322959 0.01449994
## 2 tree FALSE 10 0.7600967 0.5404993 0.008928601 0.01856165
## 5 tree TRUE 10 0.7587769 0.5393712 0.005322350 0.01286143
## 9 rules FALSE 20 0.7577841 0.5350669 0.007699640 0.01233984
## 12 rules TRUE 20 0.7603164 0.5408349 0.007322959 0.01449994
## 3 tree FALSE 20 0.7600967 0.5404993 0.008928601 0.01856165
## 6 tree TRUE 20 0.7587769 0.5393712 0.005322350 0.01286143
PredsC5_iphone_RFE<- predict(C5Fit_iphone_RFE, newdata=testing_set_iphone_RFE)
C5ConfMat_iphone_RFE<- confusionMatrix(data=PredsC5_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)
c5_iphone_RFE_accuracy<-postResample(PredsC5_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7156003 0.3884700 0.006967015 0.02045213
## 2 15 0.7599619 0.5240426 0.009561368 0.02497072
## 3 28 0.7517766 0.5128873 0.009651644 0.02502883
We can see from the below graph our Random Forest with Recursive Feature Elimination has the highest Accuracy and Kappa, slightly better than that of our Random Forest model with all features.
####iphone####
feature_model_names<-c('RF', 'C5', 'RF NZV', 'c5 NZV', 'RF RFE','c5 RFE')
iphone_feature_accuracy<- c(RF_iphone_accuracy[1],C5_iphone_accuracy[1],RF_iphone_NZV_accuracy[1],C5_iphone_NZV_accuracy[1],RF_iphone_RFE_accuracy[1],c5_iphone_RFE_accuracy[1])
iphone_feature_kappa <-c(RF_iphone_accuracy[2],C5_iphone_accuracy[2],RF_iphone_NZV_accuracy[2],C5_iphone_NZV_accuracy[2],RF_iphone_RFE_accuracy[2],c5_iphone_RFE_accuracy[2])
iphone_feature_accuracy_df<-data.frame(feature_model_names,iphone_feature_accuracy,iphone_feature_kappa)
ggplot(iphone_feature_accuracy_df, aes(x=iphone_feature_accuracy,y=iphone_feature_kappa,color=feature_model_names))+geom_point(aes(shape=feature_model_names, color=feature_model_names),size=5) + labs(title='iPhone adjusted feature model accuracy metrics')+xlab('Accuracy')+ylab('Kappa')
Looking at the confusion matrix, we can see a large number of false predictions for 5, but otherwise fairly accurate predictions. So our model seems to be more likely to over predict positive sentiment.
RFConfMat_iphone_RFE
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5
## 0 375 0 1 0 4 9
## 1 1 0 0 0 0 1
## 2 1 1 17 0 0 2
## 3 2 0 1 236 4 5
## 4 4 0 1 4 145 11
## 5 205 116 116 116 278 2234
##
## Overall Statistics
##
## Accuracy : 0.773
## 95% CI : (0.7595, 0.7861)
## No Information Rate : 0.5815
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5601
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.6378 0.0000000 0.125000 0.66292 0.33643 0.9876
## Specificity 0.9958 0.9994699 0.998934 0.99660 0.99422 0.4896
## Pos Pred Value 0.9640 0.0000000 0.809524 0.95161 0.87879 0.7289
## Neg Pred Value 0.9392 0.9699074 0.969243 0.96705 0.92322 0.9661
## Prevalence 0.1512 0.0300771 0.034961 0.09152 0.11080 0.5815
## Detection Rate 0.0964 0.0000000 0.004370 0.06067 0.03728 0.5743
## Detection Prevalence 0.1000 0.0005141 0.005398 0.06375 0.04242 0.7879
## Balanced Accuracy 0.8168 0.4997350 0.561967 0.82976 0.66532 0.7386
For Galaxy our original Random Forest model has the highest accuracy and Kappa score, followed by C5 and Random Forest RFE.
The overall accuracy for our galaxy predictions is slightly lower than that of iPhone, however, it has a slightly higher ‘Pos Pred Value’ to predict ‘5’. This means, if we are trying to predict positive sentiment towards phones our galaxy model would be slightly more accurate.
RFConfMat_galaxy
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1 2 3 4 5
## 0 355 0 3 3 7 24
## 1 0 0 1 0 1 2
## 2 1 0 17 0 1 6
## 3 1 3 0 217 6 24
## 4 4 0 1 5 122 24
## 5 147 111 113 127 288 2257
##
## Overall Statistics
##
## Accuracy : 0.7667
## 95% CI : (0.7531, 0.78)
## No Information Rate : 0.6037
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5349
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity 0.69882 0.000000 0.125926 0.61648 0.28706 0.9658
## Specificity 0.98900 0.998935 0.997859 0.99034 0.99013 0.4876
## Pos Pred Value 0.90561 0.000000 0.680000 0.86454 0.78205 0.7417
## Neg Pred Value 0.95602 0.970520 0.969319 0.96271 0.91844 0.9034
## Prevalence 0.13123 0.029450 0.034875 0.09093 0.10979 0.6037
## Detection Rate 0.09171 0.000000 0.004392 0.05606 0.03152 0.5831
## Detection Prevalence 0.10127 0.001033 0.006458 0.06484 0.04030 0.7861
## Balanced Accuracy 0.84391 0.499468 0.561892 0.80341 0.63860 0.7267
We can try to increase the model accuracy even further by reducing the number of dependent variables we are trying to predict, from 6 (0-5) to 4 (1-4).
We will combine ‘0’ and ‘1’ into ‘1’, leave ‘2’ & ‘3’ as is, and combine ‘4’ and ‘5’.
We will use our best model, Random Forest RFE for iPhone:
#### Recoding the dependent variable ####
iphone_new<-iphone_df
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphone_new$iphonesentiment <- recode(iphone_new$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4)
iphone_new$iphonesentiment<-as.factor(iphone_new$iphonesentiment)
#Using best model
iphone_new_RFE<- iphone_new[,predictors(rfeResults_iphone)]
iphone_new_RFE$iphonesentiment<-iphone_new$iphonesentiment
#Recursive Feature Elimination
training_set_iphone_RFE_new<-iphone_new_RFE[IndexTrain,]
testing_set_iphone_RFE_new<-iphone_new_RFE[-IndexTrain,]
#Random FOrest
RandomForestFit_iphone_RFE_new<-train(iphonesentiment~.,data = training_set_iphone_RFE_new,method='rf', trControl=responses_ctrl_iphone)
RandomForestFit_iphone_RFE_new_results<- RandomForestFit_iphone_RFE_new$results
RandomForestFit_iphone_RFE_new_results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7880660 0.4153755 0.003572810 0.01138683
## 2 14 0.8475164 0.6243311 0.004832229 0.01284738
## 3 26 0.8424506 0.6153212 0.005958003 0.01420274
PredsRF_iphone_RFE_new<- predict(RandomForestFit_iphone_RFE_new, newdata=testing_set_iphone_RFE_new)
RFConfMat_iphone_RFE_new<- confusionMatrix(data=PredsRF_iphone_RFE_new,testing_set_iphone_RFE_new$iphonesentiment)
RF_iphone_RFE_accuracy_new<-postResample(PredsRF_iphone_RFE_new,testing_set_iphone_RFE_new$iphonesentiment)
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.7861719 0.3711835 0.005611703 0.01764327
## 2 30 0.8410405 0.5896094 0.004894515 0.01176612
## 3 58 0.8356206 0.5797126 0.006034997 0.01233874
We can clearly see in both respects we have managed to improve our models from our previous best.
recode_iphone_model_names<-c('RF RFE', 'RF RFE recode')
iphone_recode_accuracy<- c(RF_iphone_RFE_accuracy[1],RF_iphone_RFE_accuracy_new[1])
iphone_recode_kappa <-c(RF_iphone_RFE_accuracy[2],RF_iphone_RFE_accuracy_new[2])
iphone_recode_accuracy_df<-data.frame(recode_iphone_model_names,iphone_recode_accuracy,iphone_recode_kappa)
ggplot(iphone_recode_accuracy_df, aes(x=iphone_recode_accuracy,y=iphone_recode_kappa,color=recode_iphone_model_names))+geom_point(aes(shape=recode_iphone_model_names, color=recode_iphone_model_names),size=5) + labs(title='iPhone recode comparison')+xlab('Accuracy')+ylab('Kappa')
As discussed in the introduction, we have collated a data set with over 20,000 unseen instances to try and predict sentiment. This was done using Amazon Web Service (EC2, EMR and S3) scanning and mapping information taken from the Common Crawl (https://commoncrawl.org/big-picture/what-we-do/).
We will use the original sentiment ratings (0-5) for the below.
####iPhone####
iphone_unseen_df<- read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\M5\\Outputs\\iphoneLargeMatrix.csv")
#Using best model#Recursive Feature Elimination
iphone_unseen_df_RFE<- iphone_unseen_df[,predictors(rfeResults_iphone)]
#Random Forest
PredsRF_iphone_RFE_unseen<- predict(RandomForestFit_iphone_RFE_new, newdata=iphone_unseen_df_RFE)
print("Our predictions for iphone sentiment are as below:")
## [1] "Our predictions for iphone sentiment are as below:"
summary(PredsRF_iphone_RFE_unseen)
## 1 2 3 4
## 16034 517 895 6841
## [1] "Our predictions for iphone sentiment are as below:"
## 1 2 3 4
## 15849 556 815 7067
The below probability density graphs are remarkably similar. Galaxy has slightly higher positive sentiment (4) and slightly lower negative sentiment (1) compared to the iphone. We have seen through our analysis that the most important features for both Galaxy and iPhone tend to be related to the iPhone, so it is no surprise that we see similar predictions. It is also showing us that there is a lot more negative sentiment towards these phones than there is positive. This could be because people are more likely to write a review if they have a bad experience, so we could adjust our perspective to judge sentiment by a smaller number of negative reviews.
plot_ly(iphone_unseen_df_RFE, x=~PredsRF_iphone_RFE_unseen, type='histogram',histnorm='probability') %>% layout(title='Probability distribution of sentiment towards iPhones')