Project description

Helio (a fictional app developing company) is working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere (one of the apps, for example, enables specialists in communicable diseases to diagnose conditions by examining images and other patient data uploaded by local aid workers). The government agency requires that the app suite be bundled with one model of smart phone. Helio is in the process of evaluating potential handset models to determine which one to bundle their software with. After completing an initial investigation, Helio has created a short list of five devices that are all capable of executing the app suite’s functions. To help Helio narrow their list down to one device, they have asked us to examine the prevalence of positive and negative attitudes toward these devices on the web. The goal of this project is to provide our client with a report that contains an analysis of sentiment toward the target devices, as well as a description of the methods and processes we used to arrive at our conclusions.

Approach

We will try to gauge sentiment across the internet to two devices; iPhone and Samsung Galaxy. This involves using Amazon Web Service (EC2, EMR and S3) to develop a data matrix (over 20,000 instances) taken from the Common Crawl (https://commoncrawl.org/big-picture/what-we-do/). In the below, we will only cover the analysis of this dataset, not how it was procured in AWS.

Below we import the necessary libraries for this project, and add additional processor cores to deal with the large datasets. We have two datasets, iphone_df and galaxy_df which have columns iphonesentiment and galaxysentiment respectively, these give a rating of sentiment towards the phones manually reviewed by a team who have read each instance:

0: very negative

1: negative

2: somewhat negative

3: somewhat positive

4: positive

5: very positive

#Additional processor cores
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
library(plotly)
## Loading required package: ggplot2
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(corrplot)
## corrplot 0.92 loaded
library(caret)
## Loading required package: lattice
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
detectCores() # 8 cores available
## [1] 8
cl<- makeCluster(4)

#Register Cluster
registerDoParallel(cl)

#Confirm how many clusters are available and assigned to R
getDoParWorkers() # result is 3
## [1] 4
iphone_df<- read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\M5\\iphone_smallmatrix_labeled_8d.csv")
galaxy_df<- read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\M5\\galaxy_smallmatrix_labeled_9d.csv")

#No missing data
sum(is.na(iphone_df))
## [1] 0
sum(is.na(galaxy_df))
## [1] 0

Taking a look at our data sets, we can see a large number of variables. Each variable is the same in both datasets, with the only difference being the last column, the ‘sentiment’ towards the phone.

We have the number of times the below phone types we counted in the document:

iphone samsunggalaxy sonyxperia nokialumia htcphone

For each of these phones, we have counted the number of positive words or expressions near each of the below attributes (or their synonyms):

Operating system (only ios or google andriod,positive, negative or unclear) Camera (positive, negative or unclear) Display (positive, negative or unclear) Performance (positive, negative or unclear)

options(max.print=1000000)
head(iphone_df)
##   iphone samsunggalaxy sonyxperia nokialumina htcphone ios googleandroid
## 1      1             0          0           0        0   0             0
## 2      1             0          0           0        0   0             0
## 3      1             0          0           0        0   0             0
## 4      1             0          0           0        0   0             0
## 5      1             0          0           0        0   0             0
## 6     41             0          0           0        0   6             0
##   iphonecampos samsungcampos sonycampos nokiacampos htccampos iphonecamneg
## 1            0             0          0           0         0            0
## 2            0             0          0           0         0            0
## 3            0             0          0           0         0            0
## 4            0             0          0           0         0            0
## 5            0             0          0           0         0            0
## 6            1             0          0           0         0            3
##   samsungcamneg sonycamneg nokiacamneg htccamneg iphonecamunc samsungcamunc
## 1             0          0           0         0            0             0
## 2             0          0           0         0            0             0
## 3             0          0           0         0            0             0
## 4             0          0           0         0            0             0
## 5             0          0           0         0            0             0
## 6             0          0           0         0            7             0
##   sonycamunc nokiacamunc htccamunc iphonedispos samsungdispos sonydispos
## 1          0           0         0            0             0          0
## 2          0           0         0            0             0          0
## 3          0           0         0            0             0          0
## 4          0           0         0            0             0          0
## 5          0           0         0            0             0          0
## 6          0           0         0            1             0          0
##   nokiadispos htcdispos iphonedisneg samsungdisneg sonydisneg nokiadisneg
## 1           0         0            0             0          0           0
## 2           0         0            0             0          0           0
## 3           0         0            0             0          0           0
## 4           0         0            0             0          0           0
## 5           0         0            0             0          0           0
## 6           0         0            3             0          0           0
##   htcdisneg iphonedisunc samsungdisunc sonydisunc nokiadisunc htcdisunc
## 1         0            0             0          0           0         0
## 2         0            0             0          0           0         0
## 3         0            0             0          0           0         0
## 4         0            0             0          0           0         0
## 5         0            0             0          0           0         0
## 6         0            4             0          0           0         0
##   iphoneperpos samsungperpos sonyperpos nokiaperpos htcperpos iphoneperneg
## 1            0             0          0           0         0            0
## 2            1             0          0           0         0            0
## 3            0             0          0           0         0            0
## 4            1             0          0           0         0            0
## 5            1             0          0           0         0            0
## 6            0             0          0           0         0            0
##   samsungperneg sonyperneg nokiaperneg htcperneg iphoneperunc samsungperunc
## 1             0          0           0         0            0             0
## 2             0          0           0         0            0             0
## 3             0          0           0         0            0             0
## 4             0          0           0         0            1             0
## 5             0          0           0         0            0             0
## 6             0          0           0         0            0             0
##   sonyperunc nokiaperunc htcperunc iosperpos googleperpos iosperneg
## 1          0           0         0         0            0         0
## 2          0           0         0         0            0         0
## 3          0           0         0         0            0         0
## 4          0           0         0         0            0         0
## 5          0           0         0         0            0         0
## 6          0           0         0         0            0         0
##   googleperneg iosperunc googleperunc iphonesentiment
## 1            0         0            0               0
## 2            0         0            0               0
## 3            0         0            0               0
## 4            0         0            0               0
## 5            0         0            0               0
## 6            0         0            0               4
head(galaxy_df)
##   iphone samsunggalaxy sonyxperia nokialumina htcphone ios googleandroid
## 1      1             0          0           0        0   0             0
## 2      1             0          0           0        0   0             0
## 3      1             1          0           0        0   0             0
## 4      0             0          0           0        1   0             0
## 5      1             0          0           0        0   0             0
## 6      2             0          0           0        0   0             0
##   iphonecampos samsungcampos sonycampos nokiacampos htccampos iphonecamneg
## 1            0             0          0           0         0            0
## 2            0             0          0           0         0            0
## 3            1             1          0           0         0            0
## 4            0             0          0           0         0            0
## 5            0             0          0           0         0            0
## 6            1             0          0           0         0            0
##   samsungcamneg sonycamneg nokiacamneg htccamneg iphonecamunc samsungcamunc
## 1             0          0           0         0            0             0
## 2             0          0           0         0            0             0
## 3             0          0           0         0            0             0
## 4             0          0           0         0            0             0
## 5             0          0           0         0            0             0
## 6             0          0           0         0            0             0
##   sonycamunc nokiacamunc htccamunc iphonedispos samsungdispos sonydispos
## 1          0           0         0            0             0          0
## 2          0           0         0            1             0          0
## 3          0           0         0            0             0          0
## 4          0           0         0            0             0          0
## 5          0           0         0            0             0          0
## 6          0           0         0            0             0          0
##   nokiadispos htcdispos iphonedisneg samsungdisneg sonydisneg nokiadisneg
## 1           0         0            0             0          0           0
## 2           0         0            1             0          0           0
## 3           0         0            0             0          0           0
## 4           0         1            0             0          0           0
## 5           0         0            0             0          0           0
## 6           0         0            0             0          0           0
##   htcdisneg iphonedisunc samsungdisunc sonydisunc nokiadisunc htcdisunc
## 1         0            0             0          0           0         0
## 2         0            1             0          0           0         0
## 3         0            0             0          0           0         0
## 4         0            0             0          0           0         1
## 5         0            0             0          0           0         0
## 6         0            0             0          0           0         0
##   iphoneperpos samsungperpos sonyperpos nokiaperpos htcperpos iphoneperneg
## 1            0             0          0           0         0            0
## 2            0             0          0           0         0            0
## 3            0             0          0           0         0            0
## 4            0             0          0           0         1            0
## 5            0             0          0           0         0            0
## 6            0             0          0           0         0            0
##   samsungperneg sonyperneg nokiaperneg htcperneg iphoneperunc samsungperunc
## 1             0          0           0         0            0             0
## 2             0          0           0         0            0             0
## 3             0          0           0         0            0             0
## 4             0          0           0         1            0             0
## 5             0          0           0         0            0             0
## 6             0          0           0         0            0             0
##   sonyperunc nokiaperunc htcperunc iosperpos googleperpos iosperneg
## 1          0           0         0         0            0         0
## 2          0           0         0         0            0         0
## 3          0           0         0         0            0         0
## 4          0           0         1         0            0         0
## 5          0           0         0         0            0         0
## 6          0           0         0         0            0         0
##   googleperneg iosperunc googleperunc galaxysentiment
## 1            0         0            0               5
## 2            0         0            0               3
## 3            0         0            0               3
## 4            0         0            0               0
## 5            0         0            0               1
## 6            0         0            0               0

Sentiment histograms

Below we can see very similar looking distributions for sentiment towards the two types of phones, the Samsung Galaxy has a slightly higher ‘very positive’ sentiment to iPhone’s with that difference being passed on to the ‘very negative’ ratings.

Correlation plots

Below we have the two correlation plots for iPhone’s and Samsung Galaxy’s. We want to focus on the top line here, sentiment, to see if there are any features largely correlated to this negatively or positively.

Both graphs are the same, which is expected, but the sentiment to iPhone’s and SG’s look to have similar correlations as well. This is not too surprising given the similarities in the histograms above. Nothing on either graph shows a strong positive correlation to sentiment, interestingly the count of ‘samsung’ followed by ‘galaxy’ has a prevalent negative correlation with ‘galaxysentiment’ and ‘iphonesentiment’.

Lets look at how are variables are set up. All variables are a count of a words appearance on a website, other than the sentiment column which is categorical (definitions above) - so we will change those columns to factors (categorical).

str(iphone_df)
## 'data.frame':    12973 obs. of  59 variables:
##  $ iphone         : int  1 1 1 1 1 41 1 1 1 1 ...
##  $ samsunggalaxy  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyxperia     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokialumina    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcphone       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ios            : int  0 0 0 0 0 6 0 0 0 0 ...
##  $ googleandroid  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonecampos   : int  0 0 0 0 0 1 1 0 0 0 ...
##  $ samsungcampos  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonycampos     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiacampos    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htccampos      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonecamneg   : int  0 0 0 0 0 3 1 0 0 0 ...
##  $ samsungcamneg  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonycamneg     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiacamneg    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htccamneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonecamunc   : int  0 0 0 0 0 7 1 0 0 0 ...
##  $ samsungcamunc  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonycamunc     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiacamunc    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htccamunc      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonedispos   : int  0 0 0 0 0 1 13 0 0 0 ...
##  $ samsungdispos  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonydispos     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiadispos    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcdispos      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonedisneg   : int  0 0 0 0 0 3 10 0 0 0 ...
##  $ samsungdisneg  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonydisneg     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiadisneg    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcdisneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonedisunc   : int  0 0 0 0 0 4 9 0 0 0 ...
##  $ samsungdisunc  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonydisunc     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiadisunc    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcdisunc      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphoneperpos   : int  0 1 0 1 1 0 5 3 0 0 ...
##  $ samsungperpos  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyperpos     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiaperpos    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcperpos      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphoneperneg   : int  0 0 0 0 0 0 4 1 0 0 ...
##  $ samsungperneg  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyperneg     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiaperneg    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcperneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphoneperunc   : int  0 0 0 1 0 0 5 0 0 0 ...
##  $ samsungperunc  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyperunc     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiaperunc    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcperunc      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iosperpos      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleperpos   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iosperneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleperneg   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iosperunc      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleperunc   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonesentiment: int  0 0 0 0 0 4 4 0 0 0 ...
str(galaxy_df)
## 'data.frame':    12911 obs. of  59 variables:
##  $ iphone         : int  1 1 1 0 1 2 1 1 4 1 ...
##  $ samsunggalaxy  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ sonyxperia     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokialumina    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcphone       : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ ios            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleandroid  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonecampos   : int  0 0 1 0 0 1 0 0 0 0 ...
##  $ samsungcampos  : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ sonycampos     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiacampos    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htccampos      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonecamneg   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ samsungcamneg  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonycamneg     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiacamneg    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htccamneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonecamunc   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ samsungcamunc  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonycamunc     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiacamunc    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htccamunc      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonedispos   : int  0 1 0 0 0 0 2 0 0 0 ...
##  $ samsungdispos  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonydispos     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiadispos    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcdispos      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ iphonedisneg   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ samsungdisneg  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonydisneg     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiadisneg    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcdisneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iphonedisunc   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ samsungdisunc  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonydisunc     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiadisunc    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcdisunc      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ iphoneperpos   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ samsungperpos  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyperpos     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiaperpos    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcperpos      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ iphoneperneg   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ samsungperneg  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyperneg     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiaperneg    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcperneg      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ iphoneperunc   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ samsungperunc  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ sonyperunc     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nokiaperunc    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ htcperunc      : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ iosperpos      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleperpos   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iosperneg      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleperneg   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ iosperunc      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ googleperunc   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ galaxysentiment: int  5 3 3 0 1 0 3 5 5 5 ...
iphone_df$iphonesentiment<-as.factor(iphone_df$iphonesentiment)
galaxy_df$galaxysentiment<-as.factor(galaxy_df$galaxysentiment)

Training models

In the below, we will be training 4 different classification models; Random Forest, C5, Support Vector Machine and K Nearest Neighbors. We will train each of these models on both our iPhone and Galaxy data. For these models, we will use 5 fold cross validation repeated once, and use all features. This will give us an indication of which model works best for our data, which we can then fine tune to try and improve accuracy later. I will show the code for the training of the iPhone Random Forest, and then hide all other code, just showing the results which will be compared later on.

iPhones

set.seed(123)

IndexTrain<- createDataPartition(y=iphone_df$iphonesentiment,
                                 p=0.70,
                                 list=FALSE)
training_set_iphone<-iphone_df[IndexTrain,]
testing_set_iphone<-iphone_df[-IndexTrain,]

# Define the control parameters for our model
responses_ctrl_iphone<- trainControl(method='repeatedcv', number=5,repeats=1)

Random Forest

Random Forest uses an ensemble (forest) of decision trees to try and predict the outcome of a classification problem. The output that the majority of decision trees predict is the chosen output for the Random Forest

#Fit the model

RandomForestFit_iphone<-train(iphonesentiment~.,data = training_set_iphone,method='rf', trControl=responses_ctrl_iphone)

#Results:

RandomForestResults_iphone<- RandomForestFit_iphone$results

RandomForestResults_iphone
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7007591 0.3722918 0.003156604 0.01003563
## 2   30 0.7727620 0.5622495 0.006183294 0.01484825
## 3   58 0.7637347 0.5491715 0.005424906 0.01320563
#predictions
PredsRF_iphone<- predict(RandomForestFit_iphone, newdata=testing_set_iphone)

#Confusion matrix

RFConfMat_iphone<- confusionMatrix(data=PredsRF_iphone,testing_set_iphone$iphonesentiment)

#accuracy
RF_iphone_accuracy<- postResample(PredsRF_iphone,testing_set_iphone$iphonesentiment)

C5

C5 classification models work by using a decision tree based model, or a rule based model. It works by splitting the data at the point where there is most information to be gained.

##    model winnow trials  Accuracy     Kappa  AccuracySD     KappaSD
## 7  rules  FALSE      1 0.7703402 0.5543348 0.004215102 0.008725390
## 10 rules   TRUE      1 0.7697890 0.5537985 0.004330741 0.009172138
## 1   tree  FALSE      1 0.7684691 0.5512736 0.002969988 0.005521612
## 4   tree   TRUE      1 0.7684681 0.5518640 0.004332674 0.009162804
## 8  rules  FALSE     10 0.7602120 0.5366753 0.003850594 0.009972375
## 11 rules   TRUE     10 0.7582304 0.5366311 0.004773835 0.008185420
## 2   tree  FALSE     10 0.7624127 0.5434079 0.003080477 0.007475143
## 5   tree   TRUE     10 0.7606525 0.5409539 0.002789537 0.004851069
## 9  rules  FALSE     20 0.7602120 0.5366753 0.003850594 0.009972375
## 12 rules   TRUE     20 0.7582304 0.5366311 0.004773835 0.008185420
## 3   tree  FALSE     20 0.7624127 0.5434079 0.003080477 0.007475143
## 6   tree   TRUE     20 0.7606525 0.5409539 0.002789537 0.004851069

Support Vector Machine

SVM works by creating an n-dimensional plane that splits n+1 features into groups. For example, in a 2-D data set, SVM would create a line that partitions the data into groups most effectively.

##   cost  Accuracy     Kappa  AccuracySD    KappaSD
## 1 0.25 0.6928355 0.3693173 0.012570454 0.02987336
## 2 0.50 0.7069225 0.4055675 0.009558239 0.02073187
## 3 1.00 0.7089031 0.4124570 0.008390388 0.01815390

K Nearest Neighbour

K Nearest neighbour uses the k points closest to a particular data point to classify the data point, assuming that its nearest points will have simialr characteristics to itself.

##   kmax distance  kernel  Accuracy     Kappa  AccuracySD     KappaSD
## 1    5        2 optimal 0.3080481 0.1530814 0.010067531 0.010854351
## 2    7        2 optimal 0.3203788 0.1564489 0.009636541 0.010723476
## 3    9        2 optimal 0.3283052 0.1609767 0.006996242 0.008372339

Samsung Galaxy

set.seed(123)

IndexTrain_galaxy<- createDataPartition(y=galaxy_df$galaxysentiment,
                                 p=0.70,
                                 list=FALSE)
training_set_galaxy<-galaxy_df[IndexTrain_galaxy,]
testing_set_galaxy<-galaxy_df[-IndexTrain_galaxy,]

training_set_galaxy$galaxysentiment<- as.factor(training_set_galaxy$galaxysentiment)
testing_set_galaxy$galaxysentiment<- as.factor(testing_set_galaxy$galaxysentiment)

#responses_ctrl stays the same
responses_ctrl_galaxy<- trainControl(method='repeatedcv', number=5,repeats=1)

Random Forest

##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7054209 0.3568951 0.004052458 0.01169008
## 2   30 0.7611691 0.5248937 0.006761353 0.01478376
## 3   58 0.7523198 0.5127047 0.006921926 0.01539424

C5

##    model winnow trials  Accuracy     Kappa  AccuracySD    KappaSD
## 7  rules  FALSE      1 0.7631656 0.5257385 0.009737038 0.02183258
## 10 rules   TRUE      1 0.7648244 0.5291823 0.007977222 0.01847665
## 1   tree  FALSE      1 0.7633872 0.5263596 0.009892377 0.02218260
## 4   tree   TRUE      1 0.7628333 0.5253238 0.007895657 0.01852420
## 8  rules  FALSE     10 0.7535426 0.5059946 0.010567289 0.02336173
## 11 rules   TRUE     10 0.7502237 0.4971887 0.011906873 0.02942906
## 2   tree  FALSE     10 0.7556443 0.5130673 0.010086479 0.02426922
## 5   tree   TRUE     10 0.7508851 0.5003618 0.010474820 0.02631869
## 9  rules  FALSE     20 0.7535426 0.5059946 0.010567289 0.02336173
## 12 rules   TRUE     20 0.7502237 0.4971887 0.011906873 0.02942906
## 3   tree  FALSE     20 0.7556443 0.5130673 0.010086479 0.02426922
## 6   tree   TRUE     20 0.7508851 0.5003618 0.010474820 0.02631869

Support Vector Machine

##   cost  Accuracy     Kappa  AccuracySD    KappaSD
## 1 0.25 0.7012168 0.3626839 0.010386018 0.02472235
## 2 0.50 0.7044235 0.3739690 0.008082921 0.01692360
## 3 1.00 0.7054203 0.3821431 0.006988075 0.01542300

K Nearest Neighbour

##   kmax distance  kernel  Accuracy     Kappa AccuracySD    KappaSD
## 1    5        2 optimal 0.6659261 0.4140002 0.03182499 0.03844816
## 2    7        2 optimal 0.7042064 0.4525665 0.01743256 0.02190136
## 3    9        2 optimal 0.7287630 0.4839317 0.03115862 0.04170870

Accuracy comparison

We can see in the below graphs that both C5 algorithm and the Random Forest algorithm have similar accuracies and Kappa scores.

Confusion Matrices

In the above graphs, accuracy is the ratio of the number of correct predictions to the total number of inputs. For example, if the model guesses a 5 (Very positive) review and the sentiment rating given to this website is 5, that is a correct prediction. We can analyse this further with a confusion matrix.

We will compare a poor predictive model (KNN for iphone) with a better one (Random Forest for iphone)

We can see the K Nearest Neighbour model has a huge number of errors where it predicts the value to be ‘0’, but the correct value is ‘5’. This is obviously hugely incorrect, giving values of ‘very negative’ instead of ‘very positive’. This could be due lots of reasons, KNN is particularly susceptible to irrelevant features, which due to our dataset size could be a factor.

KNNConfMat_iphone
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5
##          0  532   99   85   94  222 1731
##          1    6    4    2    3    5   49
##          2    1    1   18    1    4   27
##          3    5    1    5  233    7   27
##          4    6    4    2    5  134   33
##          5   38    8   24   20   59  395
## 
## Overall Statistics
##                                           
##                Accuracy : 0.3383          
##                  95% CI : (0.3234, 0.3534)
##     No Information Rate : 0.5815          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1714          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9048 0.034188 0.132353  0.65449  0.31090   0.1746
## Specificity            0.3243 0.982772 0.990943  0.98727  0.98554   0.9085
## Pos Pred Value         0.1925 0.057971 0.346154  0.83813  0.72826   0.7261
## Neg Pred Value         0.9503 0.970427 0.969255  0.96595  0.91986   0.4420
## Prevalence             0.1512 0.030077 0.034961  0.09152  0.11080   0.5815
## Detection Rate         0.1368 0.001028 0.004627  0.05990  0.03445   0.1015
## Detection Prevalence   0.7103 0.017738 0.013368  0.07147  0.04730   0.1398
## Balanced Accuracy      0.6146 0.508480 0.561648  0.82088  0.64822   0.5416

We can now look at our Random Forest confusion matrix which has a lot less errors. It is noticeable that the RF model is predicting a lot more ‘5’ values which is the majority of the dataset (as we saw in the histograms). This can be shown by the 72.48% positive prediction value in Class: 5. This measurement (or precision) is the measurement of True Positives (predicting a 5 when it is a 5) divided by the total False Negatives (all other predictions when the actual data is 5) and True Positives.

RFConfMat_iphone
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5
##          0  376    0    1    0    4    8
##          1    1    0    0    0    0    1
##          2    1    1   17    0    0    2
##          3    2    0    1  236    4    3
##          4    5    0    1    4  143   10
##          5  203  116  116  116  280 2238
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7738          
##                  95% CI : (0.7603, 0.7868)
##     No Information Rate : 0.5815          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5611          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0  Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.63946 0.0000000 0.125000  0.66292  0.33179   0.9894
## Specificity           0.99606 0.9994699 0.998934  0.99717  0.99422   0.4896
## Pos Pred Value        0.96658 0.0000000 0.809524  0.95935  0.87730   0.7292
## Neg Pred Value        0.93945 0.9699074 0.969243  0.96707  0.92273   0.9708
## Prevalence            0.15116 0.0300771 0.034961  0.09152  0.11080   0.5815
## Detection Rate        0.09666 0.0000000 0.004370  0.06067  0.03676   0.5753
## Detection Prevalence  0.10000 0.0005141 0.005398  0.06324  0.04190   0.7889
## Balanced Accuracy     0.81776 0.4997350 0.561967  0.83005  0.66300   0.7395

Feature Selection

To improve our models, we can select or remove certain features if they are not providing any use. There are different ways of doing this.

Zero Variance variables

The distribution of values within a variable (or feature) can give an indication on how much information that variable holds. This means features with little or near-zero variance can be removed as they do not give much information but increase computational time.

Below we have created two datasets (iphoneNZV and galaxyNZV) which have removed the less informative variables. These are the same for both except they have their phone’s sentiment column included.

The galaxyNZV dataset only has two features relating to samsung; “samsunggalaxy” & “galaxysentiment”. This raises the question, “is most sentiment to samsung galaxys driven by positive or negative reviews for iPhone’s?” which we hope to resolve.

nzvMetricsiPhone <- nearZeroVar(iphone_df, saveMetrics = TRUE)

nzviPhone<- nearZeroVar(iphone_df, saveMetrics = FALSE)

iphoneNZV<- iphone_df[,-nzviPhone]

iphoneNZV$iphonesentiment <-as.factor(iphoneNZV$iphonesentiment)

training_set_iphone_NZV<-iphoneNZV[IndexTrain,]
testing_set_iphone_NZV<-iphoneNZV[-IndexTrain,]

print("Remaing features in NZV iPhone dataset:")
## [1] "Remaing features in NZV iPhone dataset:"
colnames(training_set_iphone_NZV)
##  [1] "iphone"          "samsunggalaxy"   "htcphone"        "iphonecampos"   
##  [5] "iphonecamunc"    "iphonedispos"    "iphonedisneg"    "iphonedisunc"   
##  [9] "iphoneperpos"    "iphoneperneg"    "iphoneperunc"    "iphonesentiment"
nzvMetricsgalaxy <- nearZeroVar(galaxy_df, saveMetrics = TRUE)

nzvgalaxy<- nearZeroVar(galaxy_df, saveMetrics = FALSE)

galaxyNZV<- galaxy_df[,-nzvgalaxy]

galaxyNZV$galaxysentiment <-as.factor(galaxyNZV$galaxysentiment)

training_set_galaxy_NZV<-galaxyNZV[IndexTrain_galaxy,]
testing_set_galaxy_NZV<-galaxyNZV[-IndexTrain_galaxy,]

print("Remaing features in NZV galaxy dataset:")
## [1] "Remaing features in NZV galaxy dataset:"
colnames(training_set_galaxy_NZV)
##  [1] "iphone"          "samsunggalaxy"   "htcphone"        "iphonecampos"   
##  [5] "iphonecamunc"    "iphonedispos"    "iphonedisneg"    "iphonedisunc"   
##  [9] "iphoneperpos"    "iphoneperneg"    "iphoneperunc"    "galaxysentiment"

We will now run our two most successful models, C5.0 and Random Forest with the above feature selection to see if we can improve accuracy.

iPhone

#Random FOrest
RandomForestFit_iphone_NZV<-train(iphonesentiment~.,data = training_set_iphone_NZV,method='rf', trControl=responses_ctrl_iphone)

RandomForestResults_iphone_NZV<- RandomForestFit_iphone_NZV$results

print("Random Forest Results:")
## [1] "Random Forest Results:"
RandomForestResults_iphone_NZV
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7586686 0.5249740 0.007129869 0.01612044
## 2    6 0.7545957 0.5238594 0.003768706 0.01013590
## 3   11 0.7485400 0.5160934 0.005833476 0.01413955
#Predictions

PredsRF_iphone_NZV<- predict(RandomForestFit_iphone_NZV, newdata=testing_set_iphone_NZV)

RFConfMat_iphone_NZV<- confusionMatrix(data=PredsRF_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)

RF_iphone_NZV_accuracy<-postResample(PredsRF_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)

#C5 decision tree

C5Fit_iphone_NZV<-train(iphonesentiment~.,data = training_set_iphone_NZV,method='C5.0', trControl=responses_ctrl_iphone)

C5Results_iphone_NZV<-C5Fit_iphone_NZV$results
print("C5 results:")
## [1] "C5 results:"
C5Results_iphone_NZV
##    model winnow trials  Accuracy     Kappa  AccuracySD    KappaSD
## 7  rules  FALSE      1 0.7530516 0.5137567 0.009230818 0.02476595
## 10 rules   TRUE      1 0.7522811 0.5124292 0.009621745 0.02551985
## 1   tree  FALSE      1 0.7544856 0.5193471 0.004815998 0.01068204
## 4   tree   TRUE      1 0.7543755 0.5195277 0.005398680 0.01145251
## 8  rules  FALSE     10 0.7465593 0.5038092 0.003820829 0.01103318
## 11 rules   TRUE     10 0.7461190 0.5035769 0.005623192 0.01137125
## 2   tree  FALSE     10 0.7451299 0.5036052 0.006617427 0.01349713
## 5   tree   TRUE     10 0.7422680 0.4963711 0.009844885 0.02156069
## 9  rules  FALSE     20 0.7465593 0.5038092 0.003820829 0.01103318
## 12 rules   TRUE     20 0.7461190 0.5035769 0.005623192 0.01137125
## 3   tree  FALSE     20 0.7451299 0.5036052 0.006617427 0.01349713
## 6   tree   TRUE     20 0.7422680 0.4963711 0.009844885 0.02156069
PredsC5_iphone_NZV<- predict(C5Fit_iphone_NZV, newdata=testing_set_iphone_NZV)

C5ConfMat_iphone_NZV<- confusionMatrix(data=PredsC5_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)

C5_iphone_NZV_accuracy<-postResample(PredsC5_iphone_NZV,testing_set_iphone_NZV$iphonesentiment)

Galaxy

## [1] "Random Forest results:"
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7547522 0.5000197 0.009318872 0.02301893
## 2    6 0.7505497 0.4994057 0.009057025 0.02333364
## 3   11 0.7430277 0.4896735 0.008238344 0.02085541
## [1] "C5 results:"
##    model winnow trials  Accuracy     Kappa  AccuracySD     KappaSD
## 7  rules  FALSE      1 0.7535414 0.4989467 0.005175917 0.012051023
## 10 rules   TRUE      1 0.7535414 0.4989467 0.005175917 0.012051023
## 1   tree  FALSE      1 0.7521041 0.4965346 0.006377437 0.014390995
## 4   tree   TRUE      1 0.7521041 0.4965346 0.006377437 0.014390995
## 8  rules  FALSE     10 0.7349552 0.4547650 0.003335738 0.007459253
## 11 rules   TRUE     10 0.7349552 0.4547650 0.003335738 0.007459253
## 2   tree  FALSE     10 0.7337393 0.4558775 0.009443116 0.020439343
## 5   tree   TRUE     10 0.7337393 0.4558775 0.009443116 0.020439343
## 9  rules  FALSE     20 0.7349552 0.4547650 0.003335738 0.007459253
## 12 rules   TRUE     20 0.7349552 0.4547650 0.003335738 0.007459253
## 3   tree  FALSE     20 0.7337393 0.4558775 0.009443116 0.020439343
## 6   tree   TRUE     20 0.7337393 0.4558775 0.009443116 0.020439343

Recursive Feature Elimination

We will be using Random Forest, our best performing model, to run RFE, a feature selection algorithm on our data set of 58 variables. This will eliminate the least useful features gradually, leaving us with the best selection. We will use a sample data set for both iPhone’s and Galaxy’s to save computation time as this is a heavy process computationally.

#iphone

iphoneSample <- iphone_df[sample(1:nrow(iphone_df),1000,replace=FALSE),]

ctrl_iphone <- rfeControl(functions = rfFuncs, 
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)
rfeResults_iphone <- rfe(iphoneSample[,1:58],
                  iphoneSample$iphonesentiment,
                  sizes=(1:58),
                  rfeControl=ctrl_iphone)

plot(rfeResults_iphone, type=c("g", "o"))

iphoneRFE<- iphone_df[,predictors(rfeResults_iphone)]

iphoneRFE$iphonesentiment <- iphone_df$iphonesentiment

iphoneRFE$iphonesentiment <-as.factor(iphoneRFE$iphonesentiment)

training_set_iphone_RFE<-iphoneRFE[IndexTrain,]
testing_set_iphone_RFE<-iphoneRFE[-IndexTrain,]

print("Remaining features in RFE iphone dataset:")
## [1] "Remaining features in RFE iphone dataset:"
colnames(training_set_iphone_RFE)
##  [1] "iphone"          "googleandroid"   "samsunggalaxy"   "iphonedisunc"   
##  [5] "htcphone"        "sonyxperia"      "iphonedispos"    "iphoneperpos"   
##  [9] "iphonedisneg"    "ios"             "iphonecamunc"    "htcdispos"      
## [13] "htccampos"       "iphonecampos"    "iphonecamneg"    "iphoneperunc"   
## [17] "iphoneperneg"    "htcperpos"       "htcperneg"       "htcdisunc"      
## [21] "htcperunc"       "htccamneg"       "htcdisneg"       "samsungdispos"  
## [25] "samsungperpos"   "samsungperunc"   "iphonesentiment"

## [1] "Remaining features in RFE galaxy dataset:"
##  [1] "iphone"          "samsunggalaxy"   "googleandroid"   "iphonedisunc"   
##  [5] "htcphone"        "ios"             "iphonedispos"    "iphonecamunc"   
##  [9] "iphoneperpos"    "htccampos"       "iphonedisneg"    "htcdispos"      
## [13] "iphoneperneg"    "sonyxperia"      "iphonecamneg"    "iphonecampos"   
## [17] "htccamunc"       "htcperpos"       "iphoneperunc"    "htcdisneg"      
## [21] "htcdisunc"       "htccamneg"       "samsungcamunc"   "samsungdispos"  
## [25] "samsungcampos"   "htcperneg"       "iosperneg"       "samsungperpos"  
## [29] "galaxysentiment"

iPhone

#Random Forest
RandomForestFit_iphone_RFE<-train(iphonesentiment~.,data = training_set_iphone_RFE,method='rf', trControl=responses_ctrl_iphone)

summary(RandomForestFit_iphone_RFE)
##                 Length Class      Mode     
## call                4  -none-     call     
## type                1  -none-     character
## predicted        9083  factor     numeric  
## err.rate         3500  -none-     numeric  
## confusion          42  -none-     numeric  
## votes           54498  matrix     numeric  
## oob.times        9083  -none-     numeric  
## classes             6  -none-     character
## importance         26  -none-     numeric  
## importanceSD        0  -none-     NULL     
## localImportance     0  -none-     NULL     
## proximity           0  -none-     NULL     
## ntree               1  -none-     numeric  
## mtry                1  -none-     numeric  
## forest             14  -none-     list     
## y                9083  factor     numeric  
## test                0  -none-     NULL     
## inbag               0  -none-     NULL     
## xNames             26  -none-     character
## problemType         1  -none-     character
## tuneValue           1  data.frame list     
## obsLevels           6  -none-     character
## param               0  -none-     list
RandomForestResults_iphone_RFE<- RandomForestFit_iphone_RFE$results

print("Random Forest results:")
## [1] "Random Forest results:"
RandomForestResults_iphone_RFE
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7127635 0.4054042 0.007719913 0.01748074
## 2   14 0.7701238 0.5589461 0.010412889 0.02038441
## 3   26 0.7627470 0.5490002 0.011105589 0.02078794
#Predictions

PredsRF_iphone_RFE<- predict(RandomForestFit_iphone_RFE, newdata=testing_set_iphone_RFE)

RFConfMat_iphone_RFE<- confusionMatrix(data=PredsRF_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)

RF_iphone_RFE_accuracy<-postResample(PredsRF_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)

#C5

C5Fit_iphone_RFE<-train(iphonesentiment~.,data = training_set_iphone_RFE,method='C5.0', trControl=responses_ctrl_iphone)

summary(C5Fit_iphone_RFE)
## 
## Call:
## (function (x, y, trials = 1, rules = FALSE, weights = NULL, control
##  = 0.25, minCases = 2, fuzzyThreshold = FALSE, sample = 0, earlyStopping
##  = TRUE, label = "outcome", seed = 3913L))
## 
## 
## C5.0 [Release 2.07 GPL Edition]      Thu Jul 20 14:22:31 2023
## -------------------------------
## 
## Class specified by attribute `outcome'
## 
## Read 9083 cases (27 attributes) from undefined.data
## 
## Rules:
## 
## Rule 1: (310, lift 6.6)
##  iphone <= 0
##  htcphone > 0
##  ->  class 0  [0.997]
## 
## Rule 2: (180/3, lift 6.5)
##  iphone <= 2
##  sonyxperia > 0
##  iphoneperpos <= 0
##  iphonedisneg <= 1
##  ->  class 0  [0.978]
## 
## Rule 3: (645/14, lift 6.5)
##  iphone <= 0
##  googleandroid <= 0
##  ->  class 0  [0.977]
## 
## Rule 4: (80/1, lift 6.4)
##  iphone > 0
##  iphonedispos <= 3
##  samsungperpos > 2
##  ->  class 0  [0.976]
## 
## Rule 5: (82/2, lift 6.4)
##  iphone <= 2
##  sonyxperia <= 0
##  htccampos > 1
##  ->  class 0  [0.964]
## 
## Rule 6: (140/5, lift 6.3)
##  iphone <= 2
##  sonyxperia <= 0
##  htccampos > 0
##  iphoneperunc <= 0
##  ->  class 0  [0.958]
## 
## Rule 7: (12, lift 6.1)
##  sonyxperia > 0
##  iphonedisneg <= 1
##  iphonecamunc > 0
##  htcperunc <= 0
##  ->  class 0  [0.929]
## 
## Rule 8: (7, lift 5.9)
##  iphonedisunc <= 0
##  iphonedispos > 0
##  iphoneperpos > 4
##  iphonecamneg > 0
##  ->  class 0  [0.889]
## 
## Rule 9: (14/1, lift 5.8)
##  samsunggalaxy > 0
##  iphonedisunc <= 0
##  iphonecamunc > 0
##  ->  class 0  [0.875]
## 
## Rule 10: (16/4, lift 4.8)
##  iphone <= 1
##  iphonedisunc <= 0
##  iphonedispos > 0
##  iphoneperpos > 0
##  iphonecamneg > 1
##  ->  class 0  [0.722]
## 
## Rule 11: (45, lift 28.0)
##  iphone <= 0
##  googleandroid > 0
##  htcphone <= 0
##  ->  class 2  [0.979]
## 
## Rule 12: (117/1, lift 10.7)
##  iphone <= 1
##  samsunggalaxy <= 0
##  iphonedisunc > 0
##  iphoneperpos <= 0
##  iphonedisneg <= 0
##  iphonecamneg <= 0
##  iphoneperunc <= 0
##  iphoneperneg <= 0
##  ->  class 3  [0.983]
## 
## Rule 13: (162/2, lift 10.7)
##  iphone <= 1
##  samsunggalaxy <= 0
##  iphonedispos > 0
##  iphonedispos <= 2
##  iphoneperpos <= 0
##  iphonedisneg <= 0
##  iphonecampos <= 0
##  iphoneperunc <= 0
##  ->  class 3  [0.982]
## 
## Rule 14: (81/1, lift 10.7)
##  iphone <= 1
##  iphonedisunc > 0
##  iphonedispos > 0
##  iphonedispos <= 1
##  iphoneperpos <= 0
##  iphonecamneg <= 0
##  iphoneperunc <= 0
##  ->  class 3  [0.976]
## 
## Rule 15: (115/2, lift 10.6)
##  iphone <= 2
##  samsunggalaxy <= 0
##  iphonedisunc > 0
##  iphonedisunc <= 1
##  iphoneperpos <= 0
##  iphonedisneg <= 0
##  iphonecamneg <= 0
##  iphoneperunc <= 0
##  iphoneperneg <= 0
##  ->  class 3  [0.974]
## 
## Rule 16: (102/2, lift 10.6)
##  iphone > 0
##  googleandroid > 0
##  iphoneperpos <= 0
##  ->  class 3  [0.971]
## 
## Rule 17: (145/11, lift 10.0)
##  iphone > 0
##  samsunggalaxy > 0
##  sonyxperia <= 0
##  iphoneperpos <= 0
##  iphonecamunc <= 0
##  ->  class 3  [0.918]
## 
## Rule 18: (6, lift 9.6)
##  iphoneperpos <= 3
##  iphonecamunc > 2
##  htccampos > 0
##  samsungperpos <= 2
##  ->  class 3  [0.875]
## 
## Rule 19: (11/2, lift 8.4)
##  iphone <= 2
##  googleandroid <= 0
##  sonyxperia <= 0
##  htccampos > 0
##  htccampos <= 1
##  iphoneperunc > 0
##  ->  class 3  [0.769]
## 
## Rule 20: (9/2, lift 7.9)
##  googleandroid <= 0
##  iphonedisneg > 4
##  htcdisunc > 1
##  ->  class 3  [0.727]
## 
## Rule 21: (4/2, lift 5.5)
##  iphone > 2
##  sonyxperia > 0
##  ->  class 3  [0.500]
## 
## Rule 22: (184, lift 9.0)
##  iphonedisunc > 0
##  iphonedispos > 0
##  iphoneperpos <= 0
##  htccampos <= 0
##  iphonecampos > 0
##  iphonecampos <= 2
##  iphonecamneg > 0
##  ->  class 4  [0.995]
## 
## Rule 23: (173, lift 9.0)
##  iphone > 7
##  iphonedisunc > 0
##  iphoneperpos <= 0
##  iphonedisneg <= 4
##  ->  class 4  [0.994]
## 
## Rule 24: (174, lift 9.0)
##  ios > 3
##  ->  class 4  [0.994]
## 
## Rule 25: (142/8, lift 8.4)
##  iphonedisneg > 4
##  htccampos <= 0
##  iphonecampos > 0
##  ->  class 4  [0.938]
## 
## Rule 26: (13/1, lift 7.8)
##  iphone > 2
##  iphonecamunc <= 2
##  htccampos > 0
##  samsungperpos <= 2
##  ->  class 4  [0.867]
## 
## Rule 27: (3, lift 7.2)
##  iphonedisunc <= 0
##  iphonedispos > 0
##  iphoneperpos <= 0
##  iphonedisneg <= 0
##  iphonecamneg > 0
##  ->  class 4  [0.800]
## 
## Rule 28: (3, lift 7.2)
##  iphone > 2
##  iphoneperpos > 3
##  iphonecamunc > 2
##  htccampos > 0
##  ->  class 4  [0.800]
## 
## Rule 29: (8/1, lift 7.2)
##  iphone > 1
##  googleandroid > 0
##  sonyxperia <= 0
##  iphonedisneg > 0
##  samsungperpos <= 2
##  ->  class 4  [0.800]
## 
## Rule 30: (4/1, lift 6.0)
##  sonyxperia > 0
##  htcperunc > 0
##  samsungperpos <= 2
##  ->  class 4  [0.667]
## 
## Rule 31: (4/1, lift 6.0)
##  iphonedispos > 3
##  samsungperpos > 2
##  ->  class 4  [0.667]
## 
## Rule 32: (1161/298, lift 1.3)
##  iphone <= 2
##  iphoneperpos > 0
##  iphonedisneg <= 1
##  iphonecamunc <= 0
##  ->  class 5  [0.743]
## 
## Rule 33: (1281/344, lift 1.3)
##  iphone <= 2
##  iphoneperpos > 0
##  iphonedisneg <= 1
##  htcperunc <= 0
##  ->  class 5  [0.731]
## 
## Rule 34: (197/57, lift 1.2)
##  iphonedispos <= 0
##  iphonedisneg > 0
##  iphonecamneg <= 0
##  ->  class 5  [0.709]
## 
## Rule 35: (7599/2370, lift 1.2)
##  iphone > 0
##  googleandroid <= 0
##  samsunggalaxy <= 0
##  sonyxperia <= 0
##  ios <= 3
##  htccampos <= 0
##  ->  class 5  [0.688]
## 
## Default class: 5
## 
## 
## Evaluation on training data (9083 cases):
## 
##          Rules     
##    ----------------
##      No      Errors
## 
##      35 1986(21.9%)   <<
## 
## 
##     (a)   (b)   (c)   (d)   (e)   (f)    <-classified as
##    ----  ----  ----  ----  ----  ----
##     908                 6     3   457    (a): class 0
##                         1         272    (b): class 1
##       1          45     4     1   267    (c): class 2
##       3               543         286    (d): class 3
##       6                 3   346   653    (e): class 4
##       9                 8     6  5255    (f): class 5
## 
## 
##  Attribute usage:
## 
##   99.67% iphone
##   92.62% googleandroid
##   89.23% sonyxperia
##   87.58% htccampos
##   85.58% ios
##   85.45% samsunggalaxy
##   24.46% iphonedisneg
##   24.41% iphoneperpos
##   14.83% iphonecamunc
##   14.22% htcperunc
##    7.65% iphonedispos
##    6.37% iphonecamneg
##    5.34% iphonecampos
##    4.90% iphoneperunc
##    4.38% iphonedisunc
##    3.91% htcphone
##    1.52% iphoneperneg
##    1.23% samsungperpos
##    0.10% htcdisunc
## 
## 
## Time: 0.1 secs
C5Results_iphone_RFE<-C5Fit_iphone_RFE$results

print("C5 results:")
## [1] "C5 results:"
C5Results_iphone_RFE
##    model winnow trials  Accuracy     Kappa  AccuracySD    KappaSD
## 7  rules  FALSE      1 0.7736401 0.5605874 0.006314254 0.01362950
## 10 rules   TRUE      1 0.7724287 0.5580897 0.006726264 0.01461212
## 1   tree  FALSE      1 0.7723182 0.5581867 0.006855322 0.01496038
## 4   tree   TRUE      1 0.7714374 0.5566386 0.007558973 0.01672058
## 8  rules  FALSE     10 0.7577841 0.5350669 0.007699640 0.01233984
## 11 rules   TRUE     10 0.7603164 0.5408349 0.007322959 0.01449994
## 2   tree  FALSE     10 0.7600967 0.5404993 0.008928601 0.01856165
## 5   tree   TRUE     10 0.7587769 0.5393712 0.005322350 0.01286143
## 9  rules  FALSE     20 0.7577841 0.5350669 0.007699640 0.01233984
## 12 rules   TRUE     20 0.7603164 0.5408349 0.007322959 0.01449994
## 3   tree  FALSE     20 0.7600967 0.5404993 0.008928601 0.01856165
## 6   tree   TRUE     20 0.7587769 0.5393712 0.005322350 0.01286143
PredsC5_iphone_RFE<- predict(C5Fit_iphone_RFE, newdata=testing_set_iphone_RFE)

C5ConfMat_iphone_RFE<- confusionMatrix(data=PredsC5_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)

c5_iphone_RFE_accuracy<-postResample(PredsC5_iphone_RFE,testing_set_iphone_RFE$iphonesentiment)

Galaxy

##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7156003 0.3884700 0.006967015 0.02045213
## 2   15 0.7599619 0.5240426 0.009561368 0.02497072
## 3   28 0.7517766 0.5128873 0.009651644 0.02502883

Model comparison

iPhone

We can see from the below graph our Random Forest with Recursive Feature Elimination has the highest Accuracy and Kappa, slightly better than that of our Random Forest model with all features.

####iphone####

feature_model_names<-c('RF', 'C5', 'RF NZV', 'c5 NZV', 'RF RFE','c5 RFE')
iphone_feature_accuracy<- c(RF_iphone_accuracy[1],C5_iphone_accuracy[1],RF_iphone_NZV_accuracy[1],C5_iphone_NZV_accuracy[1],RF_iphone_RFE_accuracy[1],c5_iphone_RFE_accuracy[1])
iphone_feature_kappa <-c(RF_iphone_accuracy[2],C5_iphone_accuracy[2],RF_iphone_NZV_accuracy[2],C5_iphone_NZV_accuracy[2],RF_iphone_RFE_accuracy[2],c5_iphone_RFE_accuracy[2])

iphone_feature_accuracy_df<-data.frame(feature_model_names,iphone_feature_accuracy,iphone_feature_kappa)

ggplot(iphone_feature_accuracy_df, aes(x=iphone_feature_accuracy,y=iphone_feature_kappa,color=feature_model_names))+geom_point(aes(shape=feature_model_names, color=feature_model_names),size=5) + labs(title='iPhone adjusted feature model accuracy metrics')+xlab('Accuracy')+ylab('Kappa')

Looking at the confusion matrix, we can see a large number of false predictions for 5, but otherwise fairly accurate predictions. So our model seems to be more likely to over predict positive sentiment.

RFConfMat_iphone_RFE
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5
##          0  375    0    1    0    4    9
##          1    1    0    0    0    0    1
##          2    1    1   17    0    0    2
##          3    2    0    1  236    4    5
##          4    4    0    1    4  145   11
##          5  205  116  116  116  278 2234
## 
## Overall Statistics
##                                           
##                Accuracy : 0.773           
##                  95% CI : (0.7595, 0.7861)
##     No Information Rate : 0.5815          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5601          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 0  Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.6378 0.0000000 0.125000  0.66292  0.33643   0.9876
## Specificity            0.9958 0.9994699 0.998934  0.99660  0.99422   0.4896
## Pos Pred Value         0.9640 0.0000000 0.809524  0.95161  0.87879   0.7289
## Neg Pred Value         0.9392 0.9699074 0.969243  0.96705  0.92322   0.9661
## Prevalence             0.1512 0.0300771 0.034961  0.09152  0.11080   0.5815
## Detection Rate         0.0964 0.0000000 0.004370  0.06067  0.03728   0.5743
## Detection Prevalence   0.1000 0.0005141 0.005398  0.06375  0.04242   0.7879
## Balanced Accuracy      0.8168 0.4997350 0.561967  0.82976  0.66532   0.7386

Galaxy

For Galaxy our original Random Forest model has the highest accuracy and Kappa score, followed by C5 and Random Forest RFE.

The overall accuracy for our galaxy predictions is slightly lower than that of iPhone, however, it has a slightly higher ‘Pos Pred Value’ to predict ‘5’. This means, if we are trying to predict positive sentiment towards phones our galaxy model would be slightly more accurate.

RFConfMat_galaxy
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1    2    3    4    5
##          0  355    0    3    3    7   24
##          1    0    0    1    0    1    2
##          2    1    0   17    0    1    6
##          3    1    3    0  217    6   24
##          4    4    0    1    5  122   24
##          5  147  111  113  127  288 2257
## 
## Overall Statistics
##                                         
##                Accuracy : 0.7667        
##                  95% CI : (0.7531, 0.78)
##     No Information Rate : 0.6037        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.5349        
##                                         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: 0 Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity           0.69882 0.000000 0.125926  0.61648  0.28706   0.9658
## Specificity           0.98900 0.998935 0.997859  0.99034  0.99013   0.4876
## Pos Pred Value        0.90561 0.000000 0.680000  0.86454  0.78205   0.7417
## Neg Pred Value        0.95602 0.970520 0.969319  0.96271  0.91844   0.9034
## Prevalence            0.13123 0.029450 0.034875  0.09093  0.10979   0.6037
## Detection Rate        0.09171 0.000000 0.004392  0.05606  0.03152   0.5831
## Detection Prevalence  0.10127 0.001033 0.006458  0.06484  0.04030   0.7861
## Balanced Accuracy     0.84391 0.499468 0.561892  0.80341  0.63860   0.7267

Reducing variables

We can try to increase the model accuracy even further by reducing the number of dependent variables we are trying to predict, from 6 (0-5) to 4 (1-4).

We will combine ‘0’ and ‘1’ into ‘1’, leave ‘2’ & ‘3’ as is, and combine ‘4’ and ‘5’.

iPhone

We will use our best model, Random Forest RFE for iPhone:

#### Recoding the dependent variable ####

iphone_new<-iphone_df

# recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphone_new$iphonesentiment <- recode(iphone_new$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4) 

iphone_new$iphonesentiment<-as.factor(iphone_new$iphonesentiment)

#Using best model
iphone_new_RFE<- iphone_new[,predictors(rfeResults_iphone)]
iphone_new_RFE$iphonesentiment<-iphone_new$iphonesentiment

#Recursive Feature Elimination

training_set_iphone_RFE_new<-iphone_new_RFE[IndexTrain,]
testing_set_iphone_RFE_new<-iphone_new_RFE[-IndexTrain,]

#Random FOrest

RandomForestFit_iphone_RFE_new<-train(iphonesentiment~.,data = training_set_iphone_RFE_new,method='rf', trControl=responses_ctrl_iphone)

RandomForestFit_iphone_RFE_new_results<- RandomForestFit_iphone_RFE_new$results

RandomForestFit_iphone_RFE_new_results
##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7880660 0.4153755 0.003572810 0.01138683
## 2   14 0.8475164 0.6243311 0.004832229 0.01284738
## 3   26 0.8424506 0.6153212 0.005958003 0.01420274
PredsRF_iphone_RFE_new<- predict(RandomForestFit_iphone_RFE_new, newdata=testing_set_iphone_RFE_new)

RFConfMat_iphone_RFE_new<- confusionMatrix(data=PredsRF_iphone_RFE_new,testing_set_iphone_RFE_new$iphonesentiment)

RF_iphone_RFE_accuracy_new<-postResample(PredsRF_iphone_RFE_new,testing_set_iphone_RFE_new$iphonesentiment)

Galaxy

##   mtry  Accuracy     Kappa  AccuracySD    KappaSD
## 1    2 0.7861719 0.3711835 0.005611703 0.01764327
## 2   30 0.8410405 0.5896094 0.004894515 0.01176612
## 3   58 0.8356206 0.5797126 0.006034997 0.01233874

Model comparison

We can clearly see in both respects we have managed to improve our models from our previous best.

iPhone

recode_iphone_model_names<-c('RF RFE', 'RF RFE recode')
iphone_recode_accuracy<- c(RF_iphone_RFE_accuracy[1],RF_iphone_RFE_accuracy_new[1])
iphone_recode_kappa <-c(RF_iphone_RFE_accuracy[2],RF_iphone_RFE_accuracy_new[2])

iphone_recode_accuracy_df<-data.frame(recode_iphone_model_names,iphone_recode_accuracy,iphone_recode_kappa)

ggplot(iphone_recode_accuracy_df, aes(x=iphone_recode_accuracy,y=iphone_recode_kappa,color=recode_iphone_model_names))+geom_point(aes(shape=recode_iphone_model_names, color=recode_iphone_model_names),size=5) + labs(title='iPhone recode comparison')+xlab('Accuracy')+ylab('Kappa')

Galaxy

Predicting on unseen data

As discussed in the introduction, we have collated a data set with over 20,000 unseen instances to try and predict sentiment. This was done using Amazon Web Service (EC2, EMR and S3) scanning and mapping information taken from the Common Crawl (https://commoncrawl.org/big-picture/what-we-do/).

We will use the original sentiment ratings (0-5) for the below.

iPhone

####iPhone####

iphone_unseen_df<- read.csv("C:\\Users\\domsi\\OneDrive\\Documents\\M5\\Outputs\\iphoneLargeMatrix.csv")

#Using best model#Recursive Feature Elimination

iphone_unseen_df_RFE<- iphone_unseen_df[,predictors(rfeResults_iphone)]

#Random Forest

PredsRF_iphone_RFE_unseen<- predict(RandomForestFit_iphone_RFE_new, newdata=iphone_unseen_df_RFE)

print("Our predictions for iphone sentiment are as below:")
## [1] "Our predictions for iphone sentiment are as below:"
summary(PredsRF_iphone_RFE_unseen)
##     1     2     3     4 
## 16034   517   895  6841

Samsung Galaxy

## [1] "Our predictions for iphone sentiment are as below:"
##     1     2     3     4 
## 15849   556   815  7067

Conclusion

The below probability density graphs are remarkably similar. Galaxy has slightly higher positive sentiment (4) and slightly lower negative sentiment (1) compared to the iphone. We have seen through our analysis that the most important features for both Galaxy and iPhone tend to be related to the iPhone, so it is no surprise that we see similar predictions. It is also showing us that there is a lot more negative sentiment towards these phones than there is positive. This could be because people are more likely to write a review if they have a bad experience, so we could adjust our perspective to judge sentiment by a smaller number of negative reviews.

plot_ly(iphone_unseen_df_RFE, x=~PredsRF_iphone_RFE_unseen, type='histogram',histnorm='probability') %>% layout(title='Probability distribution of sentiment towards iPhones')