Smart Phone Sentiment Analysis and Prediction: Part 2, Analysis and Prediction

Identify one optimized model to predict the overall sentiment toward iPhones and one optimized model to predict overall sentiment toward Samsung Galaxy handsets.

Overview

Background

We are working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere. The government agency requires that the app suite be bundled with one model of smart phone. This will help them to limit purchase costs and ensure uniformity when training aid workers to use the device.

Objective

We were given a short list of devices that are all capable of executing the app suite’s functions, and we were asked to examine the prevalence of positive and negative attitudes toward these devices on the web. Our goal is to narrow this list down to one device by conducting a broad-based web sentiment analysis to gain insight into the attitudes toward the devices.

For the second part of the project, we will investigate predictive models using machine learning methods. We will apply these models to the Large Matrix file to complete the analysis of overall sentiment toward both iPhone and Samsung Galaxy.

Dataset Information

For training datasets, we have one iphone dataset and one galaxy dataset that we will use to develop our models to predict the overall sentiment. They include the counts of relevant words (sentiment lexicons) for about 12,000 instances (web pages). The values in the device sentiment columns represents the overall sentiment toward the device on a scale of 0-5. The overall sentiment value has been manually input by a team of coworkers who read each webpage and rated the sentiment. The scale is as follows: * 0: very negative * 1: negative * 2: somewhat negative * 3: somewhat positive * 4: positive * 5: very positive

For predicting dataset, we will us the big matrix we collected from AWS in the previous step. Here is the Process of Collecting Data.

Iphone

Pre-process and Feature Selection

Initial Exploration

We first call the libraries we need and set a seed for random numbers.

# call libraries and set seed. 
library(readr)
library(caret)
library(plotly)
library(corrplot)
library(doParallel)
library(dplyr)
library(plotly)
set.seed(123)

Since the dataset we are dealing with is very large, we set up parallel processing to reduce computing time.

# set up parallel processing 
# Find how many cores are on your machine
detectCores() 
# Create Cluster with desired number of cores. 
cl <- makeCluster(2)
# Register Cluster
registerDoParallel(cl)
# Confirm how many cores are now "assigned" to R and RStudio
getDoParWorkers() # Result 2 
# Stop Cluster after performing the tasks. 
stopCluster(cl)

Uploade the data. Check out the attributes and structure. All the attributes in the dataset are numetric values, here we only list the structure of “iphonesentiment” for visualization purpose.
Attributes explanation examples:
- iOS – counts mentions of iOS on a webpage
- iphonecampos – counts positive sentiment mentions of the iphone camera
- galaxydisneg – counts negative sentiment mentions of the Galaxy display
- htcperunc – counts the unclear sentiment mentions of HTC performance

# upload small matrix for training
iphone_smallmatrix <- read.csv("iphone_smallmatrix_labeled_8d.csv")
names(iphone_smallmatrix)

##  [1] "iphone"          "samsunggalaxy"   "sonyxperia"      "nokialumina"    
##  [5] "htcphone"        "ios"             "googleandroid"   "iphonecampos"   
##  [9] "samsungcampos"   "sonycampos"      "nokiacampos"     "htccampos"      
## [13] "iphonecamneg"    "samsungcamneg"   "sonycamneg"      "nokiacamneg"    
## [17] "htccamneg"       "iphonecamunc"    "samsungcamunc"   "sonycamunc"     
## [21] "nokiacamunc"     "htccamunc"       "iphonedispos"    "samsungdispos"  
## [25] "sonydispos"      "nokiadispos"     "htcdispos"       "iphonedisneg"   
## [29] "samsungdisneg"   "sonydisneg"      "nokiadisneg"     "htcdisneg"      
## [33] "iphonedisunc"    "samsungdisunc"   "sonydisunc"      "nokiadisunc"    
## [37] "htcdisunc"       "iphoneperpos"    "samsungperpos"   "sonyperpos"     
## [41] "nokiaperpos"     "htcperpos"       "iphoneperneg"    "samsungperneg"  
## [45] "sonyperneg"      "nokiaperneg"     "htcperneg"       "iphoneperunc"   
## [49] "samsungperunc"   "sonyperunc"      "nokiaperunc"     "htcperunc"      
## [53] "iosperpos"       "googleperpos"    "iosperneg"       "googleperneg"   
## [57] "iosperunc"       "googleperunc"    "iphonesentiment"

str(iphone_smallmatrix$iphonesentiment)

##  int [1:12973] 0 0 0 0 0 4 4 0 0 0 ...

We plot a histgram to check the distribution of the dependent variable (iphone sentiment score). Iphone sentiment scores are towards the positive side as shown in the histgram.

plot_ly(iphone_smallmatrix, x= ~iphone_smallmatrix$iphonesentiment, type='histogram')

In the end, we will check for missing data and address it if necessary.

sum(is.na(iphone_smallmatrix))

## [1] 0

There are no missing values for our datset.

Feature Selection

Near Zero Variance

Zero or near zero variance variables indicate constant or close to constant predictors across all the samples. This kind of predictor is not really informative, it could also break some models we are building. So we will find out and remove all the near zero variance variables.

Recursive Feature Elimination

Recursive Feature Elimination(RFE) builds models to implement backwards selection of predictors based on predictor importance ranking. The least important predictors will be removed and the model will be re-built. The processe happens recursively untill the optimal subset of predictors are found. This subset can be used to produce an accurate model.

Remove Near Zero Variances

# Examine Feature Variance
nzv <- nearZeroVar(iphone_smallmatrix, saveMetrics = FALSE) 
iphoneNZV <- iphone_smallmatrix[,-nzv]
str(iphoneNZV)

Recursive Feature Elimination

Here, we use random forest to build the model for RFE, and repeated cross validation as fit control to avoid over fitting.

# sample the data before using RFE
iphoneSample <- iphone_smallmatrix[sample(1:nrow(iphone_smallmatrix), 1000, replace=FALSE),]
# Set up rfeControl with randomforest, repeated cross validation and no updates
ctrl <- rfeControl(functions = rfFuncs, 
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)
# Use rfe and omit the response variable (attribute 59 iphonesentiment) 
rfeResults <- rfe(iphoneSample[,1:58], 
                  iphoneSample$iphonesentiment, 
                  sizes=(1:58), 
                  rfeControl=ctrl)
# Get results
rfeResults

Recursive feature selection

Outer resampling method: Cross-Validated (10 fold, repeated 5 times)

The top 5 variables (out of 20): iphone, googleandroid, iphonedispos, iphonedisneg, samsunggalaxy

# Plot results
plot(rfeResults, type=c("g", "o"))

As shown in the picture, the optimal subset is the one that has the lowest RMSE(Root Mean Squared Error), it consists 20 variables.

# create new data set with rfe recommended features
iphoneRFE <- iphone_smallmatrix[,predictors(rfeResults)]
# add the dependent variable to iphoneRFE
iphoneRFE$iphonesentiment <- iphone_smallmatrix$iphonesentiment

Pre-processing

The data type for the dependent variable “iphonesentiment” should be factor for the classification model to run correctly. Here, we convert all the “iphonesentiment” variable to factor for all the datasets.

## Preprocessing 
iphone_smallmatrix$iphonesentiment <- as.factor(iphone_smallmatrix$iphonesentiment)
iphoneNZV$iphonesentiment <- as.factor(iphoneNZV$iphonesentiment)
iphoneRFE$iphonesentiment <- as.factor(iphoneRFE$iphonesentiment)

Develop the Models

Cross Validation

Cross-validation is a resampling procedure. It splits a given data sample into numbers of groups. Each group is called a fold, total k(any number) folds. It chose a training set of the size of one fold, train our model on that partition, all the folds will be examined. Then it evaluate the results on the remaining test data. The final results/scores are averaged out. This could avoid overfitting problem.

C5.0 and Random Forest

Both C5.0 and Random Forest algorithms are under the tree model family. Tree model is a flowchart-like structure that works by splitting the sample based on the maximum informative vaiable, named nodes. Each nodes will then split again, the process repeats until the subsamples cannot be split any further.

C5.0 is robust at process a large number of variables, for example, our dataset which has 58 independent variables. It usually doesn’t take a long training time.

Random Forest on the other hand, usaully require a longer time to be trained, depending on the numbers of trees. It works by constructing multiple decision trees and output the mode of the classes for classification problem. The advantage of Random forests is avoiding overfitting from simple decision tree model.

SVM

SVM (Support Vector Machine) works by assigning examples to one category or the other. Given known training examples, the algorithm builds an optimal hyperplane which categorizes new examples. The advantage of SVM is it builds non-linear hyperpane which is able to classifiy higher dimensional problems.

KKNN

K-Nearest Neighbor algorithm is based on assume similar things exist close to each other. K is the number of neighbors considered.It captures the similarity by calculate the distance between two points. KNN is simple and easy to implement, but it could get time consuming when the dataset gets larger.

KKNN is weighted K-Nearest neighbor classifier. It weights the neighbors according to their distances. It can also be used for regression.

10-fold cross validation

#10 fold cross validation
fitControl <- trainControl(method = "cv", number = 10)

Train and Evaluate the models

We will first use all four algorithms to trian the original small matrix dataset. We will find out the optimal algorithm and use it on the near zero variance deleted dataset and recursive data elimination dataset.

Train Small Matrix

C5.0 (Small Matrix)

# define an 70%/30% train/test split of the small matrix
inT_iphone_smallmatrix <- createDataPartition(iphone_smallmatrix$iphonesentiment, 
                                  p = .70, list = FALSE)
iphone_s_train <- iphone_smallmatrix[inT_iphone_smallmatrix, ]
iphone_s_test <- iphone_smallmatrix[-inT_iphone_smallmatrix, ]
#train C5.0 model 
iphoneSmallC50 <- train(iphonesentiment~., data = iphone_s_train, method = "C5.0", 
                        trControl=fitControl)
#prediction 
iphoneSmallC50_pred <- predict(iphoneSmallC50, iphone_s_test)

Random Forest(Small Matrix)

## random forest
iphoneSmallrf <- train(iphonesentiment~., data = iphone_s_train, method = "rf", 
                        trControl=fitControl)
#prediction 
iphoneSmallrf_pred <- predict(iphoneSmallrf, iphone_s_test)

SVM(Small Matrix)

iphoneSmallSVM <- train(iphonesentiment~., data = iphone_s_train, 
                        method = "svmLinear2", trControl = fitControl)
#prediction 
iphoneSmallSVM_pred <- predict(iphoneSmallSVM, iphone_s_test)

KKNN(Small Matrix)

iphoneSmallkknn <- train(iphonesentiment~., data = iphone_s_train, method = "kknn", 
                        trControl=fitControl)
#prediction 
iphoneSmall_predkknn <- predict(iphoneSmallkknn, iphone_s_test)

Evaluate Small Matrix Models

Kappa Score

Kappa Score compares an Observed Accuracy with an Expected Accuracy. Observed Accuracy is simply the number of instances that were classified correctly. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve. The Expected Accuracy is directly related to the number of instances of each class combined with the number of instances that the machine learning classifier agreed with as being ground truth. In general Kappa Score is less misleading than simply using accuracy.

#evaluate C5.0
postResample(iphoneSmallC50_pred, iphone_s_test$iphonesentiment)
#evaluate random forest
postResample(iphoneSmallrf_pred, iphone_s_test$iphonesentiment)
#evaluate SVM
postResample(iphoneSmallSVM_pred, iphone_s_test$iphonesentiment)
#evaluate KKNN
postResample(iphoneSmall_predkknn, iphone_s_test$iphonesentiment)

As shown in the chart, C5.0, Random Forest performed the best among all four models. Since C5.0 takes significantly less amount of time to train than Random Forest, we will use C5.0 for the following training jobs.

Train near zero variance removed dataset with C5.0

## create training and testing datasets using iphoneNZV
inT_iphoneNZV <- createDataPartition(iphoneNZV$iphonesentiment, 
                                              p = .70, list = FALSE)
iphoneNZV_train <- iphoneNZV[inT_iphoneNZV, ]
iphoneNZV_test <- iphoneNZV[-inT_iphoneNZV, ]
## apply C5.0 
iphoneNZVc50 <- train(iphonesentiment~., data = iphoneNZV_train, method = "C5.0", 
                       trControl=fitControl)
# make predictions
iphoneNZVc50_pred <- predict(iphoneNZVc50, iphoneNZV_test)

Train recursive feature elimination dataset with C5.0

# create training and testing datasets using iphoneRFE
inT_iphoneRFE <- createDataPartition(iphoneRFE$iphonesentiment, 
                                     p = .70, list = FALSE)
iphoneRFE_train <- iphoneRFE[inT_iphoneRFE, ]
iphoneRFE_test <- iphoneRFE[-inT_iphoneRFE, ]
# apply C5.0
iphoneRFEc50 <- train(iphonesentiment~., data = iphoneRFE_train, method = "C5.0", 
                     trControl=fitControl)
# make predictions
iphoneRFEc50_pred <- predict(iphoneRFEc50, iphoneRFE_test)

Evaluate NZV and RFE models with C5.0 algorithm

#evaluate nzv c5.0
postResample(iphoneNZVc50_pred, iphoneNZV_test$iphonesentiment)
#evaluate rfe c5.0
postResample(iphoneRFEc50_pred, iphoneRFE_test$iphonesentiment)

NZV model and RFE model have very similar accuracy and kappa score. But the accuracy and kappa score are still not ideal.

Data Engineering - Re-code

By now we have tried a variety of tuned algorithms on a variety of data sets that have been preprocessed and feature selected. The accuracy is not very desirable and it’s not improving anymore.

We found that the “iphonesentiment” variable currently has five levels. We will combine “very negative” with “negative”, “very positive” with “positive” to reduce the levels of the factor. Hopefully this will make the algorithm perform more efficiently.

Create a dataset with new factor levels for “iphonesentiment”.

# create a new dataset that will be used for recoding sentiment
iphoneRC <- iphone_smallmatrix
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphoneRC$iphonesentiment <- recode(iphoneRC$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4) 
# make iphonesentiment a factor
iphoneRC$iphonesentiment <- as.factor(iphoneRC$iphonesentiment)
# inspect results
str(iphoneRC$iphonesentiment)

Factor w/ 4 levels “1”,“2”,“3”,“4”: 1 1 1 1 1 4 4 1 1 1 …

Train the dataset and make predictions.

# create training and testing sets
inT_iphoneRC <- createDataPartition(iphoneRC$iphonesentiment, 
                                        p = .70, list = FALSE)
iphoneRC_train <- iphoneRC[inT_iphoneRC, ]
iphoneRC_test <- iphoneRC[-inT_iphoneRC, ]
## apply c5.0 algorithm
iphoneC5.0_RC <- train(iphonesentiment~., data = iphoneRC_train, method = "C5.0", 
                        trControl=fitControl)

# make predictions 
iphoneC5.0_RCpred <- predict(iphoneC5.0_RC, iphoneRC_test)

Evaluate the models

#evaluate
postResample(iphonec5.0_RCpred, iphoneRC_test$iphonesentiment)

Model	Accuracy	Kappa
iphoneC5.0_recode	0.85	0.63

By using recoding method, we are able to increase the accuracy to 0.85 and the kappa score to 0.63. This is a good result! Next, we will work on Galaxy dataset first, then compare the models with more statistical values

Galaxy

Since galaxy small matrix is a separate data set, the best performing model with the iphone small matrix may not be the best one for galaxy. We applied Galaxy dataset with all the pre-processing, feature selection and feature engineering that we applied on iphone dataset. The final optimal model was also the one using C5.0 with recode dataset. Here we will omit most of the same process and display the steps using the optimal recode dataset building the final model.

# upload galaxy dataset 
galaxySmall <- read_csv("galaxy_smallmatrix_labeled_9d.csv")

# apply recode method
# create a new dataset that will be used for recoding sentiment
galaxyRC <- galaxySmall
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
galaxyRC$galaxysentiment <- recode(galaxyRC$galaxysentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4) 
# make iphonesentiment a factor
galaxyRC$galaxysentiment <- as.factor(galaxyRC$galaxysentiment)

# create training and testing datasets
inT_galaxyRC <- createDataPartition(galaxyRC$galaxysentiment, 
                                    p = .70, list = FALSE)
galaxyRC_train <- galaxyRC[inT_galaxyRC, ]
galaxyRC_test <- galaxyRC[-inT_galaxyRC, ]

# apply C5.0 to train the model
galaxyC5.0_RC <- train(galaxysentiment~., data = galaxyRC_train, method = "C5.0", 
                     trControl=fitControl)

# make predictions 
galaxyC5.0_RCpred <- predict(galaxyC5.0_RC, galaxyRC_test)

# evaluate
postResample(galaxyC5.0_RCpred, galaxyRC_test$galaxysentiment)

Model	Accuracy	Kappa
galaxyC5.0_recode	0.84	0.59

Accuracy and Kappa Score for both iphone and galaxy models

Model evaluation with sensitivty and specificity

Sensitivity measures the proportion of actual correct positive predictions. Specificity measures the proportion of actual correct negative predictions.

Comparing these two statistical values could be useful for classification tasks which considering certain classes are more important than others.

# confusion matrix for both models
cm_galaxyC5.0_RC <- confusionMatrix(galaxyC5.0_RCpred, galaxyRC_test$galaxysentiment)
cm_iphoneC5.0_RC <- confusionMatrix(iphoneC5.0_RCpred, iphoneRC_test$iphonesentiment)

The sensitivity of class “Somewhat Negative” for both models are very low, along with the sensitivity of “Somewhat Positive” for Galaxy. But, from the predictions below, we can see both “Somewhat Negative” and “Somewhat Positive” take very small distribution of the whole sentiment result. For this task, we are safe to say our models are optimal and trustworthy.

Make Predictions on the Large Matrix

Make the predictions for iPhone

# upload the large matrix, delect id column
iphoneLargeMatrix <- read_csv("LargeMatrix copy.csv")
iphoneLargeMatrix$id <- NULL

# make the prediction for iphone 
iphoneLargeMatrix_pred <- predict(iphoneC5.0_RC, iphoneLargeMatrix)
summary(iphoneLargeMatrix_pred)

Make the predictions for Galaxy

galaxylargeMatrix <- read_csv("LargeMatrix copy.csv")
galaxylargeMatrix_pred <- predict(galaxyC5.0_RC, galaxylargeMatrix)
summary(galaxylargeMatrix_pred)

Sentiment Comparison and Business Implication

Handset	Negative	Somewhat Negative	Somewhat Positive	Positive
iPhone	12644	850	1876	14048
Galaxy	12383	874	1767	14394

Ratings mostly distributed in Negative and Positive. The ratings represents Neutral(somewhat negative and somewhat positive) only made up about 9% of all ratings we analyzed.
Positive reviews are slightly higher than Negative reviews for both handsets. About 5% more positive reviews for iPhone, and about 7.5% more for Galaxy.
From the analysis above, it shows little differences of the rating distributions for iPhone and Galaxy. Both devices have slightly higher positive ratings than negative.
Having both price and convenience in consideration, Galaxy could be the potential system that has more benefits.
- It is usually cheaper than iphone.
- Android system is more convinient to use in developing countries.

Here ends our Sentiment Analysis project.