Identify one optimized model to predict the overall sentiment toward iPhones and one optimized model to predict overall sentiment toward Samsung Galaxy handsets.
We are working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere. The government agency requires that the app suite be bundled with one model of smart phone. This will help them to limit purchase costs and ensure uniformity when training aid workers to use the device.
We were given a short list of devices that are all capable of executing the app suite’s functions, and we were asked to examine the prevalence of positive and negative attitudes toward these devices on the web. Our goal is to narrow this list down to one device by conducting a broad-based web sentiment analysis to gain insight into the attitudes toward the devices.
For the second part of the project, we will investigate predictive models using machine learning methods. We will apply these models to the Large Matrix file to complete the analysis of overall sentiment toward both iPhone and Samsung Galaxy.
For training datasets, we have one iphone dataset and one galaxy dataset that we will use to develop our models to predict the overall sentiment. They include the counts of relevant words (sentiment lexicons) for about 12,000 instances (web pages). The values in the device sentiment columns represents the overall sentiment toward the device on a scale of 0-5. The overall sentiment value has been manually input by a team of coworkers who read each webpage and rated the sentiment. The scale is as follows: * 0: very negative * 1: negative * 2: somewhat negative * 3: somewhat positive * 4: positive * 5: very positive
For predicting dataset, we will us the big matrix we collected from AWS in the previous step. Here is the Process of Collecting Data.
# call libraries and set seed.
library(readr)
library(caret)
library(plotly)
library(corrplot)
library(doParallel)
library(dplyr)
library(plotly)
set.seed(123)
# set up parallel processing
# Find how many cores are on your machine
detectCores()
# Create Cluster with desired number of cores.
cl <- makeCluster(2)
# Register Cluster
registerDoParallel(cl)
# Confirm how many cores are now "assigned" to R and RStudio
getDoParWorkers() # Result 2
# Stop Cluster after performing the tasks.
stopCluster(cl)
# upload small matrix for training
iphone_smallmatrix <- read.csv("iphone_smallmatrix_labeled_8d.csv")
names(iphone_smallmatrix)
## [1] "iphone" "samsunggalaxy" "sonyxperia" "nokialumina"
## [5] "htcphone" "ios" "googleandroid" "iphonecampos"
## [9] "samsungcampos" "sonycampos" "nokiacampos" "htccampos"
## [13] "iphonecamneg" "samsungcamneg" "sonycamneg" "nokiacamneg"
## [17] "htccamneg" "iphonecamunc" "samsungcamunc" "sonycamunc"
## [21] "nokiacamunc" "htccamunc" "iphonedispos" "samsungdispos"
## [25] "sonydispos" "nokiadispos" "htcdispos" "iphonedisneg"
## [29] "samsungdisneg" "sonydisneg" "nokiadisneg" "htcdisneg"
## [33] "iphonedisunc" "samsungdisunc" "sonydisunc" "nokiadisunc"
## [37] "htcdisunc" "iphoneperpos" "samsungperpos" "sonyperpos"
## [41] "nokiaperpos" "htcperpos" "iphoneperneg" "samsungperneg"
## [45] "sonyperneg" "nokiaperneg" "htcperneg" "iphoneperunc"
## [49] "samsungperunc" "sonyperunc" "nokiaperunc" "htcperunc"
## [53] "iosperpos" "googleperpos" "iosperneg" "googleperneg"
## [57] "iosperunc" "googleperunc" "iphonesentiment"
str(iphone_smallmatrix$iphonesentiment)
## int [1:12973] 0 0 0 0 0 4 4 0 0 0 ...
plot_ly(iphone_smallmatrix, x= ~iphone_smallmatrix$iphonesentiment, type='histogram')
sum(is.na(iphone_smallmatrix))
## [1] 0
There are no missing values for our datset.
Near Zero Variance
Zero or near zero variance variables indicate constant or close to constant predictors across all the samples. This kind of predictor is not really informative, it could also break some models we are building. So we will find out and remove all the near zero variance variables.
Recursive Feature Elimination
Recursive Feature Elimination(RFE) builds models to implement backwards selection of predictors based on predictor importance ranking. The least important predictors will be removed and the model will be re-built. The processe happens recursively untill the optimal subset of predictors are found. This subset can be used to produce an accurate model.
Remove Near Zero Variances
# Examine Feature Variance
nzv <- nearZeroVar(iphone_smallmatrix, saveMetrics = FALSE)
iphoneNZV <- iphone_smallmatrix[,-nzv]
str(iphoneNZV)
Recursive Feature Elimination
Here, we use random forest to build the model for RFE, and repeated cross validation as fit control to avoid over fitting.
# sample the data before using RFE
iphoneSample <- iphone_smallmatrix[sample(1:nrow(iphone_smallmatrix), 1000, replace=FALSE),]
# Set up rfeControl with randomforest, repeated cross validation and no updates
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
# Use rfe and omit the response variable (attribute 59 iphonesentiment)
rfeResults <- rfe(iphoneSample[,1:58],
iphoneSample$iphonesentiment,
sizes=(1:58),
rfeControl=ctrl)
# Get results
rfeResults
Recursive feature selection
Outer resampling method: Cross-Validated (10 fold, repeated 5 times)
The top 5 variables (out of 20): iphone, googleandroid, iphonedispos, iphonedisneg, samsunggalaxy
# Plot results
plot(rfeResults, type=c("g", "o"))
As shown in the picture, the optimal subset is the one that has the lowest RMSE(Root Mean Squared Error), it consists 20 variables.
# create new data set with rfe recommended features
iphoneRFE <- iphone_smallmatrix[,predictors(rfeResults)]
# add the dependent variable to iphoneRFE
iphoneRFE$iphonesentiment <- iphone_smallmatrix$iphonesentiment
The data type for the dependent variable “iphonesentiment” should be factor for the classification model to run correctly. Here, we convert all the “iphonesentiment” variable to factor for all the datasets.
## Preprocessing
iphone_smallmatrix$iphonesentiment <- as.factor(iphone_smallmatrix$iphonesentiment)
iphoneNZV$iphonesentiment <- as.factor(iphoneNZV$iphonesentiment)
iphoneRFE$iphonesentiment <- as.factor(iphoneRFE$iphonesentiment)
Cross-validation is a resampling procedure. It splits a given data sample into numbers of groups. Each group is called a fold, total k(any number) folds. It chose a training set of the size of one fold, train our model on that partition, all the folds will be examined. Then it evaluate the results on the remaining test data. The final results/scores are averaged out. This could avoid overfitting problem.
Both C5.0 and Random Forest algorithms are under the tree model family. Tree model is a flowchart-like structure that works by splitting the sample based on the maximum informative vaiable, named nodes. Each nodes will then split again, the process repeats until the subsamples cannot be split any further.
C5.0 is robust at process a large number of variables, for example, our dataset which has 58 independent variables. It usually doesn’t take a long training time.
Random Forest on the other hand, usaully require a longer time to be trained, depending on the numbers of trees. It works by constructing multiple decision trees and output the mode of the classes for classification problem. The advantage of Random forests is avoiding overfitting from simple decision tree model.
SVM (Support Vector Machine) works by assigning examples to one category or the other. Given known training examples, the algorithm builds an optimal hyperplane which categorizes new examples. The advantage of SVM is it builds non-linear hyperpane which is able to classifiy higher dimensional problems.
K-Nearest Neighbor algorithm is based on assume similar things exist close to each other. K is the number of neighbors considered.It captures the similarity by calculate the distance between two points. KNN is simple and easy to implement, but it could get time consuming when the dataset gets larger.
KKNN is weighted K-Nearest neighbor classifier. It weights the neighbors according to their distances. It can also be used for regression.
10-fold cross validation
#10 fold cross validation
fitControl <- trainControl(method = "cv", number = 10)
We will first use all four algorithms to trian the original small matrix dataset. We will find out the optimal algorithm and use it on the near zero variance deleted dataset and recursive data elimination dataset.
# define an 70%/30% train/test split of the small matrix
inT_iphone_smallmatrix <- createDataPartition(iphone_smallmatrix$iphonesentiment,
p = .70, list = FALSE)
iphone_s_train <- iphone_smallmatrix[inT_iphone_smallmatrix, ]
iphone_s_test <- iphone_smallmatrix[-inT_iphone_smallmatrix, ]
#train C5.0 model
iphoneSmallC50 <- train(iphonesentiment~., data = iphone_s_train, method = "C5.0",
trControl=fitControl)
#prediction
iphoneSmallC50_pred <- predict(iphoneSmallC50, iphone_s_test)
## random forest
iphoneSmallrf <- train(iphonesentiment~., data = iphone_s_train, method = "rf",
trControl=fitControl)
#prediction
iphoneSmallrf_pred <- predict(iphoneSmallrf, iphone_s_test)
iphoneSmallSVM <- train(iphonesentiment~., data = iphone_s_train,
method = "svmLinear2", trControl = fitControl)
#prediction
iphoneSmallSVM_pred <- predict(iphoneSmallSVM, iphone_s_test)
iphoneSmallkknn <- train(iphonesentiment~., data = iphone_s_train, method = "kknn",
trControl=fitControl)
#prediction
iphoneSmall_predkknn <- predict(iphoneSmallkknn, iphone_s_test)
Kappa Score
Kappa Score compares an Observed Accuracy with an Expected Accuracy. Observed Accuracy is simply the number of instances that were classified correctly. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve. The Expected Accuracy is directly related to the number of instances of each class combined with the number of instances that the machine learning classifier agreed with as being ground truth. In general Kappa Score is less misleading than simply using accuracy.
#evaluate C5.0
postResample(iphoneSmallC50_pred, iphone_s_test$iphonesentiment)
#evaluate random forest
postResample(iphoneSmallrf_pred, iphone_s_test$iphonesentiment)
#evaluate SVM
postResample(iphoneSmallSVM_pred, iphone_s_test$iphonesentiment)
#evaluate KKNN
postResample(iphoneSmall_predkknn, iphone_s_test$iphonesentiment)
As shown in the chart, C5.0, Random Forest performed the best among all four models. Since C5.0 takes significantly less amount of time to train than Random Forest, we will use C5.0 for the following training jobs.
## create training and testing datasets using iphoneNZV
inT_iphoneNZV <- createDataPartition(iphoneNZV$iphonesentiment,
p = .70, list = FALSE)
iphoneNZV_train <- iphoneNZV[inT_iphoneNZV, ]
iphoneNZV_test <- iphoneNZV[-inT_iphoneNZV, ]
## apply C5.0
iphoneNZVc50 <- train(iphonesentiment~., data = iphoneNZV_train, method = "C5.0",
trControl=fitControl)
# make predictions
iphoneNZVc50_pred <- predict(iphoneNZVc50, iphoneNZV_test)
# create training and testing datasets using iphoneRFE
inT_iphoneRFE <- createDataPartition(iphoneRFE$iphonesentiment,
p = .70, list = FALSE)
iphoneRFE_train <- iphoneRFE[inT_iphoneRFE, ]
iphoneRFE_test <- iphoneRFE[-inT_iphoneRFE, ]
# apply C5.0
iphoneRFEc50 <- train(iphonesentiment~., data = iphoneRFE_train, method = "C5.0",
trControl=fitControl)
# make predictions
iphoneRFEc50_pred <- predict(iphoneRFEc50, iphoneRFE_test)
#evaluate nzv c5.0
postResample(iphoneNZVc50_pred, iphoneNZV_test$iphonesentiment)
#evaluate rfe c5.0
postResample(iphoneRFEc50_pred, iphoneRFE_test$iphonesentiment)
NZV model and RFE model have very similar accuracy and kappa score. But the accuracy and kappa score are still not ideal.
By now we have tried a variety of tuned algorithms on a variety of data sets that have been preprocessed and feature selected. The accuracy is not very desirable and it’s not improving anymore.
We found that the “iphonesentiment” variable currently has five levels. We will combine “very negative” with “negative”, “very positive” with “positive” to reduce the levels of the factor. Hopefully this will make the algorithm perform more efficiently.
Create a dataset with new factor levels for “iphonesentiment”.
# create a new dataset that will be used for recoding sentiment
iphoneRC <- iphone_smallmatrix
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphoneRC$iphonesentiment <- recode(iphoneRC$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4)
# make iphonesentiment a factor
iphoneRC$iphonesentiment <- as.factor(iphoneRC$iphonesentiment)
# inspect results
str(iphoneRC$iphonesentiment)
Factor w/ 4 levels “1”,“2”,“3”,“4”: 1 1 1 1 1 4 4 1 1 1 …
Train the dataset and make predictions.
# create training and testing sets
inT_iphoneRC <- createDataPartition(iphoneRC$iphonesentiment,
p = .70, list = FALSE)
iphoneRC_train <- iphoneRC[inT_iphoneRC, ]
iphoneRC_test <- iphoneRC[-inT_iphoneRC, ]
## apply c5.0 algorithm
iphoneC5.0_RC <- train(iphonesentiment~., data = iphoneRC_train, method = "C5.0",
trControl=fitControl)
# make predictions
iphoneC5.0_RCpred <- predict(iphoneC5.0_RC, iphoneRC_test)
Evaluate the models
#evaluate
postResample(iphonec5.0_RCpred, iphoneRC_test$iphonesentiment)
Model | Accuracy | Kappa |
---|---|---|
iphoneC5.0_recode | 0.85 | 0.63 |
By using recoding method, we are able to increase the accuracy to 0.85 and the kappa score to 0.63. This is a good result! Next, we will work on Galaxy dataset first, then compare the models with more statistical values
Since galaxy small matrix is a separate data set, the best performing model with the iphone small matrix may not be the best one for galaxy. We applied Galaxy dataset with all the pre-processing, feature selection and feature engineering that we applied on iphone dataset. The final optimal model was also the one using C5.0 with recode dataset. Here we will omit most of the same process and display the steps using the optimal recode dataset building the final model.
# upload galaxy dataset
galaxySmall <- read_csv("galaxy_smallmatrix_labeled_9d.csv")
# apply recode method
# create a new dataset that will be used for recoding sentiment
galaxyRC <- galaxySmall
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
galaxyRC$galaxysentiment <- recode(galaxyRC$galaxysentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4)
# make iphonesentiment a factor
galaxyRC$galaxysentiment <- as.factor(galaxyRC$galaxysentiment)
# create training and testing datasets
inT_galaxyRC <- createDataPartition(galaxyRC$galaxysentiment,
p = .70, list = FALSE)
galaxyRC_train <- galaxyRC[inT_galaxyRC, ]
galaxyRC_test <- galaxyRC[-inT_galaxyRC, ]
# apply C5.0 to train the model
galaxyC5.0_RC <- train(galaxysentiment~., data = galaxyRC_train, method = "C5.0",
trControl=fitControl)
# make predictions
galaxyC5.0_RCpred <- predict(galaxyC5.0_RC, galaxyRC_test)
# evaluate
postResample(galaxyC5.0_RCpred, galaxyRC_test$galaxysentiment)
Model | Accuracy | Kappa |
---|---|---|
galaxyC5.0_recode | 0.84 | 0.59 |
Accuracy and Kappa Score for both iphone and galaxy models
Sensitivity measures the proportion of actual correct positive predictions. Specificity measures the proportion of actual correct negative predictions.
Comparing these two statistical values could be useful for classification tasks which considering certain classes are more important than others.
# confusion matrix for both models
cm_galaxyC5.0_RC <- confusionMatrix(galaxyC5.0_RCpred, galaxyRC_test$galaxysentiment)
cm_iphoneC5.0_RC <- confusionMatrix(iphoneC5.0_RCpred, iphoneRC_test$iphonesentiment)
The sensitivity of class “Somewhat Negative” for both models are very low, along with the sensitivity of “Somewhat Positive” for Galaxy. But, from the predictions below, we can see both “Somewhat Negative” and “Somewhat Positive” take very small distribution of the whole sentiment result. For this task, we are safe to say our models are optimal and trustworthy.
# upload the large matrix, delect id column
iphoneLargeMatrix <- read_csv("LargeMatrix copy.csv")
iphoneLargeMatrix$id <- NULL
# make the prediction for iphone
iphoneLargeMatrix_pred <- predict(iphoneC5.0_RC, iphoneLargeMatrix)
summary(iphoneLargeMatrix_pred)
galaxylargeMatrix <- read_csv("LargeMatrix copy.csv")
galaxylargeMatrix_pred <- predict(galaxyC5.0_RC, galaxylargeMatrix)
summary(galaxylargeMatrix_pred)
Handset | Negative | Somewhat Negative | Somewhat Positive | Positive |
---|---|---|---|---|
iPhone | 12644 | 850 | 1876 | 14048 |
Galaxy | 12383 | 874 | 1767 | 14394 |
Ratings mostly distributed in Negative and Positive. The ratings represents Neutral(somewhat negative and somewhat positive) only made up about 9% of all ratings we analyzed.
Positive reviews are slightly higher than Negative reviews for both handsets. About 5% more positive reviews for iPhone, and about 7.5% more for Galaxy.
From the analysis above, it shows little differences of the rating distributions for iPhone and Galaxy. Both devices have slightly higher positive ratings than negative.
Here ends our Sentiment Analysis project.