identify one optimized model to predict the overall sentiment toward iPhone and one optimized model to predict overall sentiment toward Samsung Galaxy handsets.

Overview

Background Info

  • We are working with a government health agency to create a suite of smart phone medical apps to facilitate communication between medical professionals, and aid workers in developing countries. The government agency will be providing aid workers with technical support services, but they need to limit the support to a single model of smart phone and operating system. This will also help to limit purchase costs and ensure uniformity when training aid workers to use the device.

Objective

  • We were given a short list of choices that are all capable of executing the app suite’s functions, and we were asked to analyze the positive and negative attitudes toward these smart phones online in order to narrow this list down to one device in the end. An extensive web sentiment analysis will be performed in this scenario to gain insights into the attitudes toward these devices.

  • For the second part of the project, we will try on various feature selections and feature engineering methods to generate the most optimal predictive models and then we will apply these models to the Large Matrix.csv file to complete our sentiment analysis toward both iPhone and Samsung Galaxy.

Dataset Info

  • Two traning datasets, one iPhone dataset and one Galaxy Dataset. They include the counts of relevant words (sentiment lexicons) for about 12,000 instances (web pages). The values in the device sentiment columns represents the overall sentiment toward the device on a scale of 0-5. The overall sentiment value has been manually input by a team of coworkers who read each webpage and rated the sentiment. The scale is as follows: 0: very negative 1: negative 2: somewhat negative 3: somewhat positive 4: positive 5: very positive
  • Testing dataset, we will use the Large Matrix.csv collected from Common Crawl in Part 1 of this analysis. Sentiment Analysis Part 1

iPhone Section

Preprocessing and Feature Selection

Initial Exploration

# Call libraries and set seed
library(caret)
library(readr)
library(plotly)
library(dplyr)
library(tidyr) 
library(corrplot)
library(ggplot2) 
set.seed(520)
  • Because we are processing three big datasets, parallel processing has been applied to reduce overall computing time.
# Set up parallel processing
library(doParallel)
# Find how many cores are on your machine
detectCores()  # result [8]
# Create Cluster with desired number of cores. 
cl <- makeCluster(4)
# Register Cluster
registerDoParallel(cl)
# Confirm how many cores are now "assigned" to R and RStudio
getDoParWorkers() # result [4]
## Import iphone dataset
iphoneDF <- read_csv("C:/Dev/Data Analysis/Course 4/Task 3/iphone_smallmatrix_labeled_8d.csv")
  • All independent variables are numeric.
  • Examples of Attributes:
    • iOS – counts mentions of iOS on a webpage
    • iphonecampos – counts positive sentiment mentions of the iphone camera
    • galaxydisneg – counts negative sentiment mentions of the Galaxy display
    • htcperunc – counts the unclear sentiment mentions of HTC performance
# Check general data structure/info
str(iphoneDF)
summary(iphoneDF)
# Check all attributes of iphone DF
names(iphoneDF)
##  [1] "iphone"          "samsunggalaxy"   "sonyxperia"      "nokialumina"    
##  [5] "htcphone"        "ios"             "googleandroid"   "iphonecampos"   
##  [9] "samsungcampos"   "sonycampos"      "nokiacampos"     "htccampos"      
## [13] "iphonecamneg"    "samsungcamneg"   "sonycamneg"      "nokiacamneg"    
## [17] "htccamneg"       "iphonecamunc"    "samsungcamunc"   "sonycamunc"     
## [21] "nokiacamunc"     "htccamunc"       "iphonedispos"    "samsungdispos"  
## [25] "sonydispos"      "nokiadispos"     "htcdispos"       "iphonedisneg"   
## [29] "samsungdisneg"   "sonydisneg"      "nokiadisneg"     "htcdisneg"      
## [33] "iphonedisunc"    "samsungdisunc"   "sonydisunc"      "nokiadisunc"    
## [37] "htcdisunc"       "iphoneperpos"    "samsungperpos"   "sonyperpos"     
## [41] "nokiaperpos"     "htcperpos"       "iphoneperneg"    "samsungperneg"  
## [45] "sonyperneg"      "nokiaperneg"     "htcperneg"       "iphoneperunc"   
## [49] "samsungperunc"   "sonyperunc"      "nokiaperunc"     "htcperunc"      
## [53] "iosperpos"       "googleperpos"    "iosperneg"       "googleperneg"   
## [57] "iosperunc"       "googleperunc"    "iphonesentiment"
plot_ly(iphoneDF, x= ~iphoneDF$iphonesentiment, type='histogram')

* Above histogram the distribution of iphone sentiment score, which shows a strong positive sign (skewed to 5).

## Check for missing values
sum(is.na(iphoneDF)) 
  • There is no missing value in iPhone Dataset

Feature Selection

Examine Feature Variance

Remove Zero Variance
  • Zero or near zero variance variables refer to constant and almost constant predictors across samples. These kind of predictors are not only non-informative, but also it can break some models. Therefore, we need to throw them away before feeding into the models.
# NearZeroVar() with saveMetrics = FALSE returns an vector
nzv <- nearZeroVar(iphoneDF, saveMetrics = FALSE) 
# Create a new data set and remove near zero variance features
iphoneNZV <- iphoneDF[,-nzv]
str(iphoneNZV)
Recursive Feature Elimination (RFE)
  • Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the optimal subset of features is reached. The backwards selection of features is based on predictor importance ranking. This method can be used to produce an accurate model.
  • Caret’s rfe() function with random forest will try every combination of feature subsets and return a final list of recommended features.
  • We repeated 5 times for 10-fold cross validation to avoid overfitting.
# Let's sample the data before using RFE
iphoneSample <- iphoneDF[sample(1:nrow(iphoneDF), 1000, replace=FALSE),]
# Set up RFE Control with randomforest, repeated cross validation and no updates
ctrl <- rfeControl(functions = rfFuncs, 
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)
# Use RFE and omit the response variable (attribute 59 iphonesentiment) 
rfeResults <- rfe(iphoneSample[,1:58], 
                  iphoneSample$iphonesentiment, 
                  sizes= c(1:58), 
                  rfeControl= ctrl)

# Get results
rfeResults
  • The top 5 variables (out of 11) are iphone, googleandroid, iphonedisunc, samsunggalaxy, iphonedispos
## Plot results
plot(rfeResults, type=c("g", "o"))

  • Above plot shows the most optimal subset is the one with the lowest RMSE (Root Mean Squared Error). It consists of 11 variables.
## Create new data set with rfe recommended features
iphoneRFE <- iphoneDF[,predictors(rfeResults)]

## Add the dependent variable to iphoneRFE
iphoneRFE$iphonesentiment <- iphoneDF$iphonesentiment

Preprocessing

  • Because it is a classification model, we need to convert the data type for the dependent variable “iphonesentiment” to factor.
# Factorize the dependent variable 
iphoneDF$iphonesentiment <- as.factor(iphoneDF$iphonesentiment)
iphoneNZV$iphonesentiment <- as.factor(iphoneNZV$iphonesentiment)
iphoneRFE$iphonesentiment <- as.factor(iphoneRFE$iphonesentiment)
str(iphoneDF$iphonesentiment)
##  Factor w/ 6 levels "0","1","2","3",..: 1 1 1 1 1 5 5 1 1 1 ...

Model Building on iphoneDF

C5.0 and Random Forest
  • C5.0 and Random Forest models are both belong to the tree model family. Both model works by splitting the sample based on the field that provides the maximum information gain. Each subsample defined by the first split is then split again, usually based on a different field, and the process repeats until the subsamples cannot be split any further. Finally, the lowest-level splits are reexamined, and those that do not contribute significantly to the value of the model are removed or pruned.
  • C5.0 is more robust when processing a large number of variables, which usually takes less training time in general.
  • Random Forest algorithm normally requires longer training time. Because it consists of a large number of relatively uncorrelated models (trees) which protect each other from their individual errors. It is more conservative but also can prevent overfitting from a simple decision tree model.
Support Vector Machine
  • Support Vector Machine (SVM) works by assigning examples to one category or the other. The algorithm builds an optimal hyperplane which categorizes new examples. The advantage of SVM is it builds non-linear hyperplane which is able to classifiy higher dimensional problems. SVM is highly preferred by many as it produces significant accuracy with less computation power.
KKNN
  • K-Nearest Neighbor classifier requires the selection of the number of nearest neighbors and it captures the similarity by calculate the distance between two points. It is simple and easy to implement. However, it is time costly, lazy, requires full training data and depends on the value of k and has the issue of dimensionality because of the distance.
  • KKNN is weighted K-Nearest neighbor classifier. It weights the neighbors according to their distances. It can also be used for regression model.

Model Training and Testing on original iphoneDF

  • First, We will train all four models on the “Out of The Box”-original iphone dataset to find out the most optimal classifier.
  • Second, We will apply the best classifier to the near zero variance dataset and recursive feature elimination dataset.

Model Training on original iphoneDF

  • C5.0 Model
# Define an 70%/30% train/test split of the iphoneDF
inTraining <- createDataPartition(iphoneDF$iphonesentiment, p = .70, list = FALSE)
training <- iphoneDF[inTraining,]
testing <- iphoneDF[-inTraining,]

# 10 fold cross validation 
fitControl <- trainControl(method = "cv", number = 10)

# C5.0 training
C50 <- train(iphonesentiment~., data = training, method = "C5.0", trControl=fitControl)

# Testing 
prediction_C50 <- predict(C50, testing)
  • Random Forest Model
# Use RandomForest with 10-fold cross validation 
rf <- train(iphonesentiment~., data = training, method = "rf", trControl=fitControl)

# Testing 
prediction_rf<- predict(rf, testing)
  • SVM Model
# Use SVM with 10-fold cross validation 
svm <- train(iphonesentiment~., data = training, method = "svmLinear", trControl=fitControl)

# Testing 
prediction_svm<- predict(svm, testing)
  • KKNN Model
# Use KKNN with 10-fold cross validation 
kknn <- train(iphonesentiment~., data = training, method = "kknn", trControl=fitControl)

# Testing 
prediction_kknn<- predict(kknn, testing)

Model Evaluation on original iphoneDF

  • Kappa Score Kappa Score is a metric that compares an Observed Accuracy with an Expected Accuracy. Observed Accuracy is simply the number of instances that were classified correctly throughout the entire confusion matrix. Expected Accuracy is defined as the accuracy that any random classifier would be expected to achieve based on the confusion matrix. In general, it is less misleading than simply using accuracy.
# Evaluate C5.0 Model
postResample(prediction_C50, testing$iphonesentiment)
# Evaluate RF Model
postResample(prediction_rf, testing$iphonesentiment)
# Evaluate SVM Model
postResample(prediction_svm, testing$iphonesentiment)
# Evaluate KKNN Model
postResample(prediction_kknn, testing$iphonesentiment)

  • Above plot shows Random Forest is the best model among all four models. Therefore, we will use Random Forest model for our following analysis.
Model Training on iphoneNZV Dataset
# Define an 70%/30% train/test split of the iphoneNZV
inTraining_iphoneNZV <- createDataPartition(iphoneNZV$iphonesentiment, p = .70, list = FALSE)
training_NZV <- iphoneNZV[inTraining,]
testing_NZV <- iphoneNZV[-inTraining,]

# Apply RandomForest with 10-fold cross validation on iphoneNZV
rf_NZV <- train(iphonesentiment~., data = training_NZV, method = "rf", trControl=fitControl)

# Testing 
prediction_rf_NZV<- predict(rf_NZV, testing_NZV)
Model Training on iphoneRFE Dataset
# Define an 70%/30% train/test split of the iphoneRFE
inTraining_iphoneRFE <- createDataPartition(iphoneRFE$iphonesentiment, p = .70, list = FALSE)
training_RFE <- iphoneRFE[inTraining,]
testing_RFE <- iphoneRFE[-inTraining,]

# Apply RandomForest with 10-fold cross validation on iphoneRFE
rf_RFE <- train(iphonesentiment~., data = training_RFE, method = "rf", trControl=fitControl)
# Testing 
prediction_rf_RFE<- predict(rf_RFE, testing_RFE)
Model Evaluation on iphoneNZV and RFE Datasets with Random Forest Model
postResample(prediction_rf_NZV, testing_NZV$iphonesentiment)
postResample(prediction_rf_RFE, testing_RFE$iphonesentiment)
NZV and RFE Comparison

NZV and RFE Comparison

  • RFE dataset shows a higher accuracy and Kappa value than NZV dataset on Random Forest model.

Feature Engineering-PCA

  • Principal Component Analysis (PCA) is a form of feature engineering that removes all features and replaces them with mathematical representations of their variance.
  • The caret preprocess() function take “pca” as an argument. Setting the threshold states the amount of variance you want to PCA to capture in the model.
Create object containing centered, scaled PCA components from training and testing set
# Data = training and testing from iphoneDF (no feature selection) 
# Excluded the dependent variable and set threshold to .95
preprocessParams <- preProcess(training[,-59], method=c("center", "scale", "pca"), thresh = 0.95)
print(preprocessParams)
## Created from 9083 samples and 58 variables
## 
## Pre-processing:
##   - centered (58)
##   - ignored (0)
##   - principal component signal extraction (58)
##   - scaled (58)
## 
## PCA needed 26 components to capture 95 percent of the variance
# Use predict to apply pca parameters, create training, exclude dependant
train.pca <- predict(preprocessParams, training[,-59])

# Add the dependent to training
train.pca$iphonesentiment <- training$iphonesentiment

# Use predict to apply pca parameters, create testing, exclude dependant
test.pca <- predict(preprocessParams, testing[,-59])

# Add the dependent to testing
test.pca$iphonesentiment <- testing$iphonesentiment
Model Training and Testing on PCA Parameters
# 10 fold cross validation 
fitControl <- trainControl(method = "cv", number = 10)

# Apply RandomForest Model with 10-fold cross validation on Principal Component Analysis 
rf_pca <- train(iphonesentiment~., data = train.pca, method = "rf", trControl=fitControl)

# Testing 
prediction_rf_pca<- predict(rf_pca, test.pca)
Model Evaluation on PCA Parameter
# Evaluate the model
postResample(prediction_rf_pca, test.pca$iphonesentiment)
##  Accuracy     Kappa 
## 0.7606684 0.5382464

Feature Engineering-Recode

  • Feature Engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.
    • Initially, we have six sentiment categories for both iPhone and Galaxy derived from the original datasets ranging from 0 very negative to 5 very positive.
    • After a series of trials and errors on various algorithms, we found that the dependent variable’s factor levels had very poor Sensitivity and Balanced Accuracy. Therefore, we decided to combine some of the redundant levels and reduced them to 4 levels ranging from 1-4 (negative to positive).
Create a dataset with reduced factor level for dependent variable
# Create a new dataset that will be used for recoding sentiment
iphoneRC <- iphoneDF

# Recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphoneRC$iphonesentiment <- recode(iphoneRC$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4) 

# Make iphonesentiment a factor
iphoneRC$iphonesentiment <- as.factor(iphoneRC$iphonesentiment)

# Exam the data structure on 'iphonesentiment'
str(iphoneRC$iphonesentiment)
##  Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 4 4 1 1 1 ...
Model Training and Testing on iphoneRC Dataset
# Define an 70%/30% train/test split of the iphoneRC
inTrainingRC <- createDataPartition(iphoneRC$iphonesentiment, p = .70, list = FALSE)
training_RC <- iphoneRC[inTraining,]
testing_RC <- iphoneRC[-inTraining,]

# Use The Best RandomForest with 10-fold cross validation on Recoding the Dependant variable
rf_RC <- train(iphonesentiment~., data = training_RC, method = "rf", trControl=fitControl)
# Testing 
prediction_rf_RC<- predict(rf_RC, testing_RC) 
Model Evaluation on iphoneRC Dataset
# Evaluate the model
postResample(prediction_rf_RC, testing_RC$iphonesentiment)
##  Accuracy     Kappa 
## 0.8514139 0.6315268
  • This is by far the best algorithm and dataset combination! It is to apply feature engineering recode function to random forest model! Next, we will work on Galaxy dataset first, then apply this model to Large Matrix.csv with more statistical insights.

Samsung Galaxy Section

# Importing SamsungDF 
samsungDF <- read_csv("C:/Dev/Data Analysis/Course 4/Task 3/galaxy_smallmatrix_labeled_9d.csv")

# Create a new dataset that will be used for recoding sentiment
samsungRC <- samsungDF

# Recode sentiment to combine factor levels 0 & 1 and 4 & 5
samsungRC$galaxysentiment <- recode(samsungRC$galaxysentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4) 

# Make iphonesentiment a factor
samsungRC$galaxysentiment <- as.factor(samsungRC$galaxysentiment)

# Define an 70%/30% train/test split of the samsungRC
inTrainingRC_samsung <- createDataPartition(samsungRC$galaxysentiment, p = .70, list = FALSE)
training_RC_samsung <- samsungRC[inTrainingRC_samsung,]
testing_RC_samsung <- samsungRC[-inTrainingRC_samsung,]

# 10 fold cross validation 
fitControl <- trainControl(method = "cv", number = 10)

# Apply The Best Random Forest Model
rf_RC_samsung <- train(galaxysentiment~., data = training_RC_samsung, method = "rf", trControl=fitControl)

# Testing 
prediction_rf_RC_samsung<- predict(rf_RC_samsung, testing_RC_samsung) 

# Evaluate the model 
postResample(prediction_rf_RC_samsung, testing_RC_samsung$galaxysentiment)
## Accuracy    Kappa 
## 0.843750 0.599699
Accuracy and Kappa Score for iPhone and Galaxy Comparison

Make Predictions on the Large Matrix

# Apply Model to Large Matrix (22461 Observations)
iphoneLargeMatrix <- read_csv("C:/Dev/Data Analysis/Course 4/Task 3/iphoneLargeMatrix.csv")

# Remove the 1st column id from iphoneLargeMatrix 
iphoneLargeMatrix$id <- NULL

# Make predictions for iphone
finalPred_iphone <- predict(rf_RC, iphoneLargeMatrix)
summary(finalPred_iphone)

Pie Chart Comparisons

library(plotly)
pieData <- data.frame(COM = c("negative", "somewhat negative", "somewhat positive", "positive"), 
                      values = c(9467, 614, 1407, 10790 ))

# Create pie chart
plot_ly(pieData, labels = ~COM, values = ~ values, type = "pie",
              textposition = 'inside',
              textinfo = 'label+percent', 
              insidetextfont = list(color = '#FFFFFF'),
              hoverinfo = 'text',
              text = ~paste( values),
              marker = list(colors = colors,
                            line = list(color = '#FFFFFF', width = 1)),
              showlegend = F) %>%
  layout(title = 'iPhone Sentiment on Large Matrix', 
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
# Create two pie charts (side by side)
# summary(iphoneRC) # get last column of iphonesentiment count and put it in below values vector
pieData_iphoneRC <- data.frame(COM = c("negative", "somewhat negative", "somewhat positive", "positive"), 
                    values = c( 2352, 454, 1188, 8979 ))
#summary(samsungRC) # get last column of galaxysentiment count and put it in below values vector
pieData_samsungRC <- data.frame(COM = c("negative", "somewhat negative", "somewhat positive", "positive"), 
                              values = c( 2078, 450, 1175, 9208 ))

plot_ly(pieData_iphoneRC, labels = ~COM, values = ~ values, type = "pie", title = 'iPhone Sentiment', 
        domain = list(x = c(0, 0.5), y = c(0, 1))) %>%
  add_trace(data = pieData_samsungRC, labels = ~COM, values = ~ values, type = "pie", title = 'Samsung Sentiment', 
            domain = list(x = c(0.52, 1.02), y = c(0, 1)))
# Stop Cluster
stopCluster(cl)

Sentiment Comparison and Business Implications

This is the end of our sentiment analysis project