Background

Helio is working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere (one of the apps, for example, enables specialists in communicable diseases to diagnose conditions by examining images and other patient data uploaded by local aid workers). The government agency requires that the app suite be bundled with one model of smart phone. Helio is in the process of evaluating potential handset models to determine which one to bundle their software with. After completing an initial investigation, Helio has created a short list of five devices that are all capable of executing the app suite’s functions. To help Helio narrow their list down to one device, they have asked us to examine the prevalence of positive and negative attitudes toward these devices on the web.

Objective

The objective is to investigate predictive models using machine learning methods. These models will be applied to the Large Matrix file to complete the analysis of overall sentiment toward both iPhone and Samsung Galaxy. In this task machine learning methods will be used to predict the overall sentiment toward iPhones.

Dataset Information

iphone_smallmatrix_labeled_8d.csv is the data matrix that was used in this task to develop the models to predict the overall sentiment toward iPhone. They include the counts of relevant words (sentiment lexicons) for about 12,000 instances (web pages). The values in the device sentiment columns (last column in the matrix) represents the overall sentiment toward the device on a scale of 0-5. The overall sentiment value has been manually input by a team of coworkers who read each webpage and rated the sentiment.

The scale is as follows:

0: very negative
1: negative
2: somewhat negative
3: somewhat positive
4: positive
5: very positive

Finally, the models will be used on the large matrix created from the AWS in the previous task to predict sentiment.

Load libraries

library(doParallel)
library(readxl)
library(dplyr)
library(tidyverse)
library(tidyr)
library(plotly)
library(corrplot)
library(caret)
library(e1071)
library(kknn)
library(readxl)
library(rmarkdown)

Set up parallel processing

Since we are dealing with a large dataset, parallel processing is performed to reduce computing time.

# Find out how many cores there are on my laptop
detectCores() # Result = 8

# Create cluster with desired number of cores. 
cl <- makeCluster(2)

# Register cluster
registerDoParallel(cl)

# Confirm how many cores are now assigned to R and Rstudio
getDoParWorkers() # Result = 2

Import dataset

iphoneDF <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/Sentiment Analysis/Dataset and csv/iphone_smallmatrix_labeled_8d.csv")

Inspect attributes within dataset

##  [1] "iphone"          "samsunggalaxy"   "sonyxperia"      "nokialumina"    
##  [5] "htcphone"        "ios"             "googleandroid"   "iphonecampos"   
##  [9] "samsungcampos"   "sonycampos"      "nokiacampos"     "htccampos"      
## [13] "iphonecamneg"    "samsungcamneg"   "sonycamneg"      "nokiacamneg"    
## [17] "htccamneg"       "iphonecamunc"    "samsungcamunc"   "sonycamunc"     
## [21] "nokiacamunc"     "htccamunc"       "iphonedispos"    "samsungdispos"  
## [25] "sonydispos"      "nokiadispos"     "htcdispos"       "iphonedisneg"   
## [29] "samsungdisneg"   "sonydisneg"      "nokiadisneg"     "htcdisneg"      
## [33] "iphonedisunc"    "samsungdisunc"   "sonydisunc"      "nokiadisunc"    
## [37] "htcdisunc"       "iphoneperpos"    "samsungperpos"   "sonyperpos"     
## [41] "nokiaperpos"     "htcperpos"       "iphoneperneg"    "samsungperneg"  
## [45] "sonyperneg"      "nokiaperneg"     "htcperneg"       "iphoneperunc"   
## [49] "samsungperunc"   "sonyperunc"      "nokiaperunc"     "htcperunc"      
## [53] "iosperpos"       "googleperpos"    "iosperneg"       "googleperneg"   
## [57] "iosperunc"       "googleperunc"    "iphonesentiment"

Class of iphonesentiment

The attribute we want to predict:

## [1] "integer"

Check for missing data (NA)

## [1] FALSE

Histogram of iphonesentiment

Pre-processing (Feature engineering and feature selection)

Select only iphone/ios columns

# Select relevant columns for iphone
iphone_relevant_columns <- iphoneDF %>% 
  select(starts_with("iphone"), starts_with("ios"), iphonesentiment)

Checking for collinearity

cor(iphone_relevant_columns$iosperpos, iphone_relevant_columns$iosperneg)

## [1] 0.9323823

cor(iphone_relevant_columns$iosperpos, iphone_relevant_columns$iosperunc)

## [1] 0.9050794

cor(iphone_relevant_columns$ios, iphone_relevant_columns$iphone)

## [1] 0.9220603

# Remove columns due to collinearity 
iphone_nocol <- iphone_relevant_columns %>% 
  select(-c(iosperneg, iosperunc, ios))

Corrplot after removal attributes due to collinearity

# create a new dataset
iphoneCOR <- iphone_nocol

Removal of attributes with (near) zero variance

# Examine feature variance: nearZeroVar() with saveMetrics = TRUE returns an object containing a table including: frequency ratio, percentage unique, zero variance and near zero variance 
nzvMetrics <- nearZeroVar(iphoneCOR, saveMetrics = TRUE)
nzvMetrics

##                  freqRatio percentUnique zeroVar   nzv
## iphone            5.041322    0.20812457   FALSE FALSE
## iphonecampos     10.524697    0.23124952   FALSE FALSE
## iphonecamneg     19.517529    0.13104139   FALSE  TRUE
## iphonecamunc     16.764205    0.16187466   FALSE FALSE
## iphonedispos      6.792440    0.24666615   FALSE FALSE
## iphonedisneg     10.084428    0.18499961   FALSE FALSE
## iphonedisunc     11.471875    0.20812457   FALSE FALSE
## iphoneperpos      9.297834    0.19270793   FALSE FALSE
## iphoneperneg     11.054137    0.16958298   FALSE FALSE
## iphoneperunc     13.018349    0.12333308   FALSE FALSE
## iphonesentiment   3.843017    0.04624990   FALSE FALSE
## iosperpos       153.373494    0.09249981   FALSE  TRUE

# nearZeroVar() with saveMetrics = FALSE returns an vector 
nzv <- nearZeroVar(iphoneCOR, saveMetrics = FALSE)

## [1]  3 12

nearZeroVar columns: iphonecamneg and iosperpos**

# create a new data set and remove near zero variance features
iphoneNZV <- iphoneCOR[,-nzv]

Recursive Feature Elimination

# Sample the data (original dataset) before using RFE
set.seed(123)
iphoneSample <- iphoneDF[sample(1:nrow(iphoneDF), 1000, replace=FALSE),]

# Set up rfeControl with randomforest, repeated cross validation and no updates
ctrl <- rfeControl(functions = rfFuncs, 
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)

# Use rfe and omit the response variable (attribute sentiment) 
rfeResults <- rfe(iphoneSample[,1:58], 
                  iphoneSample$iphonesentiment, 
                  sizes=(1:58), 
                  rfeControl=ctrl)

18 is the number of attributes corresponding with the lowest RMSE

# create new data set with rfe recommended features
iphoneRFE <- iphoneDF[,predictors(rfeResults)]

# add the dependent variable to iphoneRFE
iphoneRFE$iphonesentiment <- iphoneDF$iphonesentiment

# variable importance
varImp(rfeResults)
# Overall
# iphone        70.487576
# googleandroid 33.075680
# iphonedispos  28.035451
# iphonedisneg  26.687539
# samsunggalaxy 23.868780

Recoding

The goal was to make 4 levels instead of 6:

1: negative
2: somewhat negative
3: somewhat positive
4: positive

# create new dataset that will be used for recoding sentiment
iphoneRC <- iphoneDF

# recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphoneRC$iphonesentiment <- recode(iphoneRC$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4)

Principal Component Analysis

# data = training and testing from iphoneDF (no feature selection) 
# create object containing centered, scaled PCA components from training set
# excluded the dependent variable and set threshold to .95

preprocessParams <- preProcess(trainsetDF[,-59], method=c("center", "scale", "pca"), thresh = 0.95)
print(preprocessParams)

# use predict to apply pca parameters, create training, exclude dependant
train.pca <- predict(preprocessParams, trainsetDF[,-59])

# add the dependent to training
train.pca$iphonesentiment <- trainsetDF$iphonesentiment

# use predict to apply pca parameters, create testing, exclude dependant
test.pca <- predict(preprocessParams, testsetDF[,-59])

# add the dependent to training
test.pca$iphonesentiment <- testsetDF$iphonesentiment

PCA needed 24 components to capture 95 percent of the variance

Modalization

Models used:

C5.0
Random forest
SVM
kknn

Datasets used:

iphoneDF: “Out of the box” dataset, no attributes removed.
iphone_relevant_columns: Only iphone or ios related columns included as predictors.
iphoneCOR: Using iphone_relevant_columns after removal of columns due to collinearity, defined as a correlation coefficient of >0.90 between 2 independent variables
iphoneNZV: After removal of attributes with (near) zero variance
iphoneRFE: After recursive feature elimination
The two best models (C5.0 and Random forest) were selected and used for further feature engineering (RC, PCA, combination PCA/RFE/RC)
iphoneRC: After recoding the dependent attribute iphonesentiment (reducing the levels from 6 to 4)
iphone.PCA: Principal component analysis, using a treshold variance of 0.95
iphoneRFE_RC: Recursive feature elimination after recoding sentiment
iphoneRC_RFE: Recoding sentiment after recursive feature elimination
iphoneRC_PCA: PCA after recoding sentiment

Performance metrics (test results)

Applying models to Large Matrix to predict iphone sentiment

Predicting iphone sentiment using C5.0

##          Negative          Positive Somewhat Negative Somewhat Positive 
##              5619              3204               584                36

Predicting iphone sentiment using Random Forest

##          Negative          Positive Somewhat Negative Somewhat Positive 
##              5580              3244               584                35

Summary

The iphone dataset after recoding iphonesentiment (iphoneRC) had the highest accuracy and kappa using C5.0 and Random forest as the two best models. The results were similar among both models.
Further feature engineering (PCA, or a combination of PCA-Recursive Feature Elimination-recoding) did not result in a significant improvement of the performance metrics.
There were more negative than positive reviews.

End of the task

# Stop cluster. After performing task, stop cluster
stopCluster(cl)

Sentimental Analysis - Predicting iphone sentiment

Y.S. Kim

4/22/2020