Dataset Information

galaxy_smallmatrix_labeled_8d.csv is the data matrix that was used in this task to develop the models to predict the overall sentiment toward Samsung Galaxy. They include the counts of relevant words (sentiment lexicons) for about 12,000 instances (web pages). The values in the device sentiment columns (last column in the matrix) represents the overall sentiment toward the device on a scale of 0-5. The overall sentiment value has been manually input by a team of coworkers who read each webpage and rated the sentiment.

The scale is as follows:

0: very negative
1: negative
2: somewhat negative
3: somewhat positive
4: positive
5: very positive

Finally, the models will be used on the large matrix created from the AWS in the previous task to predict sentiment.

Load libraries

library(doParallel)
library(readxl)
library(dplyr)
library(tidyverse)
library(tidyr)
library(plotly)
library(corrplot)
library(caret)
library(e1071)
library(kknn)
library(readxl)
library(rmarkdown)

Set up parallel processing

Since we are dealing with a large dataset, parallel processing is performed to reduce computing time.

# Find out how many cores there are on my laptop
detectCores() # Result = 8

# Create cluster with desired number of cores. 
cl <- makeCluster(2)

# Register cluster
registerDoParallel(cl)

# Confirm how many cores are now assigned to R and Rstudio
getDoParWorkers() # Result = 2

Import dataset

galaxyDF <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/Sentiment Analysis/Dataset and csv/galaxy_smallmatrix_labeled_8d.csv")

Inspect attributes within dataset

##  [1] "iphone"          "samsunggalaxy"   "sonyxperia"      "nokialumina"    
##  [5] "htcphone"        "ios"             "googleandroid"   "iphonecampos"   
##  [9] "samsungcampos"   "sonycampos"      "nokiacampos"     "htccampos"      
## [13] "iphonecamneg"    "samsungcamneg"   "sonycamneg"      "nokiacamneg"    
## [17] "htccamneg"       "iphonecamunc"    "samsungcamunc"   "sonycamunc"     
## [21] "nokiacamunc"     "htccamunc"       "iphonedispos"    "samsungdispos"  
## [25] "sonydispos"      "nokiadispos"     "htcdispos"       "iphonedisneg"   
## [29] "samsungdisneg"   "sonydisneg"      "nokiadisneg"     "htcdisneg"      
## [33] "iphonedisunc"    "samsungdisunc"   "sonydisunc"      "nokiadisunc"    
## [37] "htcdisunc"       "iphoneperpos"    "samsungperpos"   "sonyperpos"     
## [41] "nokiaperpos"     "htcperpos"       "iphoneperneg"    "samsungperneg"  
## [45] "sonyperneg"      "nokiaperneg"     "htcperneg"       "iphoneperunc"   
## [49] "samsungperunc"   "sonyperunc"      "nokiaperunc"     "htcperunc"      
## [53] "iosperpos"       "googleperpos"    "iosperneg"       "googleperneg"   
## [57] "iosperunc"       "googleperunc"    "galaxysentiment"

Class of galaxysentiment

The attribute we want to predict:

## [1] "integer"

Check for missing data (NA)

## [1] FALSE

Histogram of galaxysentiment

Pre-processing (Feature engineering and feature selection)

Select only Samsung-related columns (Samsung, galaxy, google)

# Select relevant columns for galaxy
galaxy_relevant_columns <- galaxyDF %>% 
  select(starts_with("samsung"), starts_with("google"), galaxysentiment)

Checking for collinearity

cor(galaxy_relevant_columns$samsungdisunc, galaxy_relevant_columns$samsungdispos) #correlation coefficient is 0.9098321

## [1] 0.9098321

cor(galaxy_relevant_columns$samsungperneg, galaxy_relevant_columns$samsungdisneg) #correlation coefficient is 0.9394673

## [1] 0.9394673

cor(galaxy_relevant_columns$samsungperunc, galaxy_relevant_columns$samsungdisunc) #correlation coefficient is 0.9403043

## [1] 0.9403043

cor(galaxy_relevant_columns$googleperneg, galaxy_relevant_columns$googleperpos) #correlation coefficient is 0.9574098

## [1] 0.9574098

# Remove columns due to collinearity 
galaxy_nocol <- galaxy_relevant_columns %>%
  select(-c(samsungdisunc, samsungdisneg, googleperneg))

Corrplot after removal attributes due to collinearity

# create a new dataset
galaxyCOR <- galaxy_nocol

Removal of attributes with (near) zero variance

# Examine feature variance: nearZeroVar() with saveMetrics = TRUE returns an object containing a table including: frequency ratio, percentage unique, zero variance and near zero variance 
galaxynzvMetrics <- nearZeroVar(galaxyCOR, saveMetrics = TRUE)
galaxynzvMetrics

##                  freqRatio percentUnique zeroVar   nzv
## samsunggalaxy    14.127336    0.05395822   FALSE FALSE
## samsungcampos    93.625000    0.08479149   FALSE  TRUE
## samsungcamneg   100.132812    0.06937486   FALSE  TRUE
## samsungcamunc    74.308140    0.06937486   FALSE  TRUE
## samsungdispos    97.061069    0.13104139   FALSE  TRUE
## samsungperpos    94.200000    0.10791644   FALSE  TRUE
## samsungperneg   101.650794    0.10020812   FALSE  TRUE
## samsungperunc    86.500000    0.09249981   FALSE  TRUE
## googleandroid    61.247573    0.04624990   FALSE  TRUE
## googleperpos     98.592308    0.06937486   FALSE  TRUE
## googleperunc     96.443609    0.07708317   FALSE  TRUE
## galaxysentiment   4.579565    0.04624990   FALSE FALSE

# nearZeroVar() with saveMetrics = FALSE returns an vector 
gnzv <- nearZeroVar(galaxyCOR, saveMetrics = FALSE)

##  [1]  2  3  4  5  6  7  8  9 10 11

nearZeroVar columns: 2 3 4 5 6 7 8 9 10 11

# create a new data set and remove near zero variance features
galaxyNZV <- galaxyCOR[,-gnzv]

However, we can see that this dataset only contains 1 indepedent attribute: samsunggalaxy

Recursive Feature Elimination

# Sample the data (original dataset) before using RFE
set.seed(123)
galaxySample <- galaxyDF[sample(1:nrow(galaxyDF), 1000, replace=FALSE),]

# Set up rfeControl with randomforest, repeated cross validation and no updates
ctrl <- rfeControl(functions = rfFuncs, 
                   method = "repeatedcv",
                   repeats = 5,
                   verbose = FALSE)

# Use rfe and omit the response variable (attribute sentiment) 
rfeResults_galaxy <- rfe(galaxySample[,1:58], 
                         galaxySample$galaxysentiment, 
                         sizes=(1:58), 
                         rfeControl=ctrl)

18 is the number of attributes corresponding with the lowest RMSE

# create new data set with rfe recommended features
galaxyRFE <- galaxyDF[,predictors(rfeResults_galaxy)]

# add the dependent variable to galaxyRFE
galaxyRFE$galaxysentiment <- galaxyDF$galaxysentiment

# review outcome
glimpse(galaxyRFE)

# variable importance
varImp(rfeResults_galaxy)
# Overall
# iphone        71.171831
# googleandroid 37.439517
# iphonedispos  30.710355
# iphonedisneg  27.831920
# samsunggalaxy 26.544268
# iphonedisunc  24.024327

Recoding

The goal was to make 4 levels instead of 6:

1: negative
2: somewhat negative
3: somewhat positive
4: positive

# create new dataset that will be used for recoding sentiment
galaxyRC <- galaxyDF

# recode sentiment to combine factor levels 0 & 1 and 4 & 5
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
galaxyRC$galaxysentiment <- recode(galaxyRC$galaxysentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4)

Principal Component Analysis

# data = training and testing from galaxyDF (no feature selection) 
# create object containing centered, scaled PCA components from training set
# excluded the dependent variable and set threshold to .95

preprocessParams <- preProcess(trainsetDF[,-59], method=c("center", "scale", "pca"), thresh = 0.95)

# use predict to apply pca parameters, create training, exclude dependant
train.pca <- predict(preprocessParams, trainsetDF[,-59])

# add the dependent to training
train.pca$galaxysentiment <- trainsetDF$galaxysentiment

# use predict to apply pca parameters, create testing, exclude dependant
test.pca <- predict(preprocessParams, testsetDF[,-59])

# add the dependent to training
test.pca$galaxysentiment <- testsetDF$galaxysentiment

PCA needed 25 components to capture 95 percent of the variance

Modalization

Models used:

C5.0
Random forest
SVM
kknn

Datasets used:

galaxyDF: “Out of the box” dataset, no attributes removed.
galaxy_relevant_columns: Only Samsung-related columns included as predictors.
galaxyCOR: Using galaxy_relevant_columns after removal of columns due to collinearity, defined as a correlation coefficient of >0.90 between 2 independent variables
galaxyNZV: After removal of attributes with (near) zero variance
galaxyRFE: After recursive feature elimination
The two best models (C5.0 and Random forest) were selected and used for further feature engineering (RC, PCA, combination PCA/RFE/RC)
galaxyRC: After recoding the dependent attribute galaxysentiment (reducing the levels from 6 to 4)
galaxy.PCA: Principal component analysis, using a treshold variance of 0.95
galaxyRFE_RC: Recursive feature elimination after recoding sentiment
galaxyRC_RFE: Recoding sentiment after recursive feature elimination
galaxyRC_PCA: PCA after recoding sentiment

Performance metrics (test results)

Applying models to Large Matrix to predict galaxy sentiment

Predicting galaxy sentiment using C5.0

##          Negative          Positive Somewhat Negative Somewhat Positive 
##              5502              3240               578               123

Predicting galaxy sentiment using Random Forest

##          Negative          Positive Somewhat Negative Somewhat Positive 
##              5554              3272               584                33

Sentimental Analysis - Predicting Samsung Galaxy sentiment

Y.S. Kim

4/23/2020

Background

Objective