Helio is working with a government health agency to create a suite of smart phone medical apps for use by aid workers in developing countries. This suite of apps will enable the aid workers to manage local health conditions by facilitating communication with medical professionals located elsewhere (one of the apps, for example, enables specialists in communicable diseases to diagnose conditions by examining images and other patient data uploaded by local aid workers). The government agency requires that the app suite be bundled with one model of smart phone. Helio is in the process of evaluating potential handset models to determine which one to bundle their software with. After completing an initial investigation, Helio has created a short list of five devices that are all capable of executing the app suite’s functions. To help Helio narrow their list down to one device, they have asked us to examine the prevalence of positive and negative attitudes toward these devices on the web.
The objective is to investigate predictive models using machine learning methods. These models will be applied to the Large Matrix file to complete the analysis of overall sentiment toward both iPhone and Samsung Galaxy. In this task machine learning methods will be used to predict the overall sentiment toward iPhones.
iphone_smallmatrix_labeled_8d.csv is the data matrix that was used in this task to develop the models to predict the overall sentiment toward iPhone. They include the counts of relevant words (sentiment lexicons) for about 12,000 instances (web pages). The values in the device sentiment columns (last column in the matrix) represents the overall sentiment toward the device on a scale of 0-5. The overall sentiment value has been manually input by a team of coworkers who read each webpage and rated the sentiment.
The scale is as follows:
Finally, the models will be used on the large matrix created from the AWS in the previous task to predict sentiment.
library(doParallel)
library(readxl)
library(dplyr)
library(tidyverse)
library(tidyr)
library(plotly)
library(corrplot)
library(caret)
library(e1071)
library(kknn)
library(readxl)
library(rmarkdown)
Since we are dealing with a large dataset, parallel processing is performed to reduce computing time.
# Find out how many cores there are on my laptop
detectCores() # Result = 8
# Create cluster with desired number of cores.
cl <- makeCluster(2)
# Register cluster
registerDoParallel(cl)
# Confirm how many cores are now assigned to R and Rstudio
getDoParWorkers() # Result = 2
iphoneDF <- read.csv("C:/Users/Y.S. Kim/Desktop/Ubiqum/Sentiment Analysis/Dataset and csv/iphone_smallmatrix_labeled_8d.csv")
## [1] "iphone" "samsunggalaxy" "sonyxperia" "nokialumina"
## [5] "htcphone" "ios" "googleandroid" "iphonecampos"
## [9] "samsungcampos" "sonycampos" "nokiacampos" "htccampos"
## [13] "iphonecamneg" "samsungcamneg" "sonycamneg" "nokiacamneg"
## [17] "htccamneg" "iphonecamunc" "samsungcamunc" "sonycamunc"
## [21] "nokiacamunc" "htccamunc" "iphonedispos" "samsungdispos"
## [25] "sonydispos" "nokiadispos" "htcdispos" "iphonedisneg"
## [29] "samsungdisneg" "sonydisneg" "nokiadisneg" "htcdisneg"
## [33] "iphonedisunc" "samsungdisunc" "sonydisunc" "nokiadisunc"
## [37] "htcdisunc" "iphoneperpos" "samsungperpos" "sonyperpos"
## [41] "nokiaperpos" "htcperpos" "iphoneperneg" "samsungperneg"
## [45] "sonyperneg" "nokiaperneg" "htcperneg" "iphoneperunc"
## [49] "samsungperunc" "sonyperunc" "nokiaperunc" "htcperunc"
## [53] "iosperpos" "googleperpos" "iosperneg" "googleperneg"
## [57] "iosperunc" "googleperunc" "iphonesentiment"
The attribute we want to predict:
## [1] "integer"
## [1] FALSE
Select only iphone/ios columns
# Select relevant columns for iphone
iphone_relevant_columns <- iphoneDF %>%
select(starts_with("iphone"), starts_with("ios"), iphonesentiment)
Checking for collinearity
cor(iphone_relevant_columns$iosperpos, iphone_relevant_columns$iosperneg)
## [1] 0.9323823
cor(iphone_relevant_columns$iosperpos, iphone_relevant_columns$iosperunc)
## [1] 0.9050794
cor(iphone_relevant_columns$ios, iphone_relevant_columns$iphone)
## [1] 0.9220603
# Remove columns due to collinearity
iphone_nocol <- iphone_relevant_columns %>%
select(-c(iosperneg, iosperunc, ios))
Corrplot after removal attributes due to collinearity
# create a new dataset
iphoneCOR <- iphone_nocol
Removal of attributes with (near) zero variance
# Examine feature variance: nearZeroVar() with saveMetrics = TRUE returns an object containing a table including: frequency ratio, percentage unique, zero variance and near zero variance
nzvMetrics <- nearZeroVar(iphoneCOR, saveMetrics = TRUE)
nzvMetrics
## freqRatio percentUnique zeroVar nzv
## iphone 5.041322 0.20812457 FALSE FALSE
## iphonecampos 10.524697 0.23124952 FALSE FALSE
## iphonecamneg 19.517529 0.13104139 FALSE TRUE
## iphonecamunc 16.764205 0.16187466 FALSE FALSE
## iphonedispos 6.792440 0.24666615 FALSE FALSE
## iphonedisneg 10.084428 0.18499961 FALSE FALSE
## iphonedisunc 11.471875 0.20812457 FALSE FALSE
## iphoneperpos 9.297834 0.19270793 FALSE FALSE
## iphoneperneg 11.054137 0.16958298 FALSE FALSE
## iphoneperunc 13.018349 0.12333308 FALSE FALSE
## iphonesentiment 3.843017 0.04624990 FALSE FALSE
## iosperpos 153.373494 0.09249981 FALSE TRUE
# nearZeroVar() with saveMetrics = FALSE returns an vector
nzv <- nearZeroVar(iphoneCOR, saveMetrics = FALSE)
## [1] 3 12
nearZeroVar columns: iphonecamneg and iosperpos**
# create a new data set and remove near zero variance features
iphoneNZV <- iphoneCOR[,-nzv]
Recursive Feature Elimination
# Sample the data (original dataset) before using RFE
set.seed(123)
iphoneSample <- iphoneDF[sample(1:nrow(iphoneDF), 1000, replace=FALSE),]
# Set up rfeControl with randomforest, repeated cross validation and no updates
ctrl <- rfeControl(functions = rfFuncs,
method = "repeatedcv",
repeats = 5,
verbose = FALSE)
# Use rfe and omit the response variable (attribute sentiment)
rfeResults <- rfe(iphoneSample[,1:58],
iphoneSample$iphonesentiment,
sizes=(1:58),
rfeControl=ctrl)
# create new data set with rfe recommended features
iphoneRFE <- iphoneDF[,predictors(rfeResults)]
# add the dependent variable to iphoneRFE
iphoneRFE$iphonesentiment <- iphoneDF$iphonesentiment
# variable importance
varImp(rfeResults)
# Overall
# iphone 70.487576
# googleandroid 33.075680
# iphonedispos 28.035451
# iphonedisneg 26.687539
# samsunggalaxy 23.868780
Recoding
# create new dataset that will be used for recoding sentiment
iphoneRC <- iphoneDF
# recode sentiment to combine factor levels 0 & 1 and 4 & 5
iphoneRC$iphonesentiment <- recode(iphoneRC$iphonesentiment, '0' = 1, '1' = 1, '2' = 2, '3' = 3, '4' = 4, '5' = 4)
Principal Component Analysis
# data = training and testing from iphoneDF (no feature selection)
# create object containing centered, scaled PCA components from training set
# excluded the dependent variable and set threshold to .95
preprocessParams <- preProcess(trainsetDF[,-59], method=c("center", "scale", "pca"), thresh = 0.95)
print(preprocessParams)
# use predict to apply pca parameters, create training, exclude dependant
train.pca <- predict(preprocessParams, trainsetDF[,-59])
# add the dependent to training
train.pca$iphonesentiment <- trainsetDF$iphonesentiment
# use predict to apply pca parameters, create testing, exclude dependant
test.pca <- predict(preprocessParams, testsetDF[,-59])
# add the dependent to training
test.pca$iphonesentiment <- testsetDF$iphonesentiment
PCA needed 24 components to capture 95 percent of the variance
Predicting iphone sentiment using C5.0
## Negative Positive Somewhat Negative Somewhat Positive
## 5619 3204 584 36
Predicting iphone sentiment using Random Forest
## Negative Positive Somewhat Negative Somewhat Positive
## 5580 3244 584 35
# Stop cluster. After performing task, stop cluster
stopCluster(cl)