This document is a detailed walk through of work done on problem statement 1 by Sunil Kumar, Ahmed Shabib from Team Evergreen during the GainSight Happy Hack on 11 - 12 April 2015.
Please note that all scripts and datafiles referred in this document are available at the URL https://www.dropbox.com/sh/jdm3gwhd862o821/AABWRnGqPEzucOaVGE5QkiNya?dl=0 unless specified otherwise explictly.
The work done by the team addresses the below mentioned challenges -
The team made use of the below mentioned open source toolkit to address the problem -
All experiments are performed on standard laptops with Linux and Mac OS and those laptops are configured with 16 GB and 4 GB RAM respectively.
The first barrier for the team to cross is to derive sentiment from the unstructured text comments. We visualized the task of deriving sentiments as a supervised classification task. We identified a hotel blog comments annotated dataset that was available at http://www.dsic.upv.es/grupos/nle/resources/corpusPM.zip and acquired this data. After unzipping and performing pre-processing, a dataset of size 2999 is obtained. This data included text comments about hotels and the annotated sentiments. The dataset is named out.csv, the code for pre-processing is written in python and is called splitter.py.
LightSIDE was used to extract the required features from the training dataset (out.csv) and to build classification models. The below mentioned features extraction techniques are used -
The extracted features dataset is named as 123grams_nostop_nopunct.features.xml
Post trying multiple classifiers, Random forest machine learning ensemble was used and it proved a 10 fold cross validation accuracy of 87% there we accepted the model for use on test dataset.The model is called weka__123grams_nostop_nopunct.model.xml
The json formatted dataset provided by GainSight for this hackathon was pre-processed through the python script jsonoutput2.py and it created 625951 records with features - “Service”, “Cleanliness”, “Sleep Quality”, “Rooms”, “Location”,“Value”,“Text”,“Overall”. The full dataset extracted is available with the name outputfinal2.csv.
The number of records in the test dataset is not something that could be processed on a standalone laptop using R / Python/ LightSIDE / Rattle environments due to RAM constraints.Therefore, we random sampled 45000 records and used it as test dataset.The testdata is called outputfile.csv. The file is processed by the model built and appropriate sentiment (pos / neg) is added to the same file under the column Label_prediction.
Exploratory data analysis (EDA) is a key step in any datascience project. Quantitative and qualitative details of the test data were analyzed during the EDA step for this project. Heatmap is one of the ways to visualize the huge datasets and get an idea on distibution of feature values. Below shown is the output of the code from the file EDA.R to visualize the test data -
## [1] "Summary of features in the test dataset"
## Cleanliness Label_prediction Location Overall
## Min. :1.000 Min. :0.0000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:0.0000 1st Qu.:4.000 1st Qu.:3.000
## Median :5.000 Median :0.0000 Median :5.000 Median :4.000
## Mean :4.151 Mean :0.4277 Mean :4.319 Mean :3.903
## 3rd Qu.:5.000 3rd Qu.:1.0000 3rd Qu.:5.000 3rd Qu.:5.000
## Max. :5.000 Max. :1.0000 Max. :5.000 Max. :5.000
## Rooms Service Sleep.Quality Value
## Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.000 1st Qu.:3.000 1st Qu.:3.00 1st Qu.:3.000
## Median :4.000 Median :4.000 Median :4.00 Median :4.000
## Mean :3.819 Mean :4.056 Mean :3.93 Mean :3.946
## 3rd Qu.:5.000 3rd Qu.:5.000 3rd Qu.:5.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## [1] "Heatmap of the test dataset"
The second problem that we handled was to determine the relative importance of features with respect to the overall rating of a hotel. This is handled through linear regression. The idea is to fit a straight line that will pass through the predictors whereby the regression determines the coefficients to be assigned to each feature. The higher the coefficient of a feature, the better relative importance of the feature. One issue is that linear regression requires all predictors and outcomes to be numeric. This is not the case with us as the feild Label_prediction is nominal. convertfactorstonumeric.R is the file that contains code to convert nominal variable to numeric.
The full R code for regression modelling is available in the file named Regressionplot.R, the outcome of the script is shown below -
## (Intercept) Cleanliness Label_prediction Location
## -0.35477131 0.12425991 0.10395281 0.08482929
## Rooms Service Sleep.Quality Value
## 0.24910248 0.25244779 0.13734977 0.20675868
Below is the visualization showing the evidence of weight of each of the features. Code for the visualization is available in piechart.R.
We did further analysis to confirm our theory though the relaimpo package in R that provides measures of relative importance for each of the predictors in the model. The R code for the same can be found in the file Relativeimportance.R.
Yet another approach for us to confirm the dependency of variables is to perform correlation analysis on each of the predictor variables with that of Ovarall, the target variable. This analysis is performed in Rattle and the code for the same can be seen in correlation analysis.R.
The next challenge is to ascertain which “key themes” in unstructured text result in a text comment being categorized as “pos” or “neg” sentiment. To overcome this challenge we extracted correlation statistics for each feature that figured in the feature table 123grams_nostop_nopunct.features.xml. The extract is done seperately for negative and positive sentiments. The extracts are available in the files named themesdecidingclasstrainneg.csv and themesdecidingclasstrainpos.csv. The files are pre-processed through the R code available in themesdecidingclasstrainneg.R and themesdecidingclasstrainpos.R.Top 100 words with highest correlation for each of the pos and neg sentiments.
The negative sentiment is determined through the words (top 100) below -
The positive sentiment is determined through the words (top 100) below -
The awesome search interface can be accessed from the URL gainhack.cloudapp.net and the code is available at https://github.com/ahmedshabib/evergreen-gainsight-hack
The R markdown file created to explain this hack work is provided for review under the name Final.Rmd
Summary & Closing Remarks
The document details various approaches that are adopted by Sunil & Ahmed, Team Evergreen to solve the problem statement 1 given to them during GainSight Happy Hack event. Sunil & Ahmed believe that applying linear regression of type generalized, logit, probit or multinomial will result in better model fitting therefore a better R square value. This may slightly change the results shown in this document. Sentiment analysis using Semantic approach was tried using Harwards Lexical enquirer database and wordnet synset however we did not proceed further downstream with it as the improvement was very minor than the other approach that was explained in this document. Lemmitization and other NLP parsing techniques are ignored in the current work for sentiment analysis and we believe using them will enhance the accuracies obtained in test datasets. These approaches have not been tried during to lack of time and infrastructure resource (RAM, CPU) constraints etc.,