GainSight Happy Hack - Team EverGreen’s hack detail document

This document is a detailed walk through of work done on problem statement 1 by Sunil Kumar, Ahmed Shabib from Team Evergreen during the GainSight Happy Hack on 11 - 12 April 2015.

Please note that all scripts and datafiles referred in this document are available at the URL https://www.dropbox.com/sh/jdm3gwhd862o821/AABWRnGqPEzucOaVGE5QkiNya?dl=0 unless specified otherwise explictly.

The work done by the team addresses the below mentioned challenges -

Data provided was enriched through external public dataset. All data was acquired , pre-processed and modeled appropritely.
Classification model was built to derive one new feature (representing sentiment in the unsturctured text comments).
Determined the “relevance of features” (weight of evidence) on the overall rating provided to each hotel.
Top 100 “Key themes” that determine the positive and negative sentiment of unstructured text comments were identified and visulized.
A complete list of indicators representing the sentiments are identified.
Data visualizations are incorporated wherever required and situation demanded so.
Created a R Markdown file detailing the work done and published the knitr file in RPubs
Created an interactive search interface which will allow the users search through the hotel db and get various insights about the desired hotel

The team made use of the below mentioned open source toolkit to address the problem -

Python
R
LightSIDE text mining workbench
Rattle dataminer
Mongo DB

All experiments are performed on standard laptops with Linux and Mac OS and those laptops are configured with 16 GB and 4 GB RAM respectively.

The first barrier for the team to cross is to derive sentiment from the unstructured text comments. We visualized the task of deriving sentiments as a supervised classification task. We identified a hotel blog comments annotated dataset that was available at http://www.dsic.upv.es/grupos/nle/resources/corpusPM.zip and acquired this data. After unzipping and performing pre-processing, a dataset of size 2999 is obtained. This data included text comments about hotels and the annotated sentiments. The dataset is named out.csv, the code for pre-processing is written in python and is called splitter.py.

LightSIDE was used to extract the required features from the training dataset (out.csv) and to build classification models. The below mentioned features extraction techniques are used -

Unigrams, Bigrams, Trigrams are extracted from the text.
Punctuations, Stopwords are excluded.
N-grams are stemmed.
Rare feature threshold is maintained at 5 which means any feature appearing in the text less than 5 times will be eliminated from being a feature.
Feature selection is done through chi-square attribute selection filter and top contibuting 3000 features are selected.

The extracted features dataset is named as 123grams_nostop_nopunct.features.xml

Post trying multiple classifiers, Random forest machine learning ensemble was used and it proved a 10 fold cross validation accuracy of 87% there we accepted the model for use on test dataset.The model is called weka__123grams_nostop_nopunct.model.xml

The json formatted dataset provided by GainSight for this hackathon was pre-processed through the python script jsonoutput2.py and it created 625951 records with features - “Service”, “Cleanliness”, “Sleep Quality”, “Rooms”, “Location”,“Value”,“Text”,“Overall”. The full dataset extracted is available with the name outputfinal2.csv.

The number of records in the test dataset is not something that could be processed on a standalone laptop using R / Python/ LightSIDE / Rattle environments due to RAM constraints.Therefore, we random sampled 45000 records and used it as test dataset.The testdata is called outputfile.csv. The file is processed by the model built and appropriate sentiment (pos / neg) is added to the same file under the column Label_prediction.

Exploratory data analysis (EDA) is a key step in any datascience project. Quantitative and qualitative details of the test data were analyzed during the EDA step for this project. Heatmap is one of the ways to visualize the huge datasets and get an idea on distibution of feature values. Below shown is the output of the code from the file EDA.R to visualize the test data -

## [1] "Summary of features in the test dataset"

##   Cleanliness    Label_prediction    Location        Overall     
##  Min.   :1.000   Min.   :0.0000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:0.0000   1st Qu.:4.000   1st Qu.:3.000  
##  Median :5.000   Median :0.0000   Median :5.000   Median :4.000  
##  Mean   :4.151   Mean   :0.4277   Mean   :4.319   Mean   :3.903  
##  3rd Qu.:5.000   3rd Qu.:1.0000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :1.0000   Max.   :5.000   Max.   :5.000  
##      Rooms          Service      Sleep.Quality      Value      
##  Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:3.000   1st Qu.:3.00   1st Qu.:3.000  
##  Median :4.000   Median :4.000   Median :4.00   Median :4.000  
##  Mean   :3.819   Mean   :4.056   Mean   :3.93   Mean   :3.946  
##  3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000

## [1] "Heatmap of the test dataset"

The second problem that we handled was to determine the relative importance of features with respect to the overall rating of a hotel. This is handled through linear regression. The idea is to fit a straight line that will pass through the predictors whereby the regression determines the coefficients to be assigned to each feature. The higher the coefficient of a feature, the better relative importance of the feature. One issue is that linear regression requires all predictors and outcomes to be numeric. This is not the case with us as the feild Label_prediction is nominal. convertfactorstonumeric.R is the file that contains code to convert nominal variable to numeric.

The full R code for regression modelling is available in the file named Regressionplot.R, the outcome of the script is shown below -

##      (Intercept)      Cleanliness Label_prediction         Location 
##      -0.35477131       0.12425991       0.10395281       0.08482929 
##            Rooms          Service    Sleep.Quality            Value 
##       0.24910248       0.25244779       0.13734977       0.20675868

Below is the visualization showing the evidence of weight of each of the features. Code for the visualization is available in piechart.R.

We did further analysis to confirm our theory though the relaimpo package in R that provides measures of relative importance for each of the predictors in the model. The R code for the same can be found in the file Relativeimportance.R.

Yet another approach for us to confirm the dependency of variables is to perform correlation analysis on each of the predictor variables with that of Ovarall, the target variable. This analysis is performed in Rattle and the code for the same can be seen in correlation analysis.R.

The next challenge is to ascertain which “key themes” in unstructured text result in a text comment being categorized as “pos” or “neg” sentiment. To overcome this challenge we extracted correlation statistics for each feature that figured in the feature table 123grams_nostop_nopunct.features.xml. The extract is done seperately for negative and positive sentiments. The extracts are available in the files named themesdecidingclasstrainneg.csv and themesdecidingclasstrainpos.csv. The files are pre-processed through the R code available in themesdecidingclasstrainneg.R and themesdecidingclasstrainpos.R.Top 100 words with highest correlation for each of the pos and neg sentiments.

The negative sentiment is determined through the words (top 100) below -

The positive sentiment is determined through the words (top 100) below -

The awesome search interface can be accessed from the URL gainhack.cloudapp.net and the code is available at https://github.com/ahmedshabib/evergreen-gainsight-hack

The R markdown file created to explain this hack work is provided for review under the name Final.Rmd

Summary & Closing Remarks

The document details various approaches that are adopted by Sunil & Ahmed, Team Evergreen to solve the problem statement 1 given to them during GainSight Happy Hack event. Sunil & Ahmed believe that applying linear regression of type generalized, logit, probit or multinomial will result in better model fitting therefore a better R square value. This may slightly change the results shown in this document. Sentiment analysis using Semantic approach was tried using Harwards Lexical enquirer database and wordnet synset however we did not proceed further downstream with it as the improvement was very minor than the other approach that was explained in this document. Lemmitization and other NLP parsing techniques are ignored in the current work for sentiment analysis and we believe using them will enhance the accuracies obtained in test datasets. These approaches have not been tried during to lack of time and infrastructure resource (RAM, CPU) constraints etc.,