PhD. Progress Fall 2017
Michael Crawford
December 4th, 2017
Accomplishments To Date
- All Coursework completed
- 7 Published papers
- Cross-Domain Sentiment Analysis: An Empirical Investigation - IRI
- Integrating Multiple Data Sources to Enhance Sentiment Prediction - IEEE CIC
- An Investigation of Ensemble Techniques for Detection of Spam Reviews - ICMLA
- Survey of Review Spam Detection Using Machine Learning Techniques
- Efficient Modeling of User-Entity Preference in Big Social Networks
- Gender Prediction in Random Chat Networks Using Topological Network Structures and Masked Content
- Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection
Research Conducted Last Semester
Research Conducted This Semester -- Dataset
- Collected additional reviews
- 640 hotels and restaurants located in 8 additional Cites
- San Francisco, San Diego, Los Angeles, Seattle, Boston, Houston, Tampa, Detroit
- Additional 740,000 reviews
- Dataset now contains about 1,000,000 reviews
Research Conducted This Semester -- Dataset
- Cities were specifically selected such that they are paired with other similar cities
- San Francisco – Seattle
- San Diego – Los Angeles
- New York – Boston
- Chicago – Detroit
- New Orleans – Houston
- Orlando – Tampa
Research Conducted This Semester -- Classification
- Initial attempt to train classifiers on dataset failed
- Dataset is now too large to run on a single machine
- Attempted to move research over to Apache Spark
- Initial attempt to move to FAU cluster failed as cluster was (and still is) having issues
- Have setup a few personal clusters on Amazon EC2
Research Conducted This Semester -- Classification
- Am now able to do classification on the dataset with the following initial results
| Classifier |
AUC |
| LR |
0.74 |
| MNB |
0.66 |
| DT |
0.46 |
| RF100 |
0.75 |
- This is with no parameter tuning
Next Steps
- Try additional base classification algorithms (NB, MLP)
- Perform hyper-parameter tuning
- Test for different ways to help with the class imbalance
- Try different forms of feature selection (the feature space is huge >500,000)
- Train word vectors that are specific to reviews
- Integrate with deep learning
- Try XG boost on the dataset
Next Steps
- Determine the effect of location on classifier training
- Test training and testing classifiers in geographically similar versus dissimilar locations
- i.e. If we train a classifier on Orlando, how does it do on Tampa versus Chicago?
- Determine the effect of business type and cross business training
- Conduct a more thorough analysis of cross domain classification in identifying review spam
- Are different types of businesses more affected by location?
- Is one type of business best for training a global classifier?
- Determine the effect of splitting test and training by business instead of simply review
- I have also collected the ratings so the same types of analyses can be conducted on sentiment