PhD. Progress Fall 2018
Michael Crawford
November 26th, 2018
Accomplishments To Date
- All Coursework completed
- 7 Published papers
- Cross-Domain Sentiment Analysis: An Empirical Investigation - IRI
- Integrating Multiple Data Sources to Enhance Sentiment Prediction - IEEE CIC
- An Investigation of Ensemble Techniques for Detection of Spam Reviews - ICMLA
- Survey of Review Spam Detection Using Machine Learning Techniques
- Efficient Modeling of User-Entity Preference in Big Social Networks
- Gender Prediction in Random Chat Networks Using Topological Network Structures and Masked Content
- Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection
Research Conducted In Fall 2017
- Collected additional reviews
- 640 hotels and restaurants located in 8 additional Cites
- San Francisco, San Diego, Los Angeles, Seattle, Boston, Houston, Tampa, Detroit
- Additional 740,000 reviews
- Dataset now contains about 1,000,000 reviews
- Baseline testing with spark cluster
| Classifier |
AUC |
| LR |
0.74 |
| MNB |
0.66 |
| DT |
0.46 |
| RF100 |
0.75 |
Research Conducted Spring 2018 -- Deep Learning
- Setup research environment for testing with pyTorch
- Researched ways to use deep learning to improve results
- Trained new word embeddings using RNN's
Research Conducted Spring 2018 -- Fake Review Generation
- Used text prediction as the task to train the word embeddings
- A side effect is the ability to generate fake reviews
- Example 1
- Seed: The rooms are very nice and the staff was
- Result: very nice and helpful staff. The hotel is in a great location , right next to the convention center
- Eample 2
- Seed: The rooms are very nice but
- Result: the hotel is a little bit of a walk from the main entrance. The hotel is very nice and the staff is very friendly . The rooms are clean and the beds are comfortable .
Research Conducted this Semester
- Incorporated hotel text prediction language model encoder into rnn classfier
- Takes the last encoding layer and replaces it with a layer to predict spam
- Decreases error rate over MNB by 47%
- Increases AUC by 5.4%
- Note: I believe I can improve this still by playing with the dropouts and learning rates
- Used transfer learning to take a language model trained on wikipedia to assist.
- Slight increase in performance (much quicker training times)
- Still evaluating
Next Steps
- Test for different ways to help with the class imbalance
- Evaluate on restaurant dataset
- Evaluate on combined dataset
- Publish findings