PhD. Progress Spring 2018

Michael Crawford
April 16th, 2018

Collected additional reviews
- 640 hotels and restaurants located in 8 additional Cites
- San Francisco, San Diego, Los Angeles, Seattle, Boston, Houston, Tampa, Detroit
- Additional 740,000 reviews
- Dataset now contains about 1,000,000 reviews
Baseline testing with spark cluster

Used text prediction as the task to train the word embeddings
A side effect is the ability to generate fake reviews
Example 1
- Seed: The rooms are very nice and the staff was
- Result: very nice and helpful staff. The hotel is in a great location , right next to the convention center
Eample 2
- Seed: The rooms are very nice but
- Result: the hotel is a little bit of a walk from the main entrance. The hotel is very nice and the staff is very friendly . The rooms are clean and the beds are comfortable .

Test different embedding sizes and RNN structures
Take word embedding from RNN network and use for classification
- A form of transfer learning
- Try using embeddings with traditional classifiers
- Try using in a new neural network built for classification
Test using reviews from other domains to improve language model
Study differences in the embeddings across geographical regions
- Does one region generalize better than the others
Test for different ways to help with the class imbalance
Try different forms of feature selection (the feature space is huge >500,000)
Try XG boost on the dataset