PhD. Progress Fall 2018

Michael Crawford
November 26th, 2018

Accomplishments To Date

  • All Coursework completed
  • 7 Published papers
    • Cross-Domain Sentiment Analysis: An Empirical Investigation - IRI
    • Integrating Multiple Data Sources to Enhance Sentiment Prediction - IEEE CIC
    • An Investigation of Ensemble Techniques for Detection of Spam Reviews - ICMLA
    • Survey of Review Spam Detection Using Machine Learning Techniques
    • Efficient Modeling of User-Entity Preference in Big Social Networks
    • Gender Prediction in Random Chat Networks Using Topological Network Structures and Masked Content
    • Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection

Research Conducted In Fall 2017

  • Collected additional reviews
    • 640 hotels and restaurants located in 8 additional Cites
    • San Francisco, San Diego, Los Angeles, Seattle, Boston, Houston, Tampa, Detroit
    • Additional 740,000 reviews
    • Dataset now contains about 1,000,000 reviews
  • Baseline testing with spark cluster
Classifier AUC
LR 0.74
MNB 0.66
DT 0.46
RF100 0.75

Research Conducted Spring 2018 -- Deep Learning

  • Setup research environment for testing with pyTorch
  • Researched ways to use deep learning to improve results
  • Trained new word embeddings using RNN's

Research Conducted Spring 2018 -- Fake Review Generation

  • Used text prediction as the task to train the word embeddings
  • A side effect is the ability to generate fake reviews
  • Example 1
    • Seed: The rooms are very nice and the staff was
    • Result: very nice and helpful staff. The hotel is in a great location , right next to the convention center
  • Eample 2
    • Seed: The rooms are very nice but
    • Result: the hotel is a little bit of a walk from the main entrance. The hotel is very nice and the staff is very friendly . The rooms are clean and the beds are comfortable .

Research Conducted this Semester

  • Incorporated hotel text prediction language model encoder into rnn classfier
    • Takes the last encoding layer and replaces it with a layer to predict spam
    • Decreases error rate over MNB by 47%
    • Increases AUC by 5.4%
    • Note: I believe I can improve this still by playing with the dropouts and learning rates
  • Used transfer learning to take a language model trained on wikipedia to assist.
    • Slight increase in performance (much quicker training times)
    • Still evaluating

Next Steps

  • Test for different ways to help with the class imbalance
  • Evaluate on restaurant dataset
  • Evaluate on combined dataset
  • Publish findings