PhD. Progress Fall 2017

Michael Crawford
December 4th, 2017

Accomplishments To Date

  • All Coursework completed
  • 7 Published papers
    • Cross-Domain Sentiment Analysis: An Empirical Investigation - IRI
    • Integrating Multiple Data Sources to Enhance Sentiment Prediction - IEEE CIC
    • An Investigation of Ensemble Techniques for Detection of Spam Reviews - ICMLA
    • Survey of Review Spam Detection Using Machine Learning Techniques
    • Efficient Modeling of User-Entity Preference in Big Social Networks
    • Gender Prediction in Random Chat Networks Using Topological Network Structures and Masked Content
    • Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection

Research Conducted Last Semester

  • Collected new review spam dataset

    • 320 hotels and restaurants located in 4 Cites
    • Chicago, New Orleans, New York, Orlando
    • Over 260,000 reviews (40k hotel, 220k restaurant)
  • Began to evaluate different methodologies for detecting the spam in new dataset

    • Basic machine learning algorithms
    • Ensemble methods
    • Random Forrest
    • Voting Classifier

Research Conducted This Semester -- Dataset

  • Collected additional reviews
    • 640 hotels and restaurants located in 4 additional Cites
    • San Francisco, San Diego, Los Angeles, Seattle, Boston, Houston, Tampa, Detroit
    • Additional 740,000 reviews
    • Dataset now conatins about 1,000,000 reviews

Research Conducted This Semester -- Dataset

  • Cities were specifically selected such that they are paired with other similiar cities
    • San Francisco – Seattle
    • San Diego – Los Angeles
    • New York – Boston
    • Chicago – Detroit
    • New Orleans – Houston
    • Orlando – Tampa

Research Conducted This Semester -- Classification

  • Initial attempt to train classifiers on dataset failed
  • Dataset is now too large to run on a single machine
  • Attempted to move research over to Apache Spark
  • Initial attempt to move to FAU cluster failed as cluster was having issues
  • Have setup a personal cluster on Amazon EC2

Research Conducted This Semester -- Classification

  • Am now able to do classification on the dataset with the following initial results
Classfier AUC
LR 0.74
MNB 0.66
DT 0.46
RF100 0.75
  • This is with no paramater tuning

Next Steps

  • Try addtional base agorithms
  • Perform hyper-parameter tuning
  • Train word vectors that are specific to reviews
  • Integrate with deep learning
  • Try XG boost on the dataset
  • Determine the effect of location on classifier training
    • Test training and testing classifiers in geographically similar versus dissimilar locations
    • i.e. If we train a classifier on Orlando, how does it do on Tampa versus Chicago?
  • Train new word vectors on these reviews
  • Determine the effect of business type and cross business training
    • Conduct a more thorough analysis of cross domain classification in identifying review spam
    • Are different types of businesses more affected by location?
    • Is one type of business best for training a global classifier?
  • Determine the effect of splitting test and training by business instead of simply review
  • I have also collected the ratings so the same types of analyses can be conducted on sentiment