PhD. Progresss Spring 2016

Michael Crawford
April 25th, 2016

Accomplishments To Date

  • All Coursework completed
  • 4 Published papers
    • Survey of Review Spam Detection Using Machine Learning Techniques
    • Efficient Modeling of User-Entity Preference in Big Social Networks
    • Gender Prediction in Random Chat Networks Using Toplogical Network Structures and Masked Content
    • Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection (in press)
  • R Version of FAU Datamining Tools
    • Facilites for quick and easy analysis of datamining results
    • Ensures far less errors in the analysis
    • Provides for consistency of standard graphs and figures across papers

Research Conducted This Semester

  • Python version of FAUDMUtils
  • Implement basic semi-supervised learning framework in python
  • Expand Chatous paper to include behavioral features

Research Conducted This Semester -- FAUDMUtils

  • Python version of FAUDMUtils
    • https://pypi.python.org/pypi/faudmutils/0.1.1
    • Build as an extension of scikit-learn
    • Provides some common tranformers which are missing in the standard implementation
    • Provides an easy wrapper for conducting proper experimentation
  • Implement basic semi-supervised learning framework in python
  • Expand Chatous paper to include behavioral features

Research Conducted This Semester (Semi-Supervised Learning)

  • Python version of FAUDMUtils
  • Implement basic semi-supervised learning framework in python
    • Framework and methodology is complete
    • Self-Learning algorithm has been implemented
    • Experimented with self learning in the are of review spam with mixed results
  • Expand Chatous paper to include behavioral features

Research Conducted This Semester (Semi-Supervised Learning)

  • Python version of FAUDMUtils
  • Implement basic semi-supervised learning framework in python
  • Expand Chatous paper to include behavioral features
    • Explored various behavioral features in the Chatous chat network and their usefulness in predicting profile information (i.e. age)
    • Results are extremely promising
    • Longer presentation on results to follow

Research Currently in the Pipeline

  • Implement other semi-supervised learning algorithms using the faudmutils framework
  • Automate collemction of more labeled data from yelp
  • Investigate and compare different feature engineering techniques with reviewspam
  • Implement a class balanced version of count vectorizer
  • Find location to publish Chatous results

Chatous In-Depth

  • Chatous is a Random Chat Network
  • Unique type of network in that users do not know each other prior to engaging in conversations
  • Users are placed into chat together randomly or based upon common interests
  • Can we somehow predict profile features (i.e. gender) using:
    • Masked Content
    • Network Features
    • Behavioral Features
  • The ability to predict this will facilitate filling in missing information and spotting potentially falsified information

Chatous Challenges

  • Content is masked
    • In today's society, privacy is of real concern
    • Users do not want to expose the content of their messages
    • Can we work around the limitation of not really know what was said?
    • No possibility of part-of-speech or LIWC features
    • word order is lost
    • word counts are lost
  • Network is quite large
    • Giant Component of more than 300,000 users
    • More than 9 million chat logs

Chatous What We Did

  • From our previous research we computed serveral network statistics
    • Node Degree
    • Page Rank
    • Clustering Coefficient
    • Betweenness Centrality
  • New behavioral statistics have been added
    • Number of chats
    • Number of chats initiated
    • Number of chats terminated
    • Length of chats
    • ….
  • Compared and contrasted the 3 types of features (masked content, network, behavior)

Chatous Results

Tukey HSD Test
features Group AUC stdev
all A 0.960 0.001
behavior-network B 0.958 0.001
behavior-text C 0.952 0.001
behavior D 0.949 0.001
network-text E 0.847 0.002
text F 0.769 0.002
network G 0.709 0.002
  • Behavior is far and away the best type of feature
  • Behavior+Network better than Behavior+Text

plot of chunk unnamed-chunk-3

Chatous Results ctnd.

  • What does this mean?
    • What you actually say isn't that important
    • How you behave is far more telling of gender than what one says
    • “We don't care what you say, we only care how you say it”
  • Full analysis of behavioral features used available upon request (omitted due to time)
  • Still need to compare the various types of behavioral features to determine what is the most telling