PhD. Progresss Spring 2016
Michael Crawford
April 25th, 2016
Accomplishments To Date
- All Coursework completed
- 4 Published papers
- Survey of Review Spam Detection Using Machine Learning Techniques
- Efficient Modeling of User-Entity Preference in Big Social Networks
- Gender Prediction in Random Chat Networks Using Toplogical Network Structures and Masked Content
- Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection (in press)
- R Version of FAU Datamining Tools
- Facilites for quick and easy analysis of datamining results
- Ensures far less errors in the analysis
- Provides for consistency of standard graphs and figures across papers
Research Conducted This Semester
- Python version of FAUDMUtils
- Implement basic semi-supervised learning framework in python
- Expand Chatous paper to include behavioral features
Research Conducted This Semester -- FAUDMUtils
- Python version of FAUDMUtils
- https://pypi.python.org/pypi/faudmutils/0.1.1
- Build as an extension of scikit-learn
- Provides some common tranformers which are missing in the standard implementation
- Provides an easy wrapper for conducting proper experimentation
- Implement basic semi-supervised learning framework in python
- Expand Chatous paper to include behavioral features
Research Conducted This Semester (Semi-Supervised Learning)
- Python version of FAUDMUtils
- Implement basic semi-supervised learning framework in python
- Framework and methodology is complete
- Self-Learning algorithm has been implemented
- Experimented with self learning in the are of review spam with mixed results
- Expand Chatous paper to include behavioral features
Research Conducted This Semester (Semi-Supervised Learning)
- Python version of FAUDMUtils
- Implement basic semi-supervised learning framework in python
- Expand Chatous paper to include behavioral features
- Explored various behavioral features in the Chatous chat network and their usefulness in predicting profile information (i.e. age)
- Results are extremely promising
- Longer presentation on results to follow
Research Currently in the Pipeline
- Implement other semi-supervised learning algorithms using the faudmutils framework
- Automate collemction of more labeled data from yelp
- Investigate and compare different feature engineering techniques with reviewspam
- Implement a class balanced version of count vectorizer
- Find location to publish Chatous results
Chatous In-Depth
- Chatous is a Random Chat Network
- Unique type of network in that users do not know each other prior to engaging in conversations
- Users are placed into chat together randomly or based upon common interests
- Can we somehow predict profile features (i.e. gender) using:
- Masked Content
- Network Features
- Behavioral Features
- The ability to predict this will facilitate filling in missing information and spotting potentially falsified information
Chatous Challenges
- Content is masked
- In today's society, privacy is of real concern
- Users do not want to expose the content of their messages
- Can we work around the limitation of not really know what was said?
- No possibility of part-of-speech or LIWC features
- word order is lost
- word counts are lost
- Network is quite large
- Giant Component of more than 300,000 users
- More than 9 million chat logs
Chatous What We Did
- From our previous research we computed serveral network statistics
- Node Degree
- Page Rank
- Clustering Coefficient
- Betweenness Centrality
- New behavioral statistics have been added
- Number of chats
- Number of chats initiated
- Number of chats terminated
- Length of chats
- ….
- Compared and contrasted the 3 types of features (masked content, network, behavior)
Chatous Results
Tukey HSD Test
| features | Group | AUC | stdev |
| all | A | 0.960 | 0.001 |
| behavior-network | B | 0.958 | 0.001 |
| behavior-text | C | 0.952 | 0.001 |
| behavior | D | 0.949 | 0.001 |
| network-text | E | 0.847 | 0.002 |
| text | F | 0.769 | 0.002 |
| network | G | 0.709 | 0.002 |