PhD. Progresss Spring 2016

Michael Crawford
April 25th, 2016

All Coursework completed
4 Published papers
- Survey of Review Spam Detection Using Machine Learning Techniques
- Efficient Modeling of User-Entity Preference in Big Social Networks
- Gender Prediction in Random Chat Networks Using Toplogical Network Structures and Masked Content
- Reducing Feature Set Explosion to Facilitate Real-World Review Spam Detection (in press)
R Version of FAU Datamining Tools
- Facilites for quick and easy analysis of datamining results
- Ensures far less errors in the analysis
- Provides for consistency of standard graphs and figures across papers

Python version of FAUDMUtils
- https://pypi.python.org/pypi/faudmutils/0.1.1
- Build as an extension of scikit-learn
- Provides some common tranformers which are missing in the standard implementation
- Provides an easy wrapper for conducting proper experimentation
Implement basic semi-supervised learning framework in python
Expand Chatous paper to include behavioral features

Python version of FAUDMUtils
Implement basic semi-supervised learning framework in python
- Framework and methodology is complete
- Self-Learning algorithm has been implemented
- Experimented with self learning in the are of review spam with mixed results
Expand Chatous paper to include behavioral features

Python version of FAUDMUtils
Implement basic semi-supervised learning framework in python
Expand Chatous paper to include behavioral features
- Explored various behavioral features in the Chatous chat network and their usefulness in predicting profile information (i.e. age)
- Results are extremely promising
- Longer presentation on results to follow

Implement other semi-supervised learning algorithms using the faudmutils framework
Automate collemction of more labeled data from yelp
Investigate and compare different feature engineering techniques with reviewspam
Implement a class balanced version of count vectorizer
Find location to publish Chatous results

Chatous is a Random Chat Network
Unique type of network in that users do not know each other prior to engaging in conversations
Users are placed into chat together randomly or based upon common interests
Can we somehow predict profile features (i.e. gender) using:
- Masked Content
- Network Features
- Behavioral Features
The ability to predict this will facilitate filling in missing information and spotting potentially falsified information

Content is masked
- In today's society, privacy is of real concern
- Users do not want to expose the content of their messages
- Can we work around the limitation of not really know what was said?
- No possibility of part-of-speech or LIWC features
- word order is lost
- word counts are lost
Network is quite large
- Giant Component of more than 300,000 users
- More than 9 million chat logs

From our previous research we computed serveral network statistics
- Node Degree
- Page Rank
- Clustering Coefficient
- Betweenness Centrality
New behavioral statistics have been added
- Number of chats
- Number of chats initiated
- Number of chats terminated
- Length of chats
- ….
Compared and contrasted the 3 types of features (masked content, network, behavior)

Tukey HSD Test
features	Group	AUC	stdev
all	A	0.960	0.001
behavior-network	B	0.958	0.001
behavior-text	C	0.952	0.001
behavior	D	0.949	0.001
network-text	E	0.847	0.002
text	F	0.769	0.002
network	G	0.709	0.002

plot of chunk unnamed-chunk-3

What does this mean?
- What you actually say isn't that important
- How you behave is far more telling of gender than what one says
- “We don't care what you say, we only care how you say it”
Full analysis of behavioral features used available upon request (omitted due to time)
Still need to compare the various types of behavioral features to determine what is the most telling