Fighting against social bots: The issue of identification
Dmitrii Pianov
05/02/2018
What is social bot?
- An algorithm that interacts with human users on social platforms.
- Can be benign or malicious.
- Malicious bots are used for variety of applications, including spam, manipulation of public discourse, promotion of people and ideas, etc.
- Most of the malicious social bots hide their identity to avoid being detected.
Rise of social bots
- Prevalence of social networks creates an opportunity for adversarial agents to obtain benefit by overflowing social networks with bots.
- Breakthrough in computational power and NLP allows bot owners to avoid being detected by traditional heuristics.
- There is a curious case of Cynk Technology - a company with no turnover or assets thar risen in value to $4.5 bil. overnight.
Why bother aka Research Question
- Leaving commercial issues aside, bot identification helps to avoid information manipulation and keep society healthy.
- Bots are constantly evolving and so must our detecting capabilities.
- Social bot detection is a good example why learning, rather than fitting, is an ultimate objective of machine learning.
Previous Research
- Early research on bot detection mostly operated on domain knowledge and heuristical measures (Chu et al (2012)).
- Recent studies use traditional supervised and unsupervised techniques to classify the accounts (Stukal et al (2017), Kudugunta and Ferrara (2018), Chavoshi (2016)).
- Both textual and metadata are used as features.
- There are no consensus which one is more important.
Roadmap
- I propose a theoretical model to show that the bot network is sustainable.
- I use character Bi-GRU RNN to extract features from tweet text.
- Using a labeled dataset, I build a model to classify the origin of tweets using both textual features and metadata.
Theoretical Model Results
- Bot owners decide on creation of new bots and investment in sophistication.
- Sophistication affects the survival rate of old bots.
- No investment in sophistication occurs if the cost of creating new bots is low.
- Bot model is sustainable if and only if there is an increasing benefit of bots with longer lifetime.
Estimation
- I use the data from Creci et al (2017) that has more than 8 million tweets from variety of human and bot accounts.
- Each entry consists of tweet text and metadata (number of mentions, replies, hashtags, etc.).
- I sample 60000 tweets (30000 of human users, 30000 of one type of bots).
- The tweet text is then fed into Bi-GRU RNN to produce 500-dimensional vector that encodes the characters within the tweet.
- For classification, I use cross-validated traditional SVM with embedding vector plus metadata as a input vector.
- 2 and 4 PCA also estimated for robustness check.
Results
|
Precision |
Recall |
F1-score |
support |
| Bot |
0.85 |
0.88 |
0.87 |
5902 |
| Human |
0.87 |
0.85 |
0.86 |
5959 |
| avg/total |
0.86 |
0.86 |
0.86 |
11861 |
Results cont.
|
Precision |
Recall |
F1-score |
support |
| Bot |
0.62 |
0.92 |
0.74 |
4891 |
| Human |
0.85 |
0.46 |
0.60 |
5044 |
| avg/total |
0.74 |
0.69 |
0.67 |
9915 |
Summary
- A model shows 87% accuracy within the same domain of data.
- PCA deteriorate the prediction to 50-55% accuracy which suggest no combination of features dominate the prediction importance.
- Sampling bots from other domain substantially decrease the prediction rate to 68%.