Fighting against social bots: The issue of identification

Dmitrii Pianov
05/02/2018

What is social bot?

  • An algorithm that interacts with human users on social platforms.
  • Can be benign or malicious.
  • Malicious bots are used for variety of applications, including spam, manipulation of public discourse, promotion of people and ideas, etc.
  • Most of the malicious social bots hide their identity to avoid being detected.

Rise of social bots

  • Prevalence of social networks creates an opportunity for adversarial agents to obtain benefit by overflowing social networks with bots.
  • Breakthrough in computational power and NLP allows bot owners to avoid being detected by traditional heuristics.
  • There is a curious case of Cynk Technology - a company with no turnover or assets thar risen in value to $4.5 bil. overnight.

Why bother aka Research Question

  • Leaving commercial issues aside, bot identification helps to avoid information manipulation and keep society healthy.
  • Bots are constantly evolving and so must our detecting capabilities.
  • Social bot detection is a good example why learning, rather than fitting, is an ultimate objective of machine learning.

Previous Research

  • Early research on bot detection mostly operated on domain knowledge and heuristic measures (Chu et al (2012)).
  • Recent studies use traditional supervised and unsupervised techniques to classify the accounts (Stukal et al (2017), Kudugunta and Ferrara (2018), Chavoshi (2016)).
  • Both textual and metadata are used as features.
  • There are no consensus which one is more important.

Roadmap

  • I propose a theoretical model to show that the bot network is sustainable.
  • I use character Bi-GRU RNN to extract features from tweet text.
  • Using a labeled dataset, I build a model to classify the origin of tweets using both textual features and metadata.

Theoretical Model Setup

alt text

Theoretical Model Results

  • Bot owners decide on creation of new bots and investment in sophistication.
  • Sophistication affects the survival rate of old bots.
  • No investment in sophistication occurs if the cost of creating new bots is low.
  • Bot model is sustainable if and only if there is an increasing benefit of bots with longer lifetime.

Estimation

  • I use the data from Creci et al (2017) that has more than 8 million tweets from variety of human and bot accounts.
  • Each entry consists of tweet text and metadata (number of mentions, replies, hashtags, etc.).
  • I sample 60000 tweets (30000 of human users, 30000 of one type of bots).
  • The tweet text is then fed into Bi-GRU RNN to produce 500-dimensional vector that encodes the characters within the tweet.
  • For classification, I use cross-validated traditional SVM with embedding vector plus metadata as a input vector.
  • 2 and 4 PCA also estimated for robustness check.

Results

alt text

Precision Recall F1-score support
Bot 0.85 0.88 0.87 5902
Human 0.87 0.85 0.86 5959
avg/total 0.86 0.86 0.86 11861

Results cont.

alt text

Precision Recall F1-score support
Bot 0.62 0.92 0.74 4891
Human 0.85 0.46 0.60 5044
avg/total 0.74 0.69 0.67 9915

Summary

  • A model shows 87% accuracy within the same domain of data.
  • PCA deteriorate the prediction to 50-55% accuracy which suggest no combination of features dominate the prediction importance.
  • Sampling bots from other domain substantially decrease the prediction rate to 68%.

Thanks!