Our goal is to create a system that can properly detect whether a Twitter account is a “bot”, meaning that it is not controlled by a human but instead by some algorithm or automated program. Twitter and other social media companies struggle to detect these “bot” accounts but seek to do so in order to delete them, as it is not the intent of a social media or communication platform to include bots that are falsifying themselves as real people! We found a data set that includes many different metrics about Twitter accounts, such as number of followers, number of friends, language of account, and number of favorites. Each of these accounts is labeled as a bot or not, too, which gives us the opportunity to use this data to train a model to detect which Twitter accounts are bots! We see this as not only a great opportunity to help Twitter remove their bot accounts; it is also good practice for us in training and evaluating models via the techniques we learned this past semester.
Data cleaning is always an important first step to building predictive models, as data sets are often riddled with empty cells, malformed cells, and even highly correlated rows that do not both need to be present in the set.
We remove the rows with incomplete cases because we see with this gg_miss_var plot that there are nearly 600 cases that are missing “bot” values. Since we are created a system that predicts whether a Twitter account is a bot, we need each data point for training/testing to have a value for this. Additionally, we remove id, id_str, screen_name, location, description, url, created_at, status, and name because these are text columns that are unique for each data point and thus unnecessary to include in the training.
## followers_count friends_count listed_count favourites_count verified
## 1 1291 0 10 0 FALSE
## 2 1 349 0 38 FALSE
## 3 1086 0 14 0 FALSE
## 4 33 0 8 0 FALSE
## 5 11 745 0 146 FALSE
## 6 1 186 0 0 FALSE
## statuses_count lang default_profile default_profile_image
## 1 78554 "en" TRUE FALSE
## 2 31 en TRUE FALSE
## 3 713 en TRUE FALSE
## 4 676 en TRUE TRUE
## 5 185 en FALSE FALSE
## 6 11 en TRUE FALSE
## has_extended_profile bot
## 1 FALSE 1
## 2 FALSE 1
## 3 FALSE 1
## 4 FALSE 1
## 5 FALSE 1
## 6 TRUE 1
This is now what the twitter_data dataframe looks like. We have 10 variables on which to form predictions, then the ‘bot’ variable that is 1 when a Twitter account is a bot and 0 when it is not.
We now check the split of data to find the base rate.
## 1
## 0.4722917
The split is 47.2%, meaning that roughly 47% of our data is for bots (“bot” = 1). This indicates a well-balanced data set, which is ideal for building a model.
Now, we have to set the remaining variables as factors and then all (besides “bot”) as numeric for analysis. We also scale all but “bot” for the KNN training.
Now, we check the correlation plot to see what we might want to remove from the dataframe. If two variables are highly correlated, then we don’t need both of them for training our system.
The variables listed_count and followers_count have a correlation coefficient of 0.81. The rest of the pairs of variables have correlation coefficients below 0.5, so we will remove one of the two from that highly correlated pair, listed_count (though we could have removed followers_count instead).
Then, we split into train and test sets so that 80% of our data would be used for training and 20% for testing.
## [1] 2238
## [1] 559
We confirm here that our train set has 80% of our data. We see that bot_data_train (our training set) has 2238 elements while bot_data_test (our testing set) has 559 elements. This aligns with our goal for separating our data points with 80-20% train-test.
Now, we create an elbow plot to check which k value would maximize the accuracy of our knn model.
We see that 11 is our best k value for training our model because it gives us a model with the highest accuracy.
Now, we run KNN analysis with 11 nearest neighbors and analyze the accuracy of the model
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 244 42
## 1 35 238
##
## Accuracy : 0.8623
## 95% CI : (0.8309, 0.8897)
## No Information Rate : 0.5009
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.7245
##
## Mcnemar's Test P-Value : 0.4941
##
## Sensitivity : 0.8500
## Specificity : 0.8746
## Pos Pred Value : 0.8718
## Neg Pred Value : 0.8531
## Prevalence : 0.5009
## Detection Rate : 0.4258
## Detection Prevalence : 0.4884
## Balanced Accuracy : 0.8623
##
## 'Positive' Class : 1
##
## F1
## 0.8607595
We see from this confusion matrix that the accuracy is 86.23%, which is far above the base rate of 47%. This means that about 86% of the time, our system can correctly differentiate between a bot and a non-bot account. The kappa is 0.7245, which is considered moderately strong for inter-rater reliability. The sensitivity (or proportion of true positives) is 0.85, which means there is an 85% chance of the system correctly identifying a bot. The specificity (or proportion of true negatives) is 0.8746, which means that there is an 87.5% chance of detecting when something is not a bot. The F1 is 0.86. This indicates a decently strong accuracy with our system. We also see in this confusion matrix that there are 35 cases of false positives (or 12.5% of non-bots are mislabeled) and 42 cases of false negatives (or 15% of bots are mislabeled). We have a well-balanced dataset, with 279 cases of non-bots and 280 cases of bots, too.
## [1] 2.777404
We have a LogLoss score of 2.78 for this model. LogLoss is a measurement of how close the prediction probability is to the actual value. As such, this score will penalize very confident incorrect guesses by the system. This score is higher than we’d like for our model, so this means that our model is confidently making some false predictions. This is something we will keep in mind!
## [1] 0.9302739
As you can see from this graph, our AUC is pretty good! There is significant area between our curve (the multicolored one) and the y = x line that represents random guessing. The AUC value is 0.93, which tells us that our model is pretty good at distinguishing between our two classes (whether or not an account is a bot).
We have slightly more false negatives than false positives. That being said, we feel that the priority for this system is to detect when an account is actually a bot account. Assuming that this system is used to flag/remove Twitter accounts that are bots, if we mislabel an account as a bot, then we may delete the account when it holds meaning for its human owner. So, we want to minimize false positives and thus maximize specificity. This means that we should increase the threshold.
When we adjust the threshold to 0.6, we have an accuracy of 0.8479, a kappa of 0.6959, a sensitivity of 0.7964, and a specificity of 0.8996. We have 28 cases of false positives.
When we adjust the threshold to 0.7, we have an accuracy of 0.8336, a kappa of 0.6674, a sensitivity of 0.7429, and a specificity of 0.9247. We have 21 cases of false positives.
When we adjust the threshold to 0.8, we have an accuracy of 0.8032, a kappa of 0.6066, a sensitivity of 0.6679, and a specificity of 0.9391. We have 17 cases of false positives.
When we adjust the threshold to 0.9, we have an accuracy of 0.7746, a kappa of 0.5495, a sensitivity of 0.5857, and a specificity of 0.9642 We have 10 cases of false positives.
So, if we were fully committed to reducing the amount of times that we remove a non-bot account with our system, our threshold would be 0.9. That being said, this threshold is detrimental to our kappa and sensitivity. As a compromise, we would perhaps be comfortable with 0.7 as a threshold.
Next, we will run a random forest model on our Twitter bot data. The random forest algorithm uses several different decision trees that evaluates different subsets of variables on different subsets of data. The algorithm helps solve the over-fitting problem that can result from using a single decision tree. After optimizing the random forest model, we will compare our results with the results of the kNN analysis.
We will use the same train and test datasets that we used for the kNN analysis (80/20 train/test split). Next, we will determine the mytry value for the model. The general rule is to start with a mytry value equal to the square root of the number of predictors. In our case, this is 3. Then, we will build an initial random forest on the training data with 1,000 trees.
We can look at the confusion matrix to assess the accuracy of our model. Based on the confusion matrix, we calculate the accuracy of our model: 88.24%! The accuracy is very good, but we should look at other metrics get a holistic view of the random forest.
## 0 1 class.error
## 0 1085 112 0.09356725
## 1 151 890 0.14505283
## [1] 0.8823903
We can use the importance function to returns the importance of each predictor variable to the accuracy of the classification. Based on our results, we observe that friends_count is the most important variable. This makes sense; we would expect that bot accounts would have less friends than real accounts run by humans. Other important variables include followers_count and favorites_count.
## 0 1 MeanDecreaseAccuracy
## followers_count 0.0653737407 0.092603632 0.078026532
## friends_count 0.1043678409 0.107315184 0.105738852
## favourites_count 0.0521137820 0.067339126 0.059190376
## verified 0.0393396940 0.068512014 0.052898364
## statuses_count 0.0323207141 0.018952389 0.026099298
## lang 0.0022207990 0.003259047 0.002702902
## default_profile 0.0118354083 0.018004253 0.014703744
## default_profile_image 0.0006353466 0.001545072 0.001057153
## has_extended_profile 0.0020698833 0.004409625 0.003159158
## MeanDecreaseGini
## followers_count 8.9369451
## friends_count 13.1321842
## favourites_count 8.1033390
## verified 4.1009469
## statuses_count 4.6877537
## lang 1.1749794
## default_profile 1.8698356
## default_profile_image 0.2982996
## has_extended_profile 0.8031797
Next, we will calculate the error rate for each individual tree. We will use these 1,000 error rates to visualize the overall performance of the random forest. The figure below visualizes how the error rte changes as we add more trees. The x-axis is the number of trees and the y-axis is the error rate. We include four different error rates in the plot: the error rate for bots, not bots, OOB and the difference between error rates. The OOB (out of bag) error rate measures the prediction error of the random forest. We would like to minimize the error for all four, however, we should be cautious when minimizing the difference between error rates because a small difference does not equal small error rates.
Now that we have a base random forest, we can tune the parameters to create an optimized model. In this situation, Twitter wants to correctly identify as many bots as possible so they can remove the accounts. Therefore, we would like to see the model correctly identify more bots (the positive class). However, Twitter does not want to incorrectly identify a real account as a bot and delete a human’s account, so we will focus on minimizing the false positive rate.
We would like to use the number of trees that would minimize both the bot error and OOB error. Looking at the error rates, we will build another random forest with 205 trees using the same mytry value and training data.
We have the confusion matrices for the first and second random forest, respectively. The optimized model predicts 5 fewer human accounts as bots but also 5 more bot accounts were not identified. Since we want to focus on minimizing the false positive rate, we must accept the trade-off.
## 0 1 class.error
## 0 1085 112 0.09356725
## 1 151 890 0.14505283
## 0 1 class.error
## 0 1090 107 0.08939014
## 1 156 885 0.14985591
We will use the new model for prediction on the test data and evaluate the model’s performance.
Based on the confusion matrix below, our model does a good job at prediction. The accuracy is 88.55% and the Kappa value is 0.771. The Sensitivity and Specificity are 86.79% and 90.32%, respectively. We want higher values for these metrics, so we are happy with these results.
##
## 0 1
## 0 252 27
## 1 37 243
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 252 37
## 1 27 243
##
## Accuracy : 0.8855
## 95% CI : (0.8562, 0.9107)
## No Information Rate : 0.5009
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.771
##
## Mcnemar's Test P-Value : 0.2606
##
## Sensitivity : 0.8679
## Specificity : 0.9032
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.8720
## Precision : 0.9000
## Recall : 0.8679
## F1 : 0.8836
## Prevalence : 0.5009
## Detection Rate : 0.4347
## Detection Prevalence : 0.4830
## Balanced Accuracy : 0.8855
##
## 'Positive' Class : 1
##
The top 3 important variables are the same as in the previous random forest: friends_count, followers_count and favorites_count.
## 0 1 MeanDecreaseAccuracy
## followers_count 0.0640830874 0.102831687 0.082107854
## friends_count 0.1063166304 0.113009728 0.109442244
## favourites_count 0.0487517279 0.064630726 0.056146324
## verified 0.0355946810 0.058656920 0.046311149
## statuses_count 0.0325051291 0.017288771 0.025433774
## lang 0.0025681730 0.003798786 0.003140661
## default_profile 0.0153996933 0.019787312 0.017444088
## default_profile_image 0.0005297281 0.002847887 0.001611573
## has_extended_profile 0.0018943634 0.003909205 0.002833305
## MeanDecreaseGini
## followers_count 8.9551361
## friends_count 13.8526406
## favourites_count 7.5984761
## verified 3.7290019
## statuses_count 4.6633603
## lang 1.2087036
## default_profile 2.0558135
## default_profile_image 0.3296123
## has_extended_profile 0.7918338
The error rate for using the optimized model for prediction is 11.45%. We want to minimize the errors, so we are pleased with the low error rate.
## [1] 11.44902
Finally, we will visualize the performance of our random forest by creating a ROC curve and calculating the area under the curve (AUC). As we can see in the plot below, the ROC curve is close to the 90-degree angle maximum and the AUC is 0.95. This is very good and indicates that our model does a great job at predicting the correct classes.
We performed a kNN analysis and ran a random forest on our data to try to find the best model for identifying Twitter bots. Both models were largely successful when used for prediction. The evaluation metrics for the kNN analysis with k = 11 and threshold of 0.7 are:
Accuracy: 83.36%
Kappa: 0.667
Sensitivity: 74.29%
Specificity: 92.47%
All of these metrics indicate the model is doing a good job at prediction. We see very similar results when comparing with the random forest model. The evaluation metrics for the random forest model are:
Accuracy: 88.55%
Kappa: 0.771
Sensitivity: 89.2%
Specificity: 87.5%
Both models are comparable, but the tree-based model finds a more balanced optimization of metrics whereas the KNN model increases specificity the most at the expense of other metrics.
The purpose of this project was to see if we could create a model that would accurately predict whether a Twitter account was a bot account. We were able to create predictive models with KNN and random forest methods that increased the accuracy beyond the base rate. We were also able to optimize these models, increasing specific evaluaton metrics. For KNN, we chose to prioritize high specificity since we believe it is most important to not have false positives (i.e. so that we do not shut down someone’s real account). For the random forest training, we focused on decreasing false positives too, but in this case we found the specific amount of trees to use that minimized error (205 trees over the default of 1000). We were happy with the resulting models, given that their predictive strength is far higher than random guessing which indicates that the models are useful in achieving our goal. There is no clear winner between the KNN and random forest models, as both have very similar results in the evaluation metrics.
The models are limited by the scope of the data set that we used for training and testing. Our data set has not been updated in 2 years, which means that it is probably out-of-date. So, applying our models to predict current Twitter bots would not be advised, especially given how quickly bots adapt. Additionally, we had less than 3000 data points, which is fine for our purposes but not for large-scale use of the model (i.e. internationally). So, our models should be used with this in mind.
As mentioned above, these models should be trained and tested on updated data sets with more depth and breadth of data. However, this is a good indication that these types of models are good at detecting bot accounts. As we have seen in the news, being able to detect bots can help stop the spread of misinformation. This implies that these kinds of models can be really important for real-life applications, i.e. around political elections. This can extend past Twitter, as well. If we found data sets pertaining account information associated with whether the account is a bot for Instagram, Facebook, Snapchat, or other social media platforms, then we could train models for all of the major mass communication channels. We could also try different types of modeling outside of KNN and random forest, such as other types of tree-based learning algorithms, but this is out of scope for this project.