Group Members: Desyne Martinez & Daniel Perez Ulloa
Professor: Dr. Amir Karami
Course: Programming in R DS 7130
Project Overview
Dataset:The “Dead Internet” Theory: Reddit Bot vs. Human
Research Question: As large language models have rapidly advanced, automated bots have become increasingly common across the internet. Because online platforms are now central to daily life and information‑seeking, the rise of AI‑generated content has made it harder to distinguish genuine human posts from automated ones. Reddit, a major hub for discussion, advice, news, and entertainment, is no exception. The purpose of this study is to examine which factors predict whether a Reddit post is created by a human or a bot. Using logistic regression, we evaluate how variables such as sentiment, account characteristics, and behavioral patterns contribute to the likelihood that a post originates from an automated account.
Introduction
Generating the Dataset
#We started by reading in all the neccessary libraries we would need for our project to ensure we have every useful tool to proceed with an efficient analysis.library(psych)library(readxl)library(ggplot2)library(MASS)library(rstatix)library(GGally)library(tidyverse)library(tidymodels)library(moderndive)library("scatterplot3d")library(rgl)library(lm.beta)library(olsrr)library(car)library(report)library(ppcor)library(statar)library(rsample)library(car)library(ROSE)library(caret)library(dplyr)library(pROC)library(rsample)library(predtools)library(caret)library(PredictABEL)library(aod)library(glmtoolbox)library(lmtest)library(DescTools)library(readr)library(DT)library(plotly)setwd("C:/Users/Owner/Downloads/")# the setwd statement here helps us set up our work directory so it is essentially an apartment for the files we want pull out and work with.# for me I put the datasets in my downloads on my computer hence thats why my work directory is in the downloads space. So when setting up your work directory make sure you are # setting it up with the file path where your datasets are staying at.# "C:\Users\Owner\Downloads\reddit_dead_internet_analysis_2026.xlsx"reddit<-read_xlsx("reddit_dead_internet_analysis_2026.xlsx")#here we are reading in the dataset using the read.csv statement. Since we already created our work directory all we need to do is read in the dataset using its exact file path name # as it is stated in the place where you saved it.# Inspect structure of the datasetglimpse(reddit)
The Dataset included 500 observations of Reddit comment metadata with 11 variables including:
comment_id
subreddit
account_age_days
user_karma
reply_delay_seconds
sentiment_score
avg_word_length
contains_links
is_bot_flag
bot_type_label
bot_probability
We wanted to put a focus on the primary factors that would contribute to detecting bot accounts on reddit while also keeping our model parsimonious, so our primary variables of interest were as follows
Dependent variable: is_bot_flag
Primary exposure variable: Sentiment score
Covariates: User Karma, Account age, and Avg word length.
Extra exploratory variable: Subreddit
Is bot flag is a categorical variable that distinguishes between bot and human accounts. Sentiment score is a continuous variable that detects the emotional sentiment of a post on a scale of -1 to 1 from negative to positive respectively. User karma is a continuous variable detailing the karmic engagement points a user would gain with a reddit account. Account age represents the longevity of the account in days and Average word length specifies the length of the words the user of the account uses primarily on posts.
To improve the comprehensibility of the variables and make them easier to work with we recoded our DV is_bot_flag to botdetector and renamed the levels as well as removing the underscores from the covariates! We also added another variable age category created from account age to categorize the time an account was created to distinguish it by “New accounts”, “Established” , and “Veteran” within each threshold.
#Recoding variables to make them easier to work withreddit <- reddit %>% dplyr::select(is_bot_flag, user_karma, sentiment_score, account_age_days, avg_word_length, subreddit) %>%rename(Sentimentscore = sentiment_score,Userkarma = user_karma,Accountage = account_age_days,Avgwordlength = avg_word_length ) %>%mutate(botdetector =factor(is_bot_flag), # Let R detect the levels firstbotdetector =fct_recode(botdetector, "HUMAN"="FALSE", "BOT"="TRUE"),age_category =case_when( Accountage <30~"New Account", Accountage >=30& Accountage <365~"Established", Accountage >=365~"Veteran" ) )ls(reddit)
# INTERACTIVE TABLE (Requirement: At least one)datatable(head(reddit, 100), options =list(pageLength =5), caption ='Table 1: Interactive view of Reddit Bot Analysis Data')
GROUPED SUMMARIES
Using group summaries to explore the data looking at the categorical variable subreddit to evaluate the highest average user karma points, Sentiment Score, Average word length, and account age between each topic. we wanted to explore this to see which topics (Funny, Technology, world news, politics, gaming, and Pics) a majority of accounts on reddit would be found based on these continuous factors. This would aid in seeking out bot accounts.
# Top 5 subreddits with highest average user karmareddit %>% dplyr::group_by(subreddit) %>% dplyr::summarize(mean_karma =mean(Userkarma, na.rm =TRUE)) %>% dplyr::arrange(desc(mean_karma)) %>%head(5)
# Top 5 subreddits with highest average word lengthreddit %>% dplyr::group_by(subreddit) %>% dplyr::summarize(mean_word =mean(Avgwordlength, na.rm =TRUE)) %>% dplyr::arrange(desc(mean_word)) %>%head(5)
# ---- Filter before ggplot (global filter) ----# Bots within the "Funny" subreddit onlylibrary(patchwork)library(gridExtra)p1 <- reddit |>filter(subreddit =="funny", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point() +labs(title ="Subreddit:Funny" )p1
p1 <- reddit |>filter(subreddit =="funny", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point() +labs(title ="Subreddit:Funny" )# Bots within the "worldnews" subreddit onlyp2<- reddit |>filter(subreddit =="worldnews", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point()+labs(title ="Subreddit:Worldnews" )# Bots within the "technology" subreddit onlyp3<- reddit |>filter(subreddit =="technology", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point()+labs(title ="Subreddit:Technology" )p3
# Humans within the "technology" subreddit onlyreddit |>filter(subreddit =="technology", botdetector =="HUMAN") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point()+labs(title ="Subreddit:Technology" )
# Bots within the "pics" subreddit onlyp4<- reddit |>filter(subreddit =="pics", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point()+labs(title ="Subreddit:Pics" )# Bots within the "gaming" subreddit onlyp5<- reddit |>filter(subreddit =="gaming", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point()+labs(title ="Subreddit:Gaming" )# Bots within the "politics" subreddit onlyp6<- reddit |>filter(subreddit =="politics", botdetector =="BOT") |>ggplot(aes(x = Userkarma, y = Accountage)) +geom_smooth() +geom_point()+labs(title ="Subreddit:Politics" )gridExtra::grid.arrange(p1,p2,p3,p4,p5,p6,nrow=2)
The plots do not show any linear relationships between User karma and account age when filtering for the subreddit topics. This can still be a good visual to show to display that there isn’t a linear relationship for these topics when it comes to bot accounts.
Interactive Figure 1: Account age by Userkarma Scatterplot
# INTERACTIVE FIGURES (Requirement: At least two)# Figure 1: Interactive Scatterint_plot1 <-ggplot(reddit, aes(x = Accountage, y = Userkarma, color = botdetector)) +geom_point()ggplotly(int_plot1)
An interactive scatterplot was created to pinpoint the relationship between User karma and Account age by whether the account is ran by a bot or human. Alongside this Blue strip towards the left of the plot is where a majority of the data for bot accounts reside. We can easily deduce that bot accounts who just made the account tend to have high user karma points to increase their engagment and foot traffic to their account.
Statistical transformations: bar chart counts
bar1<- reddit %>%count(botdetector) %>%ggplot(aes(x = botdetector, y = n, fill = botdetector)) +geom_col(width =0.7) +geom_text(aes(label = n), vjust =-0.5, size =5, fontface ="bold") +scale_fill_manual(values =c("HUMAN"="#FDB813", # yellow-orange"BOT"="#E66100"# deep orange )) +labs(x ="Botdetector",y ="Count",title ="Counts of Human vs Bot Accounts" ) +theme_minimal(base_size =14) +theme(legend.position ="none",plot.title =element_text(face ="bold") )# Subreddit Proportions by age category (Bar)bar2 <-ggplot(reddit, aes(x = age_category, fill = botdetector)) +geom_bar(position ="fill") +scale_fill_manual(values =c("HUMAN"="#FDB813","BOT"="#E66100" )) +labs(title ="Bot Proportions by Age Category",y ="Proportion",fill ="Bot Status" ) +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold") )bar1
bar2
To also evaluate how many suspected humans and bot accounts are represented in the dataset we created a Bar chart to deliver these results which concluded that there are 282 Human accounts and 218 Bot accounts. We also utilized the new age category variable to get a further look at the proportions of human vs bot accounts based on the time the account was opened and you can note that 100% of bots are new accounts on reddit!
reddit |>count(subreddit, botdetector) |>group_by(subreddit) |>mutate(pct = n /sum(n)) |>ggplot(aes(x = subreddit, y = pct, fill = botdetector)) +geom_col(position ="fill") +geom_text(aes(label = scales::percent(pct, accuracy =0.1)),position =position_fill(vjust =0.5),color ="white",fontface ="bold",size =4) +scale_y_continuous(labels = scales::percent) +scale_fill_manual(values =c("HUMAN"="#FDB813", # yellow-orange"BOT"="#E66100"# deep orange )) +labs(x ="Subreddit",y ="Percentage",fill ="Account Type",title ="Percentage of HUMAN vs BOT Accounts by Subreddit" ) +theme_minimal(base_size =14) +theme(plot.title =element_text(face ="bold"),axis.text.x =element_text(angle =45, hjust =1) )
#More "Bot" accounts tend to be found in the gaming and Worldnews categories # while Less of them are found in the funny category.#We could filter the dataset to include the datapoints that are associated# with the subreddit levels: Worldnews and gaming?
Utilizing the subreddit variable once more we wanted to inspect the proportion of bots to humans within each reddit category. More “Bot” accounts tend to be found in the gaming and Worldnews categories while Less of them are found in the funny category.
#explore is_bot_flag by User Karmalibrary(viridis)reddit %>%ggplot(aes(x = botdetector, y = Userkarma, fill = botdetector)) +geom_boxplot() +scale_fill_manual(values =c("HUMAN"="#FDB813", # warm yellow"BOT"="#E66100"# deep orange ) ) +labs(x ="Bot (Y/N)",y ="User Karma",title ="Side-by-Side User Karma" ) +theme_minimal(base_size =14)
#Explore Is bot flag by sentiment scorereddit %>%ggplot(aes(x = botdetector, y = Sentimentscore, fill = botdetector)) +geom_boxplot() +scale_fill_manual(values =c("HUMAN"="#FDB813", # warm yellow"BOT"="#E66100"# deep orange ) ) +xlab("Bot (Y/N)") +ylab("Sentiment Score") +ggtitle("Side by side Sentiment Score")
Although the Boxplots do not show any significant difference this is still quite alarming given the context of AI bots running accounts across the social media site reddit. This could indicate that AI accounts are able to replicate the same emotional sentiment that real human account owners express which would make it much more difficult to distinguish between whether an account is run by a bot or a human.
Interactive Figure 2: Side by Side Sentiment score Boxplot
Just to examine how closely related sentiment score is based upon human and bot accounts, we developed an interactive boxplot to display the statistical metrics between Humans and Bot’s sentiment.
#Explore Is bot flag by Account agereddit %>%ggplot(aes(x = botdetector, y = Accountage, fill = botdetector)) +geom_boxplot() +scale_fill_manual(values =c("HUMAN"="#FDB813", # warm yellow"BOT"="#E66100"# deep orange ) ) +xlab("Bot (Y/N)") +ylab("Account age (days)") +ggtitle("Side by side Account age in Days comparison")
#Explore Is bot flag by Avg word length reddit %>%ggplot(aes(x = botdetector, y = Avgwordlength, fill = botdetector)) +geom_boxplot() +scale_fill_manual(values =c("HUMAN"="#FDB813", # warm yellow"BOT"="#E66100"# deep orange ) ) +xlab("Bot (Y/N)") +ylab("Average word length") +ggtitle("Side by side Average word length comparison")
The boxplots here seem to show signs of a significant difference when it comes to the longevity of the account on reddit and the average word length for the user of the account on reddit posts. It would seem like bot accounts have a younger account age whereas suspected human account holders have a older account age which is plausible given the rise of bot accounts years after the development of the reddit site. Also there is a clear difference in the boxplots for the average word length between human and bot accounts making it seem that bot accounts tend to use longer words then human account holders.
Methods
Data Modeling
clustering
# Load clustering package and sample dataset packagelibrary(cluster)library(GDAdata)colnames(reddit)
# Select columns 3 to 6: user_karma, sentimentscore, account age, and average word length reddit_data <- reddit[, c(2:5)]# Standardize the data so variables with larger scales do not dominatereddit_data_scale <-scale(reddit_data)# Load package for estimating the number of clusterslibrary(NbClust)# Estimate the best number of clusters from 2 to 10 using k-means criterianumber_cluster_estimate <-NbClust( reddit_data_scale,distance ="euclidean",min.nc =2,max.nc =10,method ="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 6 proposed 2 as the best number of clusters
* 6 proposed 3 as the best number of clusters
* 1 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 6 as the best number of clusters
* 4 proposed 7 as the best number of clusters
* 4 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
# Show the voting results for the best number of clustersnumber_cluster_estimate$Best.nc
# Set seed for reproducibilityset.seed(123)# Run PAM clustering with 5 clusters# Note: the slides call this k-means, but this function is PAMkmeans_reddit_data_scale_cluster <-pam(reddit_data_scale, k =3)# Show medoids for the clusterskmeans_reddit_data_scale_cluster$medoids
# Plot the clustering result in two reduced dimensionsplot(kmeans_reddit_data_scale_cluster)
# Add the assigned cluster to the original food datareddit_data_cluster <- reddit_data %>%mutate(cluster = kmeans_reddit_data_scale_cluster$clustering)# Show the dataset with assigned clustersreddit_data_cluster
# Compute the average of each variable by clusterreddit_cluster_summary <- reddit_data_cluster %>%group_by(cluster) %>%summarise(across(everything(), ~mean(.x, na.rm =TRUE)))# Show cluster summaries to compare reddit_cluster_summary
Reducing the number of clusters from k = 5 to k = 3 improved the clustering quality slightly, increasing the average silhouette width from 0.19 to 0.20 and producing more balanced cluster sizes. However, the silhouette width remains low overall, indicating weak separation between clusters.
Although there is weak separation between the clusters, cluster 6 detects a majority of bot accounts! Account age and avg word length is telling of how bots just now acquired accounts and they tend to over explain certain topics on the site, Have negative tones, and Also bots tend to have higher user karma points to increase their engagement value.
Logistic Regression Model
########################################Run logistic regression logit1 <-glm(botdetector ~Sentimentscore + Userkarma + Accountage + Avgwordlength, data = reddit, family ="binomial")#finally estimate a reduced model with only Sentiment Scorelogit2 <-glm(botdetector ~ Sentimentscore, data = reddit, family ="binomial")#########summary(logit1)
Call:
glm(formula = botdetector ~ Sentimentscore + Userkarma + Accountage +
Avgwordlength, family = "binomial", data = reddit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.696e+01 4.659e+00 -7.934 2.12e-15 ***
Sentimentscore 4.890e-01 3.068e-01 1.593 0.1110
Userkarma -6.411e-06 1.229e-05 -0.522 0.6018
Accountage -7.084e-04 1.905e-04 -3.719 0.0002 ***
Avgwordlength 6.544e+00 8.141e-01 8.039 9.05e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 684.93 on 499 degrees of freedom
Residual deviance: 208.32 on 495 degrees of freedom
AIC: 218.32
Number of Fisher Scoring iterations: 8
#Use the template below to help with intrepretations#For every one unit change in account_age_days, the log odds of an Account being a bot (versus a human) decreases by 0.0007.#For every one unit change in Avgwordlength, the log odds of an Account being a bot (versus a human) increases by 6.544.
Our logistic regression model shows that average word length is by far the strongest predictor of bot accounts, increasing the odds dramatically. Account age has a small but significant negative effect, meaning older accounts are slightly less likely to be bots. Sentiment score and user karma do not meaningfully contribute to prediction. The large drop in deviance and low AIC indicate an excellent model fit.
Logistic Metrics
###################################Sensitivity and specificity#create confusion matrix (classification table)threshold=0.5predicted_values<-ifelse(predict(logit1,type="response")>threshold,1,0)actual_values<-logit1$yconf_matrix<-table(predicted_values,actual_values)conf_matrix
#high sensisitivity, This indicates that the model is very good at correctly identifying positive cases.#Esimate specificityspecificity(conf_matrix)
[1] 0.8715596
#low specificity, This suggests that the model has a high rate of false positives.
The model is able to capture 91% of postive cases meaning we want to detect bot accounts correctly and it correctly rejects 87% of the negative cases mistaking human accounts for bot accounts.
###################################ROC curve and AUCpredicted_prob<-predict(logit1,type="response")roc_curve <-roc(logit1$y, predicted_prob)plot(roc_curve)
#Visuallibrary(pROC)library(ggplot2)# Predicted probabilitiespredicted_prob2 <-predict(logit2, type ="response")# ROC objectroc_curve2 <-roc(logit2$y, predicted_prob2)# Convert ROC data to a data frameroc_df2 <-data.frame(tpr = roc_curve2$sensitivities,fpr =1- roc_curve2$specificities)# ggplot ROC curveRoc2 <-ggplot(roc_df2, aes(x = fpr, y = tpr)) +geom_line(color ="#D55E00", size =1.3) +geom_abline(linetype ="dashed", color ="gray40") +theme_minimal() +labs(title =paste("ROC Curve for logit2 (AUC =", round(auc(roc_curve2), 3), ")"),x ="False Positive Rate",y ="True Positive Rate" )gridExtra::grid.arrange(Roc1,Roc2,nrow=2)
The full model, with an AUC of 0.9707, appears to have better predictive accuracy than the reduced model(AUC of 0.5336). This suggests that the additional variables included in the full model contribute to its predictive power. Showcasing that the full model is able to distinguish between a Human and a Bot account 97% of the time vs the reduced model only being able to distinguish them 53% of the time.
#but discussing things in "log odds" isn't very helpful, let's exponentiate all coefficients#with the exponetiated CIs we are intersted in 1 as the reference pointexp(cbind(OR =coef(logit1), confint(logit1)))
Call:
glm(formula = botdetector ~ Sentimentscore + Userkarma + Accountage +
Avgwordlength, family = "binomial", data = reddit)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.696e+01 4.659e+00 -7.934 2.12e-15 ***
Sentimentscore 4.890e-01 3.068e-01 1.593 0.1110
Userkarma -6.411e-06 1.229e-05 -0.522 0.6018
Accountage -7.084e-04 1.905e-04 -3.719 0.0002 ***
Avgwordlength 6.544e+00 8.141e-01 8.039 9.05e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 684.93 on 499 degrees of freedom
Residual deviance: 208.32 on 495 degrees of freedom
AIC: 218.32
Number of Fisher Scoring iterations: 8
Overall Results
Avgwordlength is the only strong, meaningful predictor accounts with longer average word length have dramatically higher odds of the outcome. For each 1‑unit increase in average word length, the odds increase by a factor of ~695. CI is wide (165–4062), meaning the effect is large but uncertain. Average word length is a very strong predictor — longer average words dramatically increase the odds of the outcome.
Accountage has a tiny but statistically significant negative effect. older accounts have slightly lower odds of the outcome
The model suggests linguistic features matter far more than user metadata
Conclusion
With the evidence gathered from the analysis it seems like a majority of bot accounts can be detected by the age of their accounts as well as the length of the words used on their posts. Although not a significant outcome the sentiment score between Humans and Bot accounts are closely related to each other which in retrospective may be pretty alarming as AI continues to grow in intelligence. This model produced in this analysis can be used as a foundation for detecting bot account activity and future studies should include exploring the emotional intelligence of AI to truly discover the fine line between distinguishing between humans and bots online.