DND Team Presentation code Quarto

Team DND

Group Members: Desyne Martinez & Daniel Perez Ulloa

Professor: Dr. Amir Karami

Course: Programming in R DS 7130

Project Overview

Dataset: The “Dead Internet” Theory: Reddit Bot vs. Human

Research Question: As large language models have rapidly advanced, automated bots have become increasingly common across the internet. Because online platforms are now central to daily life and information‑seeking, the rise of AI‑generated content has made it harder to distinguish genuine human posts from automated ones. Reddit, a major hub for discussion, advice, news, and entertainment, is no exception. The purpose of this study is to examine which factors predict whether a Reddit post is created by a human or a bot. Using logistic regression, we evaluate how variables such as sentiment, account characteristics, and behavioral patterns contribute to the likelihood that a post originates from an automated account.

Introduction

Generating the Dataset

#We started by reading in all the neccessary libraries we would need for our project to ensure we have every useful tool to proceed with an efficient analysis.
library(psych)
library(readxl)
library(ggplot2)
library(MASS)
library(rstatix)
library(GGally)
library(tidyverse)
library(tidymodels)
library(moderndive)
library("scatterplot3d")
library(rgl)
library(lm.beta)
library(olsrr)
library(car)
library(report)
library(ppcor)
library(statar)
library(rsample)
library(car)
library(ROSE)
library(caret)
library(dplyr)
library(pROC)
library(rsample)
library(predtools)
library(caret)
library(PredictABEL)
library(aod)
library(glmtoolbox)
library(lmtest)
library(DescTools)
library(readr)
library(DT)
library(plotly)

#setwd("C:/Users/Owner/Downloads/")
setwd("/home/dplant/Documents/Rwork")
# the setwd statement here helps us set up our work directory so it is essentially an apartment for the files we want pull out and work with.
# for me I put the datasets in my downloads on my computer hence thats why my work directory is in the downloads space. So when setting up your work directory make sure you are 
# setting it up with the file path where your datasets are staying at.
# "C:\Users\Owner\Downloads\reddit_dead_internet_analysis_2026.xlsx"

#reddit<- read_xlsx("reddit_dead_internet_analysis_2026.xlsx")
reddit <- read_csv("reddit_dead_internet_analysis_2026.csv")
#here we are reading in the dataset using the read.csv statement. Since we already created our work directory all we need to do is read in the dataset using its exact file path name 
# as it is stated in the place where you saved it.

# Inspect structure of the dataset
glimpse(reddit)

Rows: 500
Columns: 11
$ comment_id          <chr> "t1_3148938807", "t1_2516748923", "t1_833922318", …
$ subreddit           <chr> "funny", "gaming", "politics", "funny", "gaming", …
$ account_age_days    <dbl> 2264, 1654, 2442, 168, 801, 1333, 2582, 380, 2916,…
$ user_karma          <dbl> 34192, 2812, 37109, 32997, 25088, 17813, 46483, 67…
$ reply_delay_seconds <dbl> 1854, 1033, 2703, 2811, 1721, 3237, 1084, 622, 348…
$ sentiment_score     <dbl> -0.60, 0.15, 0.35, -0.74, -0.96, 0.93, 0.98, -0.42…
$ avg_word_length     <dbl> 4.62, 5.84, 5.73, 4.58, 5.78, 4.08, 5.33, 4.72, 4.…
$ contains_links      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ is_bot_flag         <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ bot_type_label      <chr> "None (Human)", "None (Human)", "None (Human)", "N…
$ bot_probability     <dbl> 0.093, 0.117, 0.115, 0.094, 0.116, 0.082, 0.107, 0…

Exploring the Data!

Variables of interest

The Dataset included 500 observations of Reddit comment metadata with 11 variables including:

comment_id
subreddit
account_age_days
user_karma
reply_delay_seconds
sentiment_score
avg_word_length
contains_links
is_bot_flag
bot_type_label
bot_probability

We wanted to put a focus on the primary factors that would contribute to detecting bot accounts on reddit while also keeping our model parsimonious, so our primary variables of interest were as follows
Dependent variable: is_bot_flag
Primary exposure variable: Sentiment score
Covariates: User Karma, Account age, and Avg word length.

Extra exploratory variable: Subreddit

Is bot flag is a categorical variable that distinguishes between bot and human accounts. Sentiment score is a continuous variable that detects the emotional sentiment of a post on a scale of -1 to 1 from negative to positive respectively. User karma is a continuous variable detailing the karmic engagement points a user would gain with a reddit account. Account age represents the longevity of the account in days and Average word length specifies the length of the words the user of the account uses primarily on posts.

To improve the comprehensibility of the variables and make them easier to work with we recoded our DV is_bot_flag to botdetector and renamed the levels as well as removing the underscores from the covariates! We also added another variable age category created from account age to categorize the time an account was created to distinguish it by “New accounts”, “Established” , and “Veteran” within each threshold.

#Recoding variables to make them easier to work with
reddit <- reddit %>%
  dplyr::select(is_bot_flag, user_karma, sentiment_score, account_age_days, avg_word_length, subreddit) %>%
  rename(
    Sentimentscore = sentiment_score,
    Userkarma = user_karma,
    Accountage = account_age_days,
    Avgwordlength = avg_word_length
  ) %>%
  mutate(
    botdetector = factor(is_bot_flag), # Let R detect the levels first
    botdetector = fct_recode(botdetector, 
                             "HUMAN" = "FALSE", 
                             "BOT" = "TRUE"),
    age_category = case_when(
      Accountage < 30 ~ "New Account",
      Accountage >= 30 & Accountage < 365 ~ "Established",
      Accountage >= 365 ~ "Veteran"
    )
  )

ls(reddit)

[1] "Accountage"     "age_category"   "Avgwordlength"  "botdetector"   
[5] "is_bot_flag"    "Sentimentscore" "subreddit"      "Userkarma"

Interactive Table

# INTERACTIVE TABLE (Requirement: At least one)
datatable(head(reddit, 100), options = list(pageLength = 5), 
          caption = 'Table 1: Interactive view of Reddit Bot Analysis Data')

GROUPED SUMMARIES

Using group summaries to explore the data looking at the categorical variable subreddit to evaluate the highest average user karma points, Sentiment Score, Average word length, and account age between each topic. we wanted to explore this to see which topics (Funny, Technology, world news, politics, gaming, and Pics) a majority of accounts on reddit would be found based on these continuous factors. This would aid in seeking out bot accounts.

# Top 5 subreddits with highest average user karma
reddit %>%
  dplyr::group_by(subreddit) %>%
  dplyr::summarize(mean_karma = mean(Userkarma, na.rm = TRUE)) %>%
  dplyr::arrange(desc(mean_karma)) %>%
  head(5)

# A tibble: 5 × 2
  subreddit  mean_karma
  <chr>           <dbl>
1 technology     27139.
2 worldnews      26282.
3 funny          26210.
4 politics       25848.
5 gaming         23585.

# Top 5 subreddits with highest average account age
reddit %>%
  dplyr::group_by(subreddit) %>%
  dplyr::summarize(mean_age = mean(Accountage, na.rm = TRUE)) %>%
  dplyr::arrange(desc(mean_age)) %>%
  head(5)

# A tibble: 5 × 2
  subreddit  mean_age
  <chr>         <dbl>
1 technology    1274.
2 gaming        1203.
3 politics      1189.
4 funny         1135.
5 pics          1129.

# Top 5 subreddits with highest average word length
reddit %>%
  dplyr::group_by(subreddit) %>%
  dplyr::summarize(mean_word = mean(Avgwordlength, na.rm = TRUE)) %>%
  dplyr::arrange(desc(mean_word)) %>%
  head(5)

# A tibble: 5 × 2
  subreddit  mean_word
  <chr>          <dbl>
1 worldnews       5.74
2 politics        5.68
3 gaming          5.66
4 pics            5.66
5 technology      5.63

# Top 5 subreddits with highest Sentiment score
reddit %>%
  dplyr::group_by(subreddit) %>%
  dplyr::summarize(mean_sentiment = mean(Sentimentscore, na.rm = TRUE)) %>%
  dplyr::arrange(desc(mean_sentiment)) %>%
  head(5)

# A tibble: 5 × 2
  subreddit  mean_sentiment
  <chr>               <dbl>
1 politics           0.0793
2 gaming             0.0745
3 worldnews          0.0249
4 technology        -0.0172
5 pics              -0.0310

Patchwork of Scatterplots

# ---- Filter before ggplot (global filter) ----
# Bots within the "Funny" subreddit only
library(patchwork)
library(gridExtra)


p1 <- reddit |>
  filter(subreddit == "funny", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point() +
  labs(
    title = "Subreddit:Funny"
  )
p1

p1 <- reddit |>
  filter(subreddit == "funny", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point() +
  labs(
    title = "Subreddit:Funny"
  )
# Bots within the "worldnews" subreddit only
p2<- reddit |>
  filter(subreddit == "worldnews", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Subreddit:Worldnews"
  )

# Bots within the "technology" subreddit only
p3<- reddit |>
  filter(subreddit == "technology", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Subreddit:Technology"
  )
p3

# Humans within the "technology" subreddit only
reddit |>
  filter(subreddit == "technology", botdetector == "HUMAN") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Subreddit:Technology"
  )

# Bots within the "pics" subreddit only
p4<- reddit |>
  filter(subreddit == "pics", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Subreddit:Pics"
  )

# Bots within the "gaming" subreddit only
p5<- reddit |>
  filter(subreddit == "gaming", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Subreddit:Gaming"
  )

# Bots within the "politics" subreddit only
p6<- reddit |>
  filter(subreddit == "politics", botdetector == "BOT") |>
  ggplot(aes(x = Userkarma, y = Accountage)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Subreddit:Politics"
  )

gridExtra::grid.arrange(p1,p2,p3,p4,p5,p6,
                        nrow=2)

The plots do not show any linear relationships between User karma and account age when filtering for the subreddit topics. This can still be a good visual to show to display that there isn’t a linear relationship for these topics when it comes to bot accounts.

Interactive Figure 1: Account age by Userkarma Scatterplot

# INTERACTIVE FIGURES (Requirement: At least two)
# Figure 1: Interactive Scatter
int_plot1 <- ggplot(reddit, aes(x = Accountage, y = Userkarma, color = botdetector)) +
  geom_point()
ggplotly(int_plot1)

An interactive scatterplot was created to pinpoint the relationship between User karma and Account age by whether the account is ran by a bot or human. Alongside this Blue strip towards the left of the plot is where a majority of the data for bot accounts reside. We can easily deduce that bot accounts who just made the account tend to have high user karma points to increase their engagment and foot traffic to their account.

Statistical transformations: bar chart counts

bar1<- reddit %>% 
  count(botdetector) %>% 
  ggplot(aes(x = botdetector, y = n, fill = botdetector)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = n), 
            vjust = -0.5, 
            size = 5, 
            fontface = "bold") +
  scale_fill_manual(values = c(
    "HUMAN" = "#FDB813",  # yellow-orange
    "BOT"   = "#E66100"   # deep orange
  )) +
  labs(
    x = "Botdetector",
    y = "Count",
    title = "Counts of Human vs Bot Accounts"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold")
  )


# Subreddit Proportions  by age category (Bar)
bar2 <- ggplot(reddit, aes(x = age_category, fill = botdetector)) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c(
    "HUMAN" = "#FDB813",
    "BOT"   = "#E66100"
  )) +
  labs(
    title = "Bot Proportions by Age Category",
    y = "Proportion",
    fill = "Bot Status"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold")
  )


bar1

bar2

To also evaluate how many suspected humans and bot accounts are represented in the dataset we created a Bar chart to deliver these results which concluded that there are 282 Human accounts and 218 Bot accounts. We also utilized the new age category variable to get a further look at the proportions of human vs bot accounts based on the time the account was opened and you can note that 100% of bots are new accounts on reddit!

reddit |>
  count(subreddit, botdetector) |>
  group_by(subreddit) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = subreddit, y = pct, fill = botdetector)) +
  geom_col(position = "fill") +
  geom_text(aes(label = scales::percent(pct, accuracy = 0.1)),
            position = position_fill(vjust = 0.5),
            color = "white",
            fontface = "bold",
            size = 4) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_manual(values = c(
    "HUMAN" = "#FDB813",  # yellow-orange
    "BOT"   = "#E66100"   # deep orange
  )) +
  labs(
    x = "Subreddit",
    y = "Percentage",
    fill = "Account Type",
    title = "Percentage of HUMAN vs BOT Accounts by Subreddit"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

#More "Bot" accounts tend to be found in the gaming and Worldnews categories 
# while Less of them are found in the funny category.

#We could filter the dataset to include the datapoints that are associated
# with the subreddit levels: Worldnews and gaming?

Utilizing the subreddit variable once more we wanted to inspect the proportion of bots to humans within each reddit category. More “Bot” accounts tend to be found in the gaming and Worldnews categories while Less of them are found in the funny category.

str(reddit$botdetector)

 Factor w/ 2 levels "HUMAN","BOT": 1 1 1 1 1 1 1 1 1 2 ...

#explore is_bot_flag by User Karma
library(viridis)
reddit %>% 
  ggplot(aes(x = botdetector, y = Userkarma, fill = botdetector)) +
  geom_boxplot() +
  scale_fill_manual(
    values = c(
      "HUMAN" = "#FDB813",   # warm yellow
      "BOT"   = "#E66100"    # deep orange
    )
  ) +
  labs(
    x = "Bot (Y/N)",
    y = "User Karma",
    title = "Side-by-Side User Karma"
  ) +
  theme_minimal(base_size = 14)

#Explore Is bot flag by sentiment score
reddit %>% 
  ggplot(aes(x = botdetector, y = Sentimentscore, fill = botdetector)) + 
  geom_boxplot() + 
  scale_fill_manual(
    values = c(
      "HUMAN" = "#FDB813",   # warm yellow
      "BOT"   = "#E66100"    # deep orange
    )
  ) +
  xlab("Bot (Y/N)") + 
  ylab("Sentiment Score") + 
  ggtitle("Side by side Sentiment Score")

Although the Boxplots do not show any significant difference this is still quite alarming given the context of AI bots running accounts across the social media site reddit. This could indicate that AI accounts are able to replicate the same emotional sentiment that real human account owners express which would make it much more difficult to distinguish between whether an account is run by a bot or a human.

Interactive Figure 2: Side by Side Sentiment score Boxplot

# Figure 2: Interactive Sentiment Boxplot
int_plot2 <- ggplot(reddit, aes(x = botdetector, y = Sentimentscore)) + geom_boxplot()
ggplotly(int_plot2)

Just to examine how closely related sentiment score is based upon human and bot accounts, we developed an interactive boxplot to display the statistical metrics between Humans and Bot’s sentiment.

#Explore Is bot flag by Account age
reddit %>% 
  ggplot(aes(x = botdetector, y = Accountage, fill = botdetector)) + 
  geom_boxplot() + 
  scale_fill_manual(
    values = c(
      "HUMAN" = "#FDB813",   # warm yellow
      "BOT"   = "#E66100"    # deep orange
    )
  ) +
  xlab("Bot (Y/N)") + 
  ylab("Account age (days)") + 
  ggtitle("Side by side Account age in Days comparison")

#Explore Is bot flag by Avg word length 
reddit %>% 
  ggplot(aes(x = botdetector, y = Avgwordlength, fill = botdetector)) + 
  geom_boxplot() + 
  scale_fill_manual(
    values = c(
      "HUMAN" = "#FDB813",   # warm yellow
      "BOT"   = "#E66100"    # deep orange
    )
  ) +
  xlab("Bot (Y/N)") + 
  ylab("Average word length") + 
  ggtitle("Side by side Average word length comparison")

The boxplots here seem to show signs of a significant difference when it comes to the longevity of the account on reddit and the average word length for the user of the account on reddit posts. It would seem like bot accounts have a younger account age whereas suspected human account holders have a older account age which is plausible given the rise of bot accounts years after the development of the reddit site. Also there is a clear difference in the boxplots for the average word length between human and bot accounts making it seem that bot accounts tend to use longer words then human account holders.

Methods

Data Modeling

clustering

# Load clustering package and sample dataset package
library(cluster)
library(GDAdata)
colnames(reddit)

[1] "is_bot_flag"    "Userkarma"      "Sentimentscore" "Accountage"    
[5] "Avgwordlength"  "subreddit"      "botdetector"    "age_category"

# Select columns 3 to 6: user_karma, sentimentscore, account age, and average word length 
reddit_data <- reddit[, c(2:5)]

# Standardize the data so variables with larger scales do not dominate
reddit_data_scale <- scale(reddit_data)


# Load package for estimating the number of clusters
library(NbClust)

# Estimate the best number of clusters from 2 to 10 using k-means criteria
number_cluster_estimate <- NbClust(
  reddit_data_scale,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 10,
  method = "kmeans"
)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot.

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 6 proposed 2 as the best number of clusters 
* 6 proposed 3 as the best number of clusters 
* 1 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 6 as the best number of clusters 
* 4 proposed 7 as the best number of clusters 
* 4 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
*******************************************************************

# Show the voting results for the best number of clusters
number_cluster_estimate$Best.nc

                     KL       CH Hartigan     CCC    Scott     Marriot TrCovW
Number_clusters  7.0000   2.0000   3.0000  2.0000   3.0000           3      3
Value_Index     58.4583 152.5287  38.8378 -1.1008 329.8842 13202073230  60745
                 TraceW Friedman   Rubin Cindex      DB Silhouette  Duda
Number_clusters   3.000    5.000  7.0000 6.0000 10.0000    10.0000 2.000
Value_Index     115.898    1.175 -0.1294 0.3672  1.2253     0.2312 0.872
                PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
Number_clusters   2.0000 2.0000    4.0000   3.0000     7.0000    1  2.0000
Value_Index      46.2364 0.3531    0.3272 339.4416     0.4644   NA  0.7526
                   Dunn Hubert SDindex Dindex    SDbw
Number_clusters 10.0000      0  7.0000      0 10.0000
Value_Index      0.0715      0  1.2154      0  0.3389

# Set seed for reproducibility
set.seed(123)

# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_reddit_data_scale_cluster <- pam(reddit_data_scale, k = 3)

# Show medoids for the clusters
kmeans_reddit_data_scale_cluster$medoids

       Userkarma Sentimentscore Accountage Avgwordlength
[1,]  0.58541528     -0.7485117  0.3324093    -0.6080182
[2,] -0.89231930      0.7657885  0.2830275    -0.3510934
[3,]  0.08906073     -0.3699367 -1.1070196     0.9000188

# Show the cluster assignment for each row
kmeans_reddit_data_scale_cluster$clustering

  [1] 1 2 1 1 1 2 2 2 2 1 2 2 3 1 2 1 3 1 2 2 2 3 1 1 3 1 1 3 1 1 3 1 3 2 3 1 3
 [38] 3 2 3 2 3 2 3 1 2 1 1 2 1 1 3 1 3 2 2 3 2 1 1 3 3 2 1 3 3 2 1 2 2 3 1 1 1
 [75] 1 3 1 3 1 1 3 2 1 2 3 2 2 3 1 3 2 2 3 1 3 3 2 3 1 2 1 3 1 2 2 3 1 1 2 2 3
[112] 1 2 3 2 1 2 2 1 3 3 1 2 1 3 2 2 3 2 3 2 3 3 2 1 3 3 3 3 1 2 2 3 2 2 1 2 3
[149] 1 3 1 3 1 3 1 3 2 1 1 3 2 3 1 2 3 2 1 1 3 2 1 2 3 1 1 2 2 3 3 1 2 1 3 1 3
[186] 3 2 2 1 2 2 1 1 1 2 3 3 3 1 1 2 3 2 1 2 1 1 3 1 1 3 1 1 2 1 3 2 1 1 1 1 2
[223] 1 1 3 3 2 3 1 3 2 2 2 1 2 3 1 2 3 1 2 2 1 1 3 3 2 2 3 2 3 1 2 1 1 1 1 2 2
[260] 2 2 2 1 3 1 3 3 2 2 1 3 3 2 3 2 1 1 1 3 2 3 1 2 3 1 2 2 2 1 2 2 2 3 2 3 3
[297] 3 3 3 3 1 1 2 2 2 2 2 3 1 3 3 1 2 1 2 2 1 3 2 1 2 3 3 1 3 3 2 2 1 2 1 1 1
[334] 1 1 2 3 2 1 2 1 1 2 1 1 2 2 2 2 3 1 3 3 3 3 3 3 3 3 2 1 1 1 3 1 2 2 1 2 1
[371] 2 2 1 3 1 3 2 2 2 3 2 3 1 2 1 3 2 3 2 1 3 3 3 2 1 3 3 2 3 3 2 3 3 1 1 1 2
[408] 1 3 2 1 3 1 3 1 1 1 1 3 2 2 2 1 1 3 3 1 1 2 2 2 1 2 1 1 1 3 3 3 3 1 2 1 1
[445] 3 3 2 3 3 3 3 3 2 2 3 2 3 1 1 3 1 3 1 3 1 3 1 3 3 1 3 2 3 2 3 2 1 2 3 2 2
[482] 3 2 1 1 2 2 2 1 2 1 3 3 3 3 3 1 2 2 2

# Plot the clustering result in two reduced dimensions
plot(kmeans_reddit_data_scale_cluster)

# Add the assigned cluster to the original food data
reddit_data_cluster <- reddit_data %>%
  mutate(cluster = kmeans_reddit_data_scale_cluster$clustering)

# Show the dataset with assigned clusters
reddit_data_cluster

# A tibble: 500 × 5
   Userkarma Sentimentscore Accountage Avgwordlength cluster
       <dbl>          <dbl>      <dbl>         <dbl>   <int>
 1     34192          -0.6        2264          4.62       1
 2      2812           0.15       1654          5.84       2
 3     37109           0.35       2442          5.73       1
 4     32997          -0.74        168          4.58       1
 5     25088          -0.96        801          5.78       1
 6     17813           0.93       1333          4.08       2
 7     46483           0.98       2582          5.33       2
 8      6760          -0.42        380          4.72       2
 9     17095           0.44       2916          4.56       2
10     43817          -0.15       2797          6.05       1
# ℹ 490 more rows

# Compute the average of each variable by cluster
reddit_cluster_summary <- reddit_data_cluster %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))

# Show cluster summaries to compare 
reddit_cluster_summary

# A tibble: 3 × 5
  cluster Userkarma Sentimentscore Accountage Avgwordlength
    <int>     <dbl>          <dbl>      <dbl>         <dbl>
1       1    33641.        -0.304       1635.          5.17
2       2    14660.         0.407       1528.          5.38
3       3    27266.        -0.0516       288.          6.44

Reducing the number of clusters from k = 5 to k = 3 improved the clustering quality slightly, increasing the average silhouette width from 0.19 to 0.20 and producing more balanced cluster sizes. However, the silhouette width remains low overall, indicating weak separation between clusters.

reddit_data_cluster[reddit_data_cluster$cluster == 6, ]

# A tibble: 0 × 5
# ℹ 5 variables: Userkarma <dbl>, Sentimentscore <dbl>, Accountage <dbl>,
#   Avgwordlength <dbl>, cluster <int>

Although there is weak separation between the clusters, cluster 6 detects a majority of bot accounts! Account age and avg word length is telling of how bots just now acquired accounts and they tend to over explain certain topics on the site, Have negative tones, and Also bots tend to have higher user karma points to increase their engagement value.

Logistic Regression Model

#######################################
#Run logistic regression 
logit1 <- glm(botdetector ~Sentimentscore + Userkarma + Accountage + Avgwordlength, data = reddit, family = "binomial")

#finally estimate a reduced model with only Sentiment Score
logit2 <- glm(botdetector ~ Sentimentscore, data = reddit, family = "binomial")

#########
summary(logit1)


Call:
glm(formula = botdetector ~ Sentimentscore + Userkarma + Accountage + 
    Avgwordlength, family = "binomial", data = reddit)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.696e+01  4.659e+00  -7.934 2.12e-15 ***
Sentimentscore  4.890e-01  3.068e-01   1.593   0.1110    
Userkarma      -6.411e-06  1.229e-05  -0.522   0.6018    
Accountage     -7.084e-04  1.905e-04  -3.719   0.0002 ***
Avgwordlength   6.544e+00  8.141e-01   8.039 9.05e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 684.93  on 499  degrees of freedom
Residual deviance: 208.32  on 495  degrees of freedom
AIC: 218.32

Number of Fisher Scoring iterations: 8

#Use the template below to help with intrepretations

#For every one unit change in account_age_days, the log odds of an Account being a bot (versus a human) decreases by 0.0007.

#For every one unit change in Avgwordlength, the log odds of an Account being a bot (versus a human) increases by 6.544.

Our logistic regression model shows that average word length is by far the strongest predictor of bot accounts, increasing the odds dramatically. Account age has a small but significant negative effect, meaning older accounts are slightly less likely to be bots. Sentiment score and user karma do not meaningfully contribute to prediction. The large drop in deviance and low AIC indicate an excellent model fit.

Logistic Metrics

##################################
#Sensitivity and specificity

#create confusion matrix (classification table)
threshold=0.5
predicted_values<-ifelse(predict(logit1,type="response")>threshold,1,0)
actual_values<-logit1$y
conf_matrix<-table(predicted_values,actual_values)
conf_matrix

                actual_values
predicted_values   0   1
               0 257  28
               1  25 190

# Estimate Sensitivity
sensitivity(conf_matrix)

[1] 0.9113475

#high sensisitivity, This indicates that the model is very good at correctly identifying positive cases.

#Esimate specificity
specificity(conf_matrix)

[1] 0.8715596

#low specificity, This suggests that the model has a high rate of false positives.

The model is able to capture 91% of postive cases meaning we want to detect bot accounts correctly and it correctly rejects 87% of the negative cases mistaking human accounts for bot accounts.

##################################
#ROC curve and AUC
predicted_prob<-predict(logit1,type="response")
roc_curve <- roc(logit1$y, predicted_prob)
plot(roc_curve)

#Let's get the AUC
auc(roc_curve)

Area under the curve: 0.9707

#Visual
library(ggplot2)
library(dplyr)

roc_df <- data.frame(
  tpr = roc_curve$sensitivities,
  fpr = 1 - roc_curve$specificities
)

Roc1 <- ggplot(roc_df, aes(x = fpr, y = tpr)) +
  geom_line(color = "#1f77b4", size = 1.2) +
  geom_abline(linetype = "dashed", color = "gray40") +
  theme_minimal() +
  labs(
    title = paste("ROC Curve (AUC =", round(auc(roc_curve), 3), ")"),
    x = "False Positive Rate",
    y = "True Positive Rate"
  )



#######
#comparison to reduced model
predicted_prob2<-predict(logit2,type="response")
roc_curve2 <- roc(logit2$y, predicted_prob2)
plot(roc_curve2)

auc(roc_curve2)

Area under the curve: 0.5336

#Visual
library(pROC)
library(ggplot2)

# Predicted probabilities
predicted_prob2 <- predict(logit2, type = "response")

# ROC object
roc_curve2 <- roc(logit2$y, predicted_prob2)

# Convert ROC data to a data frame
roc_df2 <- data.frame(
  tpr = roc_curve2$sensitivities,
  fpr = 1 - roc_curve2$specificities
)

# ggplot ROC curve
Roc2 <- ggplot(roc_df2, aes(x = fpr, y = tpr)) +
  geom_line(color = "#D55E00", size = 1.3) +
  geom_abline(linetype = "dashed", color = "gray40") +
  theme_minimal() +
  labs(
    title = paste("ROC Curve for logit2 (AUC =", round(auc(roc_curve2), 3), ")"),
    x = "False Positive Rate",
    y = "True Positive Rate"
  )
gridExtra::grid.arrange(Roc1,Roc2,
                        nrow=2)

The full model, with an AUC of 0.9707, appears to have better predictive accuracy than the reduced model(AUC of 0.5336). This suggests that the additional variables included in the full model contribute to its predictive power. Showcasing that the full model is able to distinguish between a Human and a Bot account 97% of the time vs the reduced model only being able to distinguish them 53% of the time.

confint(logit1)

                       2.5 %        97.5 %
(Intercept)    -4.704901e+01 -2.873218e+01
Sentimentscore -1.075470e-01  1.100833e+00
Userkarma      -3.074078e-05  1.765326e-05
Accountage     -1.093442e-03 -3.434386e-04
Avgwordlength   5.109072e+00  8.309547e+00

#but discussing things in "log odds" isn't very helpful, let's exponentiate all coefficients
#with the exponetiated CIs we are intersted in 1 as the reference point
exp(cbind(OR = coef(logit1), confint(logit1)))

                         OR        2.5 %       97.5 %
(Intercept)    8.854425e-17 3.688700e-21 3.324864e-13
Sentimentscore 1.630625e+00 8.980344e-01 3.006669e+00
Userkarma      9.999936e-01 9.999693e-01 1.000018e+00
Accountage     9.992918e-01 9.989072e-01 9.996566e-01
Avgwordlength  6.952364e+02 1.655167e+02 4.062472e+03

library(broom)
library(dplyr)
library(gt)

# 1. Get tidy coefficient table (includes p-values)
coef_tbl <- tidy(logit1) |>
  select(term, estimate, p.value)

# 2. Get exponentiated ORs + CIs
or_tbl <- exp(cbind(
  OR = coef(logit1),
  confint(logit1)
)) |>
  as.data.frame() |>
  tibble::rownames_to_column("term") |>
  rename(
    `Odds Ratio` = OR,
    `CI Lower` = `2.5 %`,
    `CI Upper` = `97.5 %`
  )

# 3. Merge ORs + p-values
final_tbl <- left_join(or_tbl, coef_tbl, by = "term") |>
  mutate(
    across(c(`Odds Ratio`, `CI Lower`, `CI Upper`), ~ round(.x, 3)),
    p.value = round(p.value, 4)
  ) |>
  rename(
    Predictor = term,
    `p-value` = p.value
  )

# 4. Create polished gt table
final_tbl |>
  gt() |>
  tab_header(
    title = "Odds Ratios with 95% CI and p-values",
    subtitle = "Exponentiated Logistic Regression Coefficients (logit1)"
  ) |>
  cols_label(
    Predictor = "Variable",
    `Odds Ratio` = "OR",
    `CI Lower` = "Lower 95% CI",
    `CI Upper` = "Upper 95% CI",
    `p-value` = "p-value"
  ) |>
  fmt_number(
    columns = c(`Odds Ratio`, `CI Lower`, `CI Upper`, `p-value`),
    decimals = 3
  ) |>
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_title(groups = "title")
  ) |>
  tab_style(
    style = cell_fill(color = "#f7f7f7"),
    locations = cells_body()
  )

Variable	OR	Lower 95% CI	Upper 95% CI	estimate	p-value
Odds Ratios with 95% CI and p-values
Exponentiated Logistic Regression Coefficients (logit1)
(Intercept)	0.000	0.000	0.000	-3.696303e+01	0.000
Sentimentscore	1.631	0.898	3.007	4.889631e-01	0.111
Userkarma	1.000	1.000	1.000	-6.410834e-06	0.602
Accountage	0.999	0.999	1.000	-7.084398e-04	0.000
Avgwordlength	695.236	165.517	4,062.472	6.544252e+00	0.000

summary(logit1)


Call:
glm(formula = botdetector ~ Sentimentscore + Userkarma + Accountage + 
    Avgwordlength, family = "binomial", data = reddit)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -3.696e+01  4.659e+00  -7.934 2.12e-15 ***
Sentimentscore  4.890e-01  3.068e-01   1.593   0.1110    
Userkarma      -6.411e-06  1.229e-05  -0.522   0.6018    
Accountage     -7.084e-04  1.905e-04  -3.719   0.0002 ***
Avgwordlength   6.544e+00  8.141e-01   8.039 9.05e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 684.93  on 499  degrees of freedom
Residual deviance: 208.32  on 495  degrees of freedom
AIC: 218.32

Number of Fisher Scoring iterations: 8

Overall Results

Avgwordlength is the only strong, meaningful predictor accounts with longer average word length have dramatically higher odds of the outcome. For each 1‑unit increase in average word length, the odds increase by a factor of ~695. CI is wide (165–4062), meaning the effect is large but uncertain. Average word length is a very strong predictor — longer average words dramatically increase the odds of the outcome.

Accountage has a tiny but statistically significant negative effect. older accounts have slightly lower odds of the outcome

The model suggests linguistic features matter far more than user metadata

Conclusion

With the evidence gathered from the analysis it seems like a majority of bot accounts can be detected by the age of their accounts as well as the length of the words used on their posts. Although not a significant outcome the sentiment score between Humans and Bot accounts are closely related to each other which in retrospective may be pretty alarming as AI continues to grow in intelligence. This model produced in this analysis can be used as a foundation for detecting bot account activity and future studies should include exploring the emotional intelligence of AI to truly discover the fine line between distinguishing between humans and bots online.