Karma and Camouflage: A Statistical Analysis of the Dead Internet Theory on Reddit

Team DND

Group Members: Desyne Martinez & Daniel Perez Ulloa

Professor: Dr. Amir Karami

Course: Programming in R DS 7130

Project Overview

Dataset: The “Dead Internet” Theory: Reddit Bot vs. Human

Concept: Given the increasing presence of automated accounts online, this study examines how account age and other behavioral indicators including sentiment score, reply delay, average word length,and bot probability to predict user karma on Reddit posts, providing insight into how bot‑like features relate to engagement.

Research Question: To what extent do account age, sentiment score, reply delay, average word length, and bot probability predict variation in user karma on Reddit posts?

Introduction

Generating the Dataset

#We started by reading in all the neccessary libraries we would need for our project to ensure we have every useful tool to proceed with an efficient analysis.
library(psych)
library(readxl)
library(ggplot2)
library(MASS)
library(rstatix)
#library(GGally)
library(tidyverse)
library(tidymodels)
library(moderndive)
library("scatterplot3d")
library(rgl)
library(lm.beta)
library(olsrr)
library(car)
library(report)
library(ppcor)
library(statar)
library(rsample)
library(car)
library(ROSE)
library(caret)
library(dplyr)
library(pROC)
library(rsample)
library(predtools)
library(caret)
library(PredictABEL)
library(aod)
library(glmtoolbox)
library(lmtest)
library(DescTools)
library(readr)
library(gt)
library(DT)
library(plotly)
library(knitr)
library(kableExtra)
library(broom)

setwd("C:/Users/Owner/Downloads/")
#setwd("/home/dplant227/Documents/Rwork")
# the setwd statement here helps us set up our work directory so it is essentially an apartment for the files we want pull out and work with.
# for me I put the datasets in my downloads on my computer hence thats why my work directory is in the downloads space. So when setting up your work directory make sure you are 
# setting it up with the file path where your datasets are staying at.
# "C:\Users\Owner\Downloads\reddit_dead_internet_analysis_2026.xlsx"

reddit<- read_xlsx("reddit_dead_internet_analysis_2026.xlsx")
#reddit <- read_csv("reddit_dead_internet_analysis_2026.csv")
#here we are reading in the dataset using the read.csv statement. Since we already created our work directory all we need to do is read in the dataset using its exact file path name 
# as it is stated in the place where you saved it.

# Inspect structure of the dataset
reddit_glimpse <- data.frame(
  Variable = names(reddit),
  Type = sapply(reddit, class),
  Preview = sapply(reddit, function(x) paste(head(x, 3), collapse = ", "))
)
reddit_glimpse %>%
  kable(
    caption = "Table: Data Structure and Variable Overview (glimpse)",
    booktabs = TRUE,
    align = "lcr",
    col.names = c("Variable Name", "Data Type", "Sample Values")
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
  column_spec(1, bold = TRUE, color = "#2C3E50") %>%
  column_spec(2, italic = TRUE, color = "#7F8C8D")
Table: Data Structure and Variable Overview (glimpse)
Variable Name Data Type Sample Values
comment_id comment_id character t1_3148938807, t1_2516748923, t1_833922318
subreddit subreddit character funny, gaming, politics
account_age_days account_age_days numeric 2264, 1654, 2442
user_karma user_karma numeric 34192, 2812, 37109
reply_delay_seconds reply_delay_seconds numeric 1854, 1033, 2703
sentiment_score sentiment_score numeric -0.6, 0.15, 0.35
avg_word_length avg_word_length numeric 4.62, 5.84, 5.73
contains_links contains_links logical FALSE, FALSE, FALSE
is_bot_flag is_bot_flag logical FALSE, FALSE, FALSE
bot_type_label bot_type_label character None (Human), None (Human), None (Human)
bot_probability bot_probability numeric 0.093, 0.117, 0.115

Exploring the Data!

Variables of interest

The dataset included 500 observations of Reddit comment metadata with 11 variables including:

  • comment_id

  • subreddit

  • account_age_days

  • user_karma

  • reply_delay_seconds

  • sentiment_score

  • avg_word_length

  • contains_links

  • is_bot_flag

  • bot_type_label

  • bot_probability

    We wanted to put a focus on the primary factors that would contribute to detecting bot accounts on Reddit while also keeping our model parsimonious, so our primary variables of interest were as follows:

  • Dependent variable: User Karma

  • Primary exposure variable: Bot prob

  • Covariates: Sentiment score, Account age, Avg word length, and Reply delay.

    User karma is a continuous variable detailing the karmic engagement points a user would gain with a reddit account. Bot Probability details the percent chance an account is ran by a bot. Sentiment score is a continuous variable that detects the emotional sentiment of a post on a scale of -1 to 1 from negative to positive respectively. Account age represents the longevity of the account in days, Average word length specifies the length of the words the user of the account uses primarily on posts, and reply delay represents the time in seconds it takes for a user to reply to a post on a Reddit account.

    To improve the comprehensibility of the variables and make them easier to work with, we renamed the variables by removing the underscores! We also added Two categorical variables derived from Accountage and Sentiment Score respectively. Age category was created from Account age to categorize the time an account was created to distinguish it by “New accounts”, “Established” , and “Veteran” within each threshold. Sentimentcat was created from Sentimentscore to categorize the emotional sentiment by “Negative”, “Neutral”, and “Positive” from a Reddit account when making posts.

    # Recoding variables to make them easier to work with
    reddit <- reddit %>%
      dplyr::select(reply_delay_seconds, user_karma, sentiment_score, account_age_days, avg_word_length, bot_probability) %>%
      rename(
        Sentimentscore = sentiment_score,
        Userkarma = user_karma,
        Accountage = account_age_days,
        Avgwordlength = avg_word_length,
        Replydelay = reply_delay_seconds,
        botprob = bot_probability
      ) %>%
      mutate(
        age_category = case_when(
          Accountage < 30 ~ "New Account",
          Accountage >= 30 & Accountage < 365 ~ "Established",
          Accountage >= 365 ~ "Veteran"
        ),
    
        # Three‑way sentiment category
        SentimentCat = case_when(
          Sentimentscore < -0.2 ~ "Negative",
          Sentimentscore >= -0.2 & Sentimentscore <= 0.2 ~ "Neutral",
          Sentimentscore > 0.2 ~ "Positive"
        )
      )
    
    all_vars <- ls(reddit)
    n <- length(all_vars)
    mid <- ceiling(n / 2)
    
    ##
    
    variables_list <- data.frame(Variables = ls(reddit))
    
    variables_list %>%
      kable(caption = "Table 2: Inventory of Variables in Reddit Dataset", booktabs = TRUE) %>%
      kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE) %>%
      row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
      column_spec(1, bold = TRUE)
    Table 2: Inventory of Variables in Reddit Dataset
    Variables
    Accountage
    age_category
    Avgwordlength
    botprob
    Replydelay
    SentimentCat
    Sentimentscore
    Userkarma

Interactive Table

# INTERACTIVE TABLE (Requirement: At least one)
datatable(head(reddit, 100), options = list(pageLength = 5), 
          caption = 'Table 1: Interactive view of Reddit Bot Analysis Data')

GROUPED SUMMARIES

Using group summaries to explore the data looking at the New categorical variable age_category to evaluate the highest average user karma points, Sentiment Score, Average word length, bot probability, and reply delay between each age distinction for Reddit accounts. We wanted to explore this to see which Accounts(New Accounts, Established Accounts, and Veteran Accounts) on Reddit would have the highest engagement factors.

#Generate the grouped summary data
#This calculates the mean for each key metric across the 3 clusters
grouped_summary <- reddit %>%
  group_by(age_category) %>%
  summarise(
    Observations = n(),
    `Avg Karma` = mean(Userkarma, na.rm = TRUE),
    `Avg Account Age` = mean(Accountage, na.rm = TRUE),
    `Avg Bot Prob` = mean(botprob, na.rm = TRUE),
    `Avg Word Length` = mean(Avgwordlength, na.rm = TRUE),
    `Avg Sentiment` = mean(Sentimentscore, na.rm = TRUE)
  ) %>%
  mutate(across(where(is.numeric), ~ round(.x, 2)))

# 2. Format as a professional kable table
grouped_summary %>%
  kable(
    caption = "Table: Behavioral Metric Summaries Grouped by Cluster",
    booktabs = TRUE,
    align = "c"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
  # Highlighting Cluster 3 as the likely "Bot-like" group
  row_spec(3, background = "#FFF4F4", bold = TRUE) %>%
  column_spec(1, bold = TRUE)
Table: Behavioral Metric Summaries Grouped by Cluster
age_category Observations Avg Karma Avg Account Age Avg Bot Prob Avg Word Length Avg Sentiment
Established 48 26669.04 156.67 0.16 5.88 -0.10
New Account 101 24505.40 14.02 0.25 6.44 0.10
Veteran 351 25267.64 1623.58 0.13 5.40 0.01

Patchwork of Scatterplots

library(patchwork)
library(gridExtra)


#EXPLORATORY DATA ANALYSIS (EDA)
# Visualizing relationship between Bot Probability and User Karma
p1 <- ggplot(reddit, aes(x = botprob, y = Userkarma)) +
  geom_point(alpha = 0.5, color = "darkblue") +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Bot Probability vs. User Karma", x = "Bot Probability", y = "Karma") +
  theme_minimal()

# Visualizing relationship between Account Age and User Karma
p2 <- ggplot(reddit, aes(x = Accountage, y = Userkarma)) +
  geom_point(alpha = 0.5, color = "darkgreen") +
  geom_smooth(method = "lm", color = "orange") +
  labs(title = "Account Age vs. User Karma", x = "Age (Days)", y = "Karma") +
  theme_minimal()

p3 <- ggplot(reddit, aes(x = Replydelay, y = Userkarma)) +
  geom_point(alpha = 0.5, color = "darkred") +
  geom_smooth(method = "lm", color = "black") +
  labs(title = "Reply Delay vs. User Karma",
       x = "Reply Delay (Seconds)",
       y = "User Karma") +
  theme_minimal()

p4 <- ggplot(reddit, aes(x = Avgwordlength, y = Userkarma)) +
  geom_point(alpha = 0.5, color = "darkcyan") +
  geom_smooth(method = "lm", color = "purple") +
  labs(title = "Average Word Length vs. User Karma",
       x = "Average Word Length",
       y = "User Karma") +
  theme_minimal()

p5 <- ggplot(reddit, aes(x = Sentimentscore, y = Userkarma)) +
  geom_point(alpha = 0.5, color = "darkorange") +
  geom_smooth(method = "lm", color = "blue") +
  labs(title = "Sentiment Score vs. User Karma",
       x = "Sentiment Score",
       y = "User Karma") +
  theme_minimal()

# Combined EDA plot (Requirement: Combined figures)
(p1 | p2) /
  (p3 | p4) /
  p5

The plots do not show any linear relationships between User karma and the covariates among the 500 Reddit accounts.

library(patchwork)
library(gridExtra)

p1 <- reddit |>
  filter(age_category == "New Account", SentimentCat == "Negative") |>
  ggplot(aes(x = Userkarma, y = botprob)) +
  geom_smooth() +
  geom_point() +
  labs(
    title = "New Account (- sentiment) "
  )


p3<- reddit |>
  filter(age_category == "Veteran", SentimentCat == "Negative") |>
  ggplot(aes(x = Userkarma, y = botprob)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Veteran Accounts (- sentiment)"
  )




p2<- reddit |>
  filter(age_category == "Established", SentimentCat == "Negative") |>
  ggplot(aes(x = Userkarma, y = botprob)) +
  geom_smooth() +
  geom_point() +
  labs(
    title = "Established (- sentiment)"
  )



p4 <- reddit |>
  filter(age_category == "New Account", SentimentCat == "Positive") |>
  ggplot(aes(x = Userkarma, y = botprob)) +
  geom_smooth() +
  geom_point() +
  labs(
    title = "New Account (+ sentiment) "
  )


p6<- reddit |>
  filter(age_category == "Veteran", SentimentCat == "Positive") |>
  ggplot(aes(x = Userkarma, y = botprob)) +
  geom_smooth() +
  geom_point()+
  labs(
    title = "Veteran Accounts (+ sentiment)"
  )



p5<- reddit |>
  filter(age_category == "Established", SentimentCat == "Positive") |>
  ggplot(aes(x = Userkarma, y = botprob)) +
  geom_smooth() +
  geom_point() +
  labs(
    title = "Established Account (+ sentiment) "
  )



(p1 + p2 + p3) / (p4 + p5 + p6) + 
  plot_annotation(title = "Analysis of Bot Probability by Account Category and Sentiment") +
  theme(plot.title = element_text(size = 9))

We wanted to take a look at User karma by Bot probability filtering each category by the type of account and positive or negative Sentiment. The plots here do not show any linear relationships between User karma and the bot probability, but an interesting thing to point out is that the bot probability goes down as we move from New accounts to Veteran accounts and that the emotional sentiment between Veteran account holders are roughly the same whether they have Negative or positive Engagement towards Reddit posts.

Interactive Figure 1: Reply Delay by Userkarma Scatterplot

# INTERACTIVE FIGURES (Requirement: At least two)
# Figure 1: Interactive Scatter

# Interactive Scatter: Reply Delay vs Karma
int_plot1 <- ggplot(reddit, aes(x = Replydelay, y = Userkarma)) + 
  geom_point(color = "purple", alpha = 0.4) +
  labs(title = "Interactive: Reply Delay vs Karma")
ggplotly(int_plot1)

An interactive scatter plot was created to pinpoint the relationship between User karma and Reply delay. Alongside this purple strip towards the Y axis of the plot is where a majority of the data is clustered. From examining this we can easily deduce that accounts who tend to reply really quickly in roughly under 10 seconds tend to have high user karma points to increase their engagement and foot traffic to their account which also can be an indication of bot activity.

Statistical transformations: bar chart counts

library(ggplot2)
library(dplyr)

reddit %>% 
  count(age_category) %>% 
  ggplot(aes(x = age_category, y = n, fill = age_category)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = n), 
            vjust = -0.5, 
            size = 5, 
            fontface = "bold") +
  scale_fill_manual(values = c(
    "Established" = "#FDB813",  # yellow-orange
    "New Account"   = "#E66100",   # deep orange
    "Veteran" =  "deeppink"
  )) +
  labs(
    x = "Account type",
    y = "Count",
    title = "Counts of Categorized ages of reddit Accounts"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold")
  )

reddit %>% 
  count(SentimentCat) %>% 
  ggplot(aes(x = SentimentCat, y = n, fill = SentimentCat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = n), 
            vjust = -0.5, 
            size = 5, 
            fontface = "bold") +
  scale_fill_manual(values = c(
    "Negative" = "#FDB813",  # yellow-orange
    "Neutral"   = "#E66100",   # deep orange
    "Positive" =  "deeppink"
  )) +
  labs(
    x = "Sentiment type",
    y = "Count",
    title = "Counts of Categorized Sentiment of reddit Accounts"
  ) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    plot.title = element_text(face = "bold")
  )

To evaluate how many New, Established, and Veteran accounts are represented in the dataset we created a bar chart to deliver these results which concluded that there are 101 New accounts, 48 Established accounts and 351 Veteran accounts. We also evaluated the sentiment across Reddit accounts and found that 198 accounts were associated with negative engagement posts while 201 accounts were associated with positive engagement posts and the other 101 accounts were neutral.

#explore age category by User Karma
library(viridis)
reddit %>% 
  ggplot(aes(x = age_category, y = Userkarma, fill = age_category)) +
  geom_boxplot() +
  scale_fill_manual(
    values = c(
      "New Account"  = "#FDB813",   # warm yellow
      "Established"   = "#E66100",    # deep orange
      "Veteran" =  "deeppink"
    )
  ) +
  labs(
    x = "Account type",
    y = "User Karma",
    title = "Side-by-Side User Karma by Account category"
  ) +
  theme_minimal(base_size = 14)

#explore Sentiment cat by User karma 
library(viridis)
reddit %>% 
  ggplot(aes(x = SentimentCat, y = Userkarma, fill = SentimentCat)) +
  geom_boxplot() +
  scale_fill_manual(
    values = c(
      "Negative" = "#FDB813",   # warm yellow
      "Neutral"   = "#E66100",    # deep orange
      "Positive" =  "deeppink"
    )
  ) +
  labs(
    x = "Senitment",
    y = "User Karma",
    title = "Side-by-Side User Karma by Sentiment Category"
  ) +
  theme_minimal(base_size = 14)

Although the boxplots do not show any significant difference this is still quite alarming given the context of AI bots running accounts across the social media site Reddit. New accounts seem to bring in the same user karma engagement points as Veteran and Established accounts. The user karma points for both positive and negative Sentiment overlap indicating that negativity and positivity don’t matter as much when it comes to gaining user karma points. If we were to make an inference in context of the AI bots on the site it would seem like they are not much differential from the real human account owners on the site which would make it much more difficult to distinguish between whether an account is run by a bot or a human.

Interactive Figure 2: Side by Side Sentiment score Boxplot

# Figure 2: Interactive Sentiment Boxplot
library(tidyverse)
library(plotly)

int_plot2 <- reddit %>% 
  ggplot(aes(x = SentimentCat, y = Userkarma, fill = SentimentCat)) +
  geom_boxplot() +
  scale_fill_manual(
    values = c(
      "Negative" = "#FDB813",   # warm yellow
      "Neutral"  = "#E66100",   # deep orange
      "Positive" = "deeppink"
    )
  ) +
  labs(
    x = "Sentiment",
    y = "User Karma",
    title = "Side-by-Side User Karma by Sentiment Category"
  ) +
  theme_minimal(base_size = 14)

ggplotly(int_plot2)

Just to examine how closely related sentiment score is based upon human and User karma engagement points, we developed an interactive boxplot to display the statistical metrics across the different categories of sentiment.

Methods

Data Modeling

clustering

# Load clustering package and sample dataset package
library(cluster)
library(GDAdata)

#Capture the column names into a data frame
reddit_cols_df <- data.frame(Columns = colnames(reddit))

#format table to show nicely
reddit_cols_df %>%
  kable(
    caption = "Table: Variable Names in the Final Reddit Dataset",
    booktabs = TRUE,
    align = "l"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
  column_spec(1, bold = TRUE, color = "#2C3E50")
Table: Variable Names in the Final Reddit Dataset
Columns
Replydelay
Userkarma
Sentimentscore
Accountage
Avgwordlength
botprob
age_category
SentimentCat
# Select columns 1,3,4,5,6: Replydelay, sentimentscore, account age,average word length, and  botprob
reddit_data <- reddit[, c(1,3,4,5,6)]

# Standardize the data so variables with larger scales do not dominate
reddit_data_scale <- scale(reddit_data)


# Load package for estimating the number of clusters
library(NbClust)

# Estimate the best number of clusters from 2 to 10 using k-means criteria
number_cluster_estimate <- NbClust(
  reddit_data_scale,
  distance = "euclidean",
  min.nc = 2,
  max.nc = 10,
  method = "kmeans"
)

*** : The Hubert index is a graphical method of determining the number of clusters.
                In the plot of Hubert index, we seek a significant knee that corresponds to a 
                significant increase of the value of the measure i.e the significant peak in Hubert
                index second differences plot. 
 

*** : The D index is a graphical method of determining the number of clusters. 
                In the plot of D index, we seek a significant knee (the significant peak in Dindex
                second differences plot) that corresponds to a significant increase of the value of
                the measure. 
 
******************************************************************* 
* Among all indices:                                                
* 12 proposed 2 as the best number of clusters 
* 2 proposed 3 as the best number of clusters 
* 4 proposed 4 as the best number of clusters 
* 1 proposed 5 as the best number of clusters 
* 1 proposed 8 as the best number of clusters 
* 2 proposed 9 as the best number of clusters 
* 2 proposed 10 as the best number of clusters 

                   ***** Conclusion *****                            
 
* According to the majority rule, the best number of clusters is  2 
 
 
******************************************************************* 
# Show the voting results for the best number of clusters
best_nc_data <- as.data.frame(t(number_cluster_estimate$Best.nc)) %>%
  rename(Number_of_Clusters = 1, Value_of_Statistic = 2) %>%
  mutate(Criterion = rownames(.)) %>%
  dplyr::select(Criterion, Number_of_Clusters, Value_of_Statistic)

best_nc_data %>%
  gt() %>%
  tab_header(
    title = "Optimal Cluster Estimation",
    subtitle = "Results across multiple statistical indices"
  ) %>%
  cols_label(
    Criterion = "Statistical Index",
    Number_of_Clusters = "Proposed K",
    Value_of_Statistic = "Index Value"
  ) %>%
  fmt_number(columns = Value_of_Statistic, decimals = 4) %>%
  tab_options(table.width = pct(80))
Optimal Cluster Estimation
Results across multiple statistical indices
Statistical Index Proposed K Index Value
KL 2 12.4915
CH 2 338.4806
Hartigan 4 73.1496
CCC 10 5.6324
Scott 3 388.6374
Marriot 9 831,706,347,721.2092
TrCovW 4 37,456.7745
TraceW 4 160.8872
Friedman 9 4.5622
Rubin 4 −0.2607
Cindex 5 0.3678
DB 2 1.2027
Silhouette 2 0.3571
Duda 2 0.9291
PseudoT2 2 23.6502
Beale 2 0.2379
Ratkowsky 2 0.3992
Ball 3 312.4974
PtBiserial 2 0.6090
Frey 2 1.5613
McClain 2 0.6102
Dunn 2 0.0908
Hubert 0 0.0000
SDindex 8 1.4520
Dindex 0 0.0000
SDbw 10 0.2971
# Set seed for reproducibility
set.seed(123)

# Run PAM clustering with 5 clusters
# Note: the slides call this k-means, but this function is PAM
kmeans_reddit_data_scale_cluster <- pam(reddit_data_scale, k = 3)

# Show medoids for the clusters
medoids_table <- as.data.frame(kmeans_reddit_data_scale_cluster$medoids) %>%
  mutate(Cluster = paste("Cluster", row_number())) %>%
  dplyr::select(Cluster, everything())

medoids_table %>%
  gt() %>%
  tab_header(
    title = "Cluster Medoids Analysis",
    subtitle = "Representative values for each Reddit account behavior group"
  ) %>%
  fmt_number(columns = -Cluster, decimals = 2) %>%
  cols_label(
    Replydelay = "Reply Delay (s)",
    Sentimentscore = "Sentiment Score",
    Accountage = "Account Age (d)",
    Avgwordlength = "Avg Word Length",
    botprob = "Bot Probability"
  ) %>%
  tab_options(column_labels.font.weight = "bold")
Cluster Medoids Analysis
Representative values for each Reddit account behavior group
Cluster Reply Delay (s) Sentiment Score Account Age (d) Avg Word Length Bot Probability
Cluster 1 1.38 −0.23 0.69 −0.60 −0.72
Cluster 2 0.04 −0.18 0.05 −0.42 −0.68
Cluster 3 −0.87 −0.08 −0.49 0.99 1.00
# Show the cluster assignment for each row
cluster_assignments <- data.frame(
  Observation = 1:length(kmeans_reddit_data_scale_cluster$clustering),
  Assigned_Cluster = kmeans_reddit_data_scale_cluster$clustering
)

cluster_assignments %>%
  kable(
    caption = "Table: Individual Cluster Assignments for All Observations",
    booktabs = TRUE,
    align = "c"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE,
    font_size = 14
  ) %>%
  scroll_box(width = "100%", height = "400px") %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
  column_spec(2, 
              bold = TRUE, 
              color = ifelse(cluster_assignments$Assigned_Cluster == 3, "#D9534F", "black"))
Table: Individual Cluster Assignments for All Observations
Observation Assigned_Cluster
1 1
2 2
3 1
4 1
5 2
6 1
7 2
8 2
9 1
10 3
11 1
12 3
13 3
14 1
15 2
16 2
17 3
18 1
19 1
20 3
21 2
22 3
23 1
24 2
25 3
26 2
27 2
28 3
29 2
30 2
31 3
32 2
33 2
34 3
35 3
36 3
37 3
38 3
39 1
40 3
41 2
42 1
43 1
44 3
45 1
46 3
47 1
48 2
49 2
50 2
51 2
52 3
53 2
54 3
55 3
56 1
57 3
58 3
59 2
60 1
61 3
62 2
63 1
64 2
65 3
66 3
67 1
68 1
69 1
70 3
71 3
72 2
73 2
74 1
75 1
76 3
77 2
78 3
79 2
80 2
81 3
82 3
83 1
84 3
85 3
86 2
87 3
88 3
89 2
90 3
91 1
92 1
93 3
94 2
95 3
96 3
97 2
98 3
99 2
100 1
101 2
102 1
103 3
104 3
105 2
106 3
107 1
108 1
109 2
110 2
111 3
112 1
113 3
114 2
115 1
116 3
117 1
118 2
119 1
120 3
121 3
122 2
123 3
124 1
125 3
126 3
127 2
128 3
129 3
130 3
131 2
132 3
133 3
134 1
135 2
136 3
137 3
138 2
139 2
140 2
141 2
142 3
143 3
144 1
145 2
146 2
147 3
148 3
149 3
150 3
151 2
152 2
153 1
154 3
155 1
156 2
157 2
158 1
159 2
160 3
161 3
162 3
163 1
164 2
165 3
166 2
167 3
168 1
169 3
170 2
171 1
172 3
173 3
174 3
175 1
176 3
177 1
178 3
179 3
180 2
181 2
182 2
183 3
184 1
185 3
186 3
187 1
188 2
189 2
190 1
191 2
192 2
193 1
194 1
195 1
196 3
197 3
198 3
199 2
200 2
201 3
202 3
203 1
204 1
205 1
206 1
207 1
208 3
209 2
210 1
211 3
212 1
213 1
214 2
215 1
216 3
217 3
218 3
219 2
220 2
221 2
222 1
223 1
224 1
225 2
226 3
227 2
228 3
229 3
230 3
231 3
232 2
233 2
234 2
235 2
236 3
237 1
238 1
239 3
240 2
241 2
242 3
243 1
244 2
245 3
246 3
247 1
248 2
249 3
250 1
251 2
252 2
253 1
254 2
255 1
256 1
257 1
258 1
259 2
260 2
261 2
262 3
263 2
264 3
265 1
266 3
267 3
268 2
269 3
270 1
271 2
272 1
273 1
274 3
275 1
276 3
277 1
278 1
279 3
280 1
281 3
282 1
283 2
284 3
285 3
286 3
287 2
288 2
289 1
290 2
291 3
292 3
293 3
294 2
295 2
296 2
297 3
298 3
299 3
300 3
301 2
302 1
303 2
304 3
305 1
306 2
307 2
308 3
309 1
310 3
311 3
312 2
313 2
314 2
315 2
316 2
317 1
318 3
319 1
320 2
321 3
322 3
323 3
324 1
325 3
326 2
327 1
328 2
329 1
330 2
331 1
332 1
333 1
334 2
335 2
336 2
337 3
338 2
339 2
340 1
341 1
342 1
343 1
344 1
345 2
346 1
347 3
348 1
349 1
350 3
351 3
352 3
353 3
354 2
355 3
356 3
357 3
358 2
359 3
360 1
361 2
362 3
363 3
364 3
365 2
366 3
367 2
368 1
369 1
370 1
371 1
372 1
373 1
374 3
375 1
376 2
377 2
378 1
379 2
380 3
381 2
382 3
383 1
384 2
385 1
386 2
387 2
388 3
389 2
390 1
391 3
392 3
393 3
394 3
395 2
396 3
397 3
398 1
399 2
400 3
401 2
402 3
403 3
404 2
405 2
406 2
407 2
408 2
409 3
410 2
411 1
412 3
413 1
414 3
415 1
416 1
417 2
418 1
419 3
420 3
421 1
422 2
423 1
424 1
425 3
426 3
427 3
428 1
429 3
430 2
431 2
432 1
433 2
434 2
435 3
436 2
437 3
438 3
439 3
440 3
441 1
442 2
443 2
444 2
445 3
446 3
447 2
448 3
449 3
450 3
451 3
452 3
453 3
454 1
455 2
456 1
457 3
458 2
459 2
460 3
461 2
462 3
463 2
464 3
465 1
466 3
467 2
468 3
469 2
470 1
471 3
472 2
473 3
474 1
475 3
476 1
477 2
478 1
479 3
480 2
481 1
482 3
483 3
484 3
485 2
486 3
487 3
488 2
489 1
490 3
491 1
492 3
493 3
494 3
495 3
496 3
497 2
498 1
499 1
500 1
# Plot the clustering result in two reduced dimensions
plot(kmeans_reddit_data_scale_cluster)

# Add the assigned cluster to the original food data
reddit_data_cluster <- reddit_data %>%
  mutate(cluster = kmeans_reddit_data_scale_cluster$clustering)

# Show the dataset with assigned clusters
reddit_data_cluster %>%
  head(10) %>%
  kable(
    caption = "Table: Preview of Reddit Data with Cluster Assignments (First 10 Rows)",
    booktabs = TRUE,
    align = "c"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = TRUE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") # Highlight header
Table: Preview of Reddit Data with Cluster Assignments (First 10 Rows)
Replydelay Sentimentscore Accountage Avgwordlength botprob cluster
1854 -0.60 2264 4.62 0.093 1
1033 0.15 1654 5.84 0.117 2
2703 0.35 2442 5.73 0.115 1
2811 -0.74 168 4.58 0.094 1
1721 -0.96 801 5.78 0.116 2
3237 0.93 1333 4.08 0.082 1
1084 0.98 2582 5.33 0.107 2
622 -0.42 380 4.72 0.096 2
3486 0.44 2916 4.56 0.091 1
2 -0.15 2797 6.05 0.254 3
# Compute the average of each variable by cluster
reddit_cluster_summary <- reddit_data_cluster %>%
  group_by(cluster) %>%
  summarise(across(everything(), ~ mean(.x, na.rm = TRUE)))

# Show cluster summaries to compare

reddit_cluster_summary %>%
  gt() %>%
  tab_header(
    title = "Clustering Analysis Summary",
    subtitle = "Average values for variables across clusters"
  ) %>%
  cols_label(
    cluster = "Cluster ID",
    Replydelay = "Mean Reply Delay (sec)",
    Sentimentscore = "Mean Sentiment Score",
    Accountage = "Mean Account Age (days)",
    Avgwordlength = "Mean Avg Word Length",
    botprob = "Mean Bot Probability"
  ) %>%
  fmt_number(
    columns = everything(),
    decimals = 2
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels()
  )
Clustering Analysis Summary
Average values for variables across clusters
Cluster ID Mean Reply Delay (sec) Mean Sentiment Score Mean Account Age (days) Mean Avg Word Length Mean Bot Probability
1.00 2,640.04 −0.04 1,745.56 5.02 0.10
2.00 909.08 −0.02 1,295.22 5.15 0.11
3.00 5.59 0.08 634.94 6.50 0.23
# Calculate and table the number of observations in each cluster
reddit_data_cluster %>%
  count(cluster) %>%
  gt() %>%
  tab_header(
    title = "Cluster Distribution",
    subtitle = "Number of observations assigned to each cluster"
  ) %>%
  cols_label(
    cluster = "Cluster ID",
    n = "Account Count"
  ) %>%
  tab_options(
    table.width = pct(50)
  )
Cluster Distribution
Number of observations assigned to each cluster
Cluster ID Account Count
1 139
2 162
3 199

Reducing the number of clusters from k = 5 to k = 3 improved the clustering quality slightly, increasing the average silhouette width from 0.21 to 0.23 and producing more balanced cluster sizes. However, the silhouette width remains low overall, indicating weak separation between clusters.

cluster_3_data <- reddit_data_cluster[reddit_data_cluster$cluster == 3, ] %>%
  head(10)

# Generate the formatted table
cluster_3_data %>%
  kable(
    caption = "Table: Detailed Preview of Cluster 3 Behavioral Data (Top 10 Rows)",
    booktabs = TRUE,
    align = "c",
    digits = 3
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE,
    font_size = 14
  ) %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
  # Highlighting the 'cluster' column in a distinct color to emphasize the subset
  column_spec(which(names(cluster_3_data) == "cluster"), 
              bold = TRUE, 
              color = "white", 
              background = "#D9534F")
Table: Detailed Preview of Cluster 3 Behavioral Data (Top 10 Rows)
Replydelay Sentimentscore Accountage Avgwordlength botprob cluster
2 -0.15 2797 6.05 0.254 3
2 0.98 2669 7.11 0.276 3
8 -0.88 1 5.70 0.358 3
8 -0.05 453 7.32 0.192 3
1 0.71 2631 6.43 0.329 3
9 -0.98 5 7.07 0.248 3
5 -0.46 533 6.78 0.203 3
8 -0.98 13 6.63 0.206 3
2 -0.47 1 6.60 0.465 3
3 0.65 584 5.72 0.215 3

Cluster 3 represents the bot‑like behavioral mode in the dataset. Its extremely fast reply times, younger account age, longer word length, and higher bot probability distinguish it from the human‑like clusters. Because these features are strongly linked to automated posting behavior, Cluster 3 provides the clearest insight into how AI‑generated or bot‑like activity influences user karma. This makes Cluster 3 the most meaningful cluster for understanding engagement patterns in the context of AI bots.

Linear Regression Model

#######################################

#RUN MULTIPLE LINEAR REGRESSION
# We use lm() because UserKarma is a continuous numeric variable
model_karma1 <- lm(Userkarma ~ Accountage, 
                  data = reddit)
model_karma2 <- lm(Userkarma ~ Accountage + Replydelay, 
                  data = reddit)
model_karma3 <- lm(Userkarma ~ Accountage + Replydelay + botprob, 
                  data = reddit)
model_karma4 <- lm(Userkarma ~ Accountage + Replydelay + botprob + Avgwordlength, 
                  data = reddit)
model_karma5 <- lm(Userkarma ~ Accountage + Replydelay + botprob + Avgwordlength + Sentimentscore, 
                  data = reddit)

model_karma <- lm(Userkarma ~ Accountage + Replydelay + botprob + Avgwordlength + Sentimentscore, 
                   data = reddit)


# Extract Coefficients  
coeffs <- coef(model_karma)

# Create the formula string explicitly 
formula_text <- paste0(
  "Userkarma = ", round(coeffs[1], 4), 
  " + (", round(coeffs[2], 4), " * Accountage) ",
  " + (", round(coeffs[3], 4), " * Replydelay) ",
  " + (", round(coeffs[4], 4), " * botprob) ",
  " + (", round(coeffs[5], 4), " * Avgwordlength) ",
  " + (", round(coeffs[6], 4), " * Sentimentscore)"
)

library(broom)
library(dplyr)

# Put all models into a list
model_list <- list(
  model_karma1 = model_karma1,
  model_karma2 = model_karma2,
  model_karma3 = model_karma3,
  model_karma4 = model_karma4,
  model_karma5 = model_karma5
)

# Extract model-level statistics
model_summary <- lapply(model_list, glance) %>%
  bind_rows(.id = "Model") %>%
  select(Model,
         statistic,      # F-statistic
         p.value,        # p-value for overall model
         r.squared,      # Multiple R-squared
         adj.r.squared)  # Adjusted R-squared

library(dplyr)
library(gt)

model_summary %>%
  rename(
    `F Statistic` = statistic,
    `P Value` = p.value,
    `R Squared` = r.squared,
    `Adjusted R Squared` = adj.r.squared
  ) %>%
  gt() %>%
  tab_header(
    title = md("**Model Comparison: F‑statistic, p‑value, R², Adj R²**")
  ) %>%
  fmt_number(
    columns = c(`F Statistic`, `P Value`, `R Squared`, `Adjusted R Squared`),
    decimals = 4
  ) %>%
  tab_style(
    style = cell_text(weight = "bold"),
    locations = cells_column_labels(everything())
  ) %>%
  tab_options(
    table.border.top.width = px(2),
    table.border.bottom.width = px(2),
    column_labels.border.bottom.width = px(2),
    data_row.padding = px(6)
  )
Model Comparison: F‑statistic, p‑value, R², Adj R²
Model F Statistic P Value R Squared Adjusted R Squared
model_karma1 0.3926 0.5312 0.0008 −0.0012
model_karma2 0.4336 0.6484 0.0017 −0.0023
model_karma3 0.3381 0.7978 0.0020 −0.0040
model_karma4 0.3730 0.8279 0.0030 −0.0051
model_karma5 0.5771 0.7176 0.0058 −0.0043
formula_df <- data.frame(`Model Equation` = formula_text)

# Format the regression formula table
formula_df %>%
  kable(
    caption = "Table: Finalized Linear Regression Equation",
    booktabs = TRUE,
    align = "l"
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover"),
    full_width = TRUE,
    font_size = 16
  ) %>%
  row_spec(0, bold = TRUE, background = "#F2F2F2") %>%
  column_spec(1, italic = TRUE, color = "#2C3E50")
Table: Finalized Linear Regression Equation
Model.Equation
Userkarma = 28387.8319 + (-0.4264 * Accountage) + (0.3915 * Replydelay) + (11242.3468 * botprob) + (-847.1188 * Avgwordlength) + (-1293.2708 * Sentimentscore)

A linear regression was implemented to determine the relationship between User karma and the covariates. We decided to build 5 models adding one variable at a time to compare the model’s fits at each step to aid in the prediction analysis. The models did not produce the best results, but the best fitting model was model_karma5 which we renamed to model_karma. Although the model is not significant and did not have the best fit along with the other 4 models it did explain that a little of the variation in user karma is explained by several of the covariates.

Overall Results

library(tidyverse)
library(broom)
library(gt)

# 1. Coefficient-level results
model_results <- tidy(model_karma, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~ round(.x, 4)))

# 2. Model-level statistics (F-stat, p-value, R², Adj R²)
model_stats <- glance(model_karma) %>%
  transmute(
    `F Statistic` = round(statistic, 4),
    `Model p-value` = round(p.value, 4),
    `R Squared` = round(r.squared, 4),
    `Adjusted R Squared` = round(adj.r.squared, 4)
  )

# 3. Build the GT table
model_results %>%
  gt() %>%
  tab_header(
    title = "Multiple Linear Regression Results",
    subtitle = "Predictors of Reddit User Karma"
  ) %>%
  cols_label(
    term = "Predictor",
    estimate = "Estimate (Beta)",
    std.error = "Std. Error",
    statistic = "t-statistic",
    p.value = "p-value",
    conf.low = "Lower 95% CI",
    conf.high = "Upper 95% CI"
  ) %>%
  # Add model summary as a second table section
  tab_source_note(
    source_note = md(
      paste0(
        "**Model Summary:**  ",
        "F = ", model_stats$`F Statistic`, " | ",
        "p = ", model_stats$`Model p-value`, " | ",
        "R² = ", model_stats$`R Squared`, " | ",
        "Adj R² = ", model_stats$`Adjusted R Squared`
      )
    )
  ) %>%
  tab_options(
    heading.background.color = "#F2F2F2",
    column_labels.font.weight = "bold",
    table.width = pct(100)
  )
Multiple Linear Regression Results
Predictors of Reddit User Karma
Predictor Estimate (Beta) Std. Error t-statistic p-value Lower 95% CI Upper 95% CI
(Intercept) 28387.8319 5608.4615 5.0616 0.0000 17368.4517 39407.2122
Accountage -0.4264 0.7371 -0.5785 0.5632 -1.8746 1.0218
Replydelay 0.3915 0.7080 0.5529 0.5806 -0.9995 1.7824
botprob 11242.3468 14408.7229 0.7802 0.4356 -17067.5913 39552.2848
Avgwordlength -847.1188 1089.3879 -0.7776 0.4372 -2987.5239 1293.2863
Sentimentscore -1293.2708 1095.9882 -1.1800 0.2386 -3446.6440 860.1025
Model Summary: F = 0.5771 | p = 0.7176 | R² = 0.0058 | Adj R² = -0.0043

Our linear regression model results shows that there are no statistically significant differences between user karma and the covariates. Within context of AI bots running across the Reddit site and creating accounts it showcases how they are able to replicate the actions of human account holders to make them more camouflaged within the site.

Conclusion

With the evidence gathered from the analysis it seems like there is not a significant difference in the relationships between User karma and the covariates. Since there are AI bot accounts alongside Human accounts our regression model did not yield a significant outcome, which would support evidence that the data reveals Humans and Bot accounts are remarkably related to each other. Looking back on the evidence in our exploratory data analysis we can see that User karma and Sentiment score are highly related to each other which insinuates that, as AI continues to grow in intelligence it already replicates human sentiment and replicates human online behavior to increase user engagement. A drive for future studies after this analysis should include exploring the emotional intelligence of AI to truly discover the fine line between distinguishing between humans and bot interactions online.

Contact Information

Thank You!

  • dmart274@students.kennesaw.edu

  • dperez39@students.kennesaw.edu