Karma and Camouflage: A Statistical Analysis of the Dead Internet Theory on Reddit
Team DND
Group Members: Desyne Martinez & Daniel Perez Ulloa
Professor: Dr. Amir Karami
Course: Programming in R DS 7130
Project Overview
Dataset:The “Dead Internet” Theory: Reddit Bot vs. Human
Concept: Given the increasing presence of automated accounts online, this study examines how account age and other behavioral indicators including sentiment score, reply delay, average word length,and bot probability to predict user karma on Reddit posts, providing insight into how bot‑like features relate to engagement.
Research Question: To what extent do account age, sentiment score, reply delay, average word length, and bot probability predict variation in user karma on Reddit posts?
Introduction
Generating the Dataset
#We started by reading in all the neccessary libraries we would need for our project to ensure we have every useful tool to proceed with an efficient analysis.library(psych)library(readxl)library(ggplot2)library(MASS)library(rstatix)#library(GGally)library(tidyverse)library(tidymodels)library(moderndive)library("scatterplot3d")library(rgl)library(lm.beta)library(olsrr)library(car)library(report)library(ppcor)library(statar)library(rsample)library(car)library(ROSE)library(caret)library(dplyr)library(pROC)library(rsample)library(predtools)library(caret)library(PredictABEL)library(aod)library(glmtoolbox)library(lmtest)library(DescTools)library(readr)library(gt)library(DT)library(plotly)library(knitr)library(kableExtra)library(broom)setwd("C:/Users/Owner/Downloads/")#setwd("/home/dplant227/Documents/Rwork")# the setwd statement here helps us set up our work directory so it is essentially an apartment for the files we want pull out and work with.# for me I put the datasets in my downloads on my computer hence thats why my work directory is in the downloads space. So when setting up your work directory make sure you are # setting it up with the file path where your datasets are staying at.# "C:\Users\Owner\Downloads\reddit_dead_internet_analysis_2026.xlsx"reddit<-read_xlsx("reddit_dead_internet_analysis_2026.xlsx")#reddit <- read_csv("reddit_dead_internet_analysis_2026.csv")#here we are reading in the dataset using the read.csv statement. Since we already created our work directory all we need to do is read in the dataset using its exact file path name # as it is stated in the place where you saved it.# Inspect structure of the datasetreddit_glimpse <-data.frame(Variable =names(reddit),Type =sapply(reddit, class),Preview =sapply(reddit, function(x) paste(head(x, 3), collapse =", ")))reddit_glimpse %>%kable(caption ="Table: Data Structure and Variable Overview (glimpse)",booktabs =TRUE,align ="lcr",col.names =c("Variable Name", "Data Type", "Sample Values") ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =TRUE,font_size =14 ) %>%row_spec(0, bold =TRUE, background ="#F2F2F2") %>%column_spec(1, bold =TRUE, color ="#2C3E50") %>%column_spec(2, italic =TRUE, color ="#7F8C8D")
Table: Data Structure and Variable Overview (glimpse)
Variable Name
Data Type
Sample Values
comment_id
comment_id
character
t1_3148938807, t1_2516748923, t1_833922318
subreddit
subreddit
character
funny, gaming, politics
account_age_days
account_age_days
numeric
2264, 1654, 2442
user_karma
user_karma
numeric
34192, 2812, 37109
reply_delay_seconds
reply_delay_seconds
numeric
1854, 1033, 2703
sentiment_score
sentiment_score
numeric
-0.6, 0.15, 0.35
avg_word_length
avg_word_length
numeric
4.62, 5.84, 5.73
contains_links
contains_links
logical
FALSE, FALSE, FALSE
is_bot_flag
is_bot_flag
logical
FALSE, FALSE, FALSE
bot_type_label
bot_type_label
character
None (Human), None (Human), None (Human)
bot_probability
bot_probability
numeric
0.093, 0.117, 0.115
Exploring the Data!
Variables of interest
The dataset included 500 observations of Reddit comment metadata with 11 variables including:
comment_id
subreddit
account_age_days
user_karma
reply_delay_seconds
sentiment_score
avg_word_length
contains_links
is_bot_flag
bot_type_label
bot_probability
We wanted to put a focus on the primary factors that would contribute to detecting bot accounts on Reddit while also keeping our model parsimonious, so our primary variables of interest were as follows:
Dependent variable: User Karma
Primary exposure variable: Bot prob
Covariates: Sentiment score, Account age, Avg word length, and Reply delay.
User karma is a continuous variable detailing the karmic engagement points a user would gain with a reddit account. Bot Probability details the percent chance an account is ran by a bot. Sentiment score is a continuous variable that detects the emotional sentiment of a post on a scale of -1 to 1 from negative to positive respectively. Account age represents the longevity of the account in days, Average word length specifies the length of the words the user of the account uses primarily on posts, and reply delay represents the time in seconds it takes for a user to reply to a post on a Reddit account.
To improve the comprehensibility of the variables and make them easier to work with, we renamed the variables by removing the underscores! We also added Two categorical variables derived from Accountage and Sentiment Score respectively. Age category was created from Account age to categorize the time an account was created to distinguish it by “New accounts”, “Established” , and “Veteran” within each threshold. Sentimentcat was created from Sentimentscore to categorize the emotional sentiment by “Negative”, “Neutral”, and “Positive” from a Reddit account when making posts.
# INTERACTIVE TABLE (Requirement: At least one)datatable(head(reddit, 100), options =list(pageLength =5), caption ='Table 1: Interactive view of Reddit Bot Analysis Data')
GROUPED SUMMARIES
Using group summaries to explore the data looking at the New categorical variable age_category to evaluate the highest average user karma points, Sentiment Score, Average word length, bot probability, and reply delay between each age distinction for Reddit accounts. We wanted to explore this to see which Accounts(New Accounts, Established Accounts, and Veteran Accounts) on Reddit would have the highest engagement factors.
#Generate the grouped summary data#This calculates the mean for each key metric across the 3 clustersgrouped_summary <- reddit %>%group_by(age_category) %>%summarise(Observations =n(),`Avg Karma`=mean(Userkarma, na.rm =TRUE),`Avg Account Age`=mean(Accountage, na.rm =TRUE),`Avg Bot Prob`=mean(botprob, na.rm =TRUE),`Avg Word Length`=mean(Avgwordlength, na.rm =TRUE),`Avg Sentiment`=mean(Sentimentscore, na.rm =TRUE) ) %>%mutate(across(where(is.numeric), ~round(.x, 2)))# 2. Format as a professional kable tablegrouped_summary %>%kable(caption ="Table: Behavioral Metric Summaries Grouped by Cluster",booktabs =TRUE,align ="c" ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =TRUE,font_size =14 ) %>%row_spec(0, bold =TRUE, background ="#F2F2F2") %>%# Highlighting Cluster 3 as the likely "Bot-like" grouprow_spec(3, background ="#FFF4F4", bold =TRUE) %>%column_spec(1, bold =TRUE)
Table: Behavioral Metric Summaries Grouped by Cluster
age_category
Observations
Avg Karma
Avg Account Age
Avg Bot Prob
Avg Word Length
Avg Sentiment
Established
48
26669.04
156.67
0.16
5.88
-0.10
New Account
101
24505.40
14.02
0.25
6.44
0.10
Veteran
351
25267.64
1623.58
0.13
5.40
0.01
Patchwork of Scatterplots
library(patchwork)library(gridExtra)#EXPLORATORY DATA ANALYSIS (EDA)# Visualizing relationship between Bot Probability and User Karmap1 <-ggplot(reddit, aes(x = botprob, y = Userkarma)) +geom_point(alpha =0.5, color ="darkblue") +geom_smooth(method ="lm", color ="red") +labs(title ="Bot Probability vs. User Karma", x ="Bot Probability", y ="Karma") +theme_minimal()# Visualizing relationship between Account Age and User Karmap2 <-ggplot(reddit, aes(x = Accountage, y = Userkarma)) +geom_point(alpha =0.5, color ="darkgreen") +geom_smooth(method ="lm", color ="orange") +labs(title ="Account Age vs. User Karma", x ="Age (Days)", y ="Karma") +theme_minimal()p3 <-ggplot(reddit, aes(x = Replydelay, y = Userkarma)) +geom_point(alpha =0.5, color ="darkred") +geom_smooth(method ="lm", color ="black") +labs(title ="Reply Delay vs. User Karma",x ="Reply Delay (Seconds)",y ="User Karma") +theme_minimal()p4 <-ggplot(reddit, aes(x = Avgwordlength, y = Userkarma)) +geom_point(alpha =0.5, color ="darkcyan") +geom_smooth(method ="lm", color ="purple") +labs(title ="Average Word Length vs. User Karma",x ="Average Word Length",y ="User Karma") +theme_minimal()p5 <-ggplot(reddit, aes(x = Sentimentscore, y = Userkarma)) +geom_point(alpha =0.5, color ="darkorange") +geom_smooth(method ="lm", color ="blue") +labs(title ="Sentiment Score vs. User Karma",x ="Sentiment Score",y ="User Karma") +theme_minimal()# Combined EDA plot (Requirement: Combined figures)(p1 | p2) / (p3 | p4) / p5
The plots do not show any linear relationships between User karma and the covariates among the 500 Reddit accounts.
We wanted to take a look at User karma by Bot probability filtering each category by the type of account and positive or negative Sentiment. The plots here do not show any linear relationships between User karma and the bot probability, but an interesting thing to point out is that the bot probability goes down as we move from New accounts to Veteran accounts and that the emotional sentiment between Veteran account holders are roughly the same whether they have Negative or positive Engagement towards Reddit posts.
Interactive Figure 1: Reply Delay by Userkarma Scatterplot
# INTERACTIVE FIGURES (Requirement: At least two)# Figure 1: Interactive Scatter# Interactive Scatter: Reply Delay vs Karmaint_plot1 <-ggplot(reddit, aes(x = Replydelay, y = Userkarma)) +geom_point(color ="purple", alpha =0.4) +labs(title ="Interactive: Reply Delay vs Karma")ggplotly(int_plot1)
An interactive scatter plot was created to pinpoint the relationship between User karma and Reply delay. Alongside this purple strip towards the Y axis of the plot is where a majority of the data is clustered. From examining this we can easily deduce that accounts who tend to reply really quickly in roughly under 10 seconds tend to have high user karma points to increase their engagement and foot traffic to their account which also can be an indication of bot activity.
Statistical transformations: bar chart counts
library(ggplot2)library(dplyr)reddit %>%count(age_category) %>%ggplot(aes(x = age_category, y = n, fill = age_category)) +geom_col(width =0.7) +geom_text(aes(label = n), vjust =-0.5, size =5, fontface ="bold") +scale_fill_manual(values =c("Established"="#FDB813", # yellow-orange"New Account"="#E66100", # deep orange"Veteran"="deeppink" )) +labs(x ="Account type",y ="Count",title ="Counts of Categorized ages of reddit Accounts" ) +theme_minimal(base_size =14) +theme(legend.position ="none",plot.title =element_text(face ="bold") )
reddit %>%count(SentimentCat) %>%ggplot(aes(x = SentimentCat, y = n, fill = SentimentCat)) +geom_col(width =0.7) +geom_text(aes(label = n), vjust =-0.5, size =5, fontface ="bold") +scale_fill_manual(values =c("Negative"="#FDB813", # yellow-orange"Neutral"="#E66100", # deep orange"Positive"="deeppink" )) +labs(x ="Sentiment type",y ="Count",title ="Counts of Categorized Sentiment of reddit Accounts" ) +theme_minimal(base_size =14) +theme(legend.position ="none",plot.title =element_text(face ="bold") )
To evaluate how many New, Established, and Veteran accounts are represented in the dataset we created a bar chart to deliver these results which concluded that there are 101 New accounts, 48 Established accounts and 351 Veteran accounts. We also evaluated the sentiment across Reddit accounts and found that 198 accounts were associated with negative engagement posts while 201 accounts were associated with positive engagement posts and the other 101 accounts were neutral.
#explore age category by User Karmalibrary(viridis)reddit %>%ggplot(aes(x = age_category, y = Userkarma, fill = age_category)) +geom_boxplot() +scale_fill_manual(values =c("New Account"="#FDB813", # warm yellow"Established"="#E66100", # deep orange"Veteran"="deeppink" ) ) +labs(x ="Account type",y ="User Karma",title ="Side-by-Side User Karma by Account category" ) +theme_minimal(base_size =14)
#explore Sentiment cat by User karma library(viridis)reddit %>%ggplot(aes(x = SentimentCat, y = Userkarma, fill = SentimentCat)) +geom_boxplot() +scale_fill_manual(values =c("Negative"="#FDB813", # warm yellow"Neutral"="#E66100", # deep orange"Positive"="deeppink" ) ) +labs(x ="Senitment",y ="User Karma",title ="Side-by-Side User Karma by Sentiment Category" ) +theme_minimal(base_size =14)
Although the boxplots do not show any significant difference this is still quite alarming given the context of AI bots running accounts across the social media site Reddit. New accounts seem to bring in the same user karma engagement points as Veteran and Established accounts. The user karma points for both positive and negative Sentiment overlap indicating that negativity and positivity don’t matter as much when it comes to gaining user karma points. If we were to make an inference in context of the AI bots on the site it would seem like they are not much differential from the real human account owners on the site which would make it much more difficult to distinguish between whether an account is run by a bot or a human.
Interactive Figure 2: Side by Side Sentiment score Boxplot
# Figure 2: Interactive Sentiment Boxplotlibrary(tidyverse)library(plotly)int_plot2 <- reddit %>%ggplot(aes(x = SentimentCat, y = Userkarma, fill = SentimentCat)) +geom_boxplot() +scale_fill_manual(values =c("Negative"="#FDB813", # warm yellow"Neutral"="#E66100", # deep orange"Positive"="deeppink" ) ) +labs(x ="Sentiment",y ="User Karma",title ="Side-by-Side User Karma by Sentiment Category" ) +theme_minimal(base_size =14)ggplotly(int_plot2)
Just to examine how closely related sentiment score is based upon human and User karma engagement points, we developed an interactive boxplot to display the statistical metrics across the different categories of sentiment.
Methods
Data Modeling
clustering
# Load clustering package and sample dataset packagelibrary(cluster)library(GDAdata)#Capture the column names into a data framereddit_cols_df <-data.frame(Columns =colnames(reddit))#format table to show nicelyreddit_cols_df %>%kable(caption ="Table: Variable Names in the Final Reddit Dataset",booktabs =TRUE,align ="l" ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE,font_size =14 ) %>%row_spec(0, bold =TRUE, background ="#F2F2F2") %>%column_spec(1, bold =TRUE, color ="#2C3E50")
Table: Variable Names in the Final Reddit Dataset
Columns
Replydelay
Userkarma
Sentimentscore
Accountage
Avgwordlength
botprob
age_category
SentimentCat
# Select columns 1,3,4,5,6: Replydelay, sentimentscore, account age,average word length, and botprobreddit_data <- reddit[, c(1,3,4,5,6)]# Standardize the data so variables with larger scales do not dominatereddit_data_scale <-scale(reddit_data)# Load package for estimating the number of clusterslibrary(NbClust)# Estimate the best number of clusters from 2 to 10 using k-means criterianumber_cluster_estimate <-NbClust( reddit_data_scale,distance ="euclidean",min.nc =2,max.nc =10,method ="kmeans")
*** : The Hubert index is a graphical method of determining the number of clusters.
In the plot of Hubert index, we seek a significant knee that corresponds to a
significant increase of the value of the measure i.e the significant peak in Hubert
index second differences plot.
*** : The D index is a graphical method of determining the number of clusters.
In the plot of D index, we seek a significant knee (the significant peak in Dindex
second differences plot) that corresponds to a significant increase of the value of
the measure.
*******************************************************************
* Among all indices:
* 12 proposed 2 as the best number of clusters
* 2 proposed 3 as the best number of clusters
* 4 proposed 4 as the best number of clusters
* 1 proposed 5 as the best number of clusters
* 1 proposed 8 as the best number of clusters
* 2 proposed 9 as the best number of clusters
* 2 proposed 10 as the best number of clusters
***** Conclusion *****
* According to the majority rule, the best number of clusters is 2
*******************************************************************
# Show the voting results for the best number of clustersbest_nc_data <-as.data.frame(t(number_cluster_estimate$Best.nc)) %>%rename(Number_of_Clusters =1, Value_of_Statistic =2) %>%mutate(Criterion =rownames(.)) %>% dplyr::select(Criterion, Number_of_Clusters, Value_of_Statistic)best_nc_data %>%gt() %>%tab_header(title ="Optimal Cluster Estimation",subtitle ="Results across multiple statistical indices" ) %>%cols_label(Criterion ="Statistical Index",Number_of_Clusters ="Proposed K",Value_of_Statistic ="Index Value" ) %>%fmt_number(columns = Value_of_Statistic, decimals =4) %>%tab_options(table.width =pct(80))
Optimal Cluster Estimation
Results across multiple statistical indices
Statistical Index
Proposed K
Index Value
KL
2
12.4915
CH
2
338.4806
Hartigan
4
73.1496
CCC
10
5.6324
Scott
3
388.6374
Marriot
9
831,706,347,721.2092
TrCovW
4
37,456.7745
TraceW
4
160.8872
Friedman
9
4.5622
Rubin
4
−0.2607
Cindex
5
0.3678
DB
2
1.2027
Silhouette
2
0.3571
Duda
2
0.9291
PseudoT2
2
23.6502
Beale
2
0.2379
Ratkowsky
2
0.3992
Ball
3
312.4974
PtBiserial
2
0.6090
Frey
2
1.5613
McClain
2
0.6102
Dunn
2
0.0908
Hubert
0
0.0000
SDindex
8
1.4520
Dindex
0
0.0000
SDbw
10
0.2971
# Set seed for reproducibilityset.seed(123)# Run PAM clustering with 5 clusters# Note: the slides call this k-means, but this function is PAMkmeans_reddit_data_scale_cluster <-pam(reddit_data_scale, k =3)# Show medoids for the clustersmedoids_table <-as.data.frame(kmeans_reddit_data_scale_cluster$medoids) %>%mutate(Cluster =paste("Cluster", row_number())) %>% dplyr::select(Cluster, everything())medoids_table %>%gt() %>%tab_header(title ="Cluster Medoids Analysis",subtitle ="Representative values for each Reddit account behavior group" ) %>%fmt_number(columns =-Cluster, decimals =2) %>%cols_label(Replydelay ="Reply Delay (s)",Sentimentscore ="Sentiment Score",Accountage ="Account Age (d)",Avgwordlength ="Avg Word Length",botprob ="Bot Probability" ) %>%tab_options(column_labels.font.weight ="bold")
Cluster Medoids Analysis
Representative values for each Reddit account behavior group
Cluster
Reply Delay (s)
Sentiment Score
Account Age (d)
Avg Word Length
Bot Probability
Cluster 1
1.38
−0.23
0.69
−0.60
−0.72
Cluster 2
0.04
−0.18
0.05
−0.42
−0.68
Cluster 3
−0.87
−0.08
−0.49
0.99
1.00
# Show the cluster assignment for each rowcluster_assignments <-data.frame(Observation =1:length(kmeans_reddit_data_scale_cluster$clustering),Assigned_Cluster = kmeans_reddit_data_scale_cluster$clustering)cluster_assignments %>%kable(caption ="Table: Individual Cluster Assignments for All Observations",booktabs =TRUE,align ="c" ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =FALSE,font_size =14 ) %>%scroll_box(width ="100%", height ="400px") %>%row_spec(0, bold =TRUE, background ="#F2F2F2") %>%column_spec(2, bold =TRUE, color =ifelse(cluster_assignments$Assigned_Cluster ==3, "#D9534F", "black"))
Table: Individual Cluster Assignments for All Observations
Observation
Assigned_Cluster
1
1
2
2
3
1
4
1
5
2
6
1
7
2
8
2
9
1
10
3
11
1
12
3
13
3
14
1
15
2
16
2
17
3
18
1
19
1
20
3
21
2
22
3
23
1
24
2
25
3
26
2
27
2
28
3
29
2
30
2
31
3
32
2
33
2
34
3
35
3
36
3
37
3
38
3
39
1
40
3
41
2
42
1
43
1
44
3
45
1
46
3
47
1
48
2
49
2
50
2
51
2
52
3
53
2
54
3
55
3
56
1
57
3
58
3
59
2
60
1
61
3
62
2
63
1
64
2
65
3
66
3
67
1
68
1
69
1
70
3
71
3
72
2
73
2
74
1
75
1
76
3
77
2
78
3
79
2
80
2
81
3
82
3
83
1
84
3
85
3
86
2
87
3
88
3
89
2
90
3
91
1
92
1
93
3
94
2
95
3
96
3
97
2
98
3
99
2
100
1
101
2
102
1
103
3
104
3
105
2
106
3
107
1
108
1
109
2
110
2
111
3
112
1
113
3
114
2
115
1
116
3
117
1
118
2
119
1
120
3
121
3
122
2
123
3
124
1
125
3
126
3
127
2
128
3
129
3
130
3
131
2
132
3
133
3
134
1
135
2
136
3
137
3
138
2
139
2
140
2
141
2
142
3
143
3
144
1
145
2
146
2
147
3
148
3
149
3
150
3
151
2
152
2
153
1
154
3
155
1
156
2
157
2
158
1
159
2
160
3
161
3
162
3
163
1
164
2
165
3
166
2
167
3
168
1
169
3
170
2
171
1
172
3
173
3
174
3
175
1
176
3
177
1
178
3
179
3
180
2
181
2
182
2
183
3
184
1
185
3
186
3
187
1
188
2
189
2
190
1
191
2
192
2
193
1
194
1
195
1
196
3
197
3
198
3
199
2
200
2
201
3
202
3
203
1
204
1
205
1
206
1
207
1
208
3
209
2
210
1
211
3
212
1
213
1
214
2
215
1
216
3
217
3
218
3
219
2
220
2
221
2
222
1
223
1
224
1
225
2
226
3
227
2
228
3
229
3
230
3
231
3
232
2
233
2
234
2
235
2
236
3
237
1
238
1
239
3
240
2
241
2
242
3
243
1
244
2
245
3
246
3
247
1
248
2
249
3
250
1
251
2
252
2
253
1
254
2
255
1
256
1
257
1
258
1
259
2
260
2
261
2
262
3
263
2
264
3
265
1
266
3
267
3
268
2
269
3
270
1
271
2
272
1
273
1
274
3
275
1
276
3
277
1
278
1
279
3
280
1
281
3
282
1
283
2
284
3
285
3
286
3
287
2
288
2
289
1
290
2
291
3
292
3
293
3
294
2
295
2
296
2
297
3
298
3
299
3
300
3
301
2
302
1
303
2
304
3
305
1
306
2
307
2
308
3
309
1
310
3
311
3
312
2
313
2
314
2
315
2
316
2
317
1
318
3
319
1
320
2
321
3
322
3
323
3
324
1
325
3
326
2
327
1
328
2
329
1
330
2
331
1
332
1
333
1
334
2
335
2
336
2
337
3
338
2
339
2
340
1
341
1
342
1
343
1
344
1
345
2
346
1
347
3
348
1
349
1
350
3
351
3
352
3
353
3
354
2
355
3
356
3
357
3
358
2
359
3
360
1
361
2
362
3
363
3
364
3
365
2
366
3
367
2
368
1
369
1
370
1
371
1
372
1
373
1
374
3
375
1
376
2
377
2
378
1
379
2
380
3
381
2
382
3
383
1
384
2
385
1
386
2
387
2
388
3
389
2
390
1
391
3
392
3
393
3
394
3
395
2
396
3
397
3
398
1
399
2
400
3
401
2
402
3
403
3
404
2
405
2
406
2
407
2
408
2
409
3
410
2
411
1
412
3
413
1
414
3
415
1
416
1
417
2
418
1
419
3
420
3
421
1
422
2
423
1
424
1
425
3
426
3
427
3
428
1
429
3
430
2
431
2
432
1
433
2
434
2
435
3
436
2
437
3
438
3
439
3
440
3
441
1
442
2
443
2
444
2
445
3
446
3
447
2
448
3
449
3
450
3
451
3
452
3
453
3
454
1
455
2
456
1
457
3
458
2
459
2
460
3
461
2
462
3
463
2
464
3
465
1
466
3
467
2
468
3
469
2
470
1
471
3
472
2
473
3
474
1
475
3
476
1
477
2
478
1
479
3
480
2
481
1
482
3
483
3
484
3
485
2
486
3
487
3
488
2
489
1
490
3
491
1
492
3
493
3
494
3
495
3
496
3
497
2
498
1
499
1
500
1
# Plot the clustering result in two reduced dimensionsplot(kmeans_reddit_data_scale_cluster)
# Add the assigned cluster to the original food datareddit_data_cluster <- reddit_data %>%mutate(cluster = kmeans_reddit_data_scale_cluster$clustering)# Show the dataset with assigned clustersreddit_data_cluster %>%head(10) %>%kable(caption ="Table: Preview of Reddit Data with Cluster Assignments (First 10 Rows)",booktabs =TRUE,align ="c" ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed", "responsive"),full_width =TRUE,font_size =14 ) %>%row_spec(0, bold =TRUE, background ="#F2F2F2") # Highlight header
Table: Preview of Reddit Data with Cluster Assignments (First 10 Rows)
Replydelay
Sentimentscore
Accountage
Avgwordlength
botprob
cluster
1854
-0.60
2264
4.62
0.093
1
1033
0.15
1654
5.84
0.117
2
2703
0.35
2442
5.73
0.115
1
2811
-0.74
168
4.58
0.094
1
1721
-0.96
801
5.78
0.116
2
3237
0.93
1333
4.08
0.082
1
1084
0.98
2582
5.33
0.107
2
622
-0.42
380
4.72
0.096
2
3486
0.44
2916
4.56
0.091
1
2
-0.15
2797
6.05
0.254
3
# Compute the average of each variable by clusterreddit_cluster_summary <- reddit_data_cluster %>%group_by(cluster) %>%summarise(across(everything(), ~mean(.x, na.rm =TRUE)))# Show cluster summaries to comparereddit_cluster_summary %>%gt() %>%tab_header(title ="Clustering Analysis Summary",subtitle ="Average values for variables across clusters" ) %>%cols_label(cluster ="Cluster ID",Replydelay ="Mean Reply Delay (sec)",Sentimentscore ="Mean Sentiment Score",Accountage ="Mean Account Age (days)",Avgwordlength ="Mean Avg Word Length",botprob ="Mean Bot Probability" ) %>%fmt_number(columns =everything(),decimals =2 ) %>%tab_style(style =cell_text(weight ="bold"),locations =cells_column_labels() )
Clustering Analysis Summary
Average values for variables across clusters
Cluster ID
Mean Reply Delay (sec)
Mean Sentiment Score
Mean Account Age (days)
Mean Avg Word Length
Mean Bot Probability
1.00
2,640.04
−0.04
1,745.56
5.02
0.10
2.00
909.08
−0.02
1,295.22
5.15
0.11
3.00
5.59
0.08
634.94
6.50
0.23
# Calculate and table the number of observations in each clusterreddit_data_cluster %>%count(cluster) %>%gt() %>%tab_header(title ="Cluster Distribution",subtitle ="Number of observations assigned to each cluster" ) %>%cols_label(cluster ="Cluster ID",n ="Account Count" ) %>%tab_options(table.width =pct(50) )
Cluster Distribution
Number of observations assigned to each cluster
Cluster ID
Account Count
1
139
2
162
3
199
Reducing the number of clusters from k = 5 to k = 3 improved the clustering quality slightly, increasing the average silhouette width from 0.21 to 0.23 and producing more balanced cluster sizes. However, the silhouette width remains low overall, indicating weak separation between clusters.
cluster_3_data <- reddit_data_cluster[reddit_data_cluster$cluster ==3, ] %>%head(10)# Generate the formatted tablecluster_3_data %>%kable(caption ="Table: Detailed Preview of Cluster 3 Behavioral Data (Top 10 Rows)",booktabs =TRUE,align ="c",digits =3 ) %>%kable_styling(bootstrap_options =c("striped", "hover", "condensed"),full_width =TRUE,font_size =14 ) %>%row_spec(0, bold =TRUE, background ="#F2F2F2") %>%# Highlighting the 'cluster' column in a distinct color to emphasize the subsetcolumn_spec(which(names(cluster_3_data) =="cluster"), bold =TRUE, color ="white", background ="#D9534F")
Table: Detailed Preview of Cluster 3 Behavioral Data (Top 10 Rows)
Replydelay
Sentimentscore
Accountage
Avgwordlength
botprob
cluster
2
-0.15
2797
6.05
0.254
3
2
0.98
2669
7.11
0.276
3
8
-0.88
1
5.70
0.358
3
8
-0.05
453
7.32
0.192
3
1
0.71
2631
6.43
0.329
3
9
-0.98
5
7.07
0.248
3
5
-0.46
533
6.78
0.203
3
8
-0.98
13
6.63
0.206
3
2
-0.47
1
6.60
0.465
3
3
0.65
584
5.72
0.215
3
Cluster 3 represents the bot‑like behavioral mode in the dataset. Its extremely fast reply times, younger account age, longer word length, and higher bot probability distinguish it from the human‑like clusters. Because these features are strongly linked to automated posting behavior, Cluster 3 provides the clearest insight into how AI‑generated or bot‑like activity influences user karma. This makes Cluster 3 the most meaningful cluster for understanding engagement patterns in the context of AI bots.
Linear Regression Model
########################################RUN MULTIPLE LINEAR REGRESSION# We use lm() because UserKarma is a continuous numeric variablemodel_karma1 <-lm(Userkarma ~ Accountage, data = reddit)model_karma2 <-lm(Userkarma ~ Accountage + Replydelay, data = reddit)model_karma3 <-lm(Userkarma ~ Accountage + Replydelay + botprob, data = reddit)model_karma4 <-lm(Userkarma ~ Accountage + Replydelay + botprob + Avgwordlength, data = reddit)model_karma5 <-lm(Userkarma ~ Accountage + Replydelay + botprob + Avgwordlength + Sentimentscore, data = reddit)model_karma <-lm(Userkarma ~ Accountage + Replydelay + botprob + Avgwordlength + Sentimentscore, data = reddit)# Extract Coefficients coeffs <-coef(model_karma)# Create the formula string explicitly formula_text <-paste0("Userkarma = ", round(coeffs[1], 4), " + (", round(coeffs[2], 4), " * Accountage) "," + (", round(coeffs[3], 4), " * Replydelay) "," + (", round(coeffs[4], 4), " * botprob) "," + (", round(coeffs[5], 4), " * Avgwordlength) "," + (", round(coeffs[6], 4), " * Sentimentscore)")library(broom)library(dplyr)# Put all models into a listmodel_list <-list(model_karma1 = model_karma1,model_karma2 = model_karma2,model_karma3 = model_karma3,model_karma4 = model_karma4,model_karma5 = model_karma5)# Extract model-level statisticsmodel_summary <-lapply(model_list, glance) %>%bind_rows(.id ="Model") %>%select(Model, statistic, # F-statistic p.value, # p-value for overall model r.squared, # Multiple R-squared adj.r.squared) # Adjusted R-squaredlibrary(dplyr)library(gt)model_summary %>%rename(`F Statistic`= statistic,`P Value`= p.value,`R Squared`= r.squared,`Adjusted R Squared`= adj.r.squared ) %>%gt() %>%tab_header(title =md("**Model Comparison: F‑statistic, p‑value, R², Adj R²**") ) %>%fmt_number(columns =c(`F Statistic`, `P Value`, `R Squared`, `Adjusted R Squared`),decimals =4 ) %>%tab_style(style =cell_text(weight ="bold"),locations =cells_column_labels(everything()) ) %>%tab_options(table.border.top.width =px(2),table.border.bottom.width =px(2),column_labels.border.bottom.width =px(2),data_row.padding =px(6) )
Model Comparison: F‑statistic, p‑value, R², Adj R²
Model
F Statistic
P Value
R Squared
Adjusted R Squared
model_karma1
0.3926
0.5312
0.0008
−0.0012
model_karma2
0.4336
0.6484
0.0017
−0.0023
model_karma3
0.3381
0.7978
0.0020
−0.0040
model_karma4
0.3730
0.8279
0.0030
−0.0051
model_karma5
0.5771
0.7176
0.0058
−0.0043
formula_df <-data.frame(`Model Equation`= formula_text)# Format the regression formula tableformula_df %>%kable(caption ="Table: Finalized Linear Regression Equation",booktabs =TRUE,align ="l" ) %>%kable_styling(bootstrap_options =c("striped", "hover"),full_width =TRUE,font_size =16 ) %>%row_spec(0, bold =TRUE, background ="#F2F2F2") %>%column_spec(1, italic =TRUE, color ="#2C3E50")
A linear regression was implemented to determine the relationship between User karma and the covariates. We decided to build 5 models adding one variable at a time to compare the model’s fits at each step to aid in the prediction analysis. The models did not produce the best results, but the best fitting model was model_karma5 which we renamed to model_karma. Although the model is not significant and did not have the best fit along with the other 4 models it did explain that a little of the variation in user karma is explained by several of the covariates.
Model Summary: F = 0.5771 | p = 0.7176 | R² = 0.0058 | Adj R² = -0.0043
Our linear regression model results shows that there are no statistically significant differences between user karma and the covariates. Within context of AI bots running across the Reddit site and creating accounts it showcases how they are able to replicate the actions of human account holders to make them more camouflaged within the site.
Conclusion
With the evidence gathered from the analysis it seems like there is not a significant difference in the relationships between User karma and the covariates. Since there are AI bot accounts alongside Human accounts our regression model did not yield a significant outcome, which would support evidence that the data reveals Humans and Bot accounts are remarkably related to each other. Looking back on the evidence in our exploratory data analysis we can see that User karma and Sentiment score are highly related to each other which insinuates that, as AI continues to grow in intelligence it already replicates human sentiment and replicates human online behavior to increase user engagement. A drive for future studies after this analysis should include exploring the emotional intelligence of AI to truly discover the fine line between distinguishing between humans and bot interactions online.