We began by finetuning a few of the standard transformer models (4_old_model_evaluation.ipynb) to get an idea of how they were performing. The best f1-scores for each model after hyperparameter tuning is shown below.
Model
F1-Score
bert-base-cased
0.864
bert-base-uncased
0.862
distilbert-cased
0.857
distilbert-uncased
-
roberta-base
0.865
Despite roberta not achieving results massively better than bert-base-cased, this model saw the least hyperparameter tuining (2 iterations) and even at that, both iterations performed better than the best bert model (10 itertions). We decided to move forward with roberta for further testing.
Test Conditions
We had a few theories we wanted to test:
Spoiler alert, investigating these theories brought about some new theories and so here is a list of other things I ended up investigating:
each of these contain either: - report - article link - quote (report) - announcement
so this is the relabelled gpt data, just no longer being referred to article link or not - which is maybe a little annoying for looking at accuracy at the group level?
Tests
Cross Validation of og roberta model - with og data
We have pretty much settled on roberta being our best performing model. Previous experiments (3_model_iteration) have shown that, with very little hyperparameter tuning, it is the best performing model (>86% accuracy and f1). I’m going to do a quick cross validation test to make sure that these results hold for varied testing / training sets.
[402/402 06:37, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.554600
0.340840
0.868323
0.867751
0.869972
0.868323
100
0.406000
0.324208
0.874534
0.872914
0.884007
0.874534
150
0.368400
0.311900
0.874534
0.872678
0.886015
0.874534
200
0.353700
0.312466
0.867081
0.867233
0.869958
0.867081
250
0.326900
0.304515
0.868323
0.868323
0.868323
0.868323
300
0.302700
0.288748
0.878261
0.877374
0.882388
0.878261
350
0.238800
0.289207
0.886957
0.886486
0.888694
0.886957
400
0.290100
0.285096
0.886957
0.886676
0.887616
0.886957
[404/404 06:37, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.548500
0.408954
0.829602
0.829485
0.829577
0.829602
100
0.347600
0.393902
0.840796
0.838168
0.854970
0.840796
150
0.380600
0.357313
0.854478
0.853830
0.856862
0.854478
200
0.342400
0.350081
0.839552
0.837915
0.846956
0.839552
250
0.287100
0.417664
0.838308
0.835956
0.850190
0.838308
300
0.286800
0.364391
0.858209
0.858190
0.858183
0.858209
350
0.242500
0.374562
0.847015
0.846036
0.850979
0.847015
400
0.297800
0.358485
0.842040
0.840819
0.847116
0.842040
[404/404 06:37, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.526500
0.454144
0.819652
0.813087
0.841876
0.819652
100
0.397800
0.353869
0.855721
0.854914
0.856552
0.855721
150
0.338800
0.348409
0.861940
0.862176
0.863014
0.861940
200
0.382600
0.304005
0.876866
0.876087
0.878353
0.876866
250
0.264500
0.321688
0.876866
0.875549
0.881046
0.876866
300
0.299600
0.295096
0.878109
0.877539
0.878851
0.878109
350
0.273100
0.291737
0.873134
0.873022
0.873008
0.873134
400
0.278700
0.291532
0.870647
0.870258
0.870759
0.870647
[404/404 06:39, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.549500
0.463595
0.809701
0.803958
0.821194
0.809701
100
0.397300
0.380256
0.840796
0.840902
0.841051
0.840796
150
0.391100
0.390431
0.825871
0.826573
0.830229
0.825871
200
0.329000
0.419239
0.838308
0.838251
0.838205
0.838308
250
0.346200
0.343753
0.858209
0.856757
0.860209
0.858209
300
0.282600
0.355700
0.864428
0.864451
0.864477
0.864428
350
0.256800
0.355219
0.865672
0.865624
0.865588
0.865672
400
0.233400
0.368634
0.864428
0.864404
0.864383
0.864428
[404/404 06:40, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.594100
0.380333
0.830846
0.831327
0.833291
0.830846
100
0.375200
0.426684
0.818408
0.819108
0.827188
0.818408
150
0.382000
0.360243
0.843284
0.843422
0.843655
0.843284
200
0.344800
0.332534
0.861940
0.860143
0.866251
0.861940
250
0.270400
0.353116
0.854478
0.854735
0.855442
0.854478
300
0.260300
0.360021
0.853234
0.853572
0.854769
0.853234
350
0.289400
0.357653
0.848259
0.848714
0.850946
0.848259
400
0.253600
0.338669
0.863184
0.862265
0.864172
0.863184
There is some variation in model result metrics across runs which would suggest that the training sample used has some effect on model accuracy. There is also a little bit of variation in the stability of the loss curves across runs. Interestingly, the most promising loss curves coincide with the two runs with the highest f1-score.
Another thing that I should investigate is what differs across training sets, however for now I am prioritising training with the relablled data, as if this improves the model, we will proceed with that anyway.
Trial different size datasets (from here on out everything is trained with the relabelled data)
Before training, let’s see what the split of each spam classification is across samples - we would want each classification to be roughly 25%, 50%, 75% and 100% of the available labels
train_25
train_50
train_75
overall_count
train25_prop
train50_prop
train75_prop
classification_label
not_spam
284
582
851
1128
0.251773
0.515957
0.754433
spam
84
161
242
312
0.269231
0.516026
0.775641
promotion
66
125
194
259
0.254826
0.482625
0.749035
crypto
65
136
200
270
0.240741
0.503704
0.740741
slop
59
116
186
258
0.228682
0.449612
0.720930
relabelled
51
97
145
206
0.247573
0.470874
0.703883
article_link
47
108
164
212
0.221698
0.509434
0.773585
seo
40
76
111
145
0.275862
0.524138
0.765517
announcement
26
50
73
91
0.285714
0.549451
0.802198
image
24
41
65
94
0.255319
0.436170
0.691489
event
19
33
54
73
0.260274
0.452055
0.739726
bad_scraping
19
34
62
80
0.237500
0.425000
0.775000
report
15
24
40
53
0.283019
0.452830
0.754717
quote
8
13
20
28
0.285714
0.464286
0.714286
stock_ticker
3
4
5
6
0.500000
0.666667
0.833333
hateful
2
7
10
11
0.181818
0.636364
0.909091
video
1
5
10
15
0.066667
0.333333
0.666667
aiprompt
1
4
5
8
0.125000
0.500000
0.625000
low_analytical_value()
1
2
4
5
0.200000
0.400000
0.800000
petition
0
1
2
5
0.000000
0.200000
0.400000
With the exception of the the petition label in the 25% group - we seem to have an ok representation of each classification in each group.
How training size affects results
Granualar look at spam vs not spam
model
sample
num_epochs
batch_size
learning_rate
weight_decay
eval_loss
eval_accuracy
eval_f1
eval_precision
eval_recall
eval_runtime
eval_samples_per_second
eval_steps_per_second
epoch
name_on_hub
0
roberta
train_25
2
16
0.00002
0.1
0.363416
0.847291
0.847572
0.849867
0.847291
12.3891
81.927
5.166
2.0
/roberta_train_n752
1
roberta
train_50
2
16
0.00002
0.1
0.314662
0.874877
0.875072
0.876329
0.874877
12.2712
82.714
5.215
2.0
/roberta_train_n1503
2
roberta
train_75
2
16
0.00002
0.1
0.321931
0.876847
0.877039
0.878296
0.876847
12.2485
82.868
5.225
2.0
/roberta_train_n2254
3
roberta
train_100
2
16
0.00002
0.1
0.288585
0.885714
0.885831
0.886286
0.885714
12.2070
83.149
5.243
2.0
/roberta_train_n3006
we can see that all metrics improved as training size increased, there does seem to be a less dramatic increase between the 50% and 75% sample size, let’s look at it on a graph:
as we saw in the df, all metrics improve with increasing training size. Precision is consistently the highest scoring evaluation metric.
If we look at this on a spam type (article link, slop…) level, I wonder does it improve across all categories as training size increases. There are some spam types that we have a fairly low sample size for anyway, and these likely won’t see much improvement with increasing sample size as all sample sizes will be low for those groups, but maybe we have reached saturation with other groups.
group classification across training size
For the sake of showing what I’m talking about here, the above is a screenshot of an interactive html chart that should be rendered here. I wouldn’t bother trying to interpret it.. the main conclusions are summarised below:
More data does seem to largely improve the accuracy of identifying spam categories. Some areas we are struggling and may need to get more labelled data or consider other routes to accuracy:
relabelled (max ~77% accuracy)
announcements - more data is not improving metrics with a max accuracy of ~87% at a training size of only 50% of the full training data.
report - this is also encompassed in the relabelled data but again, improves with more data but maxes out ~70%.
quote - also someone overlapping with the relabelled data (max ~68%)
bad scraping - this does significantly improve with larger training samples, that said, we have only a max of 33 “bad scraping” posts. Should this be included at all here or is bad scraping to broad a category to give to the model? Is this something we can tackle at source?
hateful, petition, stocks - really poor classification but also really small sample sizes overall. Should hateful be included here at all or too broad?
We can see that the model is improving with more data, so it seems like a good argument to source more labelled data.
Let’s have a quick look to compare how the model trained on the new labelled data compares to the old model and see if we can draw any conclusions from that.
Comparing the original and relabelled model
Let’s have a look at some of the more telling metrics on both a spam label and overall model level:
old_data_no_spam
old_data_spam
new_data_no_spam
new_data_spam
precision
0.885366
0.864463
0.863732
0.905204
recall
0.815730
0.917544
0.889849
0.882246
f1-score
0.849123
0.890213
0.876596
0.893578
support
445.000000
570.000000
463.000000
552.000000
old_data_weighted avg
new_data_weighted avg
precision
0.873627
0.886286
recall
0.872906
0.885714
f1-score
0.872198
0.885831
support
1015.000000
1015.000000
It looks like the newly labelled data is resulting in a better model across the board, but it would be interesting to see what affect the change in label on the report and article data has on a spam grouping level.
classification_label
precision_not_spam_og_model
precision_spam_og_model
recall_not_spam_og_model
recall_spam_og_model
f1_not_spam_og_model
f1_spam_og_model
precision_not_spam_relabelled
precision_spam_relabelled
recall_not_spam_relabelled
recall_spam_relabelled
f1_not_spam_relabelled
f1_spam_relabelled
0_relabelled
1_relabelled
0_og_model
1_og_model
0
aiprompt
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
0.0
3.0
0.0
3.0
1
announcement
0.333333
0.804878
0.200000
0.891892
0.250000
0.846154
0.375000
0.956522
0.750000
0.814815
0.500000
0.880000
4.0
27.0
10.0
37.0
2
article_link
0.450000
0.785714
0.300000
0.875000
0.360000
0.827957
0.526316
1.000000
1.000000
0.876712
0.689655
0.934307
10.0
73.0
30.0
88.0
3
bad_scraping
0.750000
0.837838
0.333333
0.968750
0.461538
0.898551
0.411765
0.875000
0.777778
0.583333
0.538462
0.700000
9.0
24.0
9.0
32.0
4
bot
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
0.0
1.0
0.0
1.0
5
crypto
0.333333
1.000000
1.000000
0.976744
0.500000
0.988235
0.250000
1.000000
1.000000
0.964706
0.400000
0.982036
1.0
85.0
1.0
86.0
6
event
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0.500000
1.000000
1.000000
0.958333
0.666667
0.978723
1.0
24.0
1.0
24.0
7
hateful
0.333333
0.000000
1.000000
0.000000
0.500000
0.000000
0.333333
0.000000
1.000000
0.000000
0.500000
0.000000
2.0
4.0
2.0
4.0
8
image
0.600000
0.944444
0.750000
0.894737
0.666667
0.918919
0.666667
1.000000
1.000000
0.894737
0.800000
0.944444
4.0
19.0
4.0
19.0
9
low_analytical_value()
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
1.0
0.0
1.0
0.0
10
news_report()
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
11
not_spam
1.000000
0.000000
0.889474
0.000000
0.941504
0.000000
1.000000
0.000000
0.915567
0.000000
0.955923
0.000000
379.0
0.0
380.0
0.0
12
petition
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.000000
0.0
1.0
0.0
1.0
13
promotion
0.200000
0.970874
0.250000
0.961538
0.222222
0.966184
0.222222
0.989796
0.666667
0.932692
0.333333
0.960396
3.0
104.0
4.0
104.0
14
quote
0.625000
0.888889
0.833333
0.727273
0.714286
0.800000
0.545455
1.000000
1.000000
0.500000
0.705882
0.666667
6.0
10.0
6.0
11.0
15
report
0.333333
0.781250
0.125000
0.925926
0.181818
0.847458
0.500000
0.818182
0.600000
0.750000
0.545455
0.782609
5.0
12.0
8.0
27.0
16
seo
0.000000
0.964286
0.000000
0.931034
0.000000
0.947368
0.000000
0.980769
0.000000
0.910714
0.000000
0.944444
1.0
56.0
2.0
58.0
17
slop
0.200000
1.000000
1.000000
0.902439
0.333333
0.948718
0.166667
1.000000
1.000000
0.878049
0.285714
0.935065
2.0
82.0
2.0
82.0
18
spam
0.000000
1.000000
0.000000
0.932692
0.000000
0.965174
0.000000
1.000000
0.000000
0.923077
0.000000
0.960000
0.0
104.0
0.0
104.0
19
stock_ticker
1.000000
NaN
1.000000
NaN
1.000000
NaN
0.000000
1.000000
0.000000
0.500000
0.000000
0.666667
0.0
2.0
0.0
2.0
20
stocks
0.000000
1.000000
0.000000
0.500000
0.000000
0.666667
0.000000
1.000000
0.000000
0.500000
0.000000
0.666667
0.0
2.0
0.0
2.0
21
video
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
1.000000
NaN
0.0
3.0
0.0
3.0
The table above is a bit exhaustive and hard to digest, we can filter to specific metrics we want to look at.
classification_label
precision_spam_og_model
precision_spam_relabelled
1_og_model
1_relabelled
0
aiprompt
NaN
NaN
3.0
3.0
1
announcement
0.804878
0.956522
37.0
27.0
2
article_link
0.785714
1.000000
88.0
73.0
3
bad_scraping
0.837838
0.875000
32.0
24.0
4
bot
NaN
NaN
1.0
1.0
5
crypto
1.000000
1.000000
86.0
85.0
6
event
1.000000
1.000000
24.0
24.0
7
hateful
0.000000
0.000000
4.0
4.0
8
image
0.944444
1.000000
19.0
19.0
9
low_analytical_value()
0.000000
0.000000
0.0
0.0
10
news_report()
0.000000
NaN
NaN
NaN
11
not_spam
0.000000
0.000000
0.0
0.0
12
petition
0.000000
0.000000
1.0
1.0
13
promotion
0.970874
0.989796
104.0
104.0
14
quote
0.888889
1.000000
11.0
10.0
15
report
0.781250
0.818182
27.0
12.0
16
seo
0.964286
0.980769
58.0
56.0
17
slop
1.000000
1.000000
82.0
82.0
18
spam
1.000000
1.000000
104.0
104.0
19
stock_ticker
NaN
1.000000
2.0
2.0
20
stocks
1.000000
1.000000
2.0
2.0
21
video
NaN
NaN
3.0
3.0
Instances of NaN represent cases where the numeruator of the precision formula is a 0 and thus cases where the precision is effectively 0 (there has been no cases where a post has been labelled as spam).
Video, bot and AI prompt posts are all cases where the model never recognises a post as spam. They are all represented by < 3 posts in the test set (and similarly low sample size in the training set) so perhaps more training data for these categories would improve precision.
The metrics for spam and not_spam classification categories here represent scenarios where there are no labelled instances of the inverse case (spam or not_spam), and so precision can never be anything but 0 for the inverse, as there are no “True Positive” cases.
The great news here is that, in terms of precision of spam classification, the relabelled data has improved classification across the board!
It is still kind of up for debate whether we would rather a high precision or a high recall. I think I would be favour of a slightly higher recall, depending on where the mistakes are being made, but this is a subjective goal.
If we look at the recall, the relabelled data has really reduced this metric.
Specfically looking at where we make mistakes
First let’s look at the case where spam is being classified as not spam by our model
text
classification_label
announcement
5
article_link
9
bad_scraping
10
crypto
3
event
1
hateful
4
image
2
petition
1
promotion
7
quote
5
relabelled
2
report
3
seo
5
slop
10
spam
8
stock_ticker
1
stocks
1
I should probably look at this as a percentage of overall category volume but it looksl ike slop, bad_scraping, promotion and article link are the categories most likely to be labelled as not spam when they are actually spam.
If we look then at where not spam is being classified as spam. This, to me, is an interesting area becuase if these are all low quality posts, maybe a high recall isn’t the end of the world.
text
classification_label
announcement
1
bad_scraping
2
low_analytical_value()
1
not_spam
32
promotion
1
relabelled
13
report
2
seo
1
most of the mistakes are being made where posts are tagged only as not spam and the relabelled article and report data. The relabelled data is data that we know is difficult to classify and so I am not massively worried about it. However, I want to see if there are any notable trends in the other not spam posts classified as spam.
I’ve hidden the table from here as it’s a big long, but it looks like it is longer posts that are being misclassified as spam, we will see a ferw metrics pertaining to this in the next section. This leads us nicely into another experiment - does sentence length affect spam classification with our model?
How sentence length affects model classification
Let’s first look at overall word count distribution
We would expect the correctly labelled and mislabelled data to have the same distribution of word count as the original data. I’m going to perform a ttest to confirm that this is true and I’m transforming the data because it’s a skewed distribution:
ok it doesn’t like there is any significant difference in the mean of the post length between the full dataset and those that are mislabelled, but what if we look specifically at posts that are incorrectly labelled as spam (where we have a high recall)?
TtestResult(statistic=-3.5737775575675075, pvalue=0.0003660816017693208, df=1179.0)
count 53.000000
mean 181.396226
std 219.505505
min 7.000000
25% 32.000000
50% 42.000000
75% 314.000000
max 951.000000
Name: word_count, dtype: float64
So it seems fairly certain that longer posts are more likely to be mistaken as spam - maybe we should be tokenising when we train?
Tokenising the text
I’m going to try tokenising into sentences and using our current model to classify those sentences, getting a score for each doc based on the aggregate.
I have created a score for each document which is the proportion of sentences in that document classified as spam. Let’s set everything with a score below 0.5 as not_spam and everything above 0.5 as spam and see what the accuracy metrics look like.
0
1
accuracy
macro avg
weighted avg
precision
0.684127
0.916883
0.772414
0.800505
0.810710
recall
0.930886
0.639493
0.772414
0.785189
0.772414
f1-score
0.788655
0.753469
0.772414
0.771062
0.769519
support
463.000000
552.000000
0.772414
1015.000000
1015.000000
this doesn’t look great but let’s have a quick look at what happens if we change the threshold
Interesting that if there is any sentence at all labelled as spam, you can probably classify the entire post as spam. We can get fairly similar results doing this to analysing the post as a whole but practically speaking, this is not improving detection.
We could look at training on a sentence level but that would mean another hefty labelling task. We could also look at tokenising on a paragraph level instead of a sentence level, or considering this only for posts over a certain length. For now I am going to look specifically at the posts that were mistakenly labelled as spam in the standard model and if these could be classified any more accurately if scoring on a sentence level.
0
1
accuracy
macro avg
weighted avg
precision
1.000000
0.0
0.647059
0.500000
1.000000
recall
0.647059
0.0
0.647059
0.323529
0.647059
f1-score
0.785714
0.0
0.647059
0.392857
0.785714
support
51.000000
0.0
0.647059
51.000000
51.000000
this is an improvement from the 0 accuracy obtained by the standrd model - maybe we could do some sort of combo approach in the case of long posts. I need to really try this on all posts over a certain length because we won’t know what posts are mislabelled in practice..
the mean length of false positive posts is 187, let’s look at classifying all posts over that length using the sentence level approach.
post length
The best f1 score and recall combination comes at a threshold of 0.05, this is only slightly better that the f1 score and recall we achieve with the standard model (table below). Technically it looks like we can get better results if we were to use the sentence level approach for longer posts but it might not be enough of an improvement to make it worth it.
not_spam
spam
accuracy
macro avg
weighted avg
precision
0.870968
0.842593
0.848921
0.856780
0.851575
recall
0.613636
0.957895
0.848921
0.785766
0.848921
f1-score
0.720000
0.896552
0.848921
0.808276
0.840665
support
44.000000
95.000000
0.848921
139.000000
139.000000
Trial different dataset ablations
Remove reports and articles
We have discussed at length how hard these are to label as a human, nevermind as a machine. Does the addition of these posts confuse the model? And so, does removing them improve the model?
first let’s look training over two epochs
[358/358 06:21, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
36
0.552100
0.398188
0.846875
0.845460
0.854158
0.846875
72
0.384600
0.412332
0.825000
0.823286
0.847440
0.825000
108
0.359300
0.324502
0.871875
0.871756
0.872064
0.871875
144
0.335600
0.338327
0.865625
0.864793
0.870204
0.865625
180
0.332900
0.323204
0.865625
0.865619
0.869093
0.865625
216
0.267300
0.317566
0.881250
0.881278
0.881375
0.881250
252
0.279100
0.333547
0.873958
0.874010
0.875657
0.873958
288
0.259500
0.321211
0.876042
0.876067
0.878697
0.876042
324
0.249900
0.300097
0.886458
0.886463
0.886470
0.886458
[60/60 00:10]
This doesn’t seem to be improving the model and the loss curve looks like it could converge more. Let’s try over three epochs
[537/537 09:50, Epoch 3/3]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
36
0.578400
0.357010
0.840625
0.840231
0.841607
0.840625
72
0.438500
0.441831
0.812500
0.810153
0.839011
0.812500
108
0.375500
0.321627
0.870833
0.870703
0.871058
0.870833
144
0.348900
0.330973
0.864583
0.863523
0.870868
0.864583
180
0.333000
0.295100
0.875000
0.874831
0.875426
0.875000
216
0.271300
0.402192
0.872917
0.872966
0.874722
0.872917
252
0.343600
0.346152
0.869792
0.869826
0.872175
0.869792
288
0.284800
0.391758
0.844792
0.844039
0.858769
0.844792
324
0.252200
0.291041
0.880208
0.880253
0.882133
0.880208
360
0.230600
0.331076
0.879167
0.879167
0.882539
0.879167
396
0.211000
0.368974
0.869792
0.869743
0.874242
0.869792
432
0.195900
0.338604
0.881250
0.881296
0.883065
0.881250
468
0.208300
0.317228
0.888542
0.888571
0.888696
0.888542
504
0.183500
0.333172
0.891667
0.891676
0.891693
0.891667
[60/60 00:10]
Model metrics are looking stronger here but the model does look like it might be overfitting… I’m going to try adjusting weight decay and lerning rate
I’m using optuna to try to optimise study parameters, the outputs below are from a study varying weight decay and learning rate to try to combat overfitting.
[358/358 05:01, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.495600
0.353252
0.856250
0.855878
0.857428
0.856250
100
0.364200
0.330188
0.878125
0.877704
0.880221
0.878125
150
0.340400
0.313633
0.872917
0.872966
0.873289
0.872917
200
0.297000
0.288736
0.881250
0.881149
0.881424
0.881250
250
0.241600
0.349093
0.870833
0.870864
0.873346
0.870833
300
0.242900
0.315722
0.876042
0.876082
0.878193
0.876042
350
0.202000
0.303394
0.887500
0.887533
0.887690
0.887500
[120/120 00:11]
[358/358 04:52, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.576000
0.396333
0.831250
0.831300
0.833437
0.831250
100
0.388800
0.332276
0.867708
0.867744
0.867871
0.867708
150
0.344800
0.321822
0.871875
0.871529
0.873265
0.871875
200
0.301300
0.309843
0.870833
0.870895
0.871871
0.870833
250
0.253100
0.343132
0.866667
0.866678
0.869704
0.866667
300
0.282500
0.337678
0.870833
0.870845
0.873884
0.870833
350
0.246800
0.312712
0.878125
0.878130
0.878138
0.878125
[120/120 00:11]
[358/358 04:52, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.493500
0.377654
0.855208
0.855136
0.855214
0.855208
100
0.365700
0.330696
0.872917
0.872891
0.872896
0.872917
150
0.337200
0.348639
0.840625
0.840196
0.850151
0.840625
200
0.306300
0.294552
0.879167
0.879195
0.879292
0.879167
250
0.243400
0.340071
0.869792
0.869808
0.872700
0.869792
300
0.255500
0.327951
0.878125
0.878094
0.882285
0.878125
350
0.203800
0.307104
0.883333
0.883379
0.883701
0.883333
[120/120 00:11]
[358/358 05:09, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.478000
0.356945
0.856250
0.856002
0.856831
0.856250
100
0.359300
0.321177
0.869792
0.869709
0.869856
0.869792
150
0.352600
0.309830
0.868750
0.868724
0.868728
0.868750
200
0.300100
0.283371
0.883333
0.883215
0.883591
0.883333
250
0.231200
0.337918
0.884375
0.884422
0.886084
0.884375
300
0.246100
0.331095
0.882292
0.882323
0.884702
0.882292
350
0.211000
0.302979
0.886458
0.886511
0.887119
0.886458
[120/120 00:11]
[358/358 05:01, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.475900
0.347312
0.859375
0.858926
0.861061
0.859375
100
0.379700
0.335153
0.875000
0.874446
0.877978
0.875000
150
0.336400
0.322408
0.867708
0.867689
0.871492
0.867708
200
0.299500
0.298249
0.880208
0.880264
0.880871
0.880208
250
0.227400
0.339285
0.878125
0.878130
0.881345
0.878125
300
0.245600
0.326882
0.886458
0.886495
0.888628
0.886458
350
0.215300
0.307947
0.885417
0.885468
0.886012
0.885417
[120/120 00:11]
[358/358 05:03, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.489200
0.364529
0.856250
0.854830
0.864523
0.856250
100
0.392000
0.333931
0.879167
0.878826
0.880715
0.879167
150
0.340500
0.325551
0.864583
0.864645
0.865186
0.864583
200
0.307900
0.291423
0.884375
0.884115
0.885484
0.884375
250
0.222000
0.329435
0.872917
0.872971
0.873396
0.872917
300
0.237600
0.320021
0.876042
0.876047
0.879254
0.876042
350
0.202000
0.303711
0.887500
0.887548
0.887973
0.887500
[120/120 00:11]
[358/358 05:00, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.536200
0.371610
0.844792
0.844842
0.845040
0.844792
100
0.390600
0.339289
0.865625
0.865431
0.866071
0.865625
150
0.340300
0.319113
0.860417
0.860458
0.860619
0.860417
200
0.308900
0.298684
0.880208
0.880203
0.880199
0.880208
250
0.242800
0.340142
0.869792
0.869826
0.872175
0.869792
300
0.273500
0.335391
0.875000
0.875029
0.877522
0.875000
350
0.238700
0.311230
0.880208
0.880232
0.880304
0.880208
[120/120 00:11]
[358/358 05:02, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
50
0.604000
0.445440
0.820833
0.820919
0.821717
0.820833
100
0.399000
0.336848
0.864583
0.864424
0.864876
0.864583
150
0.337000
0.325152
0.859375
0.859427
0.859705
0.859375
200
0.318400
0.300137
0.879167
0.878981
0.879725
0.879167
250
0.258500
0.342617
0.859375
0.859393
0.862251
0.859375
300
0.285000
0.329623
0.866667
0.866706
0.868921
0.866667
350
0.254500
0.307795
0.872917
0.872966
0.873289
0.872917
[120/120 00:11]
We are getting relatively good f1 results but how are the loss curves looking? Is the model still overfitting?
<Figure size 640x480 with 0 Axes>
It looks like we are still overfitting the data in every iteration. We might have to look at this a bit more in the end but we can probably draw the conclusion that training the model on reports and articles is not reducing the models ability to predict spam.
Remove Hateful posts
I think this is probably so low volume that it’s not really affecting the model
training on 2 epochs with no hateful content in the training data:
[376/376 07:17, Epoch 2/2]
Step
Training Loss
Validation Loss
Accuracy
F1
Precision
Recall
37
0.588500
0.426125
0.824579
0.824819
0.825773
0.824579
74
0.428000
0.332770
0.866204
0.865103
0.869715
0.866204
111
0.417100
0.316493
0.868186
0.867404
0.870163
0.868186
148
0.324100
0.351221
0.859267
0.859528
0.863360
0.859267
185
0.338200
0.294790
0.875124
0.874405
0.877098
0.875124
222
0.303300
0.357434
0.859267
0.859507
0.865044
0.859267
259
0.280900
0.320169
0.880079
0.880298
0.882882
0.880079
296
0.250000
0.274743
0.890981
0.890749
0.891227
0.890981
333
0.191600
0.304093
0.886026
0.886171
0.886946
0.886026
370
0.299600
0.298144
0.884044
0.884191
0.884967
0.884044
[64/64 00:11]
this probably is one of the best f1 scores we’ve obtained for spam classification. Precision is very strong at 90% and recall is similar to other motdels at ~88%. That said, it’s not an improvement on the standard model.
Removing slop
if we train over 3 epochs:
I woudn’t say this has really improved the model
if we train over 4 epochs:
4 epochs definitely overfits the data
Fine tuning a model already trained for spam identification
fine tune mrm8488/bert-tiny-finetuned-sms-spam-detection
precision
recall
f1-score
support
not_spam
0.817746
0.736501
0.775000
463.000000
spam
0.795987
0.862319
0.827826
552.000000
accuracy
0.804926
0.804926
0.804926
0.804926
macro avg
0.806866
0.799410
0.801413
1015.000000
weighted avg
0.805912
0.804926
0.803729
1015.000000
these definitely aren’t an improvement on standard, we can try training on a different model to see if that improves anything
fine tune skandavivek2/spam-classifier
precision
recall
f1-score
support
not_spam
0.817822
0.892009
0.853306
463.000000
spam
0.901961
0.833333
0.866290
552.000000
accuracy
0.860099
0.860099
0.860099
0.860099
macro avg
0.859891
0.862671
0.859798
1015.000000
weighted avg
0.863580
0.860099
0.860367
1015.000000
This is overfitting massively and only providing mediocre results, I think we can put this idea to bed
freezing neural net layers
there is the argument that we should be utilising more of the training already done on the language models we fine-tune. Here I am going to investigate if freezing each layer bar the decision head helps with training.
precision
recall
f1-score
support
not_spam
0.841649
0.838013
0.839827
463.000000
spam
0.864621
0.867754
0.866184
552.000000
accuracy
0.854187
0.854187
0.854187
0.854187
macro avg
0.853135
0.852883
0.853006
1015.000000
weighted avg
0.854142
0.854187
0.854161
1015.000000
While the accuracy of this model isn’t as good as the standard model, the loss curve is quite promising, could this perform better than higher f1 models on the validation data?
This model was trained on 2 epochs, I am going to try training on 3 epochs to see if it helps.
precision
recall
f1-score
support
not_spam
0.836559
0.840173
0.838362
463.000000
spam
0.865455
0.862319
0.863884
552.000000
accuracy
0.852217
0.852217
0.852217
0.852217
macro avg
0.851007
0.851246
0.851123
1015.000000
weighted avg
0.852274
0.852217
0.852242
1015.000000
the third epoch doesn’t appear to have improved much. I can try changing batch size and learning rate
precision
recall
f1-score
support
not_spam
0.851415
0.779698
0.813980
463.000000
spam
0.827411
0.885870
0.855643
552.000000
accuracy
0.837438
0.837438
0.837438
0.837438
macro avg
0.839413
0.832784
0.834811
1015.000000
weighted avg
0.838361
0.837438
0.836638
1015.000000
this model was trained on 2 epochs again and it looks like it needed more time to converge. The larger batch size and learning rate doesn’t appear to have improved the prediction ability of the model.
Improving the standard model
It kind of looks like the standard model is still largely the best performing model we should spend some time tuning the model hyperparameters,
First I’m going to run an optuna study to see if we can optimise the hyperparams
over 5 trials, f1 score maxes out just over 88% and none of the trial loss curves look particularly promising
Run 1 and 3 have the best result in terms of f1. Run 3 is completely overfit and run 1 probably isn’t perfect but it a bit better. We will be able to judge that better when we evaluate on the valdidation set.
I am going to do some more manual tuning to see if I can find a better model
after a little bit of playing around with hyperparameters, we can land on a model with an f1 scere of 87.3%, which is not the best f1 score we have achieved, but with a a good loss curve. It is worth noting models like this that might perform better in practice.
Conclusion
If we review the theories we tested:
yes we can create a better model
yes, we should definitely label more data, but relabelled the relabelled articles and reports are still struggling with relabelled data only reaching a 77% accuracy. Relabelling has really improved precision but has not helped recall. Which do we deem more important?
Reports and articles
Quotes
bad scraping
hateful, petition, stocks
having tried removing slop, articles and report and hateful comment, it doesn’t look like removing any of them improve model metrics and so it doesn’t look like any are confusing the model.
no, having tried on two models, this is not improving results
this doesn’t improve f1 on the test data but it does create a much nicer loss curve so maybe it would be better at predicting the eval set - I will investigate that today
Spoiler alert, investigating these theories brought about some new theories and so here is a list of other things I ended up investigating:
This does not improve the overall model, it did look like it might be helpful for classifying longer posts but it doesn’t like it results in significant improvement from the standard model. That said, it is longer posts that are the model is most frequently mislabelling as spam so this is definitetely something we need to address.
My working theory was that a higher recall would be ok if the mistakes the model was making was just classifyinf other low quality posts as spam, but it looks like we would just be removing a lot of long posts.
Some future tests we shoudl run: - [ ] We looked at classifying longer posts at a sentence level and didn’t really improve results but maybe we should look specifically at longer posts that are classified as spam and passing them through a second filter. - [ ] Maybe we need a heuristic filter for slop. This is frequently being classified as not spam and more data isn’t massively affecting results. Filters like proportion of stopwords per sentence might be good here. - [ ] We need to pick out our best models and run them over the validation set to see which perform best. - [ ] Unfortunately, we need to label more data - [ ] I still haven’t tested model accuracy if data is run through heuristic filters before modelling.