Model Training Results

Background

We began by finetuning a few of the standard transformer models (4_old_model_evaluation.ipynb) to get an idea of how they were performing. The best f1-scores for each model after hyperparameter tuning is shown below.

Model F1-Score
bert-base-cased 0.864
bert-base-uncased 0.862
distilbert-cased 0.857
distilbert-uncased -
roberta-base 0.865

Despite roberta not achieving results massively better than bert-base-cased, this model saw the least hyperparameter tuining (2 iterations) and even at that, both iterations performed better than the best bert model (10 itertions). We decided to move forward with roberta for further testing.

Test Conditions

We had a few theories we wanted to test:

Spoiler alert, investigating these theories brought about some new theories and so here is a list of other things I ended up investigating:

each of these contain either: - report - article link - quote (report) - announcement

so this is the relabelled gpt data, just no longer being referred to article link or not - which is maybe a little annoying for looking at accuracy at the group level?

Tests

Cross Validation of og roberta model - with og data

We have pretty much settled on roberta being our best performing model. Previous experiments (3_model_iteration) have shown that, with very little hyperparameter tuning, it is the best performing model (>86% accuracy and f1). I’m going to do a quick cross validation test to make sure that these results hold for varied testing / training sets.

[402/402 06:37, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.554600 0.340840 0.868323 0.867751 0.869972 0.868323
100 0.406000 0.324208 0.874534 0.872914 0.884007 0.874534
150 0.368400 0.311900 0.874534 0.872678 0.886015 0.874534
200 0.353700 0.312466 0.867081 0.867233 0.869958 0.867081
250 0.326900 0.304515 0.868323 0.868323 0.868323 0.868323
300 0.302700 0.288748 0.878261 0.877374 0.882388 0.878261
350 0.238800 0.289207 0.886957 0.886486 0.888694 0.886957
400 0.290100 0.285096 0.886957 0.886676 0.887616 0.886957

[404/404 06:37, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.548500 0.408954 0.829602 0.829485 0.829577 0.829602
100 0.347600 0.393902 0.840796 0.838168 0.854970 0.840796
150 0.380600 0.357313 0.854478 0.853830 0.856862 0.854478
200 0.342400 0.350081 0.839552 0.837915 0.846956 0.839552
250 0.287100 0.417664 0.838308 0.835956 0.850190 0.838308
300 0.286800 0.364391 0.858209 0.858190 0.858183 0.858209
350 0.242500 0.374562 0.847015 0.846036 0.850979 0.847015
400 0.297800 0.358485 0.842040 0.840819 0.847116 0.842040

[404/404 06:37, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.526500 0.454144 0.819652 0.813087 0.841876 0.819652
100 0.397800 0.353869 0.855721 0.854914 0.856552 0.855721
150 0.338800 0.348409 0.861940 0.862176 0.863014 0.861940
200 0.382600 0.304005 0.876866 0.876087 0.878353 0.876866
250 0.264500 0.321688 0.876866 0.875549 0.881046 0.876866
300 0.299600 0.295096 0.878109 0.877539 0.878851 0.878109
350 0.273100 0.291737 0.873134 0.873022 0.873008 0.873134
400 0.278700 0.291532 0.870647 0.870258 0.870759 0.870647

[404/404 06:39, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.549500 0.463595 0.809701 0.803958 0.821194 0.809701
100 0.397300 0.380256 0.840796 0.840902 0.841051 0.840796
150 0.391100 0.390431 0.825871 0.826573 0.830229 0.825871
200 0.329000 0.419239 0.838308 0.838251 0.838205 0.838308
250 0.346200 0.343753 0.858209 0.856757 0.860209 0.858209
300 0.282600 0.355700 0.864428 0.864451 0.864477 0.864428
350 0.256800 0.355219 0.865672 0.865624 0.865588 0.865672
400 0.233400 0.368634 0.864428 0.864404 0.864383 0.864428

[404/404 06:40, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.594100 0.380333 0.830846 0.831327 0.833291 0.830846
100 0.375200 0.426684 0.818408 0.819108 0.827188 0.818408
150 0.382000 0.360243 0.843284 0.843422 0.843655 0.843284
200 0.344800 0.332534 0.861940 0.860143 0.866251 0.861940
250 0.270400 0.353116 0.854478 0.854735 0.855442 0.854478
300 0.260300 0.360021 0.853234 0.853572 0.854769 0.853234
350 0.289400 0.357653 0.848259 0.848714 0.850946 0.848259
400 0.253600 0.338669 0.863184 0.862265 0.864172 0.863184

There is some variation in model result metrics across runs which would suggest that the training sample used has some effect on model accuracy. There is also a little bit of variation in the stability of the loss curves across runs. Interestingly, the most promising loss curves coincide with the two runs with the highest f1-score.

Another thing that I should investigate is what differs across training sets, however for now I am prioritising training with the relablled data, as if this improves the model, we will proceed with that anyway.

Trial different size datasets (from here on out everything is trained with the relabelled data)

Before training, let’s see what the split of each spam classification is across samples - we would want each classification to be roughly 25%, 50%, 75% and 100% of the available labels

train_25 train_50 train_75 overall_count train25_prop train50_prop train75_prop
classification_label
not_spam 284 582 851 1128 0.251773 0.515957 0.754433
spam 84 161 242 312 0.269231 0.516026 0.775641
promotion 66 125 194 259 0.254826 0.482625 0.749035
crypto 65 136 200 270 0.240741 0.503704 0.740741
slop 59 116 186 258 0.228682 0.449612 0.720930
relabelled 51 97 145 206 0.247573 0.470874 0.703883
article_link 47 108 164 212 0.221698 0.509434 0.773585
seo 40 76 111 145 0.275862 0.524138 0.765517
announcement 26 50 73 91 0.285714 0.549451 0.802198
image 24 41 65 94 0.255319 0.436170 0.691489
event 19 33 54 73 0.260274 0.452055 0.739726
bad_scraping 19 34 62 80 0.237500 0.425000 0.775000
report 15 24 40 53 0.283019 0.452830 0.754717
quote 8 13 20 28 0.285714 0.464286 0.714286
stock_ticker 3 4 5 6 0.500000 0.666667 0.833333
hateful 2 7 10 11 0.181818 0.636364 0.909091
video 1 5 10 15 0.066667 0.333333 0.666667
aiprompt 1 4 5 8 0.125000 0.500000 0.625000
low_analytical_value() 1 2 4 5 0.200000 0.400000 0.800000
petition 0 1 2 5 0.000000 0.200000 0.400000

With the exception of the the petition label in the 25% group - we seem to have an ok representation of each classification in each group.

How training size affects results

Granualar look at spam vs not spam

model sample num_epochs batch_size learning_rate weight_decay eval_loss eval_accuracy eval_f1 eval_precision eval_recall eval_runtime eval_samples_per_second eval_steps_per_second epoch name_on_hub
0 roberta train_25 2 16 0.00002 0.1 0.363416 0.847291 0.847572 0.849867 0.847291 12.3891 81.927 5.166 2.0 /roberta_train_n752
1 roberta train_50 2 16 0.00002 0.1 0.314662 0.874877 0.875072 0.876329 0.874877 12.2712 82.714 5.215 2.0 /roberta_train_n1503
2 roberta train_75 2 16 0.00002 0.1 0.321931 0.876847 0.877039 0.878296 0.876847 12.2485 82.868 5.225 2.0 /roberta_train_n2254
3 roberta train_100 2 16 0.00002 0.1 0.288585 0.885714 0.885831 0.886286 0.885714 12.2070 83.149 5.243 2.0 /roberta_train_n3006

we can see that all metrics improved as training size increased, there does seem to be a less dramatic increase between the 50% and 75% sample size, let’s look at it on a graph:

as we saw in the df, all metrics improve with increasing training size. Precision is consistently the highest scoring evaluation metric.

If we look at this on a spam type (article link, slop…) level, I wonder does it improve across all categories as training size increases. There are some spam types that we have a fairly low sample size for anyway, and these likely won’t see much improvement with increasing sample size as all sample sizes will be low for those groups, but maybe we have reached saturation with other groups.

group classification across training size

For the sake of showing what I’m talking about here, the above is a screenshot of an interactive html chart that should be rendered here. I wouldn’t bother trying to interpret it.. the main conclusions are summarised below:

More data does seem to largely improve the accuracy of identifying spam categories. Some areas we are struggling and may need to get more labelled data or consider other routes to accuracy:

  • relabelled (max ~77% accuracy)
  • announcements - more data is not improving metrics with a max accuracy of ~87% at a training size of only 50% of the full training data.
  • report - this is also encompassed in the relabelled data but again, improves with more data but maxes out ~70%.
  • quote - also someone overlapping with the relabelled data (max ~68%)
  • bad scraping - this does significantly improve with larger training samples, that said, we have only a max of 33 “bad scraping” posts. Should this be included at all here or is bad scraping to broad a category to give to the model? Is this something we can tackle at source?
  • hateful, petition, stocks - really poor classification but also really small sample sizes overall. Should hateful be included here at all or too broad?

We can see that the model is improving with more data, so it seems like a good argument to source more labelled data.

Let’s have a quick look to compare how the model trained on the new labelled data compares to the old model and see if we can draw any conclusions from that.

Comparing the original and relabelled model

Let’s have a look at some of the more telling metrics on both a spam label and overall model level:

old_data_no_spam old_data_spam new_data_no_spam new_data_spam
precision 0.885366 0.864463 0.863732 0.905204
recall 0.815730 0.917544 0.889849 0.882246
f1-score 0.849123 0.890213 0.876596 0.893578
support 445.000000 570.000000 463.000000 552.000000
old_data_weighted avg new_data_weighted avg
precision 0.873627 0.886286
recall 0.872906 0.885714
f1-score 0.872198 0.885831
support 1015.000000 1015.000000

It looks like the newly labelled data is resulting in a better model across the board, but it would be interesting to see what affect the change in label on the report and article data has on a spam grouping level.

classification_label precision_not_spam_og_model precision_spam_og_model recall_not_spam_og_model recall_spam_og_model f1_not_spam_og_model f1_spam_og_model precision_not_spam_relabelled precision_spam_relabelled recall_not_spam_relabelled recall_spam_relabelled f1_not_spam_relabelled f1_spam_relabelled 0_relabelled 1_relabelled 0_og_model 1_og_model
0 aiprompt 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 0.0 3.0 0.0 3.0
1 announcement 0.333333 0.804878 0.200000 0.891892 0.250000 0.846154 0.375000 0.956522 0.750000 0.814815 0.500000 0.880000 4.0 27.0 10.0 37.0
2 article_link 0.450000 0.785714 0.300000 0.875000 0.360000 0.827957 0.526316 1.000000 1.000000 0.876712 0.689655 0.934307 10.0 73.0 30.0 88.0
3 bad_scraping 0.750000 0.837838 0.333333 0.968750 0.461538 0.898551 0.411765 0.875000 0.777778 0.583333 0.538462 0.700000 9.0 24.0 9.0 32.0
4 bot 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 0.0 1.0 0.0 1.0
5 crypto 0.333333 1.000000 1.000000 0.976744 0.500000 0.988235 0.250000 1.000000 1.000000 0.964706 0.400000 0.982036 1.0 85.0 1.0 86.0
6 event 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.500000 1.000000 1.000000 0.958333 0.666667 0.978723 1.0 24.0 1.0 24.0
7 hateful 0.333333 0.000000 1.000000 0.000000 0.500000 0.000000 0.333333 0.000000 1.000000 0.000000 0.500000 0.000000 2.0 4.0 2.0 4.0
8 image 0.600000 0.944444 0.750000 0.894737 0.666667 0.918919 0.666667 1.000000 1.000000 0.894737 0.800000 0.944444 4.0 19.0 4.0 19.0
9 low_analytical_value() 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.0 0.0 1.0 0.0
10 news_report() 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
11 not_spam 1.000000 0.000000 0.889474 0.000000 0.941504 0.000000 1.000000 0.000000 0.915567 0.000000 0.955923 0.000000 379.0 0.0 380.0 0.0
12 petition 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 1.0 0.0 1.0
13 promotion 0.200000 0.970874 0.250000 0.961538 0.222222 0.966184 0.222222 0.989796 0.666667 0.932692 0.333333 0.960396 3.0 104.0 4.0 104.0
14 quote 0.625000 0.888889 0.833333 0.727273 0.714286 0.800000 0.545455 1.000000 1.000000 0.500000 0.705882 0.666667 6.0 10.0 6.0 11.0
15 report 0.333333 0.781250 0.125000 0.925926 0.181818 0.847458 0.500000 0.818182 0.600000 0.750000 0.545455 0.782609 5.0 12.0 8.0 27.0
16 seo 0.000000 0.964286 0.000000 0.931034 0.000000 0.947368 0.000000 0.980769 0.000000 0.910714 0.000000 0.944444 1.0 56.0 2.0 58.0
17 slop 0.200000 1.000000 1.000000 0.902439 0.333333 0.948718 0.166667 1.000000 1.000000 0.878049 0.285714 0.935065 2.0 82.0 2.0 82.0
18 spam 0.000000 1.000000 0.000000 0.932692 0.000000 0.965174 0.000000 1.000000 0.000000 0.923077 0.000000 0.960000 0.0 104.0 0.0 104.0
19 stock_ticker 1.000000 NaN 1.000000 NaN 1.000000 NaN 0.000000 1.000000 0.000000 0.500000 0.000000 0.666667 0.0 2.0 0.0 2.0
20 stocks 0.000000 1.000000 0.000000 0.500000 0.000000 0.666667 0.000000 1.000000 0.000000 0.500000 0.000000 0.666667 0.0 2.0 0.0 2.0
21 video 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 1.000000 NaN 0.0 3.0 0.0 3.0

The table above is a bit exhaustive and hard to digest, we can filter to specific metrics we want to look at.

classification_label precision_spam_og_model precision_spam_relabelled 1_og_model 1_relabelled
0 aiprompt NaN NaN 3.0 3.0
1 announcement 0.804878 0.956522 37.0 27.0
2 article_link 0.785714 1.000000 88.0 73.0
3 bad_scraping 0.837838 0.875000 32.0 24.0
4 bot NaN NaN 1.0 1.0
5 crypto 1.000000 1.000000 86.0 85.0
6 event 1.000000 1.000000 24.0 24.0
7 hateful 0.000000 0.000000 4.0 4.0
8 image 0.944444 1.000000 19.0 19.0
9 low_analytical_value() 0.000000 0.000000 0.0 0.0
10 news_report() 0.000000 NaN NaN NaN
11 not_spam 0.000000 0.000000 0.0 0.0
12 petition 0.000000 0.000000 1.0 1.0
13 promotion 0.970874 0.989796 104.0 104.0
14 quote 0.888889 1.000000 11.0 10.0
15 report 0.781250 0.818182 27.0 12.0
16 seo 0.964286 0.980769 58.0 56.0
17 slop 1.000000 1.000000 82.0 82.0
18 spam 1.000000 1.000000 104.0 104.0
19 stock_ticker NaN 1.000000 2.0 2.0
20 stocks 1.000000 1.000000 2.0 2.0
21 video NaN NaN 3.0 3.0

Instances of NaN represent cases where the numeruator of the precision formula is a 0 and thus cases where the precision is effectively 0 (there has been no cases where a post has been labelled as spam).

Video, bot and AI prompt posts are all cases where the model never recognises a post as spam. They are all represented by < 3 posts in the test set (and similarly low sample size in the training set) so perhaps more training data for these categories would improve precision.

The metrics for spam and not_spam classification categories here represent scenarios where there are no labelled instances of the inverse case (spam or not_spam), and so precision can never be anything but 0 for the inverse, as there are no “True Positive” cases.

\[\text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}}\]

The great news here is that, in terms of precision of spam classification, the relabelled data has improved classification across the board!

It is still kind of up for debate whether we would rather a high precision or a high recall. I think I would be favour of a slightly higher recall, depending on where the mistakes are being made, but this is a subjective goal.

If we look at the recall, the relabelled data has really reduced this metric.

Specfically looking at where we make mistakes

First let’s look at the case where spam is being classified as not spam by our model

text
classification_label
announcement 5
article_link 9
bad_scraping 10
crypto 3
event 1
hateful 4
image 2
petition 1
promotion 7
quote 5
relabelled 2
report 3
seo 5
slop 10
spam 8
stock_ticker 1
stocks 1

I should probably look at this as a percentage of overall category volume but it looksl ike slop, bad_scraping, promotion and article link are the categories most likely to be labelled as not spam when they are actually spam.

If we look then at where not spam is being classified as spam. This, to me, is an interesting area becuase if these are all low quality posts, maybe a high recall isn’t the end of the world.

text
classification_label
announcement 1
bad_scraping 2
low_analytical_value() 1
not_spam 32
promotion 1
relabelled 13
report 2
seo 1

most of the mistakes are being made where posts are tagged only as not spam and the relabelled article and report data. The relabelled data is data that we know is difficult to classify and so I am not massively worried about it. However, I want to see if there are any notable trends in the other not spam posts classified as spam.

I’ve hidden the table from here as it’s a big long, but it looks like it is longer posts that are being misclassified as spam, we will see a ferw metrics pertaining to this in the next section. This leads us nicely into another experiment - does sentence length affect spam classification with our model?

How sentence length affects model classification

Let’s first look at overall word count distribution

array([[<Axes: title={'center': 'word_count'}>]], dtype=object)

and then look at how that compares to the word count distribution for all mislabelled posts (both false positive and negative).

array([[<Axes: title={'center': 'word_count'}>]], dtype=object)

We would expect the correctly labelled and mislabelled data to have the same distribution of word count as the original data. I’m going to perform a ttest to confirm that this is true and I’m transforming the data because it’s a skewed distribution:

TtestResult(statistic=-0.3802876800689237, pvalue=0.7037961797341186, df=1256.0)
TtestResult(statistic=0.10712299765143264, pvalue=0.9147015195217277, df=2124.0)

ok it doesn’t like there is any significant difference in the mean of the post length between the full dataset and those that are mislabelled, but what if we look specifically at posts that are incorrectly labelled as spam (where we have a high recall)?

TtestResult(statistic=-3.5737775575675075, pvalue=0.0003660816017693208, df=1179.0)
count     53.000000
mean     181.396226
std      219.505505
min        7.000000
25%       32.000000
50%       42.000000
75%      314.000000
max      951.000000
Name: word_count, dtype: float64

So it seems fairly certain that longer posts are more likely to be mistaken as spam - maybe we should be tokenising when we train?

Tokenising the text

I’m going to try tokenising into sentences and using our current model to classify those sentences, getting a score for each doc based on the aggregate.

I have created a score for each document which is the proportion of sentences in that document classified as spam. Let’s set everything with a score below 0.5 as not_spam and everything above 0.5 as spam and see what the accuracy metrics look like.

0 1 accuracy macro avg weighted avg
precision 0.684127 0.916883 0.772414 0.800505 0.810710
recall 0.930886 0.639493 0.772414 0.785189 0.772414
f1-score 0.788655 0.753469 0.772414 0.771062 0.769519
support 463.000000 552.000000 0.772414 1015.000000 1015.000000

this doesn’t look great but let’s have a quick look at what happens if we change the threshold

Interesting that if there is any sentence at all labelled as spam, you can probably classify the entire post as spam. We can get fairly similar results doing this to analysing the post as a whole but practically speaking, this is not improving detection.

We could look at training on a sentence level but that would mean another hefty labelling task. We could also look at tokenising on a paragraph level instead of a sentence level, or considering this only for posts over a certain length. For now I am going to look specifically at the posts that were mistakenly labelled as spam in the standard model and if these could be classified any more accurately if scoring on a sentence level.

0 1 accuracy macro avg weighted avg
precision 1.000000 0.0 0.647059 0.500000 1.000000
recall 0.647059 0.0 0.647059 0.323529 0.647059
f1-score 0.785714 0.0 0.647059 0.392857 0.785714
support 51.000000 0.0 0.647059 51.000000 51.000000

this is an improvement from the 0 accuracy obtained by the standrd model - maybe we could do some sort of combo approach in the case of long posts. I need to really try this on all posts over a certain length because we won’t know what posts are mislabelled in practice..

the mean length of false positive posts is 187, let’s look at classifying all posts over that length using the sentence level approach.

post length

The best f1 score and recall combination comes at a threshold of 0.05, this is only slightly better that the f1 score and recall we achieve with the standard model (table below). Technically it looks like we can get better results if we were to use the sentence level approach for longer posts but it might not be enough of an improvement to make it worth it.

not_spam spam accuracy macro avg weighted avg
precision 0.870968 0.842593 0.848921 0.856780 0.851575
recall 0.613636 0.957895 0.848921 0.785766 0.848921
f1-score 0.720000 0.896552 0.848921 0.808276 0.840665
support 44.000000 95.000000 0.848921 139.000000 139.000000

Trial different dataset ablations

Remove reports and articles

We have discussed at length how hard these are to label as a human, nevermind as a machine. Does the addition of these posts confuse the model? And so, does removing them improve the model?

first let’s look training over two epochs

[358/358 06:21, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
36 0.552100 0.398188 0.846875 0.845460 0.854158 0.846875
72 0.384600 0.412332 0.825000 0.823286 0.847440 0.825000
108 0.359300 0.324502 0.871875 0.871756 0.872064 0.871875
144 0.335600 0.338327 0.865625 0.864793 0.870204 0.865625
180 0.332900 0.323204 0.865625 0.865619 0.869093 0.865625
216 0.267300 0.317566 0.881250 0.881278 0.881375 0.881250
252 0.279100 0.333547 0.873958 0.874010 0.875657 0.873958
288 0.259500 0.321211 0.876042 0.876067 0.878697 0.876042
324 0.249900 0.300097 0.886458 0.886463 0.886470 0.886458

[60/60 00:10]

This doesn’t seem to be improving the model and the loss curve looks like it could converge more. Let’s try over three epochs

[537/537 09:50, Epoch 3/3]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
36 0.578400 0.357010 0.840625 0.840231 0.841607 0.840625
72 0.438500 0.441831 0.812500 0.810153 0.839011 0.812500
108 0.375500 0.321627 0.870833 0.870703 0.871058 0.870833
144 0.348900 0.330973 0.864583 0.863523 0.870868 0.864583
180 0.333000 0.295100 0.875000 0.874831 0.875426 0.875000
216 0.271300 0.402192 0.872917 0.872966 0.874722 0.872917
252 0.343600 0.346152 0.869792 0.869826 0.872175 0.869792
288 0.284800 0.391758 0.844792 0.844039 0.858769 0.844792
324 0.252200 0.291041 0.880208 0.880253 0.882133 0.880208
360 0.230600 0.331076 0.879167 0.879167 0.882539 0.879167
396 0.211000 0.368974 0.869792 0.869743 0.874242 0.869792
432 0.195900 0.338604 0.881250 0.881296 0.883065 0.881250
468 0.208300 0.317228 0.888542 0.888571 0.888696 0.888542
504 0.183500 0.333172 0.891667 0.891676 0.891693 0.891667

[60/60 00:10]

Model metrics are looking stronger here but the model does look like it might be overfitting… I’m going to try adjusting weight decay and lerning rate

I’m using optuna to try to optimise study parameters, the outputs below are from a study varying weight decay and learning rate to try to combat overfitting.

[358/358 05:01, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.495600 0.353252 0.856250 0.855878 0.857428 0.856250
100 0.364200 0.330188 0.878125 0.877704 0.880221 0.878125
150 0.340400 0.313633 0.872917 0.872966 0.873289 0.872917
200 0.297000 0.288736 0.881250 0.881149 0.881424 0.881250
250 0.241600 0.349093 0.870833 0.870864 0.873346 0.870833
300 0.242900 0.315722 0.876042 0.876082 0.878193 0.876042
350 0.202000 0.303394 0.887500 0.887533 0.887690 0.887500

[120/120 00:11]
[358/358 04:52, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.576000 0.396333 0.831250 0.831300 0.833437 0.831250
100 0.388800 0.332276 0.867708 0.867744 0.867871 0.867708
150 0.344800 0.321822 0.871875 0.871529 0.873265 0.871875
200 0.301300 0.309843 0.870833 0.870895 0.871871 0.870833
250 0.253100 0.343132 0.866667 0.866678 0.869704 0.866667
300 0.282500 0.337678 0.870833 0.870845 0.873884 0.870833
350 0.246800 0.312712 0.878125 0.878130 0.878138 0.878125

[120/120 00:11]
[358/358 04:52, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.493500 0.377654 0.855208 0.855136 0.855214 0.855208
100 0.365700 0.330696 0.872917 0.872891 0.872896 0.872917
150 0.337200 0.348639 0.840625 0.840196 0.850151 0.840625
200 0.306300 0.294552 0.879167 0.879195 0.879292 0.879167
250 0.243400 0.340071 0.869792 0.869808 0.872700 0.869792
300 0.255500 0.327951 0.878125 0.878094 0.882285 0.878125
350 0.203800 0.307104 0.883333 0.883379 0.883701 0.883333

[120/120 00:11]
[358/358 05:09, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.478000 0.356945 0.856250 0.856002 0.856831 0.856250
100 0.359300 0.321177 0.869792 0.869709 0.869856 0.869792
150 0.352600 0.309830 0.868750 0.868724 0.868728 0.868750
200 0.300100 0.283371 0.883333 0.883215 0.883591 0.883333
250 0.231200 0.337918 0.884375 0.884422 0.886084 0.884375
300 0.246100 0.331095 0.882292 0.882323 0.884702 0.882292
350 0.211000 0.302979 0.886458 0.886511 0.887119 0.886458

[120/120 00:11]
[358/358 05:01, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.475900 0.347312 0.859375 0.858926 0.861061 0.859375
100 0.379700 0.335153 0.875000 0.874446 0.877978 0.875000
150 0.336400 0.322408 0.867708 0.867689 0.871492 0.867708
200 0.299500 0.298249 0.880208 0.880264 0.880871 0.880208
250 0.227400 0.339285 0.878125 0.878130 0.881345 0.878125
300 0.245600 0.326882 0.886458 0.886495 0.888628 0.886458
350 0.215300 0.307947 0.885417 0.885468 0.886012 0.885417

[120/120 00:11]
[358/358 05:03, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.489200 0.364529 0.856250 0.854830 0.864523 0.856250
100 0.392000 0.333931 0.879167 0.878826 0.880715 0.879167
150 0.340500 0.325551 0.864583 0.864645 0.865186 0.864583
200 0.307900 0.291423 0.884375 0.884115 0.885484 0.884375
250 0.222000 0.329435 0.872917 0.872971 0.873396 0.872917
300 0.237600 0.320021 0.876042 0.876047 0.879254 0.876042
350 0.202000 0.303711 0.887500 0.887548 0.887973 0.887500

[120/120 00:11]
[358/358 05:00, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.536200 0.371610 0.844792 0.844842 0.845040 0.844792
100 0.390600 0.339289 0.865625 0.865431 0.866071 0.865625
150 0.340300 0.319113 0.860417 0.860458 0.860619 0.860417
200 0.308900 0.298684 0.880208 0.880203 0.880199 0.880208
250 0.242800 0.340142 0.869792 0.869826 0.872175 0.869792
300 0.273500 0.335391 0.875000 0.875029 0.877522 0.875000
350 0.238700 0.311230 0.880208 0.880232 0.880304 0.880208

[120/120 00:11]
[358/358 05:02, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
50 0.604000 0.445440 0.820833 0.820919 0.821717 0.820833
100 0.399000 0.336848 0.864583 0.864424 0.864876 0.864583
150 0.337000 0.325152 0.859375 0.859427 0.859705 0.859375
200 0.318400 0.300137 0.879167 0.878981 0.879725 0.879167
250 0.258500 0.342617 0.859375 0.859393 0.862251 0.859375
300 0.285000 0.329623 0.866667 0.866706 0.868921 0.866667
350 0.254500 0.307795 0.872917 0.872966 0.873289 0.872917

[120/120 00:11]

We are getting relatively good f1 results but how are the loss curves looking? Is the model still overfitting?

<Figure size 640x480 with 0 Axes>

It looks like we are still overfitting the data in every iteration. We might have to look at this a bit more in the end but we can probably draw the conclusion that training the model on reports and articles is not reducing the models ability to predict spam.

Remove Hateful posts

I think this is probably so low volume that it’s not really affecting the model

training on 2 epochs with no hateful content in the training data:

[376/376 07:17, Epoch 2/2]
Step Training Loss Validation Loss Accuracy F1 Precision Recall
37 0.588500 0.426125 0.824579 0.824819 0.825773 0.824579
74 0.428000 0.332770 0.866204 0.865103 0.869715 0.866204
111 0.417100 0.316493 0.868186 0.867404 0.870163 0.868186
148 0.324100 0.351221 0.859267 0.859528 0.863360 0.859267
185 0.338200 0.294790 0.875124 0.874405 0.877098 0.875124
222 0.303300 0.357434 0.859267 0.859507 0.865044 0.859267
259 0.280900 0.320169 0.880079 0.880298 0.882882 0.880079
296 0.250000 0.274743 0.890981 0.890749 0.891227 0.890981
333 0.191600 0.304093 0.886026 0.886171 0.886946 0.886026
370 0.299600 0.298144 0.884044 0.884191 0.884967 0.884044

[64/64 00:11]

this probably is one of the best f1 scores we’ve obtained for spam classification. Precision is very strong at 90% and recall is similar to other motdels at ~88%. That said, it’s not an improvement on the standard model.

Removing slop

if we train over 3 epochs:

I woudn’t say this has really improved the model

if we train over 4 epochs:

4 epochs definitely overfits the data

Fine tuning a model already trained for spam identification

fine tune mrm8488/bert-tiny-finetuned-sms-spam-detection

precision recall f1-score support
not_spam 0.817746 0.736501 0.775000 463.000000
spam 0.795987 0.862319 0.827826 552.000000
accuracy 0.804926 0.804926 0.804926 0.804926
macro avg 0.806866 0.799410 0.801413 1015.000000
weighted avg 0.805912 0.804926 0.803729 1015.000000

these definitely aren’t an improvement on standard, we can try training on a different model to see if that improves anything

fine tune skandavivek2/spam-classifier

precision recall f1-score support
not_spam 0.817822 0.892009 0.853306 463.000000
spam 0.901961 0.833333 0.866290 552.000000
accuracy 0.860099 0.860099 0.860099 0.860099
macro avg 0.859891 0.862671 0.859798 1015.000000
weighted avg 0.863580 0.860099 0.860367 1015.000000

This is overfitting massively and only providing mediocre results, I think we can put this idea to bed

freezing neural net layers

there is the argument that we should be utilising more of the training already done on the language models we fine-tune. Here I am going to investigate if freezing each layer bar the decision head helps with training.

precision recall f1-score support
not_spam 0.841649 0.838013 0.839827 463.000000
spam 0.864621 0.867754 0.866184 552.000000
accuracy 0.854187 0.854187 0.854187 0.854187
macro avg 0.853135 0.852883 0.853006 1015.000000
weighted avg 0.854142 0.854187 0.854161 1015.000000

While the accuracy of this model isn’t as good as the standard model, the loss curve is quite promising, could this perform better than higher f1 models on the validation data?

This model was trained on 2 epochs, I am going to try training on 3 epochs to see if it helps.

precision recall f1-score support
not_spam 0.836559 0.840173 0.838362 463.000000
spam 0.865455 0.862319 0.863884 552.000000
accuracy 0.852217 0.852217 0.852217 0.852217
macro avg 0.851007 0.851246 0.851123 1015.000000
weighted avg 0.852274 0.852217 0.852242 1015.000000

the third epoch doesn’t appear to have improved much. I can try changing batch size and learning rate

precision recall f1-score support
not_spam 0.851415 0.779698 0.813980 463.000000
spam 0.827411 0.885870 0.855643 552.000000
accuracy 0.837438 0.837438 0.837438 0.837438
macro avg 0.839413 0.832784 0.834811 1015.000000
weighted avg 0.838361 0.837438 0.836638 1015.000000

this model was trained on 2 epochs again and it looks like it needed more time to converge. The larger batch size and learning rate doesn’t appear to have improved the prediction ability of the model.

Improving the standard model

It kind of looks like the standard model is still largely the best performing model we should spend some time tuning the model hyperparameters,

First I’m going to run an optuna study to see if we can optimise the hyperparams

over 5 trials, f1 score maxes out just over 88% and none of the trial loss curves look particularly promising

Run 1 and 3 have the best result in terms of f1. Run 3 is completely overfit and run 1 probably isn’t perfect but it a bit better. We will be able to judge that better when we evaluate on the valdidation set.

I am going to do some more manual tuning to see if I can find a better model

after a little bit of playing around with hyperparameters, we can land on a model with an f1 scere of 87.3%, which is not the best f1 score we have achieved, but with a a good loss curve. It is worth noting models like this that might perform better in practice.

Conclusion

If we review the theories we tested:

    • yes we can create a better model
    • yes, we should definitely label more data, but relabelled the relabelled articles and reports are still struggling with relabelled data only reaching a 77% accuracy. Relabelling has really improved precision but has not helped recall. Which do we deem more important?
    • Reports and articles
    • Quotes
    • bad scraping
    • hateful, petition, stocks
    • having tried removing slop, articles and report and hateful comment, it doesn’t look like removing any of them improve model metrics and so it doesn’t look like any are confusing the model.
    • no, having tried on two models, this is not improving results
    • this doesn’t improve f1 on the test data but it does create a much nicer loss curve so maybe it would be better at predicting the eval set - I will investigate that today

Spoiler alert, investigating these theories brought about some new theories and so here is a list of other things I ended up investigating:

    • This does not improve the overall model, it did look like it might be helpful for classifying longer posts but it doesn’t like it results in significant improvement from the standard model. That said, it is longer posts that are the model is most frequently mislabelling as spam so this is definitetely something we need to address.
    • My working theory was that a higher recall would be ok if the mistakes the model was making was just classifyinf other low quality posts as spam, but it looks like we would just be removing a lot of long posts.

Some future tests we shoudl run: - [ ] We looked at classifying longer posts at a sentence level and didn’t really improve results but maybe we should look specifically at longer posts that are classified as spam and passing them through a second filter. - [ ] Maybe we need a heuristic filter for slop. This is frequently being classified as not spam and more data isn’t massively affecting results. Filters like proportion of stopwords per sentence might be good here. - [ ] We need to pick out our best models and run them over the validation set to see which perform best. - [ ] Unfortunately, we need to label more data - [ ] I still haven’t tested model accuracy if data is run through heuristic filters before modelling.