Model Training Results

Background

We began by finetuning a few of the standard transformer models (4_old_model_evaluation.ipynb) to get an idea of how they were performing. The best f1-scores for each model after hyperparameter tuning is shown below.

Model	F1-Score
bert-base-cased	0.864
bert-base-uncased	0.862
distilbert-cased	0.857
distilbert-uncased	-
roberta-base	0.865

Despite roberta not achieving results massively better than bert-base-cased, this model saw the least hyperparameter tuining (2 iterations) and even at that, both iterations performed better than the best bert model (10 itertions). We decided to move forward with roberta for further testing.

Test Conditions

We had a few theories we wanted to test:

If we employ gpt to label some of the harder to label reports and articles, can we improve model results
Is it worth labelling more data i.e. is more data improving the model?
Are there particular spam classes that would benefit from an increase in sample size
Are there spam classes particularly confusing the model (does removing those classes improve model accuracy?)
Does training on a model already fine-tuned for spam recognition in other applications give us a better model?
I have read that sometimes finetuning models can cause “catestrophic forgetting”, should we be freezing layers that are not the decision head?

Spoiler alert, investigating these theories brought about some new theories and so here is a list of other things I ended up investigating:

Should we be tokenising into sentences or something else to classify long posts?
What is more important, recall or precision? If we went for a high recall - where are the mistakes being made? Are they poor quality posts anyway?

each of these contain either: - report - article link - quote (report) - announcement

so this is the relabelled gpt data, just no longer being referred to article link or not - which is maybe a little annoying for looking at accuracy at the group level?

Tests

Cross Validation of og roberta model - with og data

We have pretty much settled on roberta being our best performing model. Previous experiments (3_model_iteration) have shown that, with very little hyperparameter tuning, it is the best performing model (>86% accuracy and f1). I’m going to do a quick cross validation test to make sure that these results hold for varied testing / training sets.

[402/402 06:37, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.554600	0.340840	0.868323	0.867751	0.869972	0.868323
100	0.406000	0.324208	0.874534	0.872914	0.884007	0.874534
150	0.368400	0.311900	0.874534	0.872678	0.886015	0.874534
200	0.353700	0.312466	0.867081	0.867233	0.869958	0.867081
250	0.326900	0.304515	0.868323	0.868323	0.868323	0.868323
300	0.302700	0.288748	0.878261	0.877374	0.882388	0.878261
350	0.238800	0.289207	0.886957	0.886486	0.888694	0.886957
400	0.290100	0.285096	0.886957	0.886676	0.887616	0.886957

[404/404 06:37, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.548500	0.408954	0.829602	0.829485	0.829577	0.829602
100	0.347600	0.393902	0.840796	0.838168	0.854970	0.840796
150	0.380600	0.357313	0.854478	0.853830	0.856862	0.854478
200	0.342400	0.350081	0.839552	0.837915	0.846956	0.839552
250	0.287100	0.417664	0.838308	0.835956	0.850190	0.838308
300	0.286800	0.364391	0.858209	0.858190	0.858183	0.858209
350	0.242500	0.374562	0.847015	0.846036	0.850979	0.847015
400	0.297800	0.358485	0.842040	0.840819	0.847116	0.842040

[404/404 06:37, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.526500	0.454144	0.819652	0.813087	0.841876	0.819652
100	0.397800	0.353869	0.855721	0.854914	0.856552	0.855721
150	0.338800	0.348409	0.861940	0.862176	0.863014	0.861940
200	0.382600	0.304005	0.876866	0.876087	0.878353	0.876866
250	0.264500	0.321688	0.876866	0.875549	0.881046	0.876866
300	0.299600	0.295096	0.878109	0.877539	0.878851	0.878109
350	0.273100	0.291737	0.873134	0.873022	0.873008	0.873134
400	0.278700	0.291532	0.870647	0.870258	0.870759	0.870647

[404/404 06:39, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.549500	0.463595	0.809701	0.803958	0.821194	0.809701
100	0.397300	0.380256	0.840796	0.840902	0.841051	0.840796
150	0.391100	0.390431	0.825871	0.826573	0.830229	0.825871
200	0.329000	0.419239	0.838308	0.838251	0.838205	0.838308
250	0.346200	0.343753	0.858209	0.856757	0.860209	0.858209
300	0.282600	0.355700	0.864428	0.864451	0.864477	0.864428
350	0.256800	0.355219	0.865672	0.865624	0.865588	0.865672
400	0.233400	0.368634	0.864428	0.864404	0.864383	0.864428

[404/404 06:40, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.594100	0.380333	0.830846	0.831327	0.833291	0.830846
100	0.375200	0.426684	0.818408	0.819108	0.827188	0.818408
150	0.382000	0.360243	0.843284	0.843422	0.843655	0.843284
200	0.344800	0.332534	0.861940	0.860143	0.866251	0.861940
250	0.270400	0.353116	0.854478	0.854735	0.855442	0.854478
300	0.260300	0.360021	0.853234	0.853572	0.854769	0.853234
350	0.289400	0.357653	0.848259	0.848714	0.850946	0.848259
400	0.253600	0.338669	0.863184	0.862265	0.864172	0.863184

There is some variation in model result metrics across runs which would suggest that the training sample used has some effect on model accuracy. There is also a little bit of variation in the stability of the loss curves across runs. Interestingly, the most promising loss curves coincide with the two runs with the highest f1-score.

Another thing that I should investigate is what differs across training sets, however for now I am prioritising training with the relablled data, as if this improves the model, we will proceed with that anyway.

Trial different size datasets (from here on out everything is trained with the relabelled data)

Before training, let’s see what the split of each spam classification is across samples - we would want each classification to be roughly 25%, 50%, 75% and 100% of the available labels

	train_25	train_50	train_75	overall_count	train25_prop	train50_prop	train75_prop
classification_label
not_spam	284	582	851	1128	0.251773	0.515957	0.754433
spam	84	161	242	312	0.269231	0.516026	0.775641
promotion	66	125	194	259	0.254826	0.482625	0.749035
crypto	65	136	200	270	0.240741	0.503704	0.740741
slop	59	116	186	258	0.228682	0.449612	0.720930
relabelled	51	97	145	206	0.247573	0.470874	0.703883
article_link	47	108	164	212	0.221698	0.509434	0.773585
seo	40	76	111	145	0.275862	0.524138	0.765517
announcement	26	50	73	91	0.285714	0.549451	0.802198
image	24	41	65	94	0.255319	0.436170	0.691489
event	19	33	54	73	0.260274	0.452055	0.739726
bad_scraping	19	34	62	80	0.237500	0.425000	0.775000
report	15	24	40	53	0.283019	0.452830	0.754717
quote	8	13	20	28	0.285714	0.464286	0.714286
stock_ticker	3	4	5	6	0.500000	0.666667	0.833333
hateful	2	7	10	11	0.181818	0.636364	0.909091
video	1	5	10	15	0.066667	0.333333	0.666667
aiprompt	1	4	5	8	0.125000	0.500000	0.625000
low_analytical_value()	1	2	4	5	0.200000	0.400000	0.800000
petition	0	1	2	5	0.000000	0.200000	0.400000

With the exception of the the petition label in the 25% group - we seem to have an ok representation of each classification in each group.

How training size affects results

Granualar look at spam vs not spam

	model	sample	num_epochs	batch_size	learning_rate	weight_decay	eval_loss	eval_accuracy	eval_f1	eval_precision	eval_recall	eval_runtime	eval_samples_per_second	eval_steps_per_second	epoch	name_on_hub
0	roberta	train_25	2	16	0.00002	0.1	0.363416	0.847291	0.847572	0.849867	0.847291	12.3891	81.927	5.166	2.0	/roberta_train_n752
1	roberta	train_50	2	16	0.00002	0.1	0.314662	0.874877	0.875072	0.876329	0.874877	12.2712	82.714	5.215	2.0	/roberta_train_n1503
2	roberta	train_75	2	16	0.00002	0.1	0.321931	0.876847	0.877039	0.878296	0.876847	12.2485	82.868	5.225	2.0	/roberta_train_n2254
3	roberta	train_100	2	16	0.00002	0.1	0.288585	0.885714	0.885831	0.886286	0.885714	12.2070	83.149	5.243	2.0	/roberta_train_n3006

we can see that all metrics improved as training size increased, there does seem to be a less dramatic increase between the 50% and 75% sample size, let’s look at it on a graph:

as we saw in the df, all metrics improve with increasing training size. Precision is consistently the highest scoring evaluation metric.

If we look at this on a spam type (article link, slop…) level, I wonder does it improve across all categories as training size increases. There are some spam types that we have a fairly low sample size for anyway, and these likely won’t see much improvement with increasing sample size as all sample sizes will be low for those groups, but maybe we have reached saturation with other groups.

group classification across training size

For the sake of showing what I’m talking about here, the above is a screenshot of an interactive html chart that should be rendered here. I wouldn’t bother trying to interpret it.. the main conclusions are summarised below:

More data does seem to largely improve the accuracy of identifying spam categories. Some areas we are struggling and may need to get more labelled data or consider other routes to accuracy:

relabelled (max ~77% accuracy)
announcements - more data is not improving metrics with a max accuracy of ~87% at a training size of only 50% of the full training data.
report - this is also encompassed in the relabelled data but again, improves with more data but maxes out ~70%.
quote - also someone overlapping with the relabelled data (max ~68%)
bad scraping - this does significantly improve with larger training samples, that said, we have only a max of 33 “bad scraping” posts. Should this be included at all here or is bad scraping to broad a category to give to the model? Is this something we can tackle at source?
hateful, petition, stocks - really poor classification but also really small sample sizes overall. Should hateful be included here at all or too broad?

We can see that the model is improving with more data, so it seems like a good argument to source more labelled data.

Let’s have a quick look to compare how the model trained on the new labelled data compares to the old model and see if we can draw any conclusions from that.

Comparing the original and relabelled model

Let’s have a look at some of the more telling metrics on both a spam label and overall model level:

	old_data_no_spam	old_data_spam	new_data_no_spam	new_data_spam
precision	0.885366	0.864463	0.863732	0.905204
recall	0.815730	0.917544	0.889849	0.882246
f1-score	0.849123	0.890213	0.876596	0.893578
support	445.000000	570.000000	463.000000	552.000000

	old_data_weighted avg	new_data_weighted avg
precision	0.873627	0.886286
recall	0.872906	0.885714
f1-score	0.872198	0.885831
support	1015.000000	1015.000000

It looks like the newly labelled data is resulting in a better model across the board, but it would be interesting to see what affect the change in label on the report and article data has on a spam grouping level.

	classification_label	precision_not_spam_og_model	precision_spam_og_model	recall_not_spam_og_model	recall_spam_og_model	f1_not_spam_og_model	f1_spam_og_model	precision_not_spam_relabelled	precision_spam_relabelled	recall_not_spam_relabelled	recall_spam_relabelled	f1_not_spam_relabelled	f1_spam_relabelled	0_relabelled	1_relabelled	0_og_model	1_og_model
0	aiprompt	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	0.0	3.0	0.0	3.0
1	announcement	0.333333	0.804878	0.200000	0.891892	0.250000	0.846154	0.375000	0.956522	0.750000	0.814815	0.500000	0.880000	4.0	27.0	10.0	37.0
2	article_link	0.450000	0.785714	0.300000	0.875000	0.360000	0.827957	0.526316	1.000000	1.000000	0.876712	0.689655	0.934307	10.0	73.0	30.0	88.0
3	bad_scraping	0.750000	0.837838	0.333333	0.968750	0.461538	0.898551	0.411765	0.875000	0.777778	0.583333	0.538462	0.700000	9.0	24.0	9.0	32.0
4	bot	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	0.0	1.0	0.0	1.0
5	crypto	0.333333	1.000000	1.000000	0.976744	0.500000	0.988235	0.250000	1.000000	1.000000	0.964706	0.400000	0.982036	1.0	85.0	1.0	86.0
6	event	1.000000	1.000000	1.000000	1.000000	1.000000	1.000000	0.500000	1.000000	1.000000	0.958333	0.666667	0.978723	1.0	24.0	1.0	24.0
7	hateful	0.333333	0.000000	1.000000	0.000000	0.500000	0.000000	0.333333	0.000000	1.000000	0.000000	0.500000	0.000000	2.0	4.0	2.0	4.0
8	image	0.600000	0.944444	0.750000	0.894737	0.666667	0.918919	0.666667	1.000000	1.000000	0.894737	0.800000	0.944444	4.0	19.0	4.0	19.0
9	low_analytical_value()	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.0	0.0	1.0	0.0
10	news_report()	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	not_spam	1.000000	0.000000	0.889474	0.000000	0.941504	0.000000	1.000000	0.000000	0.915567	0.000000	0.955923	0.000000	379.0	0.0	380.0	0.0
12	petition	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.0	1.0	0.0	1.0
13	promotion	0.200000	0.970874	0.250000	0.961538	0.222222	0.966184	0.222222	0.989796	0.666667	0.932692	0.333333	0.960396	3.0	104.0	4.0	104.0
14	quote	0.625000	0.888889	0.833333	0.727273	0.714286	0.800000	0.545455	1.000000	1.000000	0.500000	0.705882	0.666667	6.0	10.0	6.0	11.0
15	report	0.333333	0.781250	0.125000	0.925926	0.181818	0.847458	0.500000	0.818182	0.600000	0.750000	0.545455	0.782609	5.0	12.0	8.0	27.0
16	seo	0.000000	0.964286	0.000000	0.931034	0.000000	0.947368	0.000000	0.980769	0.000000	0.910714	0.000000	0.944444	1.0	56.0	2.0	58.0
17	slop	0.200000	1.000000	1.000000	0.902439	0.333333	0.948718	0.166667	1.000000	1.000000	0.878049	0.285714	0.935065	2.0	82.0	2.0	82.0
18	spam	0.000000	1.000000	0.000000	0.932692	0.000000	0.965174	0.000000	1.000000	0.000000	0.923077	0.000000	0.960000	0.0	104.0	0.0	104.0
19	stock_ticker	1.000000	NaN	1.000000	NaN	1.000000	NaN	0.000000	1.000000	0.000000	0.500000	0.000000	0.666667	0.0	2.0	0.0	2.0
20	stocks	0.000000	1.000000	0.000000	0.500000	0.000000	0.666667	0.000000	1.000000	0.000000	0.500000	0.000000	0.666667	0.0	2.0	0.0	2.0
21	video	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	1.000000	NaN	0.0	3.0	0.0	3.0

The table above is a bit exhaustive and hard to digest, we can filter to specific metrics we want to look at.

	classification_label	precision_spam_og_model	precision_spam_relabelled	1_og_model	1_relabelled
0	aiprompt	NaN	NaN	3.0	3.0
1	announcement	0.804878	0.956522	37.0	27.0
2	article_link	0.785714	1.000000	88.0	73.0
3	bad_scraping	0.837838	0.875000	32.0	24.0
4	bot	NaN	NaN	1.0	1.0
5	crypto	1.000000	1.000000	86.0	85.0
6	event	1.000000	1.000000	24.0	24.0
7	hateful	0.000000	0.000000	4.0	4.0
8	image	0.944444	1.000000	19.0	19.0
9	low_analytical_value()	0.000000	0.000000	0.0	0.0
10	news_report()	0.000000	NaN	NaN	NaN
11	not_spam	0.000000	0.000000	0.0	0.0
12	petition	0.000000	0.000000	1.0	1.0
13	promotion	0.970874	0.989796	104.0	104.0
14	quote	0.888889	1.000000	11.0	10.0
15	report	0.781250	0.818182	27.0	12.0
16	seo	0.964286	0.980769	58.0	56.0
17	slop	1.000000	1.000000	82.0	82.0
18	spam	1.000000	1.000000	104.0	104.0
19	stock_ticker	NaN	1.000000	2.0	2.0
20	stocks	1.000000	1.000000	2.0	2.0
21	video	NaN	NaN	3.0	3.0

Instances of NaN represent cases where the numeruator of the precision formula is a 0 and thus cases where the precision is effectively 0 (there has been no cases where a post has been labelled as spam).

Video, bot and AI prompt posts are all cases where the model never recognises a post as spam. They are all represented by < 3 posts in the test set (and similarly low sample size in the training set) so perhaps more training data for these categories would improve precision.

The metrics for spam and not_spam classification categories here represent scenarios where there are no labelled instances of the inverse case (spam or not_spam), and so precision can never be anything but 0 for the inverse, as there are no “True Positive” cases.

\[\text{Precision} = \frac{\text{True Positive}}{\text{True Positive} + \text{False Positive}}\]

The great news here is that, in terms of precision of spam classification, the relabelled data has improved classification across the board!

It is still kind of up for debate whether we would rather a high precision or a high recall. I think I would be favour of a slightly higher recall, depending on where the mistakes are being made, but this is a subjective goal.

If we look at the recall, the relabelled data has really reduced this metric.

Specfically looking at where we make mistakes

First let’s look at the case where spam is being classified as not spam by our model

	text
classification_label
announcement	5
article_link	9
bad_scraping	10
crypto	3
event	1
hateful	4
image	2
petition	1
promotion	7
quote	5
relabelled	2
report	3
seo	5
slop	10
spam	8
stock_ticker	1
stocks	1

dtype: int64

I should probably look at this as a percentage of overall category volume but it looksl ike slop, bad_scraping, promotion and article link are the categories most likely to be labelled as not spam when they are actually spam.

If we look then at where not spam is being classified as spam. This, to me, is an interesting area becuase if these are all low quality posts, maybe a high recall isn’t the end of the world.

	text
classification_label
announcement	1
bad_scraping	2
low_analytical_value()	1
not_spam	32
promotion	1
relabelled	13
report	2
seo	1

dtype: int64

most of the mistakes are being made where posts are tagged only as not spam and the relabelled article and report data. The relabelled data is data that we know is difficult to classify and so I am not massively worried about it. However, I want to see if there are any notable trends in the other not spam posts classified as spam.

I’ve hidden the table from here as it’s a big long, but it looks like it is longer posts that are being misclassified as spam, we will see a ferw metrics pertaining to this in the next section. This leads us nicely into another experiment - does sentence length affect spam classification with our model?

How sentence length affects model classification

Let’s first look at overall word count distribution

array([[<Axes: title={'center': 'word_count'}>]], dtype=object)

and then look at how that compares to the word count distribution for all mislabelled posts (both false positive and negative).

array([[<Axes: title={'center': 'word_count'}>]], dtype=object)

We would expect the correctly labelled and mislabelled data to have the same distribution of word count as the original data. I’m going to perform a ttest to confirm that this is true and I’m transforming the data because it’s a skewed distribution:

TtestResult(statistic=-0.3802876800689237, pvalue=0.7037961797341186, df=1256.0)
TtestResult(statistic=0.10712299765143264, pvalue=0.9147015195217277, df=2124.0)

ok it doesn’t like there is any significant difference in the mean of the post length between the full dataset and those that are mislabelled, but what if we look specifically at posts that are incorrectly labelled as spam (where we have a high recall)?

TtestResult(statistic=-3.5737775575675075, pvalue=0.0003660816017693208, df=1179.0)
count     53.000000
mean     181.396226
std      219.505505
min        7.000000
25%       32.000000
50%       42.000000
75%      314.000000
max      951.000000
Name: word_count, dtype: float64

So it seems fairly certain that longer posts are more likely to be mistaken as spam - maybe we should be tokenising when we train?

Tokenising the text

I’m going to try tokenising into sentences and using our current model to classify those sentences, getting a score for each doc based on the aggregate.

I have created a score for each document which is the proportion of sentences in that document classified as spam. Let’s set everything with a score below 0.5 as not_spam and everything above 0.5 as spam and see what the accuracy metrics look like.

	0	1	accuracy	macro avg	weighted avg
precision	0.684127	0.916883	0.772414	0.800505	0.810710
recall	0.930886	0.639493	0.772414	0.785189	0.772414
f1-score	0.788655	0.753469	0.772414	0.771062	0.769519
support	463.000000	552.000000	0.772414	1015.000000	1015.000000

this doesn’t look great but let’s have a quick look at what happens if we change the threshold

Interesting that if there is any sentence at all labelled as spam, you can probably classify the entire post as spam. We can get fairly similar results doing this to analysing the post as a whole but practically speaking, this is not improving detection.

We could look at training on a sentence level but that would mean another hefty labelling task. We could also look at tokenising on a paragraph level instead of a sentence level, or considering this only for posts over a certain length. For now I am going to look specifically at the posts that were mistakenly labelled as spam in the standard model and if these could be classified any more accurately if scoring on a sentence level.

	0	accuracy	macro avg	weighted avg
precision	1.000000	0.647059	0.500000	1.000000
recall	0.647059	0.647059	0.323529	0.647059
f1-score	0.785714	0.647059	0.392857	0.785714
support	51.000000	0.647059	51.000000	51.000000

this is an improvement from the 0 accuracy obtained by the standrd model - maybe we could do some sort of combo approach in the case of long posts. I need to really try this on all posts over a certain length because we won’t know what posts are mislabelled in practice..

the mean length of false positive posts is 187, let’s look at classifying all posts over that length using the sentence level approach.

The best f1 score and recall combination comes at a threshold of 0.05, this is only slightly better that the f1 score and recall we achieve with the standard model (table below). Technically it looks like we can get better results if we were to use the sentence level approach for longer posts but it might not be enough of an improvement to make it worth it.

	not_spam	spam	accuracy	macro avg	weighted avg
precision	0.870968	0.842593	0.848921	0.856780	0.851575
recall	0.613636	0.957895	0.848921	0.785766	0.848921
f1-score	0.720000	0.896552	0.848921	0.808276	0.840665
support	44.000000	95.000000	0.848921	139.000000	139.000000

Trial different dataset ablations

Remove reports and articles

We have discussed at length how hard these are to label as a human, nevermind as a machine. Does the addition of these posts confuse the model? And so, does removing them improve the model?

first let’s look training over two epochs

[358/358 06:21, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
36	0.552100	0.398188	0.846875	0.845460	0.854158	0.846875
72	0.384600	0.412332	0.825000	0.823286	0.847440	0.825000
108	0.359300	0.324502	0.871875	0.871756	0.872064	0.871875
144	0.335600	0.338327	0.865625	0.864793	0.870204	0.865625
180	0.332900	0.323204	0.865625	0.865619	0.869093	0.865625
216	0.267300	0.317566	0.881250	0.881278	0.881375	0.881250
252	0.279100	0.333547	0.873958	0.874010	0.875657	0.873958
288	0.259500	0.321211	0.876042	0.876067	0.878697	0.876042
324	0.249900	0.300097	0.886458	0.886463	0.886470	0.886458

[60/60 00:10]

This doesn’t seem to be improving the model and the loss curve looks like it could converge more. Let’s try over three epochs

[537/537 09:50, Epoch 3/3]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
36	0.578400	0.357010	0.840625	0.840231	0.841607	0.840625
72	0.438500	0.441831	0.812500	0.810153	0.839011	0.812500
108	0.375500	0.321627	0.870833	0.870703	0.871058	0.870833
144	0.348900	0.330973	0.864583	0.863523	0.870868	0.864583
180	0.333000	0.295100	0.875000	0.874831	0.875426	0.875000
216	0.271300	0.402192	0.872917	0.872966	0.874722	0.872917
252	0.343600	0.346152	0.869792	0.869826	0.872175	0.869792
288	0.284800	0.391758	0.844792	0.844039	0.858769	0.844792
324	0.252200	0.291041	0.880208	0.880253	0.882133	0.880208
360	0.230600	0.331076	0.879167	0.879167	0.882539	0.879167
396	0.211000	0.368974	0.869792	0.869743	0.874242	0.869792
432	0.195900	0.338604	0.881250	0.881296	0.883065	0.881250
468	0.208300	0.317228	0.888542	0.888571	0.888696	0.888542
504	0.183500	0.333172	0.891667	0.891676	0.891693	0.891667

[60/60 00:10]

Model metrics are looking stronger here but the model does look like it might be overfitting… I’m going to try adjusting weight decay and lerning rate

I’m using optuna to try to optimise study parameters, the outputs below are from a study varying weight decay and learning rate to try to combat overfitting.

[358/358 05:01, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.495600	0.353252	0.856250	0.855878	0.857428	0.856250
100	0.364200	0.330188	0.878125	0.877704	0.880221	0.878125
150	0.340400	0.313633	0.872917	0.872966	0.873289	0.872917
200	0.297000	0.288736	0.881250	0.881149	0.881424	0.881250
250	0.241600	0.349093	0.870833	0.870864	0.873346	0.870833
300	0.242900	0.315722	0.876042	0.876082	0.878193	0.876042
350	0.202000	0.303394	0.887500	0.887533	0.887690	0.887500

[120/120 00:11]

[358/358 04:52, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.576000	0.396333	0.831250	0.831300	0.833437	0.831250
100	0.388800	0.332276	0.867708	0.867744	0.867871	0.867708
150	0.344800	0.321822	0.871875	0.871529	0.873265	0.871875
200	0.301300	0.309843	0.870833	0.870895	0.871871	0.870833
250	0.253100	0.343132	0.866667	0.866678	0.869704	0.866667
300	0.282500	0.337678	0.870833	0.870845	0.873884	0.870833
350	0.246800	0.312712	0.878125	0.878130	0.878138	0.878125

[120/120 00:11]

[358/358 04:52, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.493500	0.377654	0.855208	0.855136	0.855214	0.855208
100	0.365700	0.330696	0.872917	0.872891	0.872896	0.872917
150	0.337200	0.348639	0.840625	0.840196	0.850151	0.840625
200	0.306300	0.294552	0.879167	0.879195	0.879292	0.879167
250	0.243400	0.340071	0.869792	0.869808	0.872700	0.869792
300	0.255500	0.327951	0.878125	0.878094	0.882285	0.878125
350	0.203800	0.307104	0.883333	0.883379	0.883701	0.883333

[120/120 00:11]

[358/358 05:09, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.478000	0.356945	0.856250	0.856002	0.856831	0.856250
100	0.359300	0.321177	0.869792	0.869709	0.869856	0.869792
150	0.352600	0.309830	0.868750	0.868724	0.868728	0.868750
200	0.300100	0.283371	0.883333	0.883215	0.883591	0.883333
250	0.231200	0.337918	0.884375	0.884422	0.886084	0.884375
300	0.246100	0.331095	0.882292	0.882323	0.884702	0.882292
350	0.211000	0.302979	0.886458	0.886511	0.887119	0.886458

[120/120 00:11]

[358/358 05:01, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.475900	0.347312	0.859375	0.858926	0.861061	0.859375
100	0.379700	0.335153	0.875000	0.874446	0.877978	0.875000
150	0.336400	0.322408	0.867708	0.867689	0.871492	0.867708
200	0.299500	0.298249	0.880208	0.880264	0.880871	0.880208
250	0.227400	0.339285	0.878125	0.878130	0.881345	0.878125
300	0.245600	0.326882	0.886458	0.886495	0.888628	0.886458
350	0.215300	0.307947	0.885417	0.885468	0.886012	0.885417

[120/120 00:11]

[358/358 05:03, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.489200	0.364529	0.856250	0.854830	0.864523	0.856250
100	0.392000	0.333931	0.879167	0.878826	0.880715	0.879167
150	0.340500	0.325551	0.864583	0.864645	0.865186	0.864583
200	0.307900	0.291423	0.884375	0.884115	0.885484	0.884375
250	0.222000	0.329435	0.872917	0.872971	0.873396	0.872917
300	0.237600	0.320021	0.876042	0.876047	0.879254	0.876042
350	0.202000	0.303711	0.887500	0.887548	0.887973	0.887500

[120/120 00:11]

[358/358 05:00, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.536200	0.371610	0.844792	0.844842	0.845040	0.844792
100	0.390600	0.339289	0.865625	0.865431	0.866071	0.865625
150	0.340300	0.319113	0.860417	0.860458	0.860619	0.860417
200	0.308900	0.298684	0.880208	0.880203	0.880199	0.880208
250	0.242800	0.340142	0.869792	0.869826	0.872175	0.869792
300	0.273500	0.335391	0.875000	0.875029	0.877522	0.875000
350	0.238700	0.311230	0.880208	0.880232	0.880304	0.880208

[120/120 00:11]

[358/358 05:02, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
50	0.604000	0.445440	0.820833	0.820919	0.821717	0.820833
100	0.399000	0.336848	0.864583	0.864424	0.864876	0.864583
150	0.337000	0.325152	0.859375	0.859427	0.859705	0.859375
200	0.318400	0.300137	0.879167	0.878981	0.879725	0.879167
250	0.258500	0.342617	0.859375	0.859393	0.862251	0.859375
300	0.285000	0.329623	0.866667	0.866706	0.868921	0.866667
350	0.254500	0.307795	0.872917	0.872966	0.873289	0.872917

[120/120 00:11]

We are getting relatively good f1 results but how are the loss curves looking? Is the model still overfitting?

<Figure size 640x480 with 0 Axes>

It looks like we are still overfitting the data in every iteration. We might have to look at this a bit more in the end but we can probably draw the conclusion that training the model on reports and articles is not reducing the models ability to predict spam.

Remove Hateful posts

I think this is probably so low volume that it’s not really affecting the model

training on 2 epochs with no hateful content in the training data:

[376/376 07:17, Epoch 2/2]

Step	Training Loss	Validation Loss	Accuracy	F1	Precision	Recall
37	0.588500	0.426125	0.824579	0.824819	0.825773	0.824579
74	0.428000	0.332770	0.866204	0.865103	0.869715	0.866204
111	0.417100	0.316493	0.868186	0.867404	0.870163	0.868186
148	0.324100	0.351221	0.859267	0.859528	0.863360	0.859267
185	0.338200	0.294790	0.875124	0.874405	0.877098	0.875124
222	0.303300	0.357434	0.859267	0.859507	0.865044	0.859267
259	0.280900	0.320169	0.880079	0.880298	0.882882	0.880079
296	0.250000	0.274743	0.890981	0.890749	0.891227	0.890981
333	0.191600	0.304093	0.886026	0.886171	0.886946	0.886026
370	0.299600	0.298144	0.884044	0.884191	0.884967	0.884044

[64/64 00:11]

this probably is one of the best f1 scores we’ve obtained for spam classification. Precision is very strong at 90% and recall is similar to other motdels at ~88%. That said, it’s not an improvement on the standard model.

Removing slop

if we train over 3 epochs:

I woudn’t say this has really improved the model

if we train over 4 epochs:

4 epochs definitely overfits the data

Fine tuning a model already trained for spam identification

fine tune mrm8488/bert-tiny-finetuned-sms-spam-detection

	precision	recall	f1-score	support
not_spam	0.817746	0.736501	0.775000	463.000000
spam	0.795987	0.862319	0.827826	552.000000
accuracy	0.804926	0.804926	0.804926	0.804926
macro avg	0.806866	0.799410	0.801413	1015.000000
weighted avg	0.805912	0.804926	0.803729	1015.000000

these definitely aren’t an improvement on standard, we can try training on a different model to see if that improves anything

fine tune skandavivek2/spam-classifier

	precision	recall	f1-score	support
not_spam	0.817822	0.892009	0.853306	463.000000
spam	0.901961	0.833333	0.866290	552.000000
accuracy	0.860099	0.860099	0.860099	0.860099
macro avg	0.859891	0.862671	0.859798	1015.000000
weighted avg	0.863580	0.860099	0.860367	1015.000000

This is overfitting massively and only providing mediocre results, I think we can put this idea to bed

freezing neural net layers

there is the argument that we should be utilising more of the training already done on the language models we fine-tune. Here I am going to investigate if freezing each layer bar the decision head helps with training.

	precision	recall	f1-score	support
not_spam	0.841649	0.838013	0.839827	463.000000
spam	0.864621	0.867754	0.866184	552.000000
accuracy	0.854187	0.854187	0.854187	0.854187
macro avg	0.853135	0.852883	0.853006	1015.000000
weighted avg	0.854142	0.854187	0.854161	1015.000000

While the accuracy of this model isn’t as good as the standard model, the loss curve is quite promising, could this perform better than higher f1 models on the validation data?

This model was trained on 2 epochs, I am going to try training on 3 epochs to see if it helps.

	precision	recall	f1-score	support
not_spam	0.836559	0.840173	0.838362	463.000000
spam	0.865455	0.862319	0.863884	552.000000
accuracy	0.852217	0.852217	0.852217	0.852217
macro avg	0.851007	0.851246	0.851123	1015.000000
weighted avg	0.852274	0.852217	0.852242	1015.000000

the third epoch doesn’t appear to have improved much. I can try changing batch size and learning rate

	precision	recall	f1-score	support
not_spam	0.851415	0.779698	0.813980	463.000000
spam	0.827411	0.885870	0.855643	552.000000
accuracy	0.837438	0.837438	0.837438	0.837438
macro avg	0.839413	0.832784	0.834811	1015.000000
weighted avg	0.838361	0.837438	0.836638	1015.000000

this model was trained on 2 epochs again and it looks like it needed more time to converge. The larger batch size and learning rate doesn’t appear to have improved the prediction ability of the model.

Improving the standard model

It kind of looks like the standard model is still largely the best performing model we should spend some time tuning the model hyperparameters,

First I’m going to run an optuna study to see if we can optimise the hyperparams

over 5 trials, f1 score maxes out just over 88% and none of the trial loss curves look particularly promising

Run 1 and 3 have the best result in terms of f1. Run 3 is completely overfit and run 1 probably isn’t perfect but it a bit better. We will be able to judge that better when we evaluate on the valdidation set.

I am going to do some more manual tuning to see if I can find a better model

after a little bit of playing around with hyperparameters, we can land on a model with an f1 scere of 87.3%, which is not the best f1 score we have achieved, but with a a good loss curve. It is worth noting models like this that might perform better in practice.

Conclusion

If we review the theories we tested:

If we employ gpt to label some of the harder to label reports and articles, can we improve model results
- yes we can create a better model
Is it worth labelling more data i.e. is more data improving the model?
- yes, we should definitely label more data, but relabelled the relabelled articles and reports are still struggling with relabelled data only reaching a 77% accuracy. Relabelling has really improved precision but has not helped recall. Which do we deem more important?
Are there particular spam classes that would benefit from an increase in sample size
- Reports and articles
- Quotes
- bad scraping
- hateful, petition, stocks
Are there spam classes particularly confusing the model (does removing those classes improve model accuracy?)
- having tried removing slop, articles and report and hateful comment, it doesn’t look like removing any of them improve model metrics and so it doesn’t look like any are confusing the model.
Does training on a model already fine-tuned for spam recognition in other applications give us a better model?
- no, having tried on two models, this is not improving results
I have read that sometimes finetuning models can cause “catestrophic forgetting”, should we be freezing layers that are not the decision head?
- this doesn’t improve f1 on the test data but it does create a much nicer loss curve so maybe it would be better at predicting the eval set - I will investigate that today

Spoiler alert, investigating these theories brought about some new theories and so here is a list of other things I ended up investigating:

Should we be tokenising into sentences or something else to classify long posts?
- This does not improve the overall model, it did look like it might be helpful for classifying longer posts but it doesn’t like it results in significant improvement from the standard model. That said, it is longer posts that are the model is most frequently mislabelling as spam so this is definitetely something we need to address.
What is more important, recall or precision? If we went for a high recall - where are the mistakes being made? Are they poor quality posts anyway?
- My working theory was that a higher recall would be ok if the mistakes the model was making was just classifyinf other low quality posts as spam, but it looks like we would just be removing a lot of long posts.

Some future tests we shoudl run: - [ ] We looked at classifying longer posts at a sentence level and didn’t really improve results but maybe we should look specifically at longer posts that are classified as spam and passing them through a second filter. - [ ] Maybe we need a heuristic filter for slop. This is frequently being classified as not spam and more data isn’t massively affecting results. Filters like proportion of stopwords per sentence might be good here. - [ ] We need to pick out our best models and run them over the validation set to see which perform best. - [ ] Unfortunately, we need to label more data - [ ] I still haven’t tested model accuracy if data is run through heuristic filters before modelling.