Test Win32 Build Available with Junk Mail Filter Changes

Discussion about official Mozilla Thunderbird builds
Locked
User avatar
naylor83
Posts: 325
Joined: September 11th, 2003, 4:04 am
Location: Uppsala, Sweden
Contact:

Post by naylor83 »

Hello people!

I have som graphic results of parallell testing of the new Junk Mail Build and TB 0.5:

<a href="http://home.student.uu.se/dana3949/tbjunkfilter.png">Cumulative Efficiency</a> - <a href="http://home.student.uu.se/dana3949/tbjunkfilter2.png">Instant Efficiency</a>

(Instant Efficiency = efficiency at each download)

I set them up with different profiles, and set them to leave mail on my server for three days, hence making it possible to download the same e-mails to both of them.

I pretrained them both builds with 100 messages: 50 spam and 50 legitimate e-mails.

I then just used them both, downloaded mails at the same time so they got EXACTLY the same training. I believe this method of testing them may be more realistic than just giving them x thousand e-mails to chew on in one go. (More realistic since I updated it's training with every download, fixing false negatives/positives and marking not junk as 'not junk'.)

I found some very pleasing results for the devs, and this is even with what is now said to be a broken build. The overall efficiency has stabilized at ~75% for the test build and at ~45% for TB 0.5, and so far they have had 302 e-mails (on top of the 100 for pre-training). An average download consisted of 23 e-mails, 8 of which were spam.

As for false positives, the test build had none and TB 0.5 just had one, or 0.5%.
Last edited by naylor83 on April 5th, 2004, 1:53 am, edited 1 time in total.
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

awesome!

FYI, what 'test build' were you using? The automated windows nightlies contain the new junk mail controls including bug fixes that were causing the earlier 'junk mail test build' that I released to not perform correctly.
Thunderbirds are Go!
User avatar
naylor83
Posts: 325
Joined: September 11th, 2003, 4:04 am
Location: Uppsala, Sweden
Contact:

Post by naylor83 »

This was the original junk test build I think, posted at the top of this thread. Maybe I should try doing the same with the latest nightlies?
<a href="http://davidnaylor.org/blog/">David Naylor: Blog</a> | <a href="http://davidnaylor.org/photography/">David Naylor: Photography</a>
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

that would be really cool if you could re-test with the latest nightlies. Many thanks in advance!
Thunderbirds are Go!
User avatar
naylor83
Posts: 325
Joined: September 11th, 2003, 4:04 am
Location: Uppsala, Sweden
Contact:

Post by naylor83 »

ok! I'm on to it!
<a href="http://davidnaylor.org/blog/">David Naylor: Blog</a> | <a href="http://davidnaylor.org/photography/">David Naylor: Photography</a>
User avatar
naylor83
Posts: 325
Joined: September 11th, 2003, 4:04 am
Location: Uppsala, Sweden
Contact:

Post by naylor83 »

I now have simulated the same test as I did previously, by using my collected batch of e-mail, randomly dividing it into small "downloads" (in folders), with the same average size as previously. I tested TB 0.5, the "junk mail build" at the top of this thread and the 2004-04-04 nightly.

After pre-training with 50+50 (as previously) i ran the junk mail controls on one subfolder at a time, correcting the filter as I went along.

Here are the results:

<a href="http://home.student.uu.se/dana3949/junk1.png">Cumulative Efficiency</a> - <a href="http://home.student.uu.se/dana3949/junk2.png">Instant efficiency</a>

The nightly I tried did not do as well, surprisingly, as the junk mail test build. Has the default cut-off been changed since?

On the other hand, it didn't have any false positives, whereas the test build marked 1.2 % of the ham as junk, and TB 0.5 0.8 %. (3 and 2 emails respectively.)

In the linked graphs the "efficiency" is the percentage of the junk caught by the filter, and it is plotted against the training size <i>prior</i> to the filter being run on the "download" (folder) of e-mail.

[Edit] Oh, yes, a question I forgot: Do the bufixes in the nightlies include the updated tokenizer which looks at the headers as well as the bodies?
sasquatch
Posts: 6022
Joined: November 25th, 2003, 8:56 am

Post by sasquatch »

just wondering the latest status on these 2 bugs:

http://bugzilla.mozilla.org/show_bug.cgi?id=227846

http://bugzilla.mozilla.org/show_bug.cgi?id=229002


Thanks (and no, they are not mine, but I was just reading about them elsewhere).
mgl
Posts: 29
Joined: August 22nd, 2003, 9:57 am

Argh.

Post by mgl »

I don't have any stats to offer, but the new Junk algorithm is, after a few weeks of use, still performing worse than the 0.5 milestone build. I am currently using the 06-Apr weekly build, but it seems no better or worse than its predecessors. When I moved from 0.5 to the new builds, I did the following:

-Erased training.dat
-Collected all old spam in one Junk folder.
-Erased the .msf file for my Junk folder, resetting all Junk status.
-Trained the new filter on my existing Junk corpus (>5,000 messages, covering all variations of spam), using "Run Junk Mail on Folder", and repeatedly re-marking spam that the filter didn't catch (false negatives) until the filter caught them all.

When I let the filter loose on incoming mail, it at first seemed to catch most spam, which I considered good for a "young" filter. There were (and are) a few false positives, and a greater number of false negatives, which I marked as Junk. Once in the Junk folder, I left them alone. But the filter's performance has seemed to decrease over time, and now it misses about 40-50% of new spam. Sometimes it doesn't catch any, and I end up with 9 or so spam messages in my Inbox (I just looked, and since the last time I checked, there are 6 new spam messages in my Inbox, none of which were caught).

So, despite the "improved" algorithm, I am getting far worse performance than with 0.5, which consistently caught 80-90% of incoming spam, and which seemed to improve over time. I'm not sure I ever got clear answers to the following questions, so I'll ask them again:

-Should I train the new filter on my existing Junk corpus, or just on new incoming spam? As I understand Bayesian filtering, training on a representative corpus should be beneficial to filter performance, the larger the better. However, current versions of TB don't seem to filter that well when pre-trained.

-Should I periodically re-train my existing Junk corpus? With 0.5, I'd occasionally run junk mail controls on my Junk folder, and find that the filter would <i>unmark</i> previously marked spam. So I'd re-mark them, run the controls again, and keep doing so until all messages remained marked (usually only 1-3 iterations). I tried doing this for early versions with the new filter, but noticed that training.dat was often not updating afterwards, so I stopped.

I appreciate all the work Scott is doing to improve the filter, and I'm glad it's working well for some people. But it's definitely still broken for me.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

If you're pre-training, remember you don't just train on junk! The filter needs to see what is Not Junk in order to be able to be able to tell the difference. Train on roughly the same amount of Not Junk as you do Junk. Or just train on new emails, which should probably work slightly more accurately but will take longer.
mgl
Posts: 29
Joined: August 22nd, 2003, 9:57 am

Post by mgl »

See, this is one of those perpetually confusing issues, made more so by the TB interface. If TB correctly classifies a ham e-mail, my natural impulse is to assume that the filter has got it right, and leave it alone. I think someone suggested that TB should have a third junk icon, for messages TB is unsure about (perhaps those that fall within a certain scoring range). At first, this would be most of them, but presumably the quantity would drop with more training.

One of the problems with the way TB's junk filter is implemented is that it's almost entirely opaque to the user. Why did TB classify one message as spam but not a very similar message? Should I continue to re-train my junk (or non-junk) folders to keep the filter ship-shape, or should I just leave them alone?

So, anyway ... I went through a bunch of large ham folders, and classified them all as Not Junk. Then I went back to my (now >6,000 message) spam corpus, which were all marked as Junk, and ran the Junk Mail Controls on the folder. It unmarked about 2,000 messages (that is, incorrectly classified them as ham), so I moved those ones to an empty folder for re-training. The first re-training run, it was down to 1,820 incorrect faux-ham, then 1,217, then 1,066, and so on. After about 20 iterations of this, I'm down to a hard core of ~200 faux-ham, which stubbornly resist the Junk Mail Controls: even after I repeatedly mark them as Junk, the filter un-marks them each time.

Looking a little closer at the stubborn faux-ham, it seems that the majority of them are the kind of spam that include a couple of hundred plain english words at the bottom of the message, like the following:

<i>truce dehumidify insensible curtain dynamo trisyllable thrust glass micrography alpheratz casein airy winston detour constantinople oxen bearish congregate industrial io blade barstow ammerman aeneas eastwood cosmopolitan feat excrete hangable earl carnegie corporeal bloodshot flex atlantes exculpate sloven dwyer aerospace catherwood coalition incomprehensible natural blaspheme eyebrow bertrand avesta gorse chamois stubby ford crawl incest ...</i> [goes on for about four times this long]

The correctly marked spam, by contrast, have relatively few of these messages. TB's filter catches gibberish words quite well, and interestingly, seems to have no problem with random prose excerpts placed in junk messages (say, from Hitchhiker's Guide to the Galaxy), but seems to be bamboozled by these messages with large numbers of random, relatively obscure words.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

mgl wrote:See, this is one of those perpetually confusing issues, made more so by the TB interface. If TB correctly classifies a ham e-mail, my natural impulse is to assume that the filter has got it right, and leave it alone.


Your impulse is correct in this case.

One of the problems with the way TB's junk filter is implemented is that it's almost entirely opaque to the user. Why did TB classify one message as spam but not a very similar message?


Usually the results are too complex to mean much anyway. Each word or token will have a probability which is combined together. Then (I assume, based on other software) a statistical significance test is done to see whether the difference between the aggregate spam and not-spam scores are significant, and if so, act upon the findings.

Should I continue to re-train my junk (or non-junk) folders to keep the filter ship-shape, or should I just leave them alone?


Leave them alone. A message only needs to go through the system once. Retraining shouldn't make a difference anyway, providing you corrected every mistake it made. If it does make a difference, well, that signifies a bug in Thunderbird to me.

So, anyway ... I went through a bunch of large ham folders, and classified them all as Not Junk.


The problem here is that you're probably now overtraining it towards 'ham'. Roughly equal-sized sets are the best way if you want to pre-train.

After about 20 iterations of this, I'm down to a hard core of ~200 faux-ham, which stubbornly resist the Junk Mail Controls: even after I repeatedly mark them as Junk, the filter un-marks them each time.


This sounds like a bug, since other spam software based on the same algorithms don't have this problem. However, the large size of your training set may be introducing so much noise that what happens is that the system correctly decides that the emails are more likely to be spam than ham, but that the difference is not large enough to be statistically important.

Looking a little closer at the stubborn faux-ham, it seems that the majority of them are the kind of spam that include a couple of hundred plain english words at the bottom of the message


Any decent Bayesian classifier should be immune to such tricks as you won't use even a tiny proportion of these words in your ham mail. Maybe 2 or 3 of them will count as ham tokens, and these should be easily outweighed by the obvious spam tokens.
mgl
Posts: 29
Joined: August 22nd, 2003, 9:57 am

Post by mgl »

Kylotan wrote:
mgl wrote:Should I continue to re-train my junk (or non-junk) folders to keep the filter ship-shape, or should I just leave them alone?


Leave them alone. A message only needs to go through the system once. Retraining shouldn't make a difference anyway, providing you corrected every mistake it made. If it does make a difference, well, that signifies a bug in Thunderbird to me.


Right. You'd think that, having marked a message as spam, it would forevermore be regarded as spam by TB. Not so--it seems that in the constant (and necessary) changes made to the filter with each new set of messages, some messages drop out of the spam classification. The icon will indicate that they're junk, but if you run the filter on your previously classified spam, some will be <i>un</i>-marked. This implies that any new messages of this type that arrive will also be erroneously marked as ham by the filter. That's my rationale for periodic re-training on the Junk folders--to reinforce the spamness of everything in there.


So, anyway ... I went through a bunch of large ham folders, and classified them all as Not Junk.


The problem here is that you're probably now overtraining it towards 'ham'. Roughly equal-sized sets are the best way if you want to pre-train.


Argh. I see your point, but this is frustrating. We need some idiot-proof (i.e. mgl-proof) guidelines for training this thing.

After about 20 iterations of this, I'm down to a hard core of ~200 faux-ham, which stubbornly resist the Junk Mail Controls: even after I repeatedly mark them as Junk, the filter un-marks them each time.


This sounds like a bug, since other spam software based on the same algorithms don't have this problem. However, the large size of your training set may be introducing so much noise that what happens is that the system correctly decides that the emails are more likely to be spam than ham, but that the difference is not large enough to be statistically important.


Which may be fixable by changing the filtering threshold, right?

Looking a little closer at the stubborn faux-ham, it seems that the majority of them are the kind of spam that include a couple of hundred plain english words at the bottom of the message


Any decent Bayesian classifier should be immune to such tricks as you won't use even a tiny proportion of these words in your ham mail. Maybe 2 or 3 of them will count as ham tokens, and these should be easily outweighed by the obvious spam tokens.


Right, and so it's interesting how many of these are getting past the filter. Thing is, they're actually quite clever: the vast majority of the words in the message are in these text blocks (the actual lame ad is usually only a few words or just an image), and the vast majority of <i>those</i> are not obvious spam tokens. Of the 53 words I quoted before, only "incest" (and perhaps "excrete"--yuck) is likely to raise a red flag with the filter. They also tend to greatly increase the size of training.dat (7 MB and counting for me), though I can't see how that would hamper its effectiveness.

Thanks for the help, Kylotan. I've been regretting ditching the 0.5 milestone for this version, which is much less reliable for me. But perhaps I've been too enthusiastic about training the filter, and maybe I should just start <i>again</i> with clean files and see how it goes...
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

mgl wrote:You'd think that, having marked a message as spam, it would forevermore be regarded as spam by TB. Not so--it seems that in the constant (and necessary) changes made to the filter with each new set of messages, some messages drop out of the spam classification.


Technically, yes, this is possible. With other spam software I use, this almost never happens. It's as if the distinguishing factors in Thunderbird are less prominent, and maybe this is because people train it with large sets of email, introducing noise into the corpus.

That's my rationale for periodic re-training on the Junk folders--to reinforce the spamness of everything in there.


I seem to remember reading that it remembers which messages you've classified before, so that no message will be classified twice. However your experiences suggest otherwise...

Argh. I see your point, but this is frustrating. We need some idiot-proof (i.e. mgl-proof) guidelines for training this thing.


One problem is that people aren't agreed on the best method. Personally I'm biased towards training on errors only, as from a mathematical standpoint that's the most accurate, assuming false positives and false negatives are of equal importance (which many people disagree with). Others are biased towards large training sets up-front which give you quick results every time, but assume that the problems of overtraining aren't significant.

Which may be fixable by changing the filtering threshold, right?


It may be best to set the threshold to 50% or whatever, so that there's no bias inbuilt into the system. I am assuming that this will allow you to probably catch 190 of those 200, at the cost of a few false positives here and there.

Thing is, they're actually quite clever: the vast majority of the words in the message are in these text blocks (the actual lame ad is usually only a few words or just an image), and the vast majority of <i>those</i> are not obvious spam tokens.


Well, in time these filters will tend to assume that such embedded html signifies spam. But it does get harder for the filter as the effective content gets smaller, yeah. It helps if the filter processes the headers as most spam passes through a relatively small number of places, and they will tend to have random email addresses in there, as opposed to your ham which is likely to have more frequently-seen addresses.

I've been regretting ditching the 0.5 milestone for this version, which is much less reliable for me. But perhaps I've been too enthusiastic about training the filter, and maybe I should just start <i>again</i> with clean files and see how it goes...


I had similar problems with other builds. One bad training set would ruin the performance permanently. This one is generally working ok for me - and by that I mean catching about 95% of my junk - after training on about 200 each of ham and spam, and making a few corrections since then. However I know that other software can do better (98% effectiveness is my usual score with POPFile) so I hope to see further improvements here.
Locked