Test Win32 Build Available with Junk Mail Filter Changes

Discussion about official Mozilla Thunderbird builds
Locked
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Post by djbrock »

I'm hitting a bit better than 50% with the Adaptive filter setting at 10. It's caught 6 Spam and missed 5 in the last day.
User avatar
peaveyman
Posts: 341
Joined: June 1st, 2003, 6:24 pm

Post by peaveyman »

Since I did the retraining of the filter, it has been running almost at 100 %. I have over a hundred junk emails and it has caught all but 1 or 2 of them. Not bad, I think.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Whereas I just got another 4 spams that Thunderbird totally missed and Popfile picked up. How can people get such totally different results?
User avatar
Moonwolf
Posts: 531
Joined: December 7th, 2003, 2:50 pm
Location: Hertfordshire, England
Contact:

Post by Moonwolf »

Yesterday it was working at around 95% for me. Then I marked a false positive as not junk. This morning it missed 95% of my spam. Still some work to do I think.
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Is the code for this build available anywhere?
User avatar
Moonwolf
Posts: 531
Joined: December 7th, 2003, 2:50 pm
Location: Hertfordshire, England
Contact:

Post by Moonwolf »

Yes, it has been checked into the trunk.
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

How would I see the file(s) in question without downloading 36MB of source code?
User avatar
peaveyman
Posts: 341
Joined: June 1st, 2003, 6:24 pm

Post by peaveyman »

And now, this morning, there were 11 spam that came in and it missed 4 of them. Was working fine yesterday.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Just as a hint to mscott, POPFile counts html tags as special tokens, and invalid html tags as another token type. Seeing 18 instances of 'html:invalid tag' in a single email is a great spam indicator!
Kerrick
Posts: 202
Joined: May 30th, 2003, 7:58 am

Post by Kerrick »

Scott:

I notice in one particualr mailbox i get lots and lots of spam that TB steadfastly refuses to catch. Would getting your hands on a few hundred (or thousand, wouldnt take long) pieces of it help you at all in analyzing why the filter wont catch it? I'd be more than pleased to post it or something if it would help your analysis
mgl
Posts: 29
Joined: August 22nd, 2003, 9:57 am

Post by mgl »

Kylotan wrote:Whereas I just got another 4 spams that Thunderbird totally missed and Popfile picked up. How can people get such totally different results?


This is a classic issue with TB's junk mail filter, as detailed in several threads in the TB forums. Some people get POPFile-quality filtering from TB with very little effort, while others (like you, and me when I had this special build installed) consistently fail to achieve even a mild level of success with it. I suspect part of the answer has to do with the filter being very sensitive to the training method used--do it even slightly wrong, and you're all messed up.

For instance, I thought that training TB on my collected Junk Mail corpus of about 4,000 messages would be sure to help it learn what spam is. After all, the corpus contains just about every permutation of "Viagra" and "mortgage" you can imagine, so what better way to teach the filter than to provide with thousands of messages which are all unambiguously junk? Turns out it doesn't work that way, for some reason: it <b>still</b> failed to pick up even the most obvious incoming spam.

I deleted training.dat several times--I tried different strategies, marking all my good messages as Not Junk, then I tried not doing that, <i>then</i> I tried marking only false positives. I tried these things without training the new build on the large corpus, but only on incoming spam. None of it helped--the filter was still failing to pick up most of the incoming junk.

So I went back to the nightly, restored my old training.dat, and I'm back to 85-90% performance--good enough for me, but not competitive with POPFile etc. This thing is a mystery to me.
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

btw, none of this is checked into the trunk. It's an experiment.

Also, there were some bugs in the new algorithm that someone is fixing for me.
Thunderbirds are Go!
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Yeah. As I understand it, a Bayesian classifier trained with 4000 junk mails should easily catch <i>every</i> junk mail you throw at it, except for utterly novel ones. The problem then would be false positives, but you'd quickly fix that by correcting the mistakes. The probabilistic nature of it should mean that only about 10 fixed false positives should be enough to correct this.

It's also interesting that we're encouraged to train the filter on previous mails, when in fact these systems tend to work better when only trained on errors.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Any idea when the next version of this experimental algorithm will be available then? I'll provide another Thunderbird vs. POPFile comparison for everyone's enjoyment. ;)
User avatar
DizzyWeb
Posts: 637
Joined: March 27th, 2003, 9:56 am

Post by DizzyWeb »

mscott wrote:btw, none of this is checked into the trunk. It's an experiment.

Also, there were some bugs in the new algorithm that someone is fixing for me.

Those bugs would be the ones where it doesn't catch anything after you mark anything as junk or not junk?
The author can never, in no way, be held responsible for any harm caused, mental or physical, by reading this post.
Locked