Test Win32 Build Available with Junk Mail Filter Changes

Discussion about official Mozilla Thunderbird builds
Locked
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Test Win32 Build Available with Junk Mail Filter Changes

Post by mscott »

For those of you who are curious....here is a test Win32 trunk build showing off the junk mail improvements I've been working on with some help from Miguel Vargas.

http://ftp.mozilla.org/pub/mozilla.org/ ... -win32.zip

It includes:

1) A new algorithm used to determine if a message is junk or not
2) A new tokenizer for deciding what tokens to extract from a message to feed into the junk algorithm.

If you want to experiment with this build, note the following:

1) I recommend you remove training.dat from your profile and re-train the filter before judging its effectivenss.

2) For those who actually understand the math behind bayesian based junk algorithms, you can fine tune the sensitivity of the filter by changing the following pref. Otherwise I suggest leaving it alone.

+// the probablilty threshold over which messages are classified as junk
+// this number is divided by 100 before it is used. The classifier can be fine
tuned
+// by changing this pref. Typical values are .99, .95, .90, etc.
+pref("mail.adaptivefilters.junk_threshold", 90);

be warned that the lower the number, the higher the false positive rate (messages that are incorrectly marked as junk) although the higher percentage of spam it catches.

So far, I am seeing some fantastic results with these changes when testing against a public mail corpus of ~10,000 messages.

The old code was catching 45% of the junk on my test setup
The new code is catching 88%.

I'm interested to here how it holds up in the real world.

And of course there is a lot more work to do on it.

-Scott
Last edited by mscott on February 21st, 2004, 4:49 pm, edited 1 time in total.
Thunderbirds are Go!
User avatar
jhirshon
Posts: 762
Joined: June 11th, 2003, 3:24 pm

Post by jhirshon »

cool - looking forward to trying this!
User avatar
Amix.
Posts: 14
Joined: November 14th, 2003, 12:14 pm
Location: Ekaterinburg, Russia

Post by Amix. »

Looks great.

But it marks all the delivery delay/failure messages as junk.
It's frustrating, but probably OK with the recent virus pandemias...
wolruf
Posts: 107
Joined: November 5th, 2002, 3:15 am
Location: France
Contact:

Post by wolruf »

is it also using SpamAssassin features such as the ones mentioned in Neil's blog ? Just curious, I don't know SpamAssassin much http://www.neilturner.me.uk/2004/Feb/15 ... score.html
for e.g. "URI: Contains a URL in the BIZ top-level domain"
nicklott
Posts: 25
Joined: February 4th, 2004, 1:28 am

Warning: random thought

Post by nicklott »

Would it be worthwhile to tune the junk threshold automatically d'you think? Perhaps by measuring the number of messages corrected by the user (so every corrected false positive nudges the value up and every spam missed nudges it down)?
User avatar
DurianCS
Posts: 767
Joined: June 5th, 2003, 6:17 am
Location: The Netherlands

Re: Warning: random thought

Post by DurianCS »

nicklott wrote:Would it be worthwhile to tune the junk threshold automatically d'you think? Perhaps by measuring the number of messages corrected by the user (so every corrected false positive nudges the value up and every spam missed nudges it down)?


Generally speaking, you should not do that. The problem is that you change the amount of false negatives and consequently the amount of false positives, in the same action. I don't care having several false negatives (not recognised as spam), but I really don't want to have false positives (thrown way, but not being spam).
For me, e.g., the standard setting in Spambayes work very well. And if not, I could change them. Spambayes, by the way, offers a very pleasant third category "suspects", which should contain most dubious mails.

CS
Durian, King of Fruits
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8a6) Gecko/20041227 Firefox/1.0+ (bangbang023)
TB version 1.0 (20041225)
level
Posts: 8
Joined: June 4th, 2003, 8:53 pm

Post by level »

I got bad results. :-(

Thunderbird 0.5:
ham = 971, fp=4(0.4%)
spam=2444, fn=170(7.0%)

Test Build:
ham = 971, fp=10(1.0%)
spam=2444, fn=336(13.7%)

NOTE:
- Half of messages are Japanese.
- At first I created new training.dat using half of these messages.
After removing msf files (reset junk status), I applied junk mail contorll on whole messages.
User avatar
RickFriedman
Posts: 361
Joined: July 25th, 2003, 9:35 am
Contact:

Post by RickFriedman »

Scott,

I just want to be sure I understand...

the automated, nightly builds do NOT have the new junk mail code... correct?

Rick
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.7) Gecko/20060911 SUSE/1.5.0.7-1.5 Firefox/1.5.0.7
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

RickFriedman wrote:Scott,

I just want to be sure I understand...

the automated, nightly builds do NOT have the new junk mail code... correct?

Rick
just this test build.
Thunderbirds are Go!
User avatar
peaveyman
Posts: 341
Joined: June 1st, 2003, 6:24 pm

Post by peaveyman »

I deleted training.dat but Thunderbird didn't recreate it, that I can see. Is it supposed to?
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

it won't get created until you start training it again.
Thunderbirds are Go!
avbohemen
Posts: 292
Joined: April 25th, 2003, 6:50 am
Location: The Netherlands

Post by avbohemen »

Hmm... this is about a week too late for me. My ISP just installed their own spam filter, and it works VERY nice. About 98% of the spam gets caught by it, and in a week time I've had absolutely no false positives.

However, I can still pop my spam box to see if Thunderbird also marks everything in it as spam :)
John Liebson
Posts: 6920
Joined: July 29th, 2003, 1:09 pm

Post by John Liebson »

avbohemen wrote:Hmm... this is about a week too late for me. My ISP just installed their own spam filter, and it works VERY nice. About 98% of the spam gets caught by it, and in a week time I've had absolutely no false positives.

However, I can still pop my spam box to see if Thunderbird also marks everything in it as spam :)
My iIP uses Postini, which does catch quite a bit of spam. However, since the US government, in its usual absurd way of making things worse, enacted an antispam law that makes it easier for spammists to pollute the world, I now get more than even a short while ago.

Until this recent calamity, I never used the Thunderbird spam filter (preferring my spam to be in slices from the contents of the Spam tins...), I now find myself appreciating this feature, even though I suspect it will take a relatively long time until I'm able to adequately train the system to despamalize things for me automatically.
avbohemen
Posts: 292
Joined: April 25th, 2003, 6:50 am
Location: The Netherlands

Post by avbohemen »

John Liebson: My ISP uses Brightmail. I get about 80 spams per day now, and it is still rising. Brightmail only misses 1 or 2 of them. Very nice. Thunderbird missed about 20 each day.

mscott: I've set up a new pop3-account to access my spam box. However, I had to start training Thunderbird again. Initially, it didn't see any spam. However, my old training.dat is still present. Is this file used by the new junk mail code? Or do I have to train again for every new account?
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

there is only one training.dat for all accounts in a profile. This test build uses the same file. In my instructions I suggest that you remove it and retrain from scratch befure judgng its effectiveness but you don't have to do that.
Thunderbirds are Go!
Locked