Test Win32 Build Available with Junk Mail Filter Changes
-
- Posts: 2516
- Joined: April 2nd, 2003, 4:10 pm
- Location: Thunderbird Research Center, CA
- Contact:
Test Win32 Build Available with Junk Mail Filter Changes
For those of you who are curious....here is a test Win32 trunk build showing off the junk mail improvements I've been working on with some help from Miguel Vargas.
http://ftp.mozilla.org/pub/mozilla.org/ ... -win32.zip
It includes:
1) A new algorithm used to determine if a message is junk or not
2) A new tokenizer for deciding what tokens to extract from a message to feed into the junk algorithm.
If you want to experiment with this build, note the following:
1) I recommend you remove training.dat from your profile and re-train the filter before judging its effectivenss.
2) For those who actually understand the math behind bayesian based junk algorithms, you can fine tune the sensitivity of the filter by changing the following pref. Otherwise I suggest leaving it alone.
+// the probablilty threshold over which messages are classified as junk
+// this number is divided by 100 before it is used. The classifier can be fine
tuned
+// by changing this pref. Typical values are .99, .95, .90, etc.
+pref("mail.adaptivefilters.junk_threshold", 90);
be warned that the lower the number, the higher the false positive rate (messages that are incorrectly marked as junk) although the higher percentage of spam it catches.
So far, I am seeing some fantastic results with these changes when testing against a public mail corpus of ~10,000 messages.
The old code was catching 45% of the junk on my test setup
The new code is catching 88%.
I'm interested to here how it holds up in the real world.
And of course there is a lot more work to do on it.
-Scott
http://ftp.mozilla.org/pub/mozilla.org/ ... -win32.zip
It includes:
1) A new algorithm used to determine if a message is junk or not
2) A new tokenizer for deciding what tokens to extract from a message to feed into the junk algorithm.
If you want to experiment with this build, note the following:
1) I recommend you remove training.dat from your profile and re-train the filter before judging its effectivenss.
2) For those who actually understand the math behind bayesian based junk algorithms, you can fine tune the sensitivity of the filter by changing the following pref. Otherwise I suggest leaving it alone.
+// the probablilty threshold over which messages are classified as junk
+// this number is divided by 100 before it is used. The classifier can be fine
tuned
+// by changing this pref. Typical values are .99, .95, .90, etc.
+pref("mail.adaptivefilters.junk_threshold", 90);
be warned that the lower the number, the higher the false positive rate (messages that are incorrectly marked as junk) although the higher percentage of spam it catches.
So far, I am seeing some fantastic results with these changes when testing against a public mail corpus of ~10,000 messages.
The old code was catching 45% of the junk on my test setup
The new code is catching 88%.
I'm interested to here how it holds up in the real world.
And of course there is a lot more work to do on it.
-Scott
Last edited by mscott on February 21st, 2004, 4:49 pm, edited 1 time in total.
Thunderbirds are Go!
-
- Posts: 107
- Joined: November 5th, 2002, 3:15 am
- Location: France
- Contact:
is it also using SpamAssassin features such as the ones mentioned in Neil's blog ? Just curious, I don't know SpamAssassin much http://www.neilturner.me.uk/2004/Feb/15 ... score.html
for e.g. "URI: Contains a URL in the BIZ top-level domain"
for e.g. "URI: Contains a URL in the BIZ top-level domain"
-
- Posts: 25
- Joined: February 4th, 2004, 1:28 am
Warning: random thought
Would it be worthwhile to tune the junk threshold automatically d'you think? Perhaps by measuring the number of messages corrected by the user (so every corrected false positive nudges the value up and every spam missed nudges it down)?
- DurianCS
- Posts: 767
- Joined: June 5th, 2003, 6:17 am
- Location: The Netherlands
Re: Warning: random thought
nicklott wrote:Would it be worthwhile to tune the junk threshold automatically d'you think? Perhaps by measuring the number of messages corrected by the user (so every corrected false positive nudges the value up and every spam missed nudges it down)?
Generally speaking, you should not do that. The problem is that you change the amount of false negatives and consequently the amount of false positives, in the same action. I don't care having several false negatives (not recognised as spam), but I really don't want to have false positives (thrown way, but not being spam).
For me, e.g., the standard setting in Spambayes work very well. And if not, I could change them. Spambayes, by the way, offers a very pleasant third category "suspects", which should contain most dubious mails.
CS
Durian, King of Fruits
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8a6) Gecko/20041227 Firefox/1.0+ (bangbang023)
TB version 1.0 (20041225)
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8a6) Gecko/20041227 Firefox/1.0+ (bangbang023)
TB version 1.0 (20041225)
-
- Posts: 8
- Joined: June 4th, 2003, 8:53 pm
I got bad results.
Thunderbird 0.5:
ham = 971, fp=4(0.4%)
spam=2444, fn=170(7.0%)
Test Build:
ham = 971, fp=10(1.0%)
spam=2444, fn=336(13.7%)
NOTE:
- Half of messages are Japanese.
- At first I created new training.dat using half of these messages.
After removing msf files (reset junk status), I applied junk mail contorll on whole messages.
Thunderbird 0.5:
ham = 971, fp=4(0.4%)
spam=2444, fn=170(7.0%)
Test Build:
ham = 971, fp=10(1.0%)
spam=2444, fn=336(13.7%)
NOTE:
- Half of messages are Japanese.
- At first I created new training.dat using half of these messages.
After removing msf files (reset junk status), I applied junk mail contorll on whole messages.
- RickFriedman
- Posts: 361
- Joined: July 25th, 2003, 9:35 am
- Contact:
-
- Posts: 2516
- Joined: April 2nd, 2003, 4:10 pm
- Location: Thunderbird Research Center, CA
- Contact:
-
- Posts: 2516
- Joined: April 2nd, 2003, 4:10 pm
- Location: Thunderbird Research Center, CA
- Contact:
-
- Posts: 292
- Joined: April 25th, 2003, 6:50 am
- Location: The Netherlands
Hmm... this is about a week too late for me. My ISP just installed their own spam filter, and it works VERY nice. About 98% of the spam gets caught by it, and in a week time I've had absolutely no false positives.
However, I can still pop my spam box to see if Thunderbird also marks everything in it as spam
However, I can still pop my spam box to see if Thunderbird also marks everything in it as spam
-
- Posts: 6922
- Joined: July 29th, 2003, 1:09 pm
My iIP uses Postini, which does catch quite a bit of spam. However, since the US government, in its usual absurd way of making things worse, enacted an antispam law that makes it easier for spammists to pollute the world, I now get more than even a short while ago.avbohemen wrote:Hmm... this is about a week too late for me. My ISP just installed their own spam filter, and it works VERY nice. About 98% of the spam gets caught by it, and in a week time I've had absolutely no false positives.
However, I can still pop my spam box to see if Thunderbird also marks everything in it as spam
Until this recent calamity, I never used the Thunderbird spam filter (preferring my spam to be in slices from the contents of the Spam tins...), I now find myself appreciating this feature, even though I suspect it will take a relatively long time until I'm able to adequately train the system to despamalize things for me automatically.
-
- Posts: 292
- Joined: April 25th, 2003, 6:50 am
- Location: The Netherlands
John Liebson: My ISP uses Brightmail. I get about 80 spams per day now, and it is still rising. Brightmail only misses 1 or 2 of them. Very nice. Thunderbird missed about 20 each day.
mscott: I've set up a new pop3-account to access my spam box. However, I had to start training Thunderbird again. Initially, it didn't see any spam. However, my old training.dat is still present. Is this file used by the new junk mail code? Or do I have to train again for every new account?
mscott: I've set up a new pop3-account to access my spam box. However, I had to start training Thunderbird again. Initially, it didn't see any spam. However, my old training.dat is still present. Is this file used by the new junk mail code? Or do I have to train again for every new account?
-
- Posts: 2516
- Joined: April 2nd, 2003, 4:10 pm
- Location: Thunderbird Research Center, CA
- Contact: