Test Win32 Build Available with Junk Mail Filter Changes

Discussion about official Mozilla Thunderbird builds
Locked
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

I18N = internationalisation. I, 18 letters, and an N.

My email is in English, and although I've managed to get it up to about 80-90% success, that's still not close to POPFile's 97% score on the same data. It is also pretty obvious from my own experiences and anecdotal evidence on these forums that it's very easy to mistrain the filter and for it to never recover. Since people are encouraged to pre-train it on their own mail, there's a good chance that people are throwing radically different and imbalanced training sets at it, probably resulting in overtraining in some cases.

I think that perhaps it would be useful for there to be some visible stats on the junk filter so that we can report back here with more detailed information. Maybe putting some of the stats into the message headers would help in identifying problem emails, too.
User avatar
Moonwolf
Posts: 531
Joined: December 7th, 2003, 2:50 pm
Location: Hertfordshire, England
Contact:

Post by Moonwolf »

The original algorithm worked for me, but was becoming slightly less efficient with every mail delivery.
The new algorithm worked well until the first time I reclassified a false positive. Since then it hasn't marked anything as junk. I tried manually marking all the missed junk for a few days with no effect at all. I average 150 emails a day (95% junk), so that should have been a lot of training.
Something just occurred to me. I deleted the old training.dat before I switched over. I wonder if the people who have the new algorithm working kept the old training.dat?
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
eolh
Posts: 20
Joined: January 27th, 2004, 11:17 pm

Post by eolh »

mscott wrote:yes it would be nice if we could figure out why some folks seem to get no benefits from the new or old filters. I wonder if those of you running into problems are I18N users or something.


I have my system setup to be multi-lingual English and Japanese, if that's what you mean. I'm running a US version of Windows XP Pro, though, and have it set to default to English. All of the spam messages that I've run through the filters have been in English (well, except for those occasional ones that are just complete nonsense!). I have character encoding in Thunderbird set to "Western (ISO-8859-1)".

Beyond that, I did a clean install with the first version with the new filters (the subsequent installs have been over top of this, but I have not noticed any significant difference in performance), and I deleted the training.dat file from my profile when I installed as suggested.

The old filters were nowhere near perfect, but they did catch, I'd say, around 75%. The new ones have yet to actually catch anything, depsite the fact that I've had a "training set" of around 1775 spam messages come in since I installed them.

If there's any additional information I can provide to help out, let me know.
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

I just discovered why some of you may be having such horrible results after retraining. If you delete training.dat and then retrain by classifying a bunch of messages that are already classified (i.e. msgs in your junk folder that are marked as junk) then our token accounting is getting seriously whacked out. If Viagra occurred in 200 messages that you re-classified after deleting training.dat, then we ended up thinking Viagra occurred in only one spam message and as such is not a very good indicator of spam.

See Bug #237095 for more details.

For now, if you wish to properly retrain you should delete the .msf files for the folders that contain the messages that are already classified (junk/not junk) which you want to use for retraining.
Thunderbirds are Go!
eolh
Posts: 20
Joined: January 27th, 2004, 11:17 pm

Post by eolh »

mscott wrote:For now, if you wish to properly retrain you should delete the .msf files for the folders that contain the messages that are already classified (junk/not junk) which you want to use for retraining.


Wow. So I did that . . . First marked all of the messages in the junk folder as "not junk", then closed Thunderbird and deleted the training.dat and junk.msf files. Then I reopened Thunderbird, selected the entire junk folder, and hit the "mark selected messages as junk button". Then I checked my mail . . .

. . . and it caught its first spam ever. Yeah, I know one spam doesn't sound like much, but since it's the first spam out of nearly 1800 that it's caught, I'm impressed. Thank you.

Once I've tested a bit longer and had more spam through it I'll report back on how promising the results are.
ferdinand
Posts: 87
Joined: August 17th, 2003, 1:30 pm
Location: Netherlands

Post by ferdinand »

What information is stored in a .msf file? Is it generated automaticaly?
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Looks like another reason why only training on errors is the best way. :)
User avatar
DizzyWeb
Posts: 637
Joined: March 27th, 2003, 9:56 am

Post by DizzyWeb »

Scott: it IS much better. But, as soon as I mark one message as junk, or as not junk, by hand, everything goes to hell. With a clean training.dat, it runs near perfectly, but as soon as it gets a false positive, it's screwed.
The author can never, in no way, be held responsible for any harm caused, mental or physical, by reading this post.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Surely with a clean training.dat, it shouldn't catch anything at all?
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Post by djbrock »

I have deleted the training.dat files several times in the past and had quit trying to re-train on old junk messages so that it only trains on the new messages. Yet my experience is still along the lines of DizzyWeb's.
User avatar
Moonwolf
Posts: 531
Joined: December 7th, 2003, 2:50 pm
Location: Hertfordshire, England
Contact:

Post by Moonwolf »

I have always trained on errors in new mail only, with exactly the same results as DizzyWeb. So I'm afraid there's something else at work here, Scott. It simply stops working after the first false positive is corrected.
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
eolh
Posts: 20
Joined: January 27th, 2004, 11:17 pm

Post by eolh »

Hm, looks like I spoke too soon. It caught that one spam that I noted above. Then it had two false positives, both of which I marked as "not junk". It hasn't caught a thing since.
coch
Posts: 94
Joined: December 20th, 2002, 12:36 pm

Post by coch »

Does the enhanced build offers some protection against spam emails containing fake HTML tags? I read a few months ago on these forums that for the moment, it was safer not to mark those emails as junk.

If not clear, I mean that some messages are obviously junk, but are not classified as junk because they contain legit words enclosed in fake HTML tags spread all over the message (only seen when looking at the message source). Sometimes I also get these fake HTML tags in the middle of a word, thereby breaking a word that would otherwise trigger a junk status.

Is this part of the enhancements?
Thx.
Kerrick
Posts: 202
Joined: May 30th, 2003, 7:58 am

Post by Kerrick »

would it not be possible to implement something in the filters (if they do not do so already) which would strip HTML tags out of a message before examining for junk/notjunk status?
PM for Gmail invites (50ish left)
User avatar
prophecy
Posts: 91
Joined: August 21st, 2003, 8:52 am
Contact:

Post by prophecy »

Now that's a good idea! Strip the html out then do the junk.
Locked