Test Win32 Build Available with Junk Mail Filter Changes

Discussion about official Mozilla Thunderbird builds
Locked
avbohemen
Posts: 292
Joined: April 25th, 2003, 6:50 am
Location: The Netherlands

Post by avbohemen »

I haven't had time to post this every day, but here's an update:

(day 1 was: 91 total, 73 marked spam, 18 incorrectly not marked, rating 80%)

Results after day 2:
79 junk mails, of which
66 marked as junk
13 not junk
no false positives in my regular mail
that is an effectiveness of 83%
Trained the filter with those messages

Day 3: haven't been online at all, while the spam was nicely waiting in large quantities

Day 4:
181 junk mails, of which
158 directly marked junk
23 not marked to what they are
no false positives in my regular mail
which results in 87% correctness of the new spam filter.
I trained the filter again with 20 of the 23 left messages, to see what happens tomorrow.

By the way, I should mention that the other 3 of those 23 messages were virusses. Not taking those in account (which I did initially, because they were filtered by my ISP's junk filter), the rating would be a little higher, 89%.

I'm starting to like this one. With a lot less training, it is getting a lot more junk than the old one. It could still be better, however, I haven't trained the filter very much. I will monitor it for at least another weak, I think it's worth the testing. Compliments to the developers, you've done a real good job!
User avatar
peaveyman
Posts: 341
Joined: June 1st, 2003, 6:24 pm

Post by peaveyman »

Well, I had to delete the training.dat file and start over with the training for the junk mail filter. I have about 75 junk emails in my junk mail folder. When I ran the junk mail controls on this folder, it marked every one of them as not junk. I then selected them all and marked them all as junk. From then on, it is catching 100 % of the junk that comes in.
eolh
Posts: 20
Joined: January 27th, 2004, 11:17 pm

Post by eolh »

wgianopoulos@yahoo.com wrote:On an unrelated to junk filtering note, I find with this build that if I install an extension, Thunderbird always crashes when I close it to restart it to activate the extension. Both Thunderbird and the extension appear to work fine though. On subsequesnt closes it does not crash. Is this a known issue?


I'm seeing this problem as well. On the first restart after installing an extension, the following error message pops up and the program crashes:

"The procedure entry point ??1nsGetServiceByContractID@@UAE@XZ could not be located in the dynamic link library xpcom.dll."

After that first crash, subsequent attempts to run Thunderbird work fine, and the extensions work without issue as well.

I've tried to install two extensions so far, "Get All Messages" and "Calendar", and I've seen this same error with both of them. The error did not occur in my previous install of the release build of TB 0.5.
Pussycat
Posts: 182
Joined: June 21st, 2003, 8:34 am
Location: Between The Netherlands and Germany

Post by Pussycat »

16 spams today, all nicely recognized. I'm very satisfied so far. If the accuracy stays this high, I'll dump K9.
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Junk Filter & address book

Post by djbrock »

Scott,

Thanks for the experimental version. I'm downloading it with anticipation. I get too much junk mail these days, too. I have a related filtering question. The WhiteList includes the Collected Addresses and I have mine set to collect only addresses that are sent. I don't like the fact that the Collected address book collects all of my addresses, even the ones in my Personal Address book. This makes it a bit tedious to go through and delete anything that might be a spam address. For example, if I try to be "removed" from a spam message, then this is collected by my filter and I have to ferrett through hundreds of my regular email addresses to find the specious one. Maybe that can be addressed in some way in the future.
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Junk Filter & address book

Post by djbrock »

Scott,

Thanks for the experimental version. I'm downloading it with anticipation. I get too much junk mail these days, too. I have a related filtering question. The WhiteList includes the Collected Addresses and I have mine set to collect only addresses that are sent. I don't like the fact that the Collected address book collects all of my addresses, even the ones in my Personal Address book. This makes it a bit tedious to go through and delete anything that might be a spam address. For example, if I try to be "removed" from a spam message, then this is collected by my filter and I have to ferrett through hundreds of my regular email addresses to find the specious one. Maybe that can be addressed in some way in the future.
User avatar
Moonwolf
Posts: 531
Joined: December 7th, 2003, 2:50 pm
Location: Hertfordshire, England
Contact:

Post by Moonwolf »

Don't EVER reply to a spam message to be removed from their list. All you're doing is confirming that they've found a live email address. You'll get more spam, not less.
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Post by djbrock »

Thanks for the advice, but I already realize that they use that as a confirmation of a live address. My question, then, is if they already have me on the list and I reply with a "remove" request, the honest ones will remove me and the crooked ones will keep me on the list. Either way, I'm on the list. So what's the point in not at least asking for the request?

So far I can't tell any difference with the TB junk mail filter. I've trained it on a few hundred messages from my trash folder (both HAM and SPAM) and I just got another Spam that it doesn't recognize. I've also added the line pref("mail.adaptivefilters.junk_threshold", 99); to my user.js file and set it to 99. Should that be pref(... or shouldn't it be user_pref(.... ?
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

if it's in prefs.js it should be user_pref

and setting it to 99 makes it even harder to get a message clasified as SPAM. A lower number like 90, 89, 88 makes it more agressive about classifying a message as SPAM.
Thunderbirds are Go!
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

It's early days yet, but I'm afraid this build is still pretty useless. I've marked emails containing the term 'prescription' as junk 4 or 5 times and it still doesn't get caught. Does the tokeniser fail to strip the (fake) HTML tags before it analyses the text or something? It's catching maybe 1 spam in 10 at the point where POPFile used to catch 9 out of 10.
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Post by djbrock »

Ah! I began to wonder if 99 was a better setting or not. It is now set at 90 but I even tried 50 and it still didn't pick up the mail I was interested in. In fact, I received 5 emails just now and 1 was junk. It didn't recognize it as such. I'll keep tabs on it over the next day or so and report back.

I just did an experiment. I started with 90 and deincremented by 10's until I finally got the junk mail recognized. I had to go all the way down to 10 (ten) before it would mark it and move it. Is this normal?

As for the line of text, it says in this forum to put it in the user.js file, not the prefs.js file. In fact, it disappears when you put it in prefs.js.
djbrock
Posts: 79
Joined: May 2nd, 2003, 8:21 pm
Contact:

Post by djbrock »

What is disturbing to me now is that I just moved some mail that I had marked as junk and the filter failed to find them. In fact, it marked them even set at 10. And while I'm typing this another piece of junk mail just came in and the filter has not picked it up either. This is not looking good.
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Latest news; this thing is still broken. Sorry to speak in such black and white terms but I've been carrying out a more scientific study in conjunction with POPFile, the filter I used to use with OE.

I trained Thunderbird only on errors for about a day; I had to mark 1 normal mail as not junk and about 6 or 7 junks as junk. It was still failing to pick things up after a few more junk mails came in. At this point, I decided to try POPFile again and compare the results.

Since I've been running PopFile with Thunderbird, I've received 31 emails. I've retrained both POPFile and Thunderbird after each download, so it improves pretty much with each email, or pair of emails if 2 came at once.

Out of those 31, 12 were spam. The first 3, neither POPFile or Thunderbird caught as spam, so I marked them manually. The next 2 spams, POPFile caught but Thunderbird did not, despite Thunderbird having been trained on significantly more spam mails. Then 2 more spams that they both missed. Following that, 5 more spams, each of which POPFile spotted instantly, but Thunderbird missed.

On these statistics alone, it seems like Thunderbird is simply failing to tokenise or recognise the features of the spam. It doesn't take 100s of spam mails for training purposes if the filter spots the correct tokens. Also, I've not trained Thunderbird on 'legitimate' mail for a good reason; at this stage, I am not worried about false positives (since I get few enough positives as it is!). If this damages the algorithm in some other way, then the algorithm is probably flawed as a Bayes classification system only really needs to be trained to correct its errors anyway.

There is one last point which makes me even surer that Thunderbird has a bug in the implementation somewhere; all email that POPFile considers to be spam, is given a new header of "X-Text-Classification: spam". So Thunderbird has seen this phrase pop up in at least 7 emails now - the ones that POPFile correctly identified before passing on to Thunderbird. So, despite there being a phrase that appears in more than 50% of my emails that I've subsequently told Thunderbird are spam, Thunderbird has failed to make this connection. I can't see how any properly-functioning Baysian classifier could make this mistake.
mscott
Posts: 2516
Joined: April 2nd, 2003, 4:10 pm
Location: Thunderbird Research Center, CA
Contact:

Post by mscott »

i don't process all of the headers yet. so x-text-classification certainly isn't going to end up getting tokenized right now. Header processing is still a work in progress.
Thunderbirds are Go!
Kylotan
Posts: 478
Joined: July 21st, 2003, 4:45 am
Location: Nottingham, UK
Contact:

Post by Kylotan »

Ok, that's fair enough. I retract my bug statement, although we both know that processing header information will be very useful.
Locked