Test Win32 Build Available with Junk Mail Filter Changes
-
- Posts: 292
- Joined: April 25th, 2003, 6:50 am
- Location: The Netherlands
I haven't had time to post this every day, but here's an update:
(day 1 was: 91 total, 73 marked spam, 18 incorrectly not marked, rating 80%)
Results after day 2:
79 junk mails, of which
66 marked as junk
13 not junk
no false positives in my regular mail
that is an effectiveness of 83%
Trained the filter with those messages
Day 3: haven't been online at all, while the spam was nicely waiting in large quantities
Day 4:
181 junk mails, of which
158 directly marked junk
23 not marked to what they are
no false positives in my regular mail
which results in 87% correctness of the new spam filter.
I trained the filter again with 20 of the 23 left messages, to see what happens tomorrow.
By the way, I should mention that the other 3 of those 23 messages were virusses. Not taking those in account (which I did initially, because they were filtered by my ISP's junk filter), the rating would be a little higher, 89%.
I'm starting to like this one. With a lot less training, it is getting a lot more junk than the old one. It could still be better, however, I haven't trained the filter very much. I will monitor it for at least another weak, I think it's worth the testing. Compliments to the developers, you've done a real good job!
(day 1 was: 91 total, 73 marked spam, 18 incorrectly not marked, rating 80%)
Results after day 2:
79 junk mails, of which
66 marked as junk
13 not junk
no false positives in my regular mail
that is an effectiveness of 83%
Trained the filter with those messages
Day 3: haven't been online at all, while the spam was nicely waiting in large quantities
Day 4:
181 junk mails, of which
158 directly marked junk
23 not marked to what they are
no false positives in my regular mail
which results in 87% correctness of the new spam filter.
I trained the filter again with 20 of the 23 left messages, to see what happens tomorrow.
By the way, I should mention that the other 3 of those 23 messages were virusses. Not taking those in account (which I did initially, because they were filtered by my ISP's junk filter), the rating would be a little higher, 89%.
I'm starting to like this one. With a lot less training, it is getting a lot more junk than the old one. It could still be better, however, I haven't trained the filter very much. I will monitor it for at least another weak, I think it's worth the testing. Compliments to the developers, you've done a real good job!
- peaveyman
- Posts: 341
- Joined: June 1st, 2003, 6:24 pm
Well, I had to delete the training.dat file and start over with the training for the junk mail filter. I have about 75 junk emails in my junk mail folder. When I ran the junk mail controls on this folder, it marked every one of them as not junk. I then selected them all and marked them all as junk. From then on, it is catching 100 % of the junk that comes in.
-
- Posts: 20
- Joined: January 27th, 2004, 11:17 pm
wgianopoulos@yahoo.com wrote:On an unrelated to junk filtering note, I find with this build that if I install an extension, Thunderbird always crashes when I close it to restart it to activate the extension. Both Thunderbird and the extension appear to work fine though. On subsequesnt closes it does not crash. Is this a known issue?
I'm seeing this problem as well. On the first restart after installing an extension, the following error message pops up and the program crashes:
"The procedure entry point ??1nsGetServiceByContractID@@UAE@XZ could not be located in the dynamic link library xpcom.dll."
After that first crash, subsequent attempts to run Thunderbird work fine, and the extensions work without issue as well.
I've tried to install two extensions so far, "Get All Messages" and "Calendar", and I've seen this same error with both of them. The error did not occur in my previous install of the release build of TB 0.5.
-
- Posts: 79
- Joined: May 2nd, 2003, 8:21 pm
- Contact:
Junk Filter & address book
Scott,
Thanks for the experimental version. I'm downloading it with anticipation. I get too much junk mail these days, too. I have a related filtering question. The WhiteList includes the Collected Addresses and I have mine set to collect only addresses that are sent. I don't like the fact that the Collected address book collects all of my addresses, even the ones in my Personal Address book. This makes it a bit tedious to go through and delete anything that might be a spam address. For example, if I try to be "removed" from a spam message, then this is collected by my filter and I have to ferrett through hundreds of my regular email addresses to find the specious one. Maybe that can be addressed in some way in the future.
Thanks for the experimental version. I'm downloading it with anticipation. I get too much junk mail these days, too. I have a related filtering question. The WhiteList includes the Collected Addresses and I have mine set to collect only addresses that are sent. I don't like the fact that the Collected address book collects all of my addresses, even the ones in my Personal Address book. This makes it a bit tedious to go through and delete anything that might be a spam address. For example, if I try to be "removed" from a spam message, then this is collected by my filter and I have to ferrett through hundreds of my regular email addresses to find the specious one. Maybe that can be addressed in some way in the future.
-
- Posts: 79
- Joined: May 2nd, 2003, 8:21 pm
- Contact:
Junk Filter & address book
Scott,
Thanks for the experimental version. I'm downloading it with anticipation. I get too much junk mail these days, too. I have a related filtering question. The WhiteList includes the Collected Addresses and I have mine set to collect only addresses that are sent. I don't like the fact that the Collected address book collects all of my addresses, even the ones in my Personal Address book. This makes it a bit tedious to go through and delete anything that might be a spam address. For example, if I try to be "removed" from a spam message, then this is collected by my filter and I have to ferrett through hundreds of my regular email addresses to find the specious one. Maybe that can be addressed in some way in the future.
Thanks for the experimental version. I'm downloading it with anticipation. I get too much junk mail these days, too. I have a related filtering question. The WhiteList includes the Collected Addresses and I have mine set to collect only addresses that are sent. I don't like the fact that the Collected address book collects all of my addresses, even the ones in my Personal Address book. This makes it a bit tedious to go through and delete anything that might be a spam address. For example, if I try to be "removed" from a spam message, then this is collected by my filter and I have to ferrett through hundreds of my regular email addresses to find the specious one. Maybe that can be addressed in some way in the future.
- Moonwolf
- Posts: 531
- Joined: December 7th, 2003, 2:50 pm
- Location: Hertfordshire, England
- Contact:
Don't EVER reply to a spam message to be removed from their list. All you're doing is confirming that they've found a live email address. You'll get more spam, not less.
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.6) Gecko/20050223 Firefox/1.0.1
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
Thunderbird 1.0 (20041206)
EMbuttons: Buttons & options for the Extension Manager. Easy Get Mail Button is here too.
-
- Posts: 79
- Joined: May 2nd, 2003, 8:21 pm
- Contact:
Thanks for the advice, but I already realize that they use that as a confirmation of a live address. My question, then, is if they already have me on the list and I reply with a "remove" request, the honest ones will remove me and the crooked ones will keep me on the list. Either way, I'm on the list. So what's the point in not at least asking for the request?
So far I can't tell any difference with the TB junk mail filter. I've trained it on a few hundred messages from my trash folder (both HAM and SPAM) and I just got another Spam that it doesn't recognize. I've also added the line pref("mail.adaptivefilters.junk_threshold", 99); to my user.js file and set it to 99. Should that be pref(... or shouldn't it be user_pref(.... ?
So far I can't tell any difference with the TB junk mail filter. I've trained it on a few hundred messages from my trash folder (both HAM and SPAM) and I just got another Spam that it doesn't recognize. I've also added the line pref("mail.adaptivefilters.junk_threshold", 99); to my user.js file and set it to 99. Should that be pref(... or shouldn't it be user_pref(.... ?
-
- Posts: 2516
- Joined: April 2nd, 2003, 4:10 pm
- Location: Thunderbird Research Center, CA
- Contact:
-
- Posts: 478
- Joined: July 21st, 2003, 4:45 am
- Location: Nottingham, UK
- Contact:
It's early days yet, but I'm afraid this build is still pretty useless. I've marked emails containing the term 'prescription' as junk 4 or 5 times and it still doesn't get caught. Does the tokeniser fail to strip the (fake) HTML tags before it analyses the text or something? It's catching maybe 1 spam in 10 at the point where POPFile used to catch 9 out of 10.
-
- Posts: 79
- Joined: May 2nd, 2003, 8:21 pm
- Contact:
Ah! I began to wonder if 99 was a better setting or not. It is now set at 90 but I even tried 50 and it still didn't pick up the mail I was interested in. In fact, I received 5 emails just now and 1 was junk. It didn't recognize it as such. I'll keep tabs on it over the next day or so and report back.
I just did an experiment. I started with 90 and deincremented by 10's until I finally got the junk mail recognized. I had to go all the way down to 10 (ten) before it would mark it and move it. Is this normal?
As for the line of text, it says in this forum to put it in the user.js file, not the prefs.js file. In fact, it disappears when you put it in prefs.js.
I just did an experiment. I started with 90 and deincremented by 10's until I finally got the junk mail recognized. I had to go all the way down to 10 (ten) before it would mark it and move it. Is this normal?
As for the line of text, it says in this forum to put it in the user.js file, not the prefs.js file. In fact, it disappears when you put it in prefs.js.
-
- Posts: 79
- Joined: May 2nd, 2003, 8:21 pm
- Contact:
-
- Posts: 478
- Joined: July 21st, 2003, 4:45 am
- Location: Nottingham, UK
- Contact:
Latest news; this thing is still broken. Sorry to speak in such black and white terms but I've been carrying out a more scientific study in conjunction with POPFile, the filter I used to use with OE.
I trained Thunderbird only on errors for about a day; I had to mark 1 normal mail as not junk and about 6 or 7 junks as junk. It was still failing to pick things up after a few more junk mails came in. At this point, I decided to try POPFile again and compare the results.
Since I've been running PopFile with Thunderbird, I've received 31 emails. I've retrained both POPFile and Thunderbird after each download, so it improves pretty much with each email, or pair of emails if 2 came at once.
Out of those 31, 12 were spam. The first 3, neither POPFile or Thunderbird caught as spam, so I marked them manually. The next 2 spams, POPFile caught but Thunderbird did not, despite Thunderbird having been trained on significantly more spam mails. Then 2 more spams that they both missed. Following that, 5 more spams, each of which POPFile spotted instantly, but Thunderbird missed.
On these statistics alone, it seems like Thunderbird is simply failing to tokenise or recognise the features of the spam. It doesn't take 100s of spam mails for training purposes if the filter spots the correct tokens. Also, I've not trained Thunderbird on 'legitimate' mail for a good reason; at this stage, I am not worried about false positives (since I get few enough positives as it is!). If this damages the algorithm in some other way, then the algorithm is probably flawed as a Bayes classification system only really needs to be trained to correct its errors anyway.
There is one last point which makes me even surer that Thunderbird has a bug in the implementation somewhere; all email that POPFile considers to be spam, is given a new header of "X-Text-Classification: spam". So Thunderbird has seen this phrase pop up in at least 7 emails now - the ones that POPFile correctly identified before passing on to Thunderbird. So, despite there being a phrase that appears in more than 50% of my emails that I've subsequently told Thunderbird are spam, Thunderbird has failed to make this connection. I can't see how any properly-functioning Baysian classifier could make this mistake.
I trained Thunderbird only on errors for about a day; I had to mark 1 normal mail as not junk and about 6 or 7 junks as junk. It was still failing to pick things up after a few more junk mails came in. At this point, I decided to try POPFile again and compare the results.
Since I've been running PopFile with Thunderbird, I've received 31 emails. I've retrained both POPFile and Thunderbird after each download, so it improves pretty much with each email, or pair of emails if 2 came at once.
Out of those 31, 12 were spam. The first 3, neither POPFile or Thunderbird caught as spam, so I marked them manually. The next 2 spams, POPFile caught but Thunderbird did not, despite Thunderbird having been trained on significantly more spam mails. Then 2 more spams that they both missed. Following that, 5 more spams, each of which POPFile spotted instantly, but Thunderbird missed.
On these statistics alone, it seems like Thunderbird is simply failing to tokenise or recognise the features of the spam. It doesn't take 100s of spam mails for training purposes if the filter spots the correct tokens. Also, I've not trained Thunderbird on 'legitimate' mail for a good reason; at this stage, I am not worried about false positives (since I get few enough positives as it is!). If this damages the algorithm in some other way, then the algorithm is probably flawed as a Bayes classification system only really needs to be trained to correct its errors anyway.
There is one last point which makes me even surer that Thunderbird has a bug in the implementation somewhere; all email that POPFile considers to be spam, is given a new header of "X-Text-Classification: spam". So Thunderbird has seen this phrase pop up in at least 7 emails now - the ones that POPFile correctly identified before passing on to Thunderbird. So, despite there being a phrase that appears in more than 50% of my emails that I've subsequently told Thunderbird are spam, Thunderbird has failed to make this connection. I can't see how any properly-functioning Baysian classifier could make this mistake.
-
- Posts: 2516
- Joined: April 2nd, 2003, 4:10 pm
- Location: Thunderbird Research Center, CA
- Contact:
-
- Posts: 478
- Joined: July 21st, 2003, 4:45 am
- Location: Nottingham, UK
- Contact: