Jul. 30th, 2017

lovingboth: ([default])
Years ago, I started to use POPFile to categorise my email. It's a Bayesian filter - you show it some email, categorise it, and very quickly it learns to categorise future email almost entirely correctly - that goes between your email client and the email server. It fetches the mail from the server and assigns it to a 'bucket', then your email client fetches the mail from it and can then sort it according to the labels it's added to the email headers.

The reason I went for it rather than some others was that it allowed for multiple 'buckets' rather than simply two. So as well as a spam bucket, there was one for interesting spam - those 'I am a relative of this recently deceased, please let me give you lots of money' ones that I'd look at every so often - plus one for commercial mailings, one for mailing lists, etc etc etc.

I could also get it to work.

And it worked extremely well, with much greater than 99.5% accuracy. I remember writing about one bit of spam that did get through - POPFile wasn't sure if it was in the definitely spam bucket or the probably spam one, so left it as unclassified!

That's in complete contrast to the incredibly awful untrainable spam filter in Outlook 2003 that ex-work forced us to migrate to. Someone analysed it, but the 'article was removed due to legal issue' that's now there makes me suspect that Microsoft didn't like people knowing the details. Fortunately, someone archived it (PDF). I particularly hated the way that having 'linux' in the body of the email would make it more likely to be classed as spam, but it's an example of the way that a fixed rules-based system is nowhere near as useful as a trainable one.

At some point ten-ish years ago, I stopped using POPFile. For one thing, I stopped using Eudora and The Bat! email clients and my new choice of Mozilla Thunderbird had a (Bayesian) spam filter which worked well enough. The other - because I run my own mail server - was the discovery that doing greylisting (saying to any untrusted source of incoming email that you're not ready to receive it) turns out to be almost as effective as Bayesian filtering but means that 99% of the spam never gets delivered to you in the first place.*

Gmail, used for assorted email addresses that I don't care greatly about, does reasonably well too with its trainable system. I get way more false positives and false negatives there than in my system, but the former in particular is because they obviously share data between accounts and lots of people clearly mark as spam stuff they get because they really did sign up for a mailing list and can't/won't unsubscribe instead.**

But it's not entirely satisfactory.

For example, I'm not the only one who likes having more than two buckets (of those that submitted stats, less than 30% of POPFile users only had two) and both Thunderbird and Gmail are two bucket jobs.

Gmail also pointedly doesn't do folders, insisting you use labels and/or search. If it did POPFile-level filtering, that'd be much easier, because you could get filtering of things you didn't know exact details of. It's less of an issue for Thunderbird, which is folder based, but I'm still tempted to get POPFile going again.

And of course, email is not the only thing where this sort of filtering would be good. Usenet newsgroups would be a classic example, except that there's only one group I still sometimes read, it's low traffic, and Google Groups is good enough at getting rid of the spam.

POPFile does Usenet news, but it doesn't do RSS feeds. I'd love to be able to use it for that (looking for stuff I'm interested in on eBay and filtering the output of a news site into 'I really want to know' / 'if I'm not busy' / 'who cares' etc are my main use cases) but although there is supposed to be a way of doing it, via having something convert it into Usenet-style nntp posts, I could never get it to work properly.

It doesn't look that it ever will do them either. The last real release was in 2011, and it's only been fixes for Mac OS 'upgrades' breaking things and one change to handing SSL on Windows since then. People have been asking if development has stopped for four years, and while it officially hasn't, it probably really has.

There are / were some things that do do RSS. Some involved paying if you wanted to do anything vaguely useful, and the other is sux0r. Its domain has lapsed - never a good sign - but it is still archived. I also remember it as being horrible in terms of getting it working. I had to edit some of the code because its URL length was way too low, for example, but even then it was nowhere near as easy to use as POPFile.

Apart from that, I can find people wanting to do RSS, but nothing free that will actually work for me.

It would be great to have a Twitter client with Bayesian filtering too. Again, I'm not the only one who thinks so, but as well as the short length of tweets making it harder to train, the most promising looking solution a) wants desktop Chrome and b) isn't available yet.

If I did evilFB, I'd definitely want one there too. Even L would like a 'friends' feed with fewer 'repost this pointless graphic to show you caaare...' posts and more actually interesting ones.

So this sort of filtering is extremely useful, and in use, but finding things that do it in ways that I and other people want is hard.

Any suggestions?

* Real mail servers try again, spammers almost never do because they're sending so much in such a scattergun way that they don't bother to try again. You do need to make a couple of 'even though they don't follow the rules, allow mail from here - where they almost always run Microsoft Exchange' exceptions.

** I wonder if this is one reason why the Gmail app started offering to unsubscribe from emails...

Profile

lovingboth: (Default)
Ian

September 2025

S M T W T F S
 123456
78910111213
14151617181920
2122 2324252627
282930    

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags