[Asrg] A paper/project worth considering (found it!)
Rich Kulawiec
rsk at gsp.org
Fri Jan 2 07:51:47 PST 2009
On Sun, Dec 14, 2008 at 05:30:24PM -0500, Chris Lewis wrote:
> I don't understand this. I tried to explain this phenomena before.
> Didn't you take statistics somewhere?
Given that most of my graduate work was in statistical pattern recognition,
I think that's safe to presume. ;-)
> Let's say for sake of argument, AOL's users have a 5% error rate. 5% of
> what they report via TIS isn't spam. That means, on average, 95 out of
> 100 reports are accurate and it is spam.
>
> You have a FBL. But you don't send any spam, right? You only get your
> share of the error rate, and none of the accurate ones - because you
> don't send any spam.
>
> So, from your perspective, the TIS button is 100% wrong. For _you_ it
> is. But it's NOT reflective of TIS hits against a network that sends spam.
Of course not: you're correct. But it is reasonable to presume that a
user population which has generated a 100% error rate on the FP side has
also generated a substantial error rate on the FN side. (That is, there's
no reason to think they're any more accurate one direction or the other.)
> Are you contending that Comcast's or Yahoo's FBLs are yielding correct
> TIS hits? Or do you have FBLs with them at all?
Some of the operations I manage/consult to do. I haven't completed
analysis of all those yet, so I'm quasi-reserving judgment. But so far,
out of the data I *have* analyzed: 100% FPs. It'll be months before
I'm done, I expect, because that's what it took to go through the AOL
results with a mix of automated/manual processes, and to cross-check
against logs, and so on.
> What the TIS button does is help highlight situations where the
> anti-spam filters aren't working.
Perhaps. But I think all such instances need to be passed by a clueful,
experienced human for manual review. That is, I think aggregating the
data and presenting to a person with a note that says "there may be
a problem here" is reaonable, but automated action based on end-user
reports alone is a bad idea.
I also think a much better approach -- which allows a higher degree
of automation because it removes users from the equation -- is to
run a large number of local and remote spamtraps. After all, if spammer X
targets A local "real" users, then it seems reasonable to guess that X
will also target B local spamtraps and perhaps C remote spamtraps.
Correlation of data between all these makes it possible to identify at
least some spammers before users ever get a chance to use the TIS button.
(Yes, this is a methodology I use, and I use it based on connecting IP
address alone -- that is, I ignore everything else. Any IP address
connecting to a sufficient number of sufficiently-diverse MXs and
attempting delivery to a sufficient number of spamtraps is up to no good
and is treated as such.)
---Rsk
More information about the Asrg
mailing list