[Asrg] A paper/project worth considering (found it!)

Rich Kulawiec rsk at gsp.org
Fri Jan 2 07:51:47 PST 2009


On Sun, Dec 14, 2008 at 05:30:24PM -0500, Chris Lewis wrote:
> I don't understand this.  I tried to explain this phenomena before.
> Didn't you take statistics somewhere?

Given that most of my graduate work was in statistical pattern recognition,
I think that's safe to presume. ;-)

> Let's say for sake of argument, AOL's users have a 5% error rate.  5% of
> what they report via TIS isn't spam.  That means, on average, 95 out of
> 100 reports are accurate and it is spam.
> 
> You have a FBL.  But you don't send any spam, right?  You only get your
> share of the error rate, and none of the accurate ones - because you
> don't send any spam.
> 
> So, from your perspective, the TIS button is 100% wrong.  For _you_ it
> is.  But it's NOT reflective of TIS hits against a network that sends spam.

Of course not: you're correct.  But it is reasonable to presume that a
user population which has generated a 100% error rate on the FP side has
also generated a substantial error rate on the FN side.  (That is, there's
no reason to think they're any more accurate one direction or the other.)

> Are you contending that Comcast's or Yahoo's FBLs are yielding correct
> TIS hits?  Or do you have FBLs with them at all?

Some of the operations I manage/consult to do.  I haven't completed
analysis of all those yet, so I'm quasi-reserving judgment.  But so far,
out of the data I *have* analyzed: 100% FPs.  It'll be months before
I'm done, I expect, because that's what it took to go through the AOL
results with a mix of automated/manual processes, and to cross-check
against logs, and so on.

> What the TIS button does is help highlight situations where the
> anti-spam filters aren't working.

Perhaps.  But I think all such instances need to be passed by a clueful,
experienced human for manual review.  That is, I think aggregating the
data and presenting to a person with a note that says "there may be
a problem here" is reaonable, but automated action based on end-user
reports alone is a bad idea.

I also think a much better approach -- which allows a higher degree
of automation because it removes users from the equation -- is to
run a large number of local and remote spamtraps.  After all, if spammer X
targets A local "real" users, then it seems reasonable to guess that X
will also target B local spamtraps and perhaps C remote spamtraps.
Correlation of data between all these makes it possible to identify at
least some spammers before users ever get a chance to use the TIS button.
(Yes, this is a methodology I use, and I use it based on connecting IP
address alone -- that is, I ignore everything else.  Any IP address
connecting to a sufficient number of sufficiently-diverse MXs and
attempting delivery to a sufficient number of spamtraps is up to no good
and is treated as such.)

---Rsk


More information about the Asrg mailing list