derefspam.pl - Use MT-Blacklist rules to remove referral spam

Posted by Tony Buser Fri, 21 Jan 2005 06:56:28 GMT

Warning: If you use awstats, make sure you upgrade immediately! I've decided to stop using it expecially now that I've seen this nifty WordPress plugin.

It seems like I enjoy fighting blog spammers more then I enjoy posting to my blog lately. Tom Sherman linked to a post of mine about how I was trying to deal with referral spam. It was suggested that a good idea would be to use the MT-Blacklist file to actively filter out referral spam from your log files. I thought that was a pretty good idea too, so I wrote a little perl script. Probably the best way to use it would be to run it right before your log analyzer processes it and then rotate the log. (I'll leave that up to you)

Download derefspam.pl v.2 (01-23-2005) Download my blacklist.txt (01-23-2005) Download my whitelist.txt (01-23-2005) Download MT-Blacklist's blacklist.txt

Update: Version .2

  • added optional whitelist file
  • added optional second blacklist file
  • added code to only check the referral field making it about 3x faster

Statistics: Completed 153709 lines in 266 seconds. (about 578 lines/second)

Usage: derefspam.pl [OPTIONS]

Take a log file, search through it an remove any lines that match lines in
the blacklist file and output it to another file.

Mandatory arguments:
  -i, --in file            path to log
  -o, --out file           path to output cleaned log
  -b, --blacklist file     path to blacklist rule file

Optional arguments:
  -s, --spam file          path to output lines that match blacklist
  -x, --myblacklist file   a second blacklist (so you can keep a second
                           blacklist that you maintain and overwrite the one
                           you download from MT-Blacklist)
  -m, --mydomain 'domain'  ignore referrals from this domain, this should
                           speed up processing time by ignoring common
                           domains you can also seperate multiple domains
                           with a | character and no spaces, and enclose in '
  -w, --whitelist file     path to a whitelist, same syntax as the blacklist
                           use this instead of mydomain if you have a lot
  -d, --debug              print extra debug info
  -h, --help               what you're reading right now

Example:
  ./derefspam.pl -b blacklist.txt -w whitelist.txt -i juju-combined.log 
  -x myblacklist.txt -o juju-derefspam.log -s juju-refspam.log 
  -m 'juju.org|google.com' -d
Trackbacks

Use the following link to trackback from your own site:
http://juju.org/articles/trackback/326

Comments

Leave a response, Track co.mments

Comments