Filtering data centres from web stats

Over the past few weeks, I’ve been noticing weird stuff popping up in my WordPress Statistics for Emacs buffer (that I wrote about here). It’s like the above — a bunch of hits for the same page, using the same identical User-Agent, in a short time period, from different IP addresses (if they’d used the same IP address, they’d already be filtered out). The User-Agent doesn’t announce that it’s a bot (bots are filtered out already by wse), but is instead something like:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.6778.33 Safari/537.36

I finally broke down and investigated… and it turns out that those IP addresses are from Azure. So this is probably some kind of AI-scraping bot? I guess Copilot is going to become more adept at writing smart-ass articles about obscure comics or somethings.

The sensible thing would, of course, be to ignore all this, because who cares. On the other hand, it made me wonder whether it was possible to whip something up to ignore all visits from data centres when doing the stats… and somebody has helpfully made a list of all CIDRs for Azure/AWS/etc.

But I’m unable to find anything to cover all data centres — and that list is IPv4 only, which is weird. Don’t any of these services use IPv6 yet?

Anyway, after doing some typing:

(wse--data-center-ip-p "40.122.184.170")
=> "Azure"

(wse--data-center-ip-p "80.91.231.1")
=> nil

It works!!!

(Hm… I feel I’ve written a CIDR function like this before, but I can’t find it now.)

So how big is the problem really?

Oh, it’s 6% of page views. That’s not so bad. However, this is just AWS/Azure/CGP/Cloudflare — it doesn’t cover the rest. So my question is — does anybody have a complete data centre CIDR list? And howabouts them IPv6es?

2 thoughts on “Filtering data centres from web stats”

Leave a Reply to larsmagne23Cancel reply