Woke up this morning
Looked at the blog stats
It was twice what is normal
What’s up with that
It’s a new scraper of some kind — it helpfully uses the User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/139.0.0.0 Safari/537.36, which is… apparently… a real browser? Google gives me conflicting results.
The fascinating thing about this scraper is that it seems to be firing through a VPN service — it fetches each page three times (within a few seconds), but through different IP addresses, often from different ISPs.
It needs to have all pages in triplicate? Very bureaucratic.
I had a look at the Jetpack stats for today, and they are apparently able to filter out this junk? Or perhaps the scraper people have blacklisted those JS resources, but not the JS resources I use for my stats. Or perhaps Jetpack does deeper inspection and compares TLS version with User-Agent versions and sees whether the combination looks likely, or has a list of VPN IPs, or…
I guess I could just blacklist this User-Agent so that I have more real stats again, but on the other hand, it’s a Sisyphean task: I already filter visits from known data centres, and from China etc, and bots that announce that they’re bots. I guess the Jetpack people have people that work actively on the issue to make the stats more reasonable — it’s getting to be impossible to do stats yourself. Just like it’s impossible to allow comments without having Akismet do spam filtering:
But let’s see… *math* *math* *math*… Yes, filtering away all of these new things, it seems like the actual traffic from humans is 18%. And that’s not counting non-JS (i.e., simple old-fashioned scrapers) at all. I wonder whether I can dredge up approximate stats for that, too… Let’s see…
Grepping the Apache access log, counting only hits to what looks like actual blog pages, the bot readership is 98.5%. (It’s nice that WordPress comes with functional caching, I guess.) And only 8% of scrapers use a headless browser (i.e., one that runs Javascript) to do the scraping.
So I dunno. Just give up? It’s not that the stats are actually useful for anything, but I find them amusing to look at anyway.
I guess we can never have nice things.



