Innovations in Web Scraping

I added event descriptions to my Concerts in Oslo a few months back. It mostly worked kinda OK, but it’s using heuristics to find out what “the text” is, so it sometimes includes less-than-useful information.

In particular, those fucking “YES IT KNOW IT”S A FUCKING COOKIE!” texts that all fucking web sites slather their pages with now fucking get in the way, because those texts are often a significant portion of the text on any random page. (Fucking.)

But I added filtering for the those bits, and things looked fine.

Yesterday I was told that all Facebook events are basically nothing but that cookie warning (in Norwegian), and that’s because the Facebook event pages now contain nothing but that text, plus some scaffolding to load the rest as JSON:

To build the Facebook event page, about 85 HTTP calls are done and 6MB worth of data is loaded.

I contemplated reverse-engineering the calls to get the event description via the graphql calls (since Facebook has closed all access to public events via their API), but then it struck me: The browser is showing me all this data, so perhaps I could just point a headless browser towards the site, and then ask it to dump its DOM, and then I can parse that?

Which I’ve now done.

I know, it’s probably a common technique, but I’d just not considered it at all. A mental block of some kind, I guess. I’m so embarrassed. Of course, it now takes 1000x longer to scrape a Facebook event than something that just puts the event descriptions in the HTML, but whatevs. That’s what you have caches for.

I’m using PhantomJS, and it seems to work well (even if development has been discontinued). PhantomJS is so easy and pleasant to work with that I think I’ll try to stick with it until it disappears completely. Is there another headless browser that’s as good? All the other ones I’ve seen are more… enterprisey.

Concert Diary: Rockefeller

Dear concert diary.

Today I went to a Slowdive concert, mostly because Lost Girls were the opening act.

They were great, and so were Slowdive, but the experience was somewhat marred by the odour of the venue.

Rockefeller is the foundational concert venue in Oslo.

t’s always been somewhat whiffy: If you don’t stand in the middle of the floor, you’ll experience the olfactory delights of a beer-drenched carped that’s never been cleaned the last few decades (see picture above for why this happens: The venue shovels all “empty” beer glasses from the hardwood floor in front of the stage onto the carpeted wings before collecting the glasses).

But, dear diary, today was a brand new experience.

It’s been a few very warm weeks in Oslo, no doubt due to random weather fluctuations and not climate change at all.  But the stench that met us when we entered the venue was of a different kind than any we’ve experienced before.

Instead of the normal yeasty bouquet we’re used to, the non-hardwood parts of the Rockefeller venue smelled like a well-aged mixture of stale ale and diseased piss.

The urinal overtones of the venue were so overpowering that I almost tossed my cookies.  I was only able to hold on to the contents of my stomach by standing in the front of the stage, even though I am very tall and that, sensibly, annoys all people of normal height.

If only somebody, somewhere, perhaps the owners of the Rockefeller venue, would hire somebody to clean the carpet in the back of the venue, people would get less nauseated when visiting the place.

Dear diary, one can only dream.

Oslo, July 18th, 2018.

Innovations in Music Distribution

I was at a jazz concert the other week, and I was looking at the CDs and stuff the musicians had brought to sell.

Adam Pulz Melbye had brought a shrinkwrapped bass string:

With a Bandcamp download code. (Censored above.)

I just had to buy one! Genius!

It’s weird that I haven’t seen anybody doing something along these lines before… It’s like a souvenir from the concert, but it’s also a way of selling music.

Concerts in Oslo

I maintain a site that list concerts in Oslo.

In Facebook’s continuing war on its users, the events API was discontinued without warning a month ago. (That is, they may allow access to some apps after doing an individual review, but somehow I suspect that allowing access to a service that tries to drive foot traffic to venues that use Facebook to host their calendars won’t be one of those special apps, because Facebook never wants anybody to leave Facebook ever, I think?)

About a quarter of the venues have their event listings on Facebook only, so that’s a rather big blow against having a useful concert listing site.

So I spent an evening reimplementing Facebook event web page scraping, and while doing that I started thinking about whether I should fancify my Concerts in Oslo web site. Scraping an image and a summary from the event pages didn’t seem insurmountable… Just find the largest image and the most coherent textual part of the HTML and there you are. (You have to filter out the “COOKIES EXIST! DID YOU KNOW THAT!” texts on most pages, because they’re often the longest texts, though.)

What took most work was trying to determine how this data should be loaded. In total, all the extra data is about 45MB, so just having it all in that initial table doesn’t really work. And I wanted to keep the data structures the same, so that the apps would also continue to work.

I first tried displaying the event summaries on hovering, but that was insanely annoying. Then I tried expanding the table when scrolling into view, and that was even more annoying, because things would move around a lot and you’d get confused.

UX is hard!

So I settled on pre-expanding the bottom border of each table line and then putting the event info in an absolutely-positioned div relative to the line. It’s a crime against CSS! But it works!

And now I don’t have to do any work on the site… until Facebook changes their HTML again.