Emacs DOM Traversal

I’ve been doing a bit of web scraping with Emacs lately, and I haven’t been totally satisfied with how my dom.el library worked.

But on Friday I was fiddling around with some jQuery stuff, and I noticed how handy it was that jQuery functions that dealt with a single node (like .attr()) could be fed a list of jQuery objects.  Whenever that happened, the function would just work on the first element in the list.

Which is totally what you want when doing web scraping.  You map over some objects, but a lot of the time you want, for instance, the text from the first <a> element underneath a node.

So I rewrote the library, and my scraping code became prettier, I think.

This is the code to scrape the concerts for the “Mir” venue before:

 (defun csid-parse-mir (dom)
   (loop for elem in (dom-by-id dom "program")
        for link = (car (dom-by-name
                         (car (dom-by-class elem "programtittel")) 'a))
        collect (list (csid-parse-month-date
		       (dom-text (car (dom-by-class elem "programtid"))))
                      (dom-attr link :href)
                      (dom-attr link :title))))

And this is the code after:

(defun csid-parse-mir (dom)
  (loop for elem in (dom-by-id dom "program")
        for link = (dom-by-name (dom-by-class elem "programtittel") 'a)
        collect (list (csid-parse-month-date
                       (dom-text (dom-by-class elem "programtid")))
                      (dom-attr link :href)
                      (dom-attr link :title))))

See? All those `car’s are gone. Much more eco-friendly.  Stop climate change!

Things indent so awkwardly when you have to sprinkle short functions all over the place.

Anyway, I have a bike-shedding query.  I’m pretty satisfied with function names like `dom-by-id’ and `dom-by-class’.  They’re pretty self-explanatory.  (Although if somebody has better (i.e. shorter and clearer) names, that’d be good.)

But I dislike `dom-by-name’.  It should really be `dom-by-node-name’, but that’s way too long.  And `dom-by-name’ is just confusing, and clobbers with name attributes, so it has to go.

But what should it be?  `dom-by-node’?  `dom-by-tag’?  `dom-by-type’?  WHAT!?!?

 

Leave a Reply