Emacs DOM Traversal

I’ve been doing a bit of web scraping with Emacs lately, and I haven’t been totally satisfied with how my dom.el library worked.

But on Friday I was fiddling around with some jQuery stuff, and I noticed how handy it was that jQuery functions that dealt with a single node (like .attr()) could be fed a list of jQuery objects.  Whenever that happened, the function would just work on the first element in the list.

Which is totally what you want when doing web scraping.  You map over some objects, but a lot of the time you want, for instance, the text from the first <a> element underneath a node.

So I rewrote the library, and my scraping code became prettier, I think.

This is the code to scrape the concerts for the “Mir” venue before:

 (defun csid-parse-mir (dom)
   (loop for elem in (dom-by-id dom "program")
        for link = (car (dom-by-name
                         (car (dom-by-class elem "programtittel")) 'a))
        collect (list (csid-parse-month-date
		       (dom-text (car (dom-by-class elem "programtid"))))
                      (dom-attr link :href)
                      (dom-attr link :title))))

And this is the code after:

(defun csid-parse-mir (dom)
  (loop for elem in (dom-by-id dom "program")
        for link = (dom-by-name (dom-by-class elem "programtittel") 'a)
        collect (list (csid-parse-month-date
                       (dom-text (dom-by-class elem "programtid")))
                      (dom-attr link :href)
                      (dom-attr link :title))))

See? All those `car’s are gone. Much more eco-friendly.  Stop climate change!

Things indent so awkwardly when you have to sprinkle short functions all over the place.

Anyway, I have a bike-shedding query.  I’m pretty satisfied with function names like `dom-by-id’ and `dom-by-class’.  They’re pretty self-explanatory.  (Although if somebody has better (i.e. shorter and clearer) names, that’d be good.)

But I dislike `dom-by-name’.  It should really be `dom-by-node-name’, but that’s way too long.  And `dom-by-name’ is just confusing, and clobbers with name attributes, so it has to go.

But what should it be?  `dom-by-node’?  `dom-by-tag’?  `dom-by-type’?  WHAT!?!?

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s