Campaign for Humane Sorting

Does this look familiar to you?

DSC00937Or this?

DSC00936Then you know the pain and suffering caused by virtually all tools that sort things “alphabetically” when those things contain numbers, and the strategies we’ve adopted to deal with these broken tools.

We’re humans.  How would a human sort “foo25.txt” versus “foo3.txt”?  We would think “hm, there’s some text, and there’s a number, and then there’s more text.  Let’s sort the texty bits first, and then sort by number, and then sort the rest of the texty bits”.

That would make sense.  But that’s not what’s been implemented, for the most part.  Instead people are forced to create bug reports with names like “bug #000000001345” and keep track of the leading zeroes and hope that they’ll never need more.  Because then everything will break down anyway.

It seems like a no-brainer to sort contiguous numerical parts as numbers.  This solves most of our problems.  But it’s interesting to consider whether this might be extended.

For instance, if you have “15.6%” and “15.16%”, you probably want the latter to be sorted before the former.  If you split them into two numerical chunks, you’d instead get the former ahead of the latter.

It’s common to write numbers with other characters in the middle to break up the long stream of digits.  In the US, you write “10,000,004” for ten million and four.  It would be nice if that sorted after “20”.  In some other countries you’d write “10.000.004” for the same number.  In India you’d write “1,00,00,004”.

If you have a version number like “Gnus 5.15.5”, then that’s a later version than “Gnus 5.6” and “Gnus 5.6.4”.

So we quickly run into consistency problems if we try to be too clever.  On the other hand, we’re can probably get the computer to sort better than most humans would.

But surely I can’t be the first person to think of this major and pressing issue.  Surely somebody has doing something awesome here.

Let’s take a look at Explorer in Windows:

Screen Shot 2014-10-13 at 14.22.01 pm

Nope.  “foo25.txt” sorts before “foo4.txt”.  Bad Bill.  (Edit: I’m being informed that I must have clicked on something, because Windows Explorer does allegedly sort correctly.)

What about OS X:

Screen Shot 2014-10-13 at 14.17.37 pmWow!  Steve got it right!  It only sorts based on contiguous digits, though, and doesn’t try to do anything clever with decimals and the like.  But so pretty:

Screen Shot 2014-10-13 at 11.47.06 amAnd on Linux, the version of Firefox in Debian Stale seems to be sorting on ASCII values, while the latest version sorts just like on OS X.  Fancy that.

Ok, so I wasn’t the first one to consider this.  In Unix tool land, we have “sort -g” that will interpret contiguous digits, including decimal points, as numbers instead of strings.  And they’ve recentlishly added “-human-numeric-sort” that, if I understand correctly, does exactly the same as “sort -g”, but also interprets SI units lik “G” and “M”, so that you can say “du -h | sort -h” and get the result you desire instead of the useless default sorting method.

Anyway, we must adapt this to Emacs.  Dired should definitely sort more humanely, but how ambitious should we be?  Just go for contiguous digits, or try to interpret “number clusters” more freely?

I’ve made a proof of concept on Github (written in the most inefficient way possible) instead of just putting this into Emacs immediately.  It just does the “contiguous digit” thing at present, so that “foo25.txt” sorts after “foo4.txt”.

What do you think?  How far should Dired (and the like) go in this direction?

Films 4 Ever

DSC00933
The wall of unseen movies keeps expanding

I’ve sort of stumbled into another CDO project that has even less utility than most of the other ones.

I’ve been ripping DVD and BluRay films with makemkv before viewing, because 1) mplayer under Linux doesn’t really do BluRay, and 2) mplayer fails to play an ever increasing number of DVDs.  The joys of Digital Rights Management.

And it’s nice to take movies with me when going on holidays.

So I’ve been ripping, viewing and deleting.

But a few weeks back I started thinking that perhaps I shouldn’t delete after viewing.  I may not want to watch the films again anytime soon, but some of them I wanted to listen to the commentary tracks on (for instance, on John Walter movies), or see some of the extras on.  So I kinda have unseen movies, partially seen stuff, and totally seen stuff.

And it would be nice to list movies by genre or director or seen-ness.  It just appeals to me.

DSC00931
Colour-coded films based on “seen” status

A DVD is normally in the 6GB area (between 4 and 9GB, usually).  So if a 6TB disk existed, then that would have room for 1000 films!  And it does!

So I rewrote movie.el a bit and bought a couple of 6TB WD Green disks in USB3 enclosures.

That was not a wise choice.  When the disks spin down, something along the way freaks out when trying to write to a disk.  Either the USB3 enclosure or the Linux USB layer thinks the disk is dead and kicks it off the system.  Which means unmounting, switching the disk off and then on again, and then remounting.

So that’s a no go.  Besides, I kinda underestimated the size of BluRay films.  Normal ones are about 35GB.  Major sci-fi ones usually comes in 3D and 2D versions, each taking up to 50GB, and with extras and stuff it can come to over 150GB.  A 6TB disk really doesn’t seem that big any more.

It seems like the rule is: The current generation of optical media is always too impractical to store without further compression.  It all started with CDs.  We ripped and converted to too compressed mp3s.  Then a few years later the hard disks grew and we went to lossless.  Then people started ripping DVDs, transcoding and compressing the hell out of the data.  Then disks grew and we’re now ripping lossless.  Then BluRay arrived…

Anyway…  this a project of dubious utility, and seems totally unpractical.  So abandon it?  No!  Double down!

I bought a pair of 6TB WD Red disks and RAID0’d them in a JBOD USB3 enclosure.  Yee-haa!  And then I’m going to keep an rsync mirror somewhere with another one of those.

I wrote up a tiny Emacs library to query imdb for movie details.

[larsi@stories ~/src/imdb.el]$ df -h /mdvd/
Filesystem Size Used Avail Use% Mounted on
/dev/md0 11T 4.3T 6.7T 39% /mdvd

Almost done ripping the old, seen films.

I got an USB3 BluRay player to see whether it’s faster than the USB2 one, and it’s not.  But DSC00932I can rip two disks in parallel now, so that helps a lot.  I can’t really use the internal one, because it doesn’t let the region code be changed, and I haven’t really bought DVDs based on region.

Just because something is impractical and useless doesn’t mean that it’s not worth doing.

Er.