Twiddling Youtube; or, I mean, Innovations in Machine Learning

I mean, we’ve all been annoyed when we set up our USB monitor in our hallway that displays weather data, and then we decided to show videos from Youtube that somehow relate to the music that’s playing our apartment; we’ve dreamed of having something like the following catch our eyes when passing by on the way to the kitchen.

Oh, what a marvellous dream we all had, but then it turned out that most of the videos that vaguely matched the song titles turn out to be still videos.

So many still photo videos. So very many.

I mean, this is a common problem, right? Something we all have?

Right?

Finally I’m writing about something we all can relate to!

So only about five years after first getting annoyed by this huge problem, I sat down this weekend and implemented something.

First I thought about using the video bandwidth of the streaming video as a proxy for how much liveliness there is in a video. But that seems error prone (somebody may be uploading still videos in very HD and with only I-frames, and I don’t know how good Youtube is as optimising that stuff), and why not go overboard if we’re dipping our feet into the water, to keep the metaphor moist.

So I thought: Play five seconds of a video, taking a screenshot every second, and then compare the snapshots with ImageMagick “compare” and get a more solid metric, and I can then check whether bandwidth is a good proxy, after all.

The “compare” incantation I’m using is:

compare -metric NCC "$tmp/flutter1.jpg" "$tmp/flutter2.jpg" null:

I have no idea what all the different metrics mean, but one’s perhaps as good as another when all I want to do is detect still images?

So after hacking for a bit in Perl and Bash and making a complete mess of things (asynchronous handling of all the various error conditions and loops and stuff is hard, boo hoo, and I want to rewrite the thing in Lisp and use a state machine instead, but whatevs), I now have a result.

Behold! Below I’m playing a song by Oneohtrix Point Never, who has a ton of mad Youtube uploaders, and watch it cycle through the various hits until if finds something that’s alive.

Err… What a magnificent success! Such relevance!

Oh, shut up!

*mumble*

But let’s have a look at the data (I’m storing it using sqlite3 for convenience) and see whether videos are classified correctly.

I’m saying that everything that “compare” gives a rating of more than 0.95 is a “still image video”. So first of all we have a buttload of videos with a metric of 0.9999, which is very still indeed.

0.9999yAZrDkz_7aY36170
0.9999yCNZVvP7cAE150241
0.9999yai4bier1oM128630
0.9999yt1qj-ja5yA476736
0.9999yxWzoYQb5gU244076
0.9999z1YKfu5sD24723392
0.9999z28HTTtJJEE372014
0.9999zOirMAHQ20g574614
0.9999zWxiVHOJVGU70909

But the bitrates vary from 36kbps to 723kbps, which is a wide range. So let’s look at the ones with very low metrics:

0.067slzSNsE7CKw1359008
0.1068m_jA8-Gf1M02027565
0.12087PCkvCPvDXk1702924
0.1292zuDtACzKGRs3969219
0.1336VHKqn0Ld8zs1607430
0.1603Tgbi3E316aU1877994
0.2153ltNGaVp8PHI506771
0.2192j14r_0qotns683650
0.2224dhf3X6rBT-I1715754
0.2391WV4CQFD5eY0416458
0.2444NdUZI4snzk82073374

Very lively!

These definitely have higher mean bitrates, but a third of them have lower bitrates than the highest bitrated (that’s a word) still videos, so my guess was right, I guess. I guess? I mean, my hypothesis has proven to be scientifically sound: Bitrates aren’t a good metric for stillness.

And finally, let’s have a peek at the videos that are around my cutoff point of 0.95 (which is a number I just pulled out of, er, the air, yeah, that’s the expression):

0.9384t5jw3T3Jy70802643
0.94545Neh0fRZBU41227196
0.9475ygnn_PTPQI01907749
0.949XYa2ye4GPY884848
0.9501myxZM9cCtiE1202315
0.9503lkA9BRDWKco297490
0.9507mz91Z2aRJfs203855
0.9512IDMuu6DnXN8358156
0.9513bsFRMTbhOn0198332
0.9513v6CKHqhbos81686790
0.95143Y1yda0YfQs1012911

Yeah, perhaps I could lower the cutoff to 0.90 or something to miss the semi-static videos, too, but then I’d also miss videos that have large black areas on the screen.

Hm… and there’s also a bunch of videos that it wasn’t able to get a metric on… I wonder what’s up with those.

1pIBEwmyIwLA349057
1pzSz8ks1rPA108422
1qmlJveN9IkI83383
1srBhVq3i2Zs1651041
1tPgf_btTFlc111953
1uxpDa-c-4Mc691684
1uyI3MBpWLuQ45383

And some it wasn’t able to play at all?

03zJkTILvayA0
05sR2sCIjptY0
0E44bbh32LTY4774360
0FDjJpmt-wzg0
0U1GDpOyCXcQ0
0XorPyqPYOl4

Might just be bugs from when I was testing the code, though, and those are still in the database. Well, no biggie.

You can find the code on Microsoft Github, but avert your eyes: This is bad, bad code.

Anyway, the fun thing (for me) is that the video monitor will get better over time. Since it stores these ratings in the sqlite3 database and skips all videos with high metrics, I’ll wind up with all action all the time on the monitor, and the player doesn’t have to cycle through all the still-video guesses first.

See? The machine learns, so this is definitely a machine learning breakthrough.

Innovations in Emacs Touch Interfacing

I’ve long attempted to hack some touch interfaces for laptops in non-keyboard configurations.

The sad thing is that there aren’t really any good solutions in GNU/Linux. If you want to be able to respond to more complex events like “two finger drag”, you have to hack GTK and use Touchégg, and then it turns out that doesn’t really work on Wayland, and then most of the events disappeared from the X driver, and then…

In short, the situation is still a mess. And then my lug-around-the-apt-while-washing-TV-laptop died (ish), so I had to get a new one (a Lenovo X1 Yoga (2nd gen (which I had to buy from Australia, because nobody would sell it with the specs I wanted (i.e., LTE modem if I wanted to also take it travelling (the 3rd gen has an LTE modem that’s not supported by Linux))))):

And now, with Ubuntu 18.04, everything is even worse, and I’m not able to get any multi finger events at all! All the touch events are just translated into mouse events! Aaaargh!

After despairing for an eternity (OK, half a day), I remembered another touch interface that I quite like: The Perfect Reader.

It’s a bit hard to tell here, but the idea is that you divide the screen into areas, and then you just tap one of the areas to have the associated action happen.

Surely even Linux can’t have fucked up something so basic: It must be possible to get that kind of access.

And it’s possible! Behold!

Er… What’s going on on the other side of the backyard?

Eeek! Kitten! Go back inside!

That’s not a safe place to play! … *phew* It sat down, and turned around and went back inside. *heart attack averted*

ANYWAY!

The idea is that there’s one action grid overlay when Emacs is in the forefront, and another when the mpv video player is.  All the events go via Emacs, though, which controls mpv via the mpv socket interface.  (And, by the way, I have to say that I’m really impressed with mpv.  It has all the commands you want it to have.  The documentation is somewhat lacking, though.)

Here’s a demo:

Basically, I’m just reading the output from libinput-debug-events (which outputs everything that goes through /dev/input/event* (so you have to add your user to the input group to access them)), and then execute things based on that. libinput is allegedly the new hotness, and replaces libev and the synaptics X driver, and is supposed to be supported on both Wayland and Xorg, so hopefully this attempt at an interface will last a bit longer than the previous ones.

I wrote the controlling thing in Emacs, of course, and you can find it on Github. I’ve been using an Emacs-based movie/TV viewer since 2004, and I’m not giving up now! So there.