While spelunking the web to look for magazines about comics to add to kwakk.info, I happened upon a site that linked to a lot of PDFs that were stored on Google Drive.
Then I noticed that these files were “protected” — this means that there’s no download button on Google Drive. So I thought “well, I won’t be adding these to the search engine, then, because the person who put them up here obviously wouldn’t want that”.
But I was also curios to see whether, you know… I could download them? Just curious! Because anything that’s available for a human eye to see can be downloaded, of course.
So I searched a bit, and I found at least a dozen web sites that had some variation on this theme: You go to the “protected” PDF, then open the Developer Console (i.e., F12), and then you paste in three screenfuls worth of Javascript, and then you have the PDF.
Or rather, what you have is a file like this:
But that’s fine… easy enough to extract from that. And this isn’t the “original PDF” — it’s a series of images as rendered by the web browser, so there’s some quality loss, but whatever.
Except that it didn’t work if you had a long PDF, because the JS in question created a ginormous string that contained all the image data, and the Javascript engines have max string lengths.
And besides, I thought it was yucky that you have to be all manual and stuff. So… I reached for Selenium!
I’ve used Selenium before, and I’ve used the Python interface to it. So I tried that on this laptop, and it crapped out immediately with a bunch of incomprehensible error messages. I tried searching for what they meant, and got a gazillion different answers depending on whether it was for version 103.0.001 or 103.1.004. As is usual when I encounter a Python library, I just gave up.
(What is it with Python library maintainers, anyway? Why is there zero interface stability? Is the culture in Python quarters to always be on the edge, bleeding?)
Instead I said:
npm install selenium-webdriver
And started typing away in the Node version of Selenium instead. I haven’t used Node in, what, a decade? Probably something like that, but after five minutes of scratching my head, I had it popping up a Firefox window and I was away.
The result is now on Microsoft Github. It uses basically the same technique as the link above, but is more, er, automated.
The interface is:
node drivedown.js URL
This saves the PDF as a directory of PNG files like ~/Download/drivedown/page-001.png etc, so that you can post-process that into the format of your choice, like PDF of CBR/CBZ.
Fun detail: This doesn’t work in headless mode, or if the screen is sleeping, because Google Drive checks whether the browser is actually viewing the document before it deigns to render anything. So if you’re downloading a lot, expect to have Firefox pop up windows on your screen a lot.
There’s tons of legitimate use cases for this, like when, er, you have protected PDF file, and you’ve er forgotten your Google password. Yeah! That’s the ticket!
Or you want to print the PDF out, because you want to have it on paper, because that’s disabled in “protected” PDFs, too.