Frizeflink on retrobrewcomputers "For January 2nd 2020 I will take offline for personal reasons and will not open it again."

So grab anything you need now

· · Web · 4 · 3 · 2

@EtchedPixels Announcement anywhere?

Internet Archive's "Save Page Now" can be *very* trivially scripted:<source-url>

curl, wget, HEAD / GET, etc.

The tricky part is the URL list to feed it.

@EtchedPixels Looks as if there are a few archives.

It's *specifically* excluded from (robots.txt). If you've any pull with the site maintainer, asking about lifting that would be Very Useful.

@dredmorbius I'll at least ask the question. I've got a mirror of it by wget -r or can mirror the mirrors ?

@EtchedPixels If it's online, can mirror it.

And if the mirror's robots.txt doesnt' disallow it, they'll also publish the mirror.

(IA largely ignore robots for archival, though it _may_ prevent _publishing_ of the archive.)

So: publish your mirror, send _those_ URLs to IA's WBM, and the archive is available.

@dredmorbius I will take a look - much of it is document scans so probably I should read up on how to submit documents properly instead so they get archived that way ?

@dredmorbius I was thinking that if the visual layout reveals the structure then you can fuzzy match blocks of the accurate text you have against the OCR to work out what went where on the page and thus the structure ?

@EtchedPixels I am humbled by your estimation of my coding-fu ;-)

That's well beyond my abilities, I think, and probably kinda hard.

@EtchedPixels For *that*, what you may want to do is set up a "collection" on the main Archive site (not WBM -- Wayback Machine), so that you can post the original scanned docs for access.

I know that is A Thing, though I've not mucked with it. Should be mostly painless.

"This URL has been excluded from the Wayback Machine."

Sign in to participate in the conversation

Linux geeks doing what Linux geeks do...