a little tool I built to fight linkrot and save our sources from the memory hole → https://sij.law/deepciter

a little tool I built to fight linkrot and save our sources from the memory hole → https://sij.law/deepciter
been trying to archive all outlinks from macwright.com with #archivebox and results are decidedly mixed: tasks keep getting stuck in a 'pending' state with no feedback as to whether anything is working or not.
I've mirrored a relatively simple website (redsails.org; it's mostly text, some images) for posterity via #wget. However, I also wanted to grab snapshots of any outlinks (of which there are many, as citations/references). By default, I couldn't figure out a configuration where wget would do that out of the box, without endlessly, recursively spidering the whole internet. I ended up making a kind-of poor man's #ArchiveBox instead:
for i in $(cat others.txt) ; do dirname=$(echo "$i" | sha256sum | cut -d' ' -f 1) ; mkdir -p $dirname ; wget --span-hosts --page-requisites --convert-links --backup-converted --adjust-extension --tries=5 --warc-file="$dirname/$dirname" --execute robots=off --wait 1 --waitretry 5 --timeout 60 -o "$dirname/wget-$dirname.log" --directory-prefix="$dirname/" $i ; done
Basically, there's a list of bookmarks^W URLs in others.txt that I grabbed from the initial mirror of the website with some #grep foo. I want to do as good of a mirror/snapshot of each specific URL as I can, without spidering/mirroring endlessly all over. So, I hash the URL, and kick off a specific wget job for it that will span hosts, but only for the purposes of making the specific URL as usable locally/offline as possible. I know from experience that this isn't perfect. But... it'll be good enough for my purposes. I'm also stashing a WARC file. Probably a bit overkill, but I figure it might be nice to have.
of course, neofeudal lords are looking to #wikipedia and #internetArchive with arson in their hearts, as they always do with the great libraries
between this and the web continuing to enshittify with AI slop and critical mass of advertising, it's probably time to start thinking about things in terms of offline-first
make local copies of resources that are important, get your personal content off of cloud providers, and archive everything you can
old phones, random flash drives, unused laptops - all of that can be put to good use as self-sovereign libraries. and if you have the financial means, seriously consider building or investing in a NAS
we have plenty of tools to make this possible:
kiwix is an offline reader for Wikipedia, Project Gutenberg, and several other online sources - there's even a method to turn a raspi into a hotspot that serves the archived content: https://kiwix.org/en/how-to-set-up-kiwix-hotspot/
youtube-dl is a program that you can use to download content from youtube, including full channels: https://ytdl-org.github.io/youtube-dl/
the Internet Archive also has a command line utility to bulk download content: https://archive.org/developers/internetarchive/cli.html
and take a look at #archiveBox - a self-hosted project that takes in urls and downloads relevant content while stripping out all of the extra shit you don't need: https://archivebox.io/
surviving and healthy bookmark archiver tools from last time i looked into replacing Pocket include LinkAce and ArchiveBox. i don't see a clear winner based on two minutes of browsing docs except based on implementation language vibes: ArchiveBox is Python, while LinkAce is PHP.