Personal data storage

These are my data storage notes, targeting primarily personal data backups: regular files (documents, photo and music collections, not databases), moderate volume, added or edited rarely, backups are managed manually.

General approach

The "3-2-1 rule" for backups suggests to keep at least 3 copies of data, on at least 2 different storage devices, with at least one copy off-site.

The regular infosec CIA triad (confidentiality, integrity, availability) is desirable and fairly straightforward to apply. We'll need encryption, so that lost or decommissioned drives won't leak personal data (i.e., crypto-shredding can be employed); integrity checking, so that we'll either read back the data that was written or detect data corruption; varied and common technologies (hardware interfaces, drivers, filesystems, file formats), so that there will be a good chance that at least some of the backups can be accessed with reasonable effort in different situations in the future.

The technologies covered here are usable for both backups and working storage. I prefer to use more general tools, since they tend to be better maintained, and learning them usually is a more useful time investment than learning specialized backup systems (but for those, see Bacula, Borg).

Hardware

Reliable computer hardware is desirable to minimize errors and hardware failures: an UPS, ECC memory, and quality hardware (including storage) in general. External HDDs are cheap and handy for local and backups, while USB flash drives seem more suitable for off-site backups (though less suitable for backups in general), being more robust for physical transfer. Both HDDs and flash drives provide interfaces different than the primary internal drives do, are easy to transfer, to plug into different machines, and to keep unplugged.

Backup operating system

I find it useful (for the peace of mind, at least) to set a bootable operating system on at least one of the backup drives, with all the necessary software to read the backups. So there's usually EFI system partition (ESP), an unencrypted partition for /boot (GRUB2 can handle encrypted ones, but it won't make much difference), an encrypted partition for the rest of the system (to prevent possible data leaks via cache, for instance, after backups are accessed from it), and a separate encrypted partition for the backup itself.

When installing a system using an installer, on a machine with more than one disk and some existing systems present, the installer would often use a seemingly random ESP on one of the internal disks, instead of the one on the backup drive. Fixing it may involve booting via the GRUB shell after GRUB fails to find or access its config from the /boot partition, remounting (and fixing in /etc/fstab) /boot/efi/, to point to the correct drive's ESP, and then running grub-install to install it there. Also removing undesirable directories from ESP manually, and adjusting things with efibootmgr. Or one can opt for a more involved/manual installation, setting it properly at once: see, for instance, "Installing Debian GNU/Linux from a Unix/Linux System" and "Full disk encryption, including /boot: Unlocking LUKS devices from GRUB".

Setups

I do partitioning with fdisk, mostly because other common tools (or at least their fancy user interfaces) tend to be buggy, and/or to hide technical information, neither of which is desirable when partitioning storage devices. fdisk is nice, commonly available, and works well.

RAID 1 is nice to set if there are spare disks, but usually not as critical for redundant personal backups as it is, for instance, for a production server.

As of 2021 and for Linux-based systems, some of the common software options are:

Below are notes and command cheatsheets for the setups I use.

LUKS and ext4

This is probably the most basic and widely supported setup for Linux-based systems. Only authenticated integrity checks are supported by cryptsetup (and those are experimental), so no CRC and no recovery from minor errors without RAID, apparently. CRC won't be useful for repairs on top of an encrypted partition either. Perhaps dm-integrity can be set separately to use CRC32C, but that would complicate the setup. Or it can be skipped altogether, since integrity checking is experimental, and wiping can slow down the process quite a bit (while skipping it easily leads to errors).

Initial setup:

cryptsetup luksFormat --type luks2 --integrity hmac-sha256 /dev/sdXY
# alternatively: cryptsetup luksFormat /dev/sdXY
cryptsetup open /dev/sdXY backup2
mkfs.ext4 /dev/mapper/backup2
cryptsetup close backup2
mkdir /var/lib/backup2

A typical session:

cryptsetup open /dev/sdXY backup2
mount -t ext4 /dev/mapper/backup2 /var/lib/backup2/
# synchronize backups
umount /var/lib/backup2/
cryptsetup close backup2

When done, in order to safely eject a device, run eject /dev/sdX, or possibly udisksctl power-off -b /dev/sdX.

For RAID with mdadm, see "dm-crypt + dm-integrity + dm-raid = awesome!".

ZFS

ZFS is not modular like LUKS and friends, there are license compatibility issues, and it's generally rather unusual, but apparently a good filesystem containing all the features needed here.

Initial setup:

# Install zfsutils-linux
apt install zfsutils-linux
# Find a partition ID
ls -l /dev/disk/by-id/ | grep sda4
# Use that ID to create a single-device pool. The "mirror" keyword
# should be added to set RAID 1.
zpool create tank usb-WD_Elements_...-part4
# Create an encrypted file system.
mkdir /var/lib/backup/
zfs create -o encryption=on -o keyformat=passphrase -o mountpoint=/var/lib/backup tank/backup

ZFS comes with its own mounting and unmounting commands, and if it's to be used from different systems, the pools should be exported and imported (or just force-imported). A typical session, assuming that it's used from different systems:

# List pools available for import
zpool import
# Import the pool
zpool import tank
# Mount an encrypted file system
zfs mount -l tank/backup
# (Synchronize backups here)
# Unmount the file system (or it'll happen on export)
zfs unmount tank/backup
# Unmount the pool too (also unnecessary to do manually though)
zfs unmount tank
# Export the pool
zpool export tank

Other useful tools

S.M.A.R.T. monitoring and testing can be done with smartmontools, and usually supported even by external and older USB drives.

I normally use just rsync --archive for the initial backup, then rsync --exclude='lost+found' --archive --verbose --checksum --dry-run --delete to compare backups and for data scrubbing, and without --dry-run afterwards, if everything looks fine.

For data erasure, dd is handy for wiping both disks and partitions (before decommissioning drives, or if there were unencrypted partitions before), e.g.:

dd status=progress if=/dev/urandom of=/dev/sdX bs=1M
dd status=progress if=/dev/urandom of=/dev/sdXY bs=1M

Public data backups

Public data may be useful to backup as well: its regular sources may be censored/blocked by a government, or simply become unavailable because of a technical issue (along with the rest of the Internet if the issue is near the user). In that case, the focus should be on high availability, probably along with integrity, while confidentiality hardly matters (unless it is outlawed). I think even unencrypted NTFS is good enough for this, and easily readable from any common system.

As for the data to backup (and later read) this way, Kiwix is a nice project. Its primary viewer may seem awkward for use in normal circumstances, but apparently it aims to be useful to general public and in bad circumstances: it provides archives as packages, while the viewer—with versions for every common OS—can also serve those to others in a local network via a web browser. library.kiwix.org provides, among others, indexed archives of Project Gutenberg (about 60,000 public domain books), Wikipedia, Wikibooks, Wikiversity, Wiktionary, Wikisource, ready.gov, WikiHow, various StackExchange projects, and many smaller bits like ArchWiki, RationalWiki, Explain XKCD (contains the comics). As of 2022, those would take just 200 to 300 GB, even with images and some non-English versions added.

Other large and legal archives to consider for backing up: Wikimedia Downloads, Complete OSM Data, maybe Debian archive mirroring and other software archives, arXiv and other Open Access sources. If one gets into tape storage, Common Crawl can be considered too. And then there are copyright-infringing but much larger libraries like Library Genesis, (for a trimmed down, txt-only version, see offline-os) as well as music and movies (particularly long TV series may be good for hoarding; out of nice sci-fi ones, there are Doctor Who, Star Trek, Red Dwarf, Farscape, Lexx, Firefly, Defiance, Battlestar Galactica, Babylon 5, The X-Files, First Wave; plenty more can be found in Wikipedia).

OpenStax provides good and freely available textbooks under the CC BY license, available for download in PDF.

YouTube videos may be useful to hoard as well: there are many nice ones, including educational channels, and platforms like that seem to be getting blocked quickly when a government tries to block information flows (see censorship of YouTube). At 480p most videos would be watchable and not take much space (perhaps 2 to 5 MB per minute), and one can download them with youtube-dl, e.g.: youtube-dl --download-archive archive.txt -f 'bestvideo[height<=480]+bestaudio/best[height<=480]' 'https://www.youtube.com/c/3blue1brown/videos' (see also: some tricks to avoid throttling). I've collected some video links, including interesting YouTube channels. I think it's best to go after relatively information-dense ones (lectures, online lessons) first, possibly followed by entertainment-education, pop-sci, and documentaries.

Remote backups

When backing up data to a remote (and usually less trusted) machine, it should be encrypted and verified client-side (so options like plain rsync over SSH are not suitable), but preferably still allowing for incremental backups (so tar and gpg are not suitable, either). One can still employ LUKS or ZFS though, by accessing remote block devices via iSCSI (in particular, tgt and open-iscsi seem to work smoothly on Debian), NBD, or similar protocols, possibly on top of IPsec or WireGuard (though as of 2024, those are blocked in Russia between local and foreign servers), tunnels made with SSH port forwarding, TLS (e.g., with stunnel), or anything else establishing a secure channel, to add encryption and a more secure authentication.

An test iSCSI setup example:

# server (192.168.1.2)
apt install tgt
dd if=/dev/zero of=/tmp/iscsi.disk bs=1M count=128
tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn:2024-07:com.example:tmp-iscsi.disk
tgtadm --lld iscsi --op show --mode target
tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /tmp/iscsi.disk
tgtadm --lld iscsi --op new --mode account --user foo --password bar
tgtadm --lld iscsi --op show --mode account
tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo
tgtadm --lld iscsi --op unbind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo
tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3

# client (192.168.1.3)
apt install open-iscsi lsscsi
iscsiadm --mode discovery --type sendtargets --portal 192.168.1.2
iscsiadm  --mode node  --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --login
iscsiadm --mode session --print=1
lsscsi
# a block device is available at this point
iscsiadm  --mode node  --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --logout