These are my data storage notes, targeting primarily personal data backups: regular files (documents, photo and music collections, not databases), moderate volume, added or edited rarely, backups are managed manually.
The "3-2-1 rule" for backups suggests to keep at least 3 copies of data, on at least 2 different storage devices, with at least one copy off-site.
The exact requirements and methods to achieve those may depend on one's threat model: in addition to device failures, bit rot, and unauthorized access by scrapers, one may have to consider fire or flooding, burglaries and robberies, book burning campaigns and censorship with isolation, hardware seizures and imprisonment without ability to maintain the remaining backups for years, inability--or a limited ability--to acquire replacement storage devices, and even uncommon and hypothetical scenarios, such as a global high energy EMP.
Considering the information security "CIA" triad (confidentiality, integrity, availability), we need encryption, so that lost or decommissioned drives will not leak personal data (i.e., crypto-shredding can be employed); integrity checking, so that we will either read back the data that was written or detect data corruption (and preferably even repair it); varied and common technologies (hardware interfaces, drivers, filesystems, file formats), so that there will be a good chance that at least some of the backups can be accessed with reasonable effort in different situations in the future.
Most of the technologies covered here are usable for both backups and working storage. I prefer to use more general tools, since they tend to be better maintained, and learning them usually is a more useful time investment than learning specialized backup systems (but for those, see Bacula, Borg, restic, DAR), some of which are quite similar to actual file systems (e.g., Borg is), while apparently often lacking error correction codes and redundancy within a single repository, but those may still be suitable for the task. Fortunately in this case the variety is preferable, and one can combine those. See also: Debian Reference Manual - 10. Backup and recovery, BackupAndRecovery - Debian Wiki, Synchronization and backup programs - ArchWiki.
As for portability, judging by experimentation in 2024, Android (as on Google Pixel phones) and Windows only support single (Ex)FAT partitions on USB drives, and probably only with MBR or without a partition table; no LUKS or filesystems such as Btrfs and ext4. So having to give up on compatibility with those.
Reliable computer hardware is desirable to minimize errors and hardware failures: an UPS, ECC memory, and quality hardware (including storage) in general.
External HDDs (or combinations of internal ones and external boxes) are inexpensive and handy for local backups, allowing to keep them safely disconnected most of the time, and to easily plug into virtually any computer when needed.
USB flash drives seem more suitable for off-site backups, being more robust for physical transfer. Apparently flash memory is not suited for a long-term storage without power though, so it is suggested to have them powered up at least for a few hours per year, letting the controllers to do maintenance. Writing onto cheap Kingston USB thumb drives (e.g., 256 GB DT Exodia) can be very slow, especially once about 2/3 of space is used and with ext4 on top of LUKS: writing at about 200 KB/s (less than 1 GB per hour). Even if you are not in a hurry, it makes one to wonder whether the device malfunctions, so perhaps it is better to not neglect the write speed completely, even for backup storage devices.
Optical drives (CD, DVD, Blu-ray) are commonly suggested for archieval, though they seem less convenient for updates and for usage in general, and it is not quite clear whether the recordable ("burned" with a laser and a dye, as opposed to being stamped at a factory) CDs and DVDs are that long-lasting.
Paper backups may be useful as well, and quite reliable, particularly for texts and images. Acid-free paper should be used for those, and one may play with bookbinding then. Some use QR codes and other two-dimensional barcodes to store arbitrary data on paper. Out of hardware, one would need a printer and a scanner for those, though I should investigate that better.
One may also consider keeping backup storage devices and related items in a specialized storage shelf, a Faraday cage, or a fire-resistant and/or waterproof safe.
To go further than that, including storage of physical items, one may also look into general archieval- and collection-related materials, such as the Preservation Self-Assessment Program.
I find it useful (for the peace of mind, at least) to set a
bootable operating system on at least one of the backup drives,
with all the necessary software to read the backups. So there
usually is EFI system partition (ESP), an unencrypted partition
for /boot
(GRUB2 can handle encrypted ones, but it
would not make much difference), an encrypted partition for the
rest of the system (to prevent possible data leaks via cache,
for instance, after backups are accessed from it), and a
separate encrypted partition for the backup itself.
When installing a system using an installer, on a machine with more than
one disk and some existing systems present, the installer would often use
a seemingly random ESP on one of the internal disks, instead of the one on
the backup drive. Fixing it may involve booting via the GRUB shell after
GRUB fails to find or access its config from the
/boot
partition, remounting (and fixing in
/etc/fstab
) /boot/efi/
, to point to the correct
drive's ESP, and then running grub-install
to install it
there. Also removing undesirable directories from ESP manually, and
adjusting things with efibootmgr
. Or one can opt for a more
involved/manual installation, setting it properly at once: see, for
instance, "Installing Debian GNU/Linux from a Unix/Linux System" and
"Full
disk encryption, including /boot: Unlocking LUKS devices from GRUB".
Alternatively, or additionally, one may set a personalized live system image, as described in the Debian Live Manual and similar documents for other systems.
I do partitioning with fdisk
, mostly because other
common tools (or at least their fancy user interfaces) tend to
be buggy, and/or to hide technical information, neither of which
is desirable when partitioning storage
devices. fdisk
is nice, commonly available, and
works well. With the setups described below, it works to set
LUKS or an encrypted filesystems directly on a block device,
without any partitioning, but it may also be desirable to store
some public data backups on a separate partition of the same
storage device, unencrypted.
RAID 1 (or possibly 5, 6) is nice to set if there are spare disks, but usually not as critical for redundant personal backups as it is, for instance, for a production server.
As of 2021 and for Linux-based systems, some of the common software options are:
sha256sum
(integrity)
Those can be combined, even the ones serving the same purpose: for instance, storing file checksums would not harm even if the underlying filesystem supports those already. Likewise, it should not harm to encrypt the more important files (cryptographic keys, passwords), even while storing those on encrypted disks.
Below are notes and command cheatsheets for the setups I use.
This is probably the most basic and widely supported setup for Linux-based systems. Only authenticated integrity checks are supported by cryptsetup (and those are experimental), so no CRC and no recovery from minor errors without RAID. Perhaps dm-integrity can be set separately to use CRC32C, but that would complicate the setup. Or it can be skipped altogether, since integrity checking is experimental, and wiping can slow down the process considerably (while skipping the wiping easily leads to errors).
Initial setup:
# Optionally, add: --type luks2 --integrity hmac-sha256 cryptsetup luksFormat /dev/sdXY cryptsetup open /dev/sdXY backup2 mkfs.ext4 /dev/mapper/backup2 cryptsetup close backup2 mkdir /var/lib/backup2
A typical session (CLI-based, though this is also handled by graphical file managers, such as Thunar):
cryptsetup open /dev/sdXY backup2 mount -t ext4 /dev/mapper/backup2 /var/lib/backup2/ # synchronize backups umount /var/lib/backup2/ cryptsetup close backup2
When done, in order to safely eject a device, run eject
/dev/sdX
, or possibly udisksctl power-off -b
/dev/sdX
.
For RAID with mdadm, see "dm-crypt + dm-integrity + dm-raid = awesome!".
ZFS is not modular like LUKS and friends, there are license compatibility issues, and it is rather unusual overall, but apparently a good filesystem containing all the features needed here.
Initial setup:
# Install zfsutils-linux apt install zfsutils-linux # Find a partition ID ls -l /dev/disk/by-id/ | grep sda4 # Use that ID to create a single-device pool. The "mirror" keyword # should be added to set RAID 1. zpool create tank usb-WD_Elements_...-part4 # Create an encrypted file system. mkdir /var/lib/backup/ # For redundancy within a dataset, add to the command below: -o copies=2 zfs create -o encryption=on -o keyformat=passphrase -o mountpoint=/var/lib/backup tank/backup
ZFS comes with its own mounting and unmounting commands, and if it is to be used from different systems, the pools should be exported and imported (or just force-imported). A typical session, assuming that it is used from different systems:
# List pools available for import zpool import # Import the pool zpool import tank # Mount an encrypted file system zfs mount -l tank/backup # (Synchronize backups here) # Unmount the file system (or it will happen on export) zfs unmount tank/backup # Unmount the pool (also unnecessary to do manually though) zfs unmount tank # Export the pool zpool export tank # And eject or udisksctl power-off -b, as mentioned above
This one is set with the DUP profile for both metadata and data, adding redundancy, and with sha256 checksums (instead of the default crc32c), to reduce chances of collisions.
Initial setup:
# LUKS, as with ext4 cryptsetup luksFormat /dev/sdXY cryptsetup open /dev/sdXY backup # The file system mkfs.btrfs --csum sha256 -m dup -d dup -L backup /dev/mapper/backup cryptsetup close backup mkdir /mnt/backup
A session:
cryptsetup open /dev/sdXY backup mount -t btrfs /dev/mapper/backup /mnt/backup/ # synchronize backups here umount /mnt/backup/ cryptsetup close backup eject /dev/sdX udisksctl power-off -b /dev/sda
As mentioned above, it is important to be able detect errors with some integrity checks, but one may also aim single-device redundancy for a recovery using that single device (and a better overall chance of successful data recovery), as well as calculate checksums on top of a filesystem (e.g., for ext4, which does not support those on its own).
For integrity checking with basic checksums, one can
use find
and sha256sum
or similar
tools:
# Store checksums mkdir checksums find . -type f ! -path './checksums*' -exec sha256sum {} \; \ > checksums/sha256 # Check them sha256sum --quiet --check checksums/sha256 # Add new ones find . -type f -newer checksums/sha256 ! -path './checksums*' \ -exec sha256sum {} \; >> checksums/sha256
For redundant error correction codes, with ability to repair,
one may employ par2
or
dvdisaster
(aiming optical discs), though those may
be quite inefficient to use for collections of files that are
updated. There are projects like blockyarchive (blkar), but just
as specialized backup systems, they tend to require specialized
tools to access the files backed up with them at all. A software
RAID (1, 5, or 6) set on different partitions of the same device
is a more time-efficient way to achieve some redandancy within a
storage device, though less space-efficient, and protecting
against different bit rot patterns. ZFS's "copies" parameter and
Btrfs's DUP profile (for both data and metadata) do something
similar, storing multiple copies of blocks within a dataset.
S.M.A.R.T. monitoring and testing can be done with smartmontools, and usually supported even by external and older USB drives.
I normally use just rsync --archive
for the initial
backup, then rsync --exclude='lost+found' --archive
--verbose --checksum --dry-run --delete
to compare
backups and for data scrubbing, and
without --dry-run
afterwards, if everything looks
fine.
For data erasure, dd
is handy for wiping both disks
and partitions (before decommissioning drives, or if there were
unencrypted partitions before), e.g.:
dd status=progress if=/dev/urandom of=/dev/sdX bs=1M dd status=progress if=/dev/urandom of=/dev/sdXY bs=1M
GnuPG is there for individual file encryption, as well as for signing. In some cases it may be useful together with tar and gzip.
Public data may be useful to backup as well: its regular sources may be censored/blocked by a government, or simply become unavailable because of a technical issue (along with the rest of the Internet if the issue is near the user). In that case, the focus should be on high availability, probably along with integrity, while confidentiality hardly matters (unless it is outlawed). I think even unencrypted NTFS is good enough for this, and easily readable from any common system.
As for the data to backup (and later read) this way, Kiwix (with its OpenZIM archives) is a nice project. Its primary viewer may seem awkward for use in normal circumstances, but apparently it aims to be useful to general public and in bad circumstances: it provides archives as packages, while the viewer—with versions for every common OS—can also serve those to others in a local network via a web browser. library.kiwix.org provides, among others, indexed archives of Project Gutenberg (about 60,000 public domain books), Wikipedia, Wikibooks, Wikiversity, Wiktionary, Wikisource, ready.gov, WikiHow, various StackExchange projects, Khan Academy, and many smaller bits like ArchWiki, RationalWiki, Explain XKCD (contains the comics). As of 2022, those would take just 200 to 300 GB, even with images and some non-English versions added.
Other large and legal archives to consider for backing up: Wikimedia Downloads, Complete OSM Data, maybe Debian archive mirroring and other software archives, arXiv and other Open Access sources. If one gets into tape storage, Common Crawl can be considered too. And then there are copyright-infringing but much larger libraries like Library Genesis (blocked in Russia; a trimmed down, txt-only version used to be available at offlineos.com, but apparently not anymore), the-eye.eu books (blocked in Russia), Anna's Archive, Z-library, as well as music and movies (particularly long TV series may be good for hoarding; out of nice sci-fi ones, there are Doctor Who, Star Trek, Red Dwarf, Farscape, Lexx, Firefly, Defiance, Battlestar Galactica, Babylon 5, The X-Files, First Wave; plenty more can be found in Wikipedia).
OpenStax provides good and freely available textbooks under the CC BY license, available for download in PDF. See OpenStax GitHub repositories for their CNXML sources and related tools, though in 2024 I found it tricky to build HTML out of those, and then it still was not good enough for printing. LibreTexts is supposed to be similar, though the licensing information is unclear in some cases, some links lead to HTTP 404 errors, and some of the books are quite messy (attempting to embed YouTube videos into PDFs, having every other page filled with listings of undeclared licenses, or with "welcome" messages). While its subdomains (math, phys, etc) geo-block direct requests from Russia, the books are available without proxying via commons.libretexts.org. One can also search for libre book sources on platforms like GitHub, possibly querying for TeX sources: there are occasional seemingly decent and not well-known textbooks, like Introductory Physics: Building Models to Describe Our World.
YouTube videos may be useful to hoard as well: there are many
nice ones, including educational channels, and platforms like
that seem to be getting blocked quickly when a government tries
to block information flows (see censorship of YouTube). At 480p
most videos would be watchable and not take much space (perhaps
2 to 5 MB per minute), and one can download them with
youtube-dl, e.g.: youtube-dl --download-archive
archive.txt -f
'bestvideo[height<=480]+bestaudio/best[height<=480]'
'https://www.youtube.com/c/3blue1brown/videos'
(see
also: some tricks to avoid throttling). I have collected
some video links, including interesting YouTube channels. I
think it is best to go after relatively information-dense ones
(lectures, online lessons) first, possibly followed by
entertainment-education, pop-sci, and documentaries.
When backing up private data to a remote (and usually less
trusted) machine, it should be encrypted and verified
client-side (so options like plain rsync over SSH are not
suitable), but preferably still allowing for incremental backups
(so tar and gpg are not suitable in general, either). One can
still employ LUKS or ZFS though, by accessing remote block
devices via iSCSI (in particular, tgt
and open-iscsi
seem to work smoothly on Debian),
NBD, or similar protocols, possibly on top of IPsec or WireGuard
(though as of 2024, those are blocked in Russia between local
and foreign machines), tunnels made with SSH port forwarding,
TLS (e.g., with stunnel), or anything else establishing a secure
channel, to add encryption and a more secure authentication.
A test iSCSI setup example:
# server (192.168.1.2) apt install tgt dd if=/dev/zero of=/tmp/iscsi.disk bs=1M count=128 tgtadm --lld iscsi --op new --mode target --tid 1 --targetname iqn:2024-07:com.example:tmp-iscsi.disk tgtadm --lld iscsi --op show --mode target tgtadm --lld iscsi --op new --mode logicalunit --tid 1 --lun 1 -b /tmp/iscsi.disk tgtadm --lld iscsi --op new --mode account --user foo --password bar tgtadm --lld iscsi --op show --mode account tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo tgtadm --lld iscsi --op unbind --mode target --tid 1 --initiator-address 192.168.1.3 --initiator-name foo tgtadm --lld iscsi --op bind --mode target --tid 1 --initiator-address 192.168.1.3 # client (192.168.1.3) apt install open-iscsi lsscsi iscsiadm --mode discovery --type sendtargets --portal 192.168.1.2 iscsiadm --mode node --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --login iscsiadm --mode session --print=1 lsscsi # a block device is available at this point iscsiadm --mode node --targetname iqn:2024-07:com.example:tmp-iscsi.disk --portal 192.168.1.2 --logout
Apart from own (or rented) remote machines, such a setup can be used with "backup buddies", exchanging some of your local storage space for someone else's. Sneakernet-based backup buddies (that is, occasionally exchanging storage devices) is a fine and easier option for remote backup storage.
A popular option for remote backups is online services (aka "the cloud" and a few other names), with many people relying on those even in place of local backups, or any local storage (as with music and video streaming, hosted photo albums, password managers, book collections, general document storage), delegating all those worries to somebody else. It seems convenient, but decreases direct control over the data, introduces dependencies on the service providers' continued existence and continued acceptable terms of service, on network connectivity to them, on ability to transfer payments. In my--possibly unrepresentative--experience, all those are unreliable, but it may still work as a redundant backup copy for some, particularly in predictable democratic countries, with a reputable service provider. Throw in the rule of law and sensible laws (or some kind of a hypothetical anarchist or communist utopia), and one may worry less about keeping some information private, as well as about aiming long-term isolated backups of public information.
For less private data (perhaps for almost everything but cryptographic keys and passwords -- that is, explicit secrets), a good way to preserve it is by sharing with others: for instance, pictures from an event or gathering are commonly shared among all the participants, while creative works (particularly books and music) can be shared among people with similar interests or tastes. Everything work-related can be backed up on work machines. While the data that is not private at all, like this very note, or other own creative works under permissive licenses, is generally useful to publish, sharing even more widely.