Recovery from backups¶
In case of emergency … break glass.
Backend¶
Don’t rush, take your time. It will take 5 days to sync 20TB of data so it is
not worth micro-optimizing tasks to save seconds. The rsync
from
storinator is expected to run 110 MB/s while the disk can handle 130
MB/s. Our instance has 5 Gbps so the bottleneck is probably on the
network between data centers.
Prepare a new RAID array¶
In case of a real disaster, you will probably do the recovery from the real production instance. In case of a simulated disaster (i.e. testing the backups), spawn a new instance:
$ git clone git@github.com:fedora-copr/ansible-fedora-copr.git
# Follow the README.md steps for preparation
$ ./run-playbook pb-backup-recovery-01.yml
Once the instance is spawned, see the instance details for its public IPv4 address, and run a second playbook:
# The comma is needed because we don't have the IP address in our inventory
$ ansible-playbook ./pb-backup-recovery-02.yml -i 54.81.xxx.xx, -u fedora
SSH to the instance:
$ ssh fedora@54.81.xxx.xx
[fedora@ip-54-81-xxx-xx ~]$ sudo su -
[root@ip-54-81-xxx-xx ~]#
Set a root password, just in case we need to log in via EC2 Serial Console:
echo $RANDOM | md5sum | head -c 12; echo;
passwd
Save the password in Bitwarden under the Temporary backup
instances
vault.
Partition the disks:
for i in /dev/nvme[1-4]n1 ; do \
(echo gpt ; echo n ; echo ; echo ; echo ; echo ; echo w ) \
| sudo fdisk $i; done
Create a new raid array:
mdadm --create /dev/md0 --level raid10 \
--name copr-backend-data --raid-disks 4 /dev/nvme[1-4]n1p1
If the raid was successfully created, a check should be running by now:
cat /proc/mdstat
You can see the raid details using:
mdadm --detail /dev/md0
Format and mount:
mkfs.ext4 /dev/md0 -L copr-repo
tune2fs -m0 /dev/md0
mkdir /mnt/data
chown copr:copr /mnt/data
mount /dev/disk/by-label/copr-repo /mnt/data/
Workaround a kernel bug¶
There is a kernel bug causing IO operations on the RAID to get stuck. Until it gets resolved, workaround it by:
echo frozen > /sys/block/md0/md/sync_action
After a week or so, when all the data are copied, run:
echo idle > /sys/block/md0/md/sync_action
to allow the RAID to finally proceed with the initial sync.
SSH key shenanigans¶
The sync will take a couple of days so we want to run it in tmux
. But it
will be more useful for us to have it as root. Run tmux
before switching
user:
tmux
Switch to the copr
user. This way we won’t have to adjust user and
group for our data once the rsync
command finishes:
su - copr
Generate a new SSH key for this temporary instance:
ssh-keygen -t rsa
Copy ~/.ssh/id_rsa.pub
into /home/copr/.ssh/authorized_keys
on
storinator. You can SSH from your machine the same way you SSH to batcave:
$ ssh frostyx@storinator01.rdu-cc.fedoraproject.org
[frostyx@storinator01 ~][PROD]$ sudo su -
[root@storinator01 frostyx][PROD]# su copr
[copr@storinator01 frostyx][PROD]$ vim ~/.ssh/authorized_keys
Sync the data¶
Sync the data. Run this command from our temporary instance, not from storinator:
time until rsync -av -H --info=progress2 --rsh=ssh \
--max-alloc=4G \
copr@storinator01.rdu-cc.fedoraproject.org:/srv/nfs/copr-be/copr-be-copr-user/backup/.sync/var/lib/copr/public_html/ \
/mnt/data; \
do true; done
Attach the volumes to the real instance¶
Umount from the temporary instance:
umount /mnt/data/
mdadm --stop /dev/md0
Go through all copr-backend-backup-test-raid-10
volumes in AWS EC2
and detach them from our temporary instance.
From now on, we don’t care about the temporary instance.
On copr-backend-dev
or copr-backend-prod
run:
systemctl stop copr-backend.target
Umount, disassemble raid, and detach volumes from copr-backend-dev
or copr-backend-prod
instance according to
https://docs.pagure.org/copr.copr/raid_on_backend.html#detaching-volume
Attach all the copr-backend-backup-test-raid-10
volumes to the
copr-backend-dev
or copr-backend-prod
instance. And assemble
the raid according to
https://docs.pagure.org/copr.copr/raid_on_backend.html#attaching-volume
Fix permissions¶
At this point, we have the correct UID, GID on our data but wrong SELinux attributes. Let’s temporarily disable SELinux:
setenforce 0
Everything should work as expected now:
systemctl start lighttpd.service copr-backend.target
Fix SELinux attributes:
time copr-selinux-relabel
setenforce 1
Final steps¶
Delete the
copr-backend-backup-test-raid-10
temporary instanceSwitch all the RAID disks from
st1
tosc1
Frontend¶
TODO
Keygen¶
TODO
DistGit¶
We don’t have any plan for DistGit recovery