.. _backup_recovery:

Recovery from backups
=====================

In case of emergency ... break glass.


Backend
-------

Don't rush, take your time. It will take 5 days to sync 20TB of data so it is
not worth micro-optimizing tasks to save seconds. The ``rsync`` from
storinator is expected to run 110 MB/s while the disk can handle 130
MB/s. Our instance has 5 Gbps so the bottleneck is probably on the
network between data centers.

Prepare a new RAID array
........................

In case of a real disaster, you will probably do the recovery from
the real production instance. In case of a simulated disaster
(i.e. testing the backups), spawn a new instance::

    $ git clone git@github.com:fedora-copr/ansible-fedora-copr.git
    # Follow the README.md steps for preparation
    $ ./run-playbook pb-backup-recovery-01.yml

Once the instance is spawned, see the instance details for its public
IPv4 address, and run a second playbook::

    # The comma is needed because we don't have the IP address in our inventory
    $ ansible-playbook ./pb-backup-recovery-02.yml -i 54.81.xxx.xx, -u fedora

SSH to the instance::

    $ ssh fedora@54.81.xxx.xx
    [fedora@ip-54-81-xxx-xx ~]$ sudo su -
    [root@ip-54-81-xxx-xx ~]#

Set a root password, just in case we need to log in via EC2 Serial
Console::

    echo $RANDOM | md5sum | head -c 12; echo;
    passwd

Save the password in Bitwarden under the ``Temporary backup
instances`` vault.

Partition the disks::

    for i in /dev/nvme[1-4]n1 ; do \
        (echo gpt ; echo n ; echo ; echo ; echo ; echo ; echo w ) \
        | sudo fdisk $i; done

Create a new raid array::

    mdadm --create /dev/md0 --level raid10 \
        --name copr-backend-data --raid-disks 4 /dev/nvme[1-4]n1p1

If the raid was successfully created, a check should be running by now::

    cat /proc/mdstat

You can see the raid details using::

    mdadm --detail /dev/md0

Format and mount::

    mkfs.ext4 /dev/md0 -L copr-repo
    tune2fs -m0 /dev/md0
    mkdir /mnt/data
    chown copr:copr /mnt/data
    mount /dev/disk/by-label/copr-repo /mnt/data/


Workaround a kernel bug
.......................

There is a kernel bug causing IO operations on the RAID to get
stuck. Until it gets resolved, workaround it by::

    echo frozen > /sys/block/md0/md/sync_action

After a week or so, when all the data are copied, run::

    echo idle > /sys/block/md0/md/sync_action

to allow the RAID to finally proceed with the initial sync.



SSH key shenanigans
...................

The sync will take a couple of days so we want to run it in ``tmux``. But it
will be more useful for us to have it as root. Run ``tmux`` before switching
user::

    tmux

Switch to the ``copr`` user. This way we won't have to adjust user and
group for our data once the ``rsync`` command finishes::

    su - copr

Generate a new SSH key for this temporary instance::

    ssh-keygen -t rsa

Copy ``~/.ssh/id_rsa.pub`` into ``/home/copr/.ssh/authorized_keys`` on
storinator. You can SSH from your machine the same way you SSH to batcave::

    $ ssh frostyx@storinator01.rdu-cc.fedoraproject.org
    [frostyx@storinator01 ~][PROD]$ sudo su -
    [root@storinator01 frostyx][PROD]# su copr
    [copr@storinator01 frostyx][PROD]$ vim ~/.ssh/authorized_keys


Sync the data
.............

Sync the data. Run this command from our temporary instance, not from
storinator::

    time until rsync -av -H --info=progress2 --rsh=ssh \
        --max-alloc=4G \
        copr@storinator01.rdu-cc.fedoraproject.org:/srv/nfs/copr-be/copr-be-copr-user/backup/.sync/var/lib/copr/public_html/ \
        /mnt/data; \
        do true; done


Attach the volumes to the real instance
.......................................

Umount from the temporary instance::

    umount /mnt/data/
    mdadm --stop /dev/md0

Go through all ``copr-backend-backup-test-raid-10`` volumes in AWS EC2
and detach them from our temporary instance.

From now on, we don't care about the temporary instance.

On ``copr-backend-dev`` or ``copr-backend-prod`` run::

    systemctl stop copr-backend.target

Umount, disassemble raid, and detach volumes from ``copr-backend-dev``
or ``copr-backend-prod`` instance according to
https://docs.pagure.org/copr.copr/raid_on_backend.html#detaching-volume

Attach all the ``copr-backend-backup-test-raid-10`` volumes to the
``copr-backend-dev`` or ``copr-backend-prod`` instance. And assemble
the raid according to
https://docs.pagure.org/copr.copr/raid_on_backend.html#attaching-volume


Fix permissions
...............

At this point, we have the correct UID, GID on our data but wrong
SELinux attributes. Let's temporarily disable SELinux::

    setenforce 0

Everything should work as expected now::

    systemctl start lighttpd.service copr-backend.target

Fix SELinux attributes::

    time copr-selinux-relabel
    setenforce 1


Final steps
...........

- Delete the ``copr-backend-backup-test-raid-10`` temporary instance
- Switch all the RAID disks from ``st1`` to ``sc1``


Frontend
--------

TODO


Keygen
------

TODO


DistGit
-------

We don't have any plan for DistGit recovery