How to Upgrade Fedora Copr Persistent VMs (Amazon AWS)

This document describes the process of upgrading persistent VM instance(s) (e.g., copr-fe-dev.aws.fedoraproject.org) to a new Fedora version by creating a completely new VM to replace the old one.

Requirements

  • Access to the team’s Amazon AWS account and proper configuration of that account according to the README.md.

  • Permissions to run playbooks on batcave01.

  • Since we do not modify the public IPs (neither v4 nor v6), no DNS modifications should be required. However, familiarize yourself with the DNS SOP in case of any issues.

  • Make sure you have /usr/bin/aws installed and that you have fedora-copr section in ~/.aws/credentials

Pre-upgrade

The goal is to complete as much pre-upgrade work as possible while focusing on minimizing the outage window and only performing essential tasks that cannot be done post-upgrade.

Avoid conducting the pre-upgrade too far in advance of the actual upgrade. Ideally, perform this phase a couple of hours or a day before.

Announce the outage

See a specific document Fedora Copr outage announcements, namely the “planned” outage state.

Check the hot-fixes

The old set of instances (especially prod) has been running for quite some time, likely accumulating several hotfixes over that period. Research the applied hotfixes and determine which of them need to be manually implemented on the N+2 boxes (if any, note them).

First, check the hot-fixed issues and PRs. Then, check the file-system modifications:

# over ssh on the _old_ box, search for weird things (ignore config changes
# and /boot)
[root@copr-be-dev ~][STG]# rpm -Va | grep -v -e /etc/ -e /boot/
...
S.5....T.    /var/www/cgi-resalloc
...
S.5....T.    /usr/lib/python3.12/site-packages/copr_backend/pulp.py
...

E.g., the /var/www/cgi-resalloc file is a weird change, but that in particular is covered in playbooks. The pulp.py change is important to note though! You may consult the dnf diff copr-backend output, find the corresponding upstream PR on GitHub, and tag the PR with hot-fixed label (if not already done).

Preparation

Ensure you have the helper playbook repository cloned locally and navigate to the clone directory.

Review the dev.yml, prod.yml, and all.yml configurations in the ./group_vars directory. Pay particular attention to the data volume IDs as these MUST match the EC2 reality.

In the following moments, you will run several playbooks on your machine. During execution, explicitly specify two Ansible variables, copr_instance (set to either dev or prod) and server_id (set to either frontend, backend, distgit, or keygen). For example:

$ opts=( -e copr_instance=dev -e server_id=keygen )
$ ansible-playbook play-vm-migration-01-new-box.yml "${opts[@]}"

Identify the AMI (golden images) you want to use for the new VM instances. Typically, upgrade to Fedora N+2 (e.g., migrating infrastructure from Fedora 37 to Fedora 39). Visit the Cloud Base Images download page, locate the Launch on public cloud platforms section for x86_64-based instances, and click the button next to Fedora Cloud 41 AWS (ensure JavaScript is enabled for this page!). Note the ami-* ID in the US East (N. Virginia) region (for example ami-0746fc234df9c1ee0). Specify this ami-* ID in group_vars/all.yml, and ensure both group_vars/{dev,prod}.yml correctly reference it.

Double-check other machine parameters such as instance types, names, tags, IP addresses, root volume sizes, etc. Usually, the pre-filled defaults suffice, but verification is recommended.

Note

Use the ec2instances.info comparator to find the cheapest available instance type that meets our needs whenever more power is required.

Note

Don’t worry about old_instance_id and new_instance_id for now. We will change them after running the first set of playbooks

Warning

The group_vars/ directory serves as the primary source of truth for the Fedora Copr instances. Update the configuration in this directory whenever you ad-hoc modify some EC2 instance parameters in the future!

Key pair named Ansible Key must be used. This allows us to initially run the playbooks from batcave01 box against the newly spawned VM. The playbooks assure that, subsequently, Fedora Copr team members can SSH using their own keys, uploaded to FAS.

Backup the Current Let’s Encrypt Certificates

We will copy and paste the certificate files used on the old set of VMs onto the new VMs. These certificates will remain in use until automatically renewed by the certbot daemon. The process begins by copying the certificate files to the batcave01 through the execution of playbooks with the -t certbot option. For instance:

$ sudo rbac-playbook -l copr-keygen.aws.fedoraproject.org groups/copr-keygen.yml -t certbot

Do this for all the instances!

Launch new instances

As simple as:

$ opts=( -e copr_instance=dev -e server_id=keygen )
$ ansible-playbook play-vm-migration-01-new-box.yml "${opts[@]}"

You’ll see an output like:

ok: [localhost] => {
    "msg": [
        "ElasticIP: not specified",
        "Instance ID: i-04ba36eb360187572",
        "Network ID: eni-048189f432f068270",
        "Unused Public IP: 100.24.62.79",
        "Private IP: 172.30.2.94"
    ]
}

Now fix the corresponding new_instance_id and new_network_id options in group_vars/{dev,prod}.yml according to the output. Also update old_instance_id and old_network_id options.

Note the Private IP addresses

Most of the communication within Copr stack happens on public interfaces via hostnames with one exception. Communication between backend and keygen is done on a private network behind a firewall through IP addresses that change when spawning a fresh instances.

So once you know the Backend’s private IP, please do a private IP change in ansible.git.

Don’t start the services after the first playbook run

Set the services_disabled: true for your instance in inventory/group_vars/copr_*_dev_aws for devel, or inventory/group_vars/copr_*_aws for production.

Pre-prepare the new VM — backend only!

Note

Running the playbook against the new copr-backend server before shutting down the old one is possible. This minimizes the outage duration with non-working DNF repositories on the backend, which is highly desirable.

However, to prevent any issues with Ansible, the following prerequisites are necessary:

  • A temporary volume attached to the new box that provides an ext4 filesystem with the copr-repo label.

  • An existing temporary hostname (having an existing DNS record) to execute the playbook against it.

The volume, DNS record, and corresponding Elastic IP for this purpose have already been prepared by the play-vm-migration-01-new-box.yml playbook mentioned above.

Note

The following inventory configuration should already be prepared for you in the “commented-out” form.

Ensure that the copr-be-dev-temp.aws.fedoraproject.org is specified in the inventory in the following groups:

copr_back_dev_aws
staging
cloud_aws

Similarly, use copr-be-temp.aws.fedoraproject.org in:

copr_back_aws
cloud_aws

For both cases, set the birthday=yes variable for the temporary hostname:

[copr_back_dev_aws]
copr-be-dev.aws.fedoraproject.org
copr-be-dev-temp.aws.fedoraproject.org birthday=yes

On Batcave, execute the playbook against the temporary hostname:

$ sudo rbac-playbook -l copr-be-dev-temp.aws.fedoraproject.org groups/copr-backend.yml
$ sudo rbac-playbook -l copr-be-temp.aws.fedoraproject.org     groups/copr-backend.yml

Once the playbook finishes successfully, remember to revert the inventory changes we did here (commenting out again).

Outage window

When initiating this section, aim for time efficiency as the services will be down and inaccessible to users.

Let users know

See Fedora Copr outage announcements again, ad “ongoning” issue.

Move IPs and Volumes to the New Instances

Warning

Prepare to follow the instructions provided during the playbook run. You’ll need to perform manual steps such as DB backups, consistency checks, etc.

Migrate the data volumes and IP addresses to the new machine. For the Backend case, a separate playbook is created. This playbook makes the results directory unavailable temporarily, affecting every Copr consumer! Ensure that that the lighttpd service is running on the new server once the playbook finishes, and that it hosts the correct results:

$ ansible-playbook play-vm-migration-02-migrate-backend-box.yml "${opts[@]}"

For the rest of the systems (Frontend, DistGit, Keygen), use:

$ ansible-playbook play-vm-migration-02-migrate-non-backend-box.yml "${opts[@]}"

Provision the new instances

In the fedora-infra ansible repository, edit the inventory/inventory file and set the birthday=yes variable for your updated host, for example:

[copr_front_dev_aws]
copr.stg.fedoraproject.org birthday=yes

This is necessary to instruct the first playbook run on batcave01 to sign the new host certificates (avoiding later manipulation with known_hosts).

On batcave01, execute the playbook to provision the instance (ignore the playbook for upgrading Copr packages). For the dev instance, refer to

https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-dev-machines

and for production, refer to

https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-production-machines

It’s possible that the playbook fails, but it typically isn’t crucial now. If provisioning at least reaches the end of the base role, revert the birthday=yes commit and proceed with the next steps.

The playbooks above have not automatically updated the systems. If you prefer to start on Fedora N+2 with up-2-date set of packages, do the dnf update now (manual step over ssh).

Get it working

Rerun the playbook from the previous section again, with dropped configuration:

services_disabled: false

It should proceed with mounting data volumes but will likely not succeed. Now, you’ll need to debug and address the issues. If necessary, modify and rerun the playbook multiple times (ensuring lighttpd running on the new backend all the time).

Note

Frontend - You’ll likely need to manually upgrade the PostgreSQL database once you migrate to the new Fedora (new PG major version). Refer to Upgrade the database.

Post-upgrade

By this point, every Copr service should be operational.

It’s a good idea to test /usr/sbin/reboot now to debug potential boot issues during the outage window, as future reboots are likely to occur at the most inconvenient times.

Rename the instance names

Remove the -new name suffix from the new instances and add a -old suffix to the old instances. This playbook should be executed only once for all the infra instances:

$ opts=( -e copr_instance=dev )  # or prod
$ ansible-playbook play-vm-migration-03-rename-instances.yml "${opts[@]}"

Terminate the old instances

Once you no longer require the old VMs, you can terminate them using the Amazon web UI. You can do this immediately after the upgrade or wait a couple of days (e.g. to keep the DB /backups for a while just in case of any problems).

The old VMs are protected against accidental termination. To disable this option, click Actions, navigate to Instance settings and then to Change termination protection.

Final steps

See a specific document Fedora Copr outage announcements, the “resolved” section.