How to upgrade persistent instances (Amazon AWS)¶
Note
This document is specific to Amazon AWS. For OpenStack, see this outdated one.
This article describes how to upgrade persistent instances (e.g. copr-fe-dev) to a new Fedora version.
Requirements¶
access to Amazon AWS
ssh access to batcave01
permissions to update aws.fedoraproject.org DNS records
Pre-upgrade¶
The goal is to do as much work pre-upgrade as possible while focusing only on important things and not creating a work overload with tasks, that can be done post-upgrade.
Don’t do the pre-upgrade too long before the actual upgrade. Ideally a couple of hours or a day before.
Launch a new instance¶
First, login into Amazon AWS, otherwise the following step will not work. Once you are logged-in, feel free to close the page.
1. Choose AMI¶
Navigate to the Cloud Base Images download page and scroll down to
the section with cloud base images for Amazon public cloud. Use
Click to launch
button to launch an instance from the x86_64
AMI. Select the US East (N. Virginia) region.
You will get redirected to the Amazon AWS page.
3. Application and OS Images (Amazon Machine Image)¶
Skip this section, we already chose the correct AMI from the Fedora website.
4. Instance type¶
Currently, we use the following instance types:
Dev |
Production |
|
---|---|---|
frontend |
t3a.medium |
t3a.xlarge |
backend |
t3a.medium |
m5a.4xlarge |
keygen |
t3a.small |
t3a.xlarge |
distgit |
t3a.medium |
t3a.medium |
pulp |
t3a.medium |
TODO |
When more power is needed, please use the ec2instances.info comparator to get the cheapest available instance type according to our needs.
5. Key pair (login)¶
Make sure to use existing key pair named
Ansible Key
. This allows us to run the playbooks onbatcave01
box against the newly spawned VM.
6. Network settings¶
Click the
Edit
button in the box heading to show more optionsSelect VPC
vpc-0af***********972
Select
Subnet
to beus-east-1c
Switch
Auto-assign IPv6 IP
toEnable
Switch to
Select existing security group
and pick one ofcopr-frontend-sg
copr-backend-sg
copr-distgit-sg
copr-keygen-sg
copr-pulp-sg
7. Configure storage¶
Click the
Advanced
button in the box heading to show more optionsUpdate the
Size (GiB)
of the root partition
Dev |
Production |
|
---|---|---|
frontend |
50G |
50G |
backend |
20G |
100G |
keygen |
10G |
20G |
distgit |
20G |
80G |
pulp |
20G |
TODO |
Turn on the
Encrypted
optionSelect
KMS key
to whatever is(default)
8. Advanced details¶
Termination protection
-Enable
9. Launch instance¶
Click Launch instance
in the right panel.
Add names for the root volumes¶
Once the instance is created, go to its details, switch to the
Storage
tab, and go through all attached volumes. Set the Name
tag for each of them. Use the name of the instance as a prefix, e.g.
copr-keygen-dev-root
, copr-frontend-prod-root
, etc.
Backup the current letsencrypt certificates¶
The certificates files used on the old set of VMs need to be copy-pasted onto the new set of VMs (at least initially, till they are automatically re-newed by the certbot daemon). For this, we need to copy the certificate files to the batcave server first.
Copy the certificate files by running the playbooks against the current (old)
copr stack (all machines). There’s the -t certbot
ansible tag that allows
you to speedup the playbook runs.
Pre-prepare the new VM¶
Note
Backend - It’s possible to run the playbook against the new copr-backend server before we actually shut-down the old one. But to make sure that ansible won’t complain, we need
A volume attached to the new box with label ‘copr-repo’. Use already existing volume named
data-copr-be-dev-initial-playbook-run
An existing complementary DNS record (
copr-be-temp
orcopr-be-dev-temp
). poiting to the non-elastic IP of the new server. See the DNS SOP.
Note the private IP addresses¶
Most of the communication within Copr stack happens on public interfaces via
hostnames with one exception. Communication between backend
and keygen
is done on a private network behind a firewall through IP addresses that change
when spawning a fresh instance.
Note
Backend - Whereas after updating a copr-backend
(or dev) instance change
the configuration in inventory/group_vars/copr_keygen_aws
or
inventory/group_vars/copr_keygen_dev_aws
and update the iptables rules:
custom_rules: [ ... ]
Don’t start the services after first playbook run¶
Set the services_disabled: true
for your instance in
inventory/group_vars/copr_*_dev_aws
for devel, or
inventory/group_vars/copr_*_aws
for production.
Outage window¶
Once you start this section, try to be time-efficient because the services are down and unreachable by users.
Stop the old services¶
Except for the lighttpd.service
on the old copr-backend (still serving
repositories to users), and postgresql.service
on the old copr-frontend (we
will need it to backup the database), stop all of our services.
Warning
Backend - You have to terminate existing resalloc resources. See Terminate resalloc resources.
Command |
|
---|---|
frontend |
|
backend |
|
keygen |
|
distgit |
|
pulp |
|
Stop all timers and cron jobs so they don’t collide or talk with the newly provisioned servers:
systemctl stop crond
systemctl stop *timer
Warning
Backend - Do not forget to kill all /usr/bin/prunerepo
and
/usr/bin/copr-backend-process-build
processes:
kill `ps -o pid,cmd -ax | grep process-build | cut -d' ' -f1`
Ideally, you should wait until
/usr/bin/copr-backend-process-action
processes gets finished.
Umount data volumes from old instances¶
Warning
Backend - Keep the backend volume mounted to the old instance. We will take care of that later
Note
Frontend - On the new instance, it will be probably necessary to manually upgrade the database to a new PostgreSQL version. This is our last chance to Backup the database before the upgrade. Do it.
Once the backup is created, stop the PostgreSQL server:
systemctl stop postgresql
It might not be clear what data volumes are mounted. You can checkout
roles/copr/*/tasks/mount_fs.yml
in the ansible playbooks to see the data
volumes.
Umount data volumes and make sure everything is written:
umount /the/data/directory/mount/point
sync
Perhaps you can shutdown the instance (but you don’t have to):
shutdown -h now
Attach data volumes to the new instances¶
Warning
Backend - Keep the backend volume attached to the old instance. We will take care of that later
Open Amazon AWS web UI, select Volumes
in the left panel, filter them with
CoprPurpose: infrastructure
and CoprInstance
either devel
or
production
. Find the correct volume, select it, and Detach Volume
.
Dev |
Production |
|
---|---|---|
frontend |
data-copr-fe-dev |
data-copr-frontend-prod |
backend |
data-copr-be-dev |
data-copr-backend-prod |
keygen |
data-copr-keygen-dev |
data-copr-keygen-prod |
distgit |
data-copr-distgit-dev |
data-copr-distgit-prod |
pulp |
data-copr-pulp-dev |
TODO |
Once it is done, right-click the volume again, and click to Attach Volume
(it can be safely attached to a running instance).
Flip the elastic IPs¶
Warning
Backend - Keep the backend elastic IP associated to the old instance. We will take care of that later
Except for copr-be, flip the Elastic IPs to the new instances. This is needed to allow successful run of playbooks.
Open Amazon AWS, in the left panel under Network & Security
click to
Elastic IPs
. Filter them by either CoprInstance : devel
or
CoprInstance : production
. Select the IP for your instance, and click
Actions
, Associate Elastic IP address
(don’t care that it is already
associated to the old instance).
In the
Instance
field, search for your instance with-new
suffixCheck-in the
Check Allow this Elastic IP address to be reassociated
option
Provision new instance from scratch¶
In the fedora-infra ansible repository, edit inventory/inventory
file and set birthday=yes
variable for your host, e.g.:
[copr_front_dev_aws]
copr.stg.fedoraproject.org birthday=yes
On batcave01 run playbook to provision the instance (ignore the playbook for upgrading Copr packages).
Note
Backend - You need to slightly modify the calls to use -l copr-be*-temp….
To make the playbook work with the new copr-be*-temp DNS record, we have to specify the host name on TWO PLACES in inventory inside ansible.git:
inventory/inventory -- copr_back_aws vs. copr_back_dev_aws groups inventory/cloud -- cloud_awsIf we don’t, when the playbook is run, this breaks the nagios monitoring miserably.
For the dev instance, see
https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-dev-machines
and for production, see
https://docs.pagure.org/copr.copr/how_to_release_copr.html#upgrade-production-machines
It is possible that the playbook fails, it isn’t important now. If the
provisioning gets at least thgourh the base
role, revert the commit to
remove the birthday
variable.
Dealing with backend¶
This is a backend-specific section. For other instaces, skip it completely.
Note
Backend - On the new copr-be*-temp hostname, stop the lighttpd etc. and umount the temporary volume. It needs to be detached in AWS cli, too.
Warning
Backend - You should hurry up and go through this section quickly. The storage will be down and end-users will see failed dnf update … processes in terminals.
Note
Backend - Connect to the old instance via SSH. It doesn’t have a hostname anymore, so you will need to use its public IP address.
Stop all services using the data volume, e.g.:
systemctl stop lighttpd
Safely ummount the data volume
Note
Backend - Open Amazon AWS, detach the data volume from the old backend instance, and a attach it to the new one.
Note
Backend - Open Amazon AWS and finally flip the backend elastic IP address from the old instance to the new one.
Note
Backend - Re-run the playbook again, this time with the correct hostname
(without -temp
) and drop the birthday=yes
parameter.
Get it working¶
Re-run the playbook from previous section again, with dropped configuration:
services_disabled: false
It’s encouraged to start with backend so the repositories are UP again. Since we have fully working DNS and elastic IPs, even copr-backend playbook can be run with normal -l argument.
It should get past mounting but it will most likely not succeed. At this point, you need to debug and fix the issues from running it. If required, adjust the playbook and re-run it again and again (pay attention to start lighttpd serving the repositories ASAP).
Note
Frontend - It will most likely be necessary to manualy upgrade the PostgreSQL database once you migrated to the new Fedora (new PG major version). See how to Upgrade the database.
Note
Keygen - If you upgraded keygen before backend, you need to re-run keygen playbook once more to allow the new backend private IP address in the iptables.
Update IPv6 addresses¶
Update the aws_ipv6_addr
for your instance in
inventory/group_vars/copr_*_dev_aws
for devel, or
inventory/group_vars/copr_*_aws
for production.
Then run the playbooks once more with -t ipv6_config
and reboot the
instance (or figure out a better way to get them working).
Fix IPv6 DNS records¶
There is no support for Elastic IPs for IPv6, so we have to update AAAA records every time we spawn a new infrastructure machine. SSH to batcave, and setup the DNS records there according to the DNS SOP.
Post-upgrade¶
At this moment, every Copr service should be up and running.
Drop suffix from instances names¶
Open Amazon AWS web UI, select Instances
in the left panel, and filter
them with CoprPurpose: infrastructure
. Rename all instances
without -new
suffix to end with -old
suffix. Then drop
-new
suffix from the instances that have it.
Terminate the old instances¶
Once you don’t need the old VMs, you can terminate them e.g. in Amazon web UI. You can do it right after the upgrade or wait a couple of days to be sure.
The instances should be protected against accidental termination, and therefore
you need to click Actions
, go to Instance settings
,
Change termination protection
, and disable this option.
Final steps¶
Don’t forget to announce on fedora devel and copr devel mailing lists and also on
#fedora-buildsys
that everything should be working again.
Close the infrastructure ticket, the upgrade is done.