Multiple server addresses or names in kdcinfo files

Problem statement

When a user authenticates using Kerberos, the KDCs that will actually be used are either discovered by libkrb5 with the help of DNS SRV records, or the KDCs are configured explicitly in /etc/krb5.conf. or provided by a special locator plugin.

Because the administrator expects that the servers they defined in sssd.conf would be used for both authentication through SSSD and by applications that use libkrb5, such the Kerberos command line tools like kinit, SSSD provides a locator plugin for libkrb5 that allows SSSD to inform libkrb5 about the servers SSSD had configured.

However, SSSD, at least in the typical use case, only writes the information about the single server it connects to and changes the address only when the daemon reconnects to a different server. This creates a problem in case the server whose address is written in the kdcinfo file is unreachable but no action towards sssd that would provoke a fail over (such as a user login over PAM) is executed. In that case, the kdcinfo file contains stale entries and because from libkrb5 point of view, the kdcinfo files are authoritative and if the information present there is not useful, libkrb5 cannot reach any KDCs from that domain.

To improve the situation, this design page proposes adding a new sssd option that, if set, would enable sssd to write additional host names into the kdcinfo files which would then allow the plugin to iterate over these items and in turn allow libkrb5 to have sort of a failover for entries configured in sssd.conf or autodiscovered by SSSD.

Use cases

A typical sequence that triggers this problem is this:
  • log in with a PAM service to a machine. This causes a KDC address to be written to the kdcinfo file
  • disable the KDC server, e.g. by enabling a restrictive firewall rule
  • call kinit on the client where the kdcinfo file was written

Overview of the solution

The Kerberos locator plugin reads the address(es) from per-realm text files written by SSSD located in the /var/lib/sss/pubconf directory. At the moment, the plugin can already read multiple entries, but currently only numerical addresses are supported.

On a high level, implementing this RFE requires several changes:
  • change the Kerberos locator plugin so that it can also consume host names in addition to numerical addresses. These host names would be resolved in the plugin itself and passed to libkrb5 with the help of a callback function libkrb5 provides to the plugin
  • add a new SSSD option that would limit the number of entries that SSSD writes to the kdcinfo plugin. This is needed to avoid time outs in case the network was truly unreachable. The default value of the option could perhaps be different in master and sssd-1-16 where master could default to writing multiple entries, but sssd-1-16 would default the option to 0 in order to not change behaviour of a stable branch.
  • extend the online callback which the SSSD fail over component uses to write the current server to the kdcinfo files to also write additional server host names in addition to the current server address
  • to enable writing multiple server addresses, the request to resolve a server for a service should be extended to resolve host names up to the specified limit

When it comes to resolving the servers, there are several scenarios to consider:

  • The servers can be enumerated using an option. This includes krb5_server/krb5_backup_server for the krb5 provider and ipa_server/ipa_backup_server and ad_server/ad_backup_server for the IPA and AD providers.
  • The servers can be completely autodiscovered. Typically this is done by either omitting the *_server options completely or using the _srv_ identifier. As long as the list is omitted or the _srv_ record is the first one in the list, any fail over service resolution would trigger the DNS SRV lookups and resolve the whole list. It is useful to note that the _srv_ identifier is not permitted in the backup server list explicitly, but the AD provider does resolve a SRV query into the backup server list. That is done in case an AD site is used, then the servers from the AD site are added as ‘primary’ and the global servers form the ‘backup’ list.
  • A mix of the above. The most complex case from the point of this RFE is a list that starts with a host name, but includes the _srv_ identifier later on, e.g. krb5_server =, _srv_. In this case, currently calling the fail over resolution would only resolve the host name of, but not the SRV query, so unless the fail over code is extended, the host names originating from the SRV query would not be known after the service resolution finishes.

Implementation details

The interface the locator plugin uses to communicate with libkrb5 is a callback function provided by the caller (libkrb5), SSSD is supposed to pass a struct sockaddr to the caller. The Kerberos locator plugin is already capable of iterating over multiple addresses, but currently really only numerical addresses are supported and the plugin converts the string representation of the address into struct sockaddr by calling getaddrinfo(3) with the AI_NUMERICHOST parameter. We should extend the locator plugin code by calling getaddrinfo for entries that do not represent an address to resolve a host name and pass its address. This can be a first self-contained step in the implementation.

The kdcinfo files are written (using write_krb5info_file) either during an online callback or in a special-case for IPA trust clients. The special case is already doing something similar to what this page is about by looking into a subsection representing a trusted domain (e.g. [domain/ipa.test/]) and resolving all the servers in that list either by name or based on a site selection. However, this is done during the subdomain provider operation, not during a resolver callback and all the addresses configured in the sssd.conf file are always resolved and written to the config file.

The write_krb5info_file receives a linked list of struct fo_server structures which contains the address, if already resolved, or at least a host name in the struct server_common member structure. Since the callback should already be synchronous and not do much work on its own, it would be best if the callback was already invoked with the data provided,

There are two kinds of servers in the fail over module - primary and backup. The backup servers are supposed to only be used temporarily and sssd periodically tries to connect to one of the primary servers. However, from the fail over code point of view, even adding a “backup” server still means the server is added to the same linked list, just with a flag denoting that the server is not primary, therfore iterating over a single list would iterate over both the primary and backup servers.

Before changing the online callbacks, it would be useful to implement and read the krb5_kdcinfo_lookahead option so that there is already an upper limit when the callbacks write the extra host names.

The next step of implementation could be extending the online callbacks that call the write_krb5info_file functions. There are several of them, ad_resolve_callback, ipa_resolve_callback and krb5_resolve_callback. The callbacks receive the current struct fo_server instance. The callbacks would then keep iterating over the linked list until either the list is exhausted or as many as krb5_kdcinfo_lookahead items are processed. The host name from the struct server_common structure would be read using fo_get_server_name and written to the array passed to write_krb5info_file.

One question to consider is whether to use the fo_server instances before the current one, i.e. those that SSSD tried before and couldn’t connect to. I think it would make sense to add them to the end of the list, at least for the primary servers not from a SRV query, because sssd never reconnects to a server earlier in the list as long as later server works. The SRV queries are different in this respect in the sense that they time out and force SSSD to resolve the whole list once a server is requested again (typically either during authentication or once the LDAP connection expires).

Finally, the case where the fail over code needs to do additional lookups in order to resolve at least the amount of host names requested by the krb5_kdcinfo_lookahead should be addressed. The caller that initializes the fail over service (maybe with be_fo_add_service) should provide a hint with the value of the lookahead option. Then, if a request for server resolution is triggered, the fail over code would resolve a server and afterwards check if enough fo_server entries with a valid hostname in the struct server_common structure. If not, the request would check if any of the fo_server structures represents a SRV query and try to resolve the query to receive more host names.

Configuration changes

A new configuration option called krb5_kdcinfo_lookahead would be added. This option would default to a sensible non-zero value in the master branch, perhaps 3 so that attempting to resolve the extra host names does not cause the libkrb5 operation to time out. If the patches are backported to any stable branch, the option must default to 0 (disabled).

In the first iteration, we might want to just read a single number, but in the future, the option should be extended to accept two numbers in the total:backup notation. This would mean write up to total servers, but include up to backup servers from the backup list. This would be useful in case none of the servers from the primary list are reachable, because e.g. they all come from the same AD site, but servers outside the site are reachable. This extension would only make sense if SSSD does not resolve the host names on its own, which might be another future extension.

It might be a good idea to add a note to the sssd-ad and sssd-ipa man pages or even the shared fail over man page include file with a pointer to how the kdcinfo files work so that the information is easy to discover for administrators.

How To Test

Plugin test
With any of the below tests or even after writing the host names to the kdcinfo files directly, make sure the first entry in the list is unreachable. Then call e.g. kinit and check that the operation succeeds.
Backwards compatibility test
Set the krb5_kdcinfo_lookahead option to 0. Define multiple servers and perform Kerberos authentication. Make sure that only the current server is written to the kdcinfo files.
Write a list of servers
Set the krb5_resolve_callback to a positive value. Make sure that the first entry in the kdcinfo files is an address and the other entries are host names from the configuration. This test case should be extended to make sure only so many entries as the value of the option are written, or if there are fewer entries in the config file, all are writen.
Fail over test
Similar to the above, except make sure the first entry in the list cannot be contacted. Then, SSSD should resolve the next entry to the address and if applicable write the rest of the list.
Backup server test
At the minimum, we should make sure that servers from the backup list are written to the kdcinfo files. If the option would implement the split total:backup value, then those should be tested as well.
(Optional) writing a previously tried, not working server
If it is agreed during design review that also not working servers are to be written to the kdcinfo files (see the section about not working servers), then a test case should make sure those are written to the end of the list.
SRV resolution test
Leave the server list (e.g. krb5_server) option empty. Make sure a DNS SRV query for the configured realm returns valid servers and they are written to the config file.
Combined SRV and server list
Set the krb5_server option to hostname, _srv_. Set the krb5_kdcinfo_lookahead option to a value greater than 1. Make sure that the host names from the DNS SRV query are also present in the kdcinfo files.
IPA client test
The test cases above should be repeated for an IPA client as well in case the IPA online callbacks are modified.
AD site test
Add an AD client to a site or set the site in the config file. Make sure that the servers from the site are written first, followed by the global servers up to the krb5_kdcinfo_lookahead value.

How To Debug

Any new code must be decorated with DEBUG messages. To debug the locator plugin changes, using KRB5_TRACE or even calling strace might be useful.

Future development

First, it might be useful to extend the resolver or fail over code to resolve the names on its own to save some potentially blocking calls in the plugin. There is already an example of resolv_hostport_list_send that can perhaps be reused.

Additionally, we already plan for some time to include connectivity checks with cLDAP ping or just plain connect() to make sure that servers that cannot be contacted at all are not tried. This is of course outside of the scope of this work, but should be kept in mind to not implement something incompatible.