How does the memory mapped cache work

The memory mapped cache allows client applications to look up user and group information very fast because the data will be mapped into the memory of the application. As long as the data is on the cache and valid no additional communication with other processes is needed.

This memory area is organized in 4 parts:

The Header

At the beginning there is a header block which is defined in src/util/mmap_cache.h as:

struct sss_mc_header {
    uint32_t b1;            /* barrier 1 */
    uint32_t major_vno;     /* major version number */
    uint32_t minor_vno;     /* minor version number */
    uint32_t status;        /* database status */
    uint32_t seed;          /* random seed used to avoid collision attacks */
    uint32_t dt_size;       /* data table size */
    uint32_t ft_size;       /* free table size */
    uint32_t ht_size;       /* hash table size */
    rel_ptr_t data_table;   /* data table pointer relative to mmap base */
    rel_ptr_t free_table;   /* free table pointer relative to mmap base */
    rel_ptr_t hash_table;   /* hash table pointer relative to mmap base */
    rel_ptr_t reserved;     /* reserved for future changes */
    uint32_t b2;            /* barrier 2 */
};

The major_vno and minor_vno are used to tell the client about how the cache and the data is organized. With status SSSD can tell the client if the cache is valid or if e.g. the cache is recycled. In the latter case the client must close the current cache file and open a newly created.

b1 and b2 are used to implement a full memory barrier with the help of GCC’s __sync_synchronize() built-in function. The memory barrier is used to make sure the header data can be updated without other processes interfering.

seed is a random seed for the hash function used to create lookup hashes from the provided name or ID. It is generated randomly whenever the cache file is created.

dt_size, ft_size and ht_size are the sizes of the remaining three sections of the memory mapped cache and data_table, free_table and hash_table are the starting points of the sections relative to the start of the header.

The Data Table

This is the main area to store the user and group data. It is organized in slots with a defined size. The current slot size is 40 bytes. Since the slot size is not given in the header the major_vno or minor_vno must be changed if the slot size is changed. Each data record for an individual user or group (see below) can occupy one or more consecutive slots.

At startup the data table is initialized with 0xff (all bits set) which is treated as invalid value by the memory cache code independent if 1, 4 or 8 bytes are read.

The Free Table

The memory area is treated as bit-field. Each bit corresponds to a slot in the data table starting with the highest bit (0x80) of the first byte in the free table.

At startup the free table is initialized with 0x00 (no bit set) which corresponds to a completely empty data table.

The Hash Table

The hash table it treated as an array of uint32_t values. The index used to access the array is the hash value calculated from the user or group name or ID and the hash value is the slot number in the data table where the data record for the object starts.

At startup the hash table is initialized with 0xff which indicates that there is no data available for the give hash.

Sizes of the tables

From the previous sections it became clear that although the sizes of the data, free and hash table are handled independently they are connected. Of course the free table should have at least so many bit as there are slots in the data table. But having more would be a waste of space (although not as much as having lesser bits). So number of bits and slots should match.

The size of the data table should be large enough to help the amount of expected data where expected do not necessarily mean e.g. all possible users but the amount of data needed on the host running SSSD to work efficiently. It should be large enough so that data in active use has not to be overwritten constantly with other data. So let’s say there should of N data elements in the cache (currently the hard-coded default is #define SSS_MC_CACHE_ELEMENTS 50000). Depending on the slot size and the type of data one slot might not be sufficient to hold a data element of average size so there is a payload which says how many slots are used typically for a data element. See below for a discussion about the payload of the current data types. For the size of the data table we now have N times the payload size.

With the number of data elements N we now can define the size of the hash table as well. Since the data typically can be accessed either by name or ID we need 2 times N entries in the hash table array for the ideal case where no hash collisions occur. Since we always have to expect and handle hash collisions it won’t help to use more memory for the hash table. On the other hand hash collisions should be avoided where possible because they make the handling of the memory cache more complex and slower. Since using less than 2 times N entries will force hash collisions if the cache is filled with the expected number of elements just using 2 * N * sizeof(uint32_t) for the hash table size looks like a good compromise.

../_images/memory_cache_tables.svg

How the data is organized

Currently there are three different kind of data elements

passwd
For user data defined by struct passwd, see man getpwnam for details
group
For group data defined by struct group, see man getgrnam for details
initgroups
For the group memberships of user

All three types are stored in individual cache files. This allows to handle the payload size flexible and avoids hash collisions of different data types accessed with the same name. Besides the type specific data all data elements start with a common header.

The Record Header

Similar the cache header the record header is defined in src/util/mmap_cache.h as:

struct sss_mc_rec {
    uint32_t b1;            /* barrier 1 */
    uint32_t len;           /* total record length including record data */
    uint64_t expire;        /* record expiration time (cast to time_t) */
    rel_ptr_t next1;        /* ptr of next record rel to data_table */
                            /* next1 is related to hash1 */
    rel_ptr_t next2;        /* ptr of next record rel to data_table */
                            /* next2 is related to hash2 */
    uint32_t hash1;         /* val of first hash (usually name of record) */
    uint32_t hash2;         /* val of second hash (usually id of record) */
    uint32_t padding;       /* padding & reserved for future changes */
    uint32_t b2;            /* barrier 2 - 32 bytes mark, fits a slot */
    char data[0];
};

Similar as in the cache header b1 and b2 (the 32 in the comment is wrong, it is kept here because it is the same in the source code as well) are used for memory barriers. len is the total length of the data record which includes the header size and the type specific data which starts at data. If the current time return by time() is larger than the value stored expire the data in the memory should not be used anymore but SSSD’s nss responder should be called to refresh the data.

hash1 and hash2 are the two has values which are used to find the right starting slot of the data record in the hash table. In theory they are not needed here but are used for a fast and easy consistency check.

Finally next1 and next2 are used to handle hash collisions. Both value are initialized with MC_INVALID_VAL. If a hash collision is detected, i.e. there is already a data record with the same hash stored in the cache the next1 or next2 elements are checked depending if the hash collisions was found with hash1 or hash2 of the old record. If the related next[12] element is MC_INVALID_VAL the slot number of the new data record is added here. If there is a different value stored in next[12] it is assumed to be the slot number of another data record with the same hash. In this case the chain is followed by reading the data record from the next slot until the next[12] element of the current data record is MC_INVALID_VAL. Then the slot number of the new data record is stored here.

../_images/memory_cache_hash_collision.svg

The Passwd (User) Data

The passwd/user data is defined in src/util/mmap_cache.h as:

struct sss_mc_pwd_data {
    rel_ptr_t name;         /* ptr to name string, rel. to struct base addr */
    uint32_t uid;
    uint32_t gid;
    uint32_t strs_len;      /* length of strs */
    char strs[0];           /* concatenation of all passwd strings, each
                             * string is zero terminated ordered as follows:
                             * name, passwd, gecos, dir, shell */
};

The name pointer is a shortcut to the user name in the strs data and is used to make sure that the object so far only found with the help of the hash value does match the input name. If the user is search by ID the uid value is used for this check. If there is no match there is either a hash collision and the next entry in the chain has to be checked or the search entry is currently not in the memory cache and the request has to be forwarded to SSSD’s nss responder.

The strs blob is expected to contain 5 0-terminated strings representing the string components of struct passwd, user name, password, gecos, home directory and user shell. Since SSSD does not add password hashes to the output the password string will typically be “*” or whatever the pwfield option is set to.

../_images/memory_cache_passwd.svg

A complete passwd/user record look like:

0000000: 0000 00f0 7100 0000  ....q...
0000008: 80f8 1f5a 0000 0000  ...Z....
0000010: ffff ffff ffff ffff  ........
0000018: 9c99 0000 2a35 0100  ....*5..
0000020: ffff ffff 0000 00f0  ........

0000028: 1000 0000 00d9 b92b  .......+
0000030: 00d9 b92b 3900 0000  ...+9...
0000038: 6164 6d69 6e40 6970  admin@ip
0000040: 6166 3236 2e64 6576  af26.dev
0000048: 656c 002a 0041 646d  el.*.Adm

0000050: 696e 6973 7472 6174  inistrat
0000058: 6f72 002f 686f 6d65  or./home
0000060: 2f61 646d 696e 002f  /admin./
0000068: 6269 6e2f 6261 7368  bin/bash
0000070: 00ff ffff ffff ffff  ........

where each block represents one slot (40 bytes).

The total length len of this record is 0x71(113) bytes. There are no hash collisions as can be seen by the 0xff in the third line.

struct sss_mc_pwd_data starts with the second block. The name string starts after 0x10(16) bytes. The uid and gid of the user are 0x2bb9d900(733600000) and all strings together including the terminating 0x00s are 0x39(57) bytes long. The reminder of the last slot is filled with 0xff.

The Group Data

The group data is defined in src/util/mmap_cache.h as:

struct sss_mc_grp_data {
    rel_ptr_t name;         /* ptr to name string, rel. to struct base addr */
    uint32_t gid;
    uint32_t members;       /* number of members in strs */
    uint32_t strs_len;      /* length of strs */
    char strs[0];           /* concatenation of all group strings, each
                             * string is zero terminated ordered as follows:
                             * name, passwd, member1, member2, ... */
};

name and gid are similar to the ones in struct sss_mc_pwd_data. members is the number of members of the group. So it is expected to have members + 2 (all members plus the group name and the group password) 0-terminated strings in the strs blob.

../_images/memory_cache_group.svg

Here is an example for a group with 8 members:

00000000: 0000 00f0 1201 0000  ........
00000008: 2262 255a 0000 0000  "b%Z....
00000010: ffff ffff ffff ffff  ........
00000018: 87f8 0000 6184 0000  ....a...
00000020: ffff ffff 0000 00f0  ........

00000028: 1000 0000 2ad9 b92b  ....*..+
00000030: 0800 0000 da00 0000  ........
00000038: 7465 7374 5f67 726f  test_gro
00000040: 7570 4069 7061 6632  up@ipaf2
00000048: 362e 6465 7665 6c00  6.devel.

00000050: 2a00 7465 7374 2d75  *.test-u
00000058: 7365 7261 4069 7061  sera@ipa
00000060: 6632 362e 6465 7665  f26.deve
00000068: 6c00 7465 7374 2d75  l.test-u
00000070: 7365 7262 4069 7061  serb@ipa

00000078: 6632 362e 6465 7665  f26.deve
00000080: 6c00 7465 7374 2d75  l.test-u
00000088: 7365 7263 4069 7061  serc@ipa
00000090: 6632 362e 6465 7665  f26.deve
00000098: 6c00 7465 7374 2d75  l.test-u

000000a0: 7365 7264 4069 7061  serd@ipa
000000a8: 6632 362e 6465 7665  f26.deve
000000b0: 6c00 7465 7374 2d75  l.test-u
000000b8: 7365 7265 4069 7061  sere@ipa
000000c0: 6632 362e 6465 7665  f26.deve

000000c8: 6c00 7465 7374 2d75  l.test-u
000000d0: 7365 7266 4069 7061  serf@ipa
000000d8: 6632 362e 6465 7665  f26.deve
000000e0: 6c00 7465 7374 2d75  l.test-u
000000e8: 7365 7267 4069 7061  serg@ipa

000000f0: 6632 362e 6465 7665  f26.deve
000000f8: 6c00 7465 7374 2d75  l.test-u
00000100: 7365 7268 4069 7061  serh@ipa
00000108: 6632 362e 6465 7665  f26.deve
00000110: 6c00 ffff ffff ffff  l.......

The full record is 0x112(274) bytes long and occupies 7 slots. The struct sss_mc_grp_data starts at the second slot, the name of the group can be found 0x10(16) bytes later, the GID is 0x2bb9d92a(733600042) and the groups has 8 members. The strs blob is 0xda(218) bytes long. Following the group name and the group password (‘*’) the names of the 8 group members test-usera@ipaf26.devel, …, test-userh@ipaf26.devel can be found.

The Initgr Data

Finally the initgr data is defined in src/util/mmap_cache.h as:

struct sss_mc_initgr_data {
    rel_ptr_t unique_name;  /* ptr to unique name string, rel. to struct base addr */
    rel_ptr_t name;         /* ptr to raw name string, rel. to struct base addr */
    rel_ptr_t strs;         /* ptr to concatenation of all strings */
    uint32_t strs_len;      /* length of strs */
    uint32_t data_len;      /* all initgroups data len */
    uint32_t num_groups;    /* number of groups */
    uint32_t gids[0];       /* array of all groups
                             * string with name and unique_name is stored
                             * after gids */
};

Here we can see some differences to the previous two structs, there are two names and two different kind of data areas. First we will have a looks to the data areas. The getgrouplist and similar other calls will return a list of GIDs of groups the user is a member of. So the first part of the data blob starting at gids is an array of uint32_t of size num_groups containing the GIDs of the groups the user is a member of. After the GID list the two names can be found as 0-terminated strings, first the string unique_name is pointing to and then the string for name. The length of both strings including the terminating 0 byte is stored in strs_len and data_len stores the length of all data, GIDs and strings, so it is data_len = strs_len * num_groups *sizeof(uint32_t).

The second name attribute was added to mitigate a general issue the memory mapped cache currently has with lookup by names for initgr requests. A user can only have a single UID and a group can only have a single GID. A different UID would by definition automatically mean a different user from the point of view of the Linux kernel. User and group names are basically labels to the UID or GID, respectively, and multiple names can be assigned to a single UID or GID. Traditionally there are struct passwd and struct group to map the UID or GID with a single name and the getpwnam, getpwuid, getgrnam and getgrgid calls are used to find the ID for a name and vice versa. But there is not restriction that the names used as first argument to getpwnam or getgrnam have to be the same as the ones returned as pw_name or gr_name in the related structs. pw_name and gr_name returned in struct passwd and struct group respectively can be considered a canonical names. The names used as first argument for getpwnam and getgrnam, as long as they are differ from the canonical name, can be considered as alias names.

Coming back to the memory mapped cache. The user and group data in the memory cache only contain a single name, the canonical name. This means a user or group entry can be only found in the memory mapped cache if the canonical name is used to lookup the entry. The is in general not an issue with many Unix/Linux based use case. But if e.g. the users are managed in Active Directory there might be different expectations about the name format, see e.g. MSDN: User Name Formats <https://msdn.microsoft.com/de-de/library/windows/desktop/aa380525(v=vs.85).aspx>. Besides the short logon name the fully-qualified user principal name (UPN) or the old NT style down-level logon name can be used. What makes it even worse is that names in AD are treated case-insensitive.

To allow lookups with different input names (aliases) one might be tempted to just replace the canonical name in the memory cache record with the input name. But this would fail if e.g. the related user is deleted on the server and has to be deleted in the memory mapped cache as well. Now all entries must be checked if they are somehow related to the deleted entry. To avoid this the canonical name is added as well and its hash is written to the otherwise unused hash2 element of struct sss_mc_rec. Now can use different alias name to look up an object and after the first lookup will add an record with this name to the cache. If there are changes on the server to the object all instances can be found with the canonical name and handled accordingly.

../_images/memory_cache_initgr.svg

Finally here is a example of how an initgr memory mapped cache record looks like:

00000000: 0000 00f0 7300 0000  ....s...
00000008: fff6 275a 0000 0000  ..'Z....
00000010: ffff ffff ffff ffff  ........
00000018: be4d 0100 049f 0000  .M......
00000020: ffff ffff 0000 00f0  ........

00000028: 2800 0000 4000 0000  (...@...
00000030: 2800 0000 2300 0000  (...#...
00000038: 3300 0000 0400 0000  3.......
00000040: 2ad9 b92b 33d9 b92b  *..+3..+
00000048: 34d9 b92b 35d9 b92b  4..+5..+

00000050: 7465 7374 2d75 7365  test-use
00000058: 7261 4069 7061 6632  ra@ipaf2
00000060: 362e 6465 7665 6c00  6.devel.
00000068: 7465 7374 2d75 7365  test-use
00000070: 7261 00ff ffff ffff  ra......

As usual the first slot contains struct sss_mc_rec. As can be seen there are two different hash values 0x14dbe and 0x9f04 indicating that name and unique_name are different. struct sss_mc_initgr_data start with the second slot. The first two relative pointers give the start of unique_name and name, respectively. Given that struct sss_mc_initgr_data starts at 0x28(40) unique_name starts 0x28(40) bytes later at 0x50(80). Similar name starts 0x40(64) after the start of the initgr data at 0x68(104). Since the two names are the only strings used here strs is 0x28(40) as well and the length strs_len of both strings is 0x23(35) bytes. Together with the GIDs the total data_len is 0x33(51) bytes. This means 16 bytes more than the strings alone with is agreement to the number of GIDs num_groups 0x4(4). The GIDs of the groups are 0x2bb9d92a(733600042), 0x2bb9d933(733600051), 0x2bb9d934(733600052) and 0x2bb9d935(733600053).

How does a Client lookup an Entry?

Depending on what data should be looked up the client has to open the related cache file and map it into its own memory. To learn about the structure of the cache the header must be read.

The Posix calls getpwnam, getpwuid, getgrnam, getgrgid and getgrouplist either use a name or a POSIX ID as input. To find a matching entry in the memory mapped cache the hash value must be calculated first. If a name is used as input the hash is calculated from the name string including the trailing 0-byte. If the lookup is by ID the numerical ID is converted into a decimal string which is used with the trailing 0-byte to calculate the hash. The seed for the hash function can be found in the header. In both cases the modulus of the 32bit hash value and the size of the hash table array is calculated.

This value is now taken to lookup an entry in the hash table array. If it is MC_INVALID_VAL there is no matching entry in the cache and the request must be forwarded to SSSD’s nss responder.

If the hash table entry contains another value it is assumed to be the starting slot number of the related entry in the data table. After the entry is read first the hashes (hash1 for name based and hash2 for ID based lookups) are compared with the input hash. If they match the input value (name or ID) itself is compared with the related data from the entry. If they match as well the data from the entry is returned to the caller in the expected format. If one of the comparisons fails the next entry (if any) with the same hash value is lookup up by reading the slot number stored in next1 or next2 depending if the input hash matches hash1 or hash2 respectively.

If no matching entry was found the request must be forwarded to SSSD’s nss responder.

How does the NSS Responder store an Entry?

First the NSS responder calculates the hash in the same way as the client and checks if the entry already exists if needed by following the chain of the next elements. If the entry already exists in the memory cache and occupies the same number of slots as needed for the new data the old entry is just overwritten with the new data.

When filling the memory cache the NSS responder keeps track of the next slot which follows the last inserted entry. As long as not all slots are used and the number of remaining slots is larger than the number of needed slots for the new entry the next free slot and the following free ones are used to store the entry and the next slot is remembered again.

If the cache is already full, the free table is used to search to the needed number of consecutive free slots. If none where found the entry the remembered next slot is pointing to is invalidate and if needed the following entries as well to make room for the new entry.

This means that this simple scheme becomes inefficient if the cache is full and more and more new entries has to be added to the cache. In the worst case the full cache is searched for empty slots every time before the new entry is added by overwritten an existing entry. Additionally the lifetime of the cached entries is not taken into account when overwriting existing entries.

After the new slots where found the entry is written to the memory mapped cached protected by memory barriers. And finally the starting slot number is either written to the hash table or to the corresponding next element at the end of the chain of entries with the same hash value.