#3006 sssd file descriptor leak after legitimate fork failures
Closed: Fixed None Opened 7 years ago by lukebigum.

If the AD sssd_be gets a legitimate fork() failure, such as ENOMEM / 'Cannot allocate memory', it enters a state where it leaks file descriptors, soon runs out, and then can't make any new connections. Restarting sssd fixes the problem.

In some cases the ENOMEM is legitimate, eg: Java has used every scrap of RAM on the box. In other cases my monitoring does not pinpoint the cause of the fork failure.

This is sssd-1.12.4-47.el6_7.8.x86_64, EL 6.7, but running on a EL 6.6 server (I fast tracked sssd to see if it made any difference). I haven't tried any 1.13 builds.

I will attach a log file of sssd_be at debug level 7 (hopefully that's enough). The log contains two instances of the problem starting at:

  • Sun May 1 10:41:44
  • Thu May 5 10:36:57

After these points if you look at the sdap_ldap_connect_callback_add log lines you will see the FD number increasing.

Unfortunately I don't have any lsof output to match the log, but from memory the FD leaks were for some sort of pipe (they did not appear to be TCP connections to LDAP).

What other debugging can I get you to diagnose this problem?


I had to sanitise that log file quite heavily, let me know if it is excluding something you need to see.

Have another instance of a server this morning unable to fork any child processes for a few hours:

(Fri May  6 03:08:09 2016) [sssd[be[example.com]]] [sdap_fork_child] (0x0020): fork failed [12][Cannot allocate memory].
(Fri May  6 03:08:09 2016) [sssd[be[example.com]]] [sdap_kinit_done] (0x0020): child failed (12 [Cannot allocate memory])
(Fri May  6 03:08:09 2016) [sssd[be[example.com]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [12](Cannot allocate memory)

The server recovered after a few hours and SSSD is working, but the sssd_be process is still holding open a large number of pipes, about 200 more FDs than it usually would have:

sssd_be 12159 root  198w  FIFO                0,8      0t0 349711344 pipe
sssd_be 12159 root  199r  FIFO                0,8      0t0 349711345 pipe
sssd_be 12159 root  200w  FIFO                0,8      0t0 349711345 pipe
sssd_be 12159 root  201r  FIFO                0,8      0t0 349711347 pipe
sssd_be 12159 root  202w  FIFO                0,8      0t0 349711347 pipe
sssd_be 12159 root  203r  FIFO                0,8      0t0 349711348 pipe
sssd_be 12159 root  204w  FIFO                0,8      0t0 349711348 pipe
sssd_be 12159 root  205r  FIFO                0,8      0t0 350446786 pipe
sssd_be 12159 root  206u  IPv4          351864269      0t0       TCP 10.1.1.1:35448->dc:389 (ESTABLISHED)
sssd_be 12159 root  207w  FIFO                0,8      0t0 350446786 pipe
sssd_be 12159 root  208r  FIFO                0,8      0t0 350446787 pipe
sssd_be 12159 root  209w  FIFO                0,8      0t0 350446787 pipe
sssd_be 12159 root  210w  FIFO                0,8      0t0 350446789 pipe
sssd_be 12159 root  211r  FIFO                0,8      0t0 350446790 pipe
sssd_be 12159 root  212w  FIFO                0,8      0t0 350446790 pipe
sssd_be 12159 root  213u  IPv4          351865156      0t0       TCP 10.1.1.1:42603->dc:3268 (ESTABLISHED)
sssd_be 12159 root  214w  FIFO                0,8      0t0 350446794 pipe
sssd_be 12159 root  215r  FIFO                0,8      0t0 350446795 pipe
sssd_be 12159 root  216w  FIFO                0,8      0t0 350446795 pipe
sssd_be 12159 root  217u  IPv4          351884545      0t0       TCP 10.1.1.1:35578->dc:389 (ESTABLISHED)

Could they be the stdin/stdout file handles of the ldap_child process that failed to fork?

Managed to catch an strace of a failing process, this is the chunk of the trace before the clone() failure:

write(11, "(Fri May  6 08:23:13 2016) [sssd"..., 155) = 155
umask(077)                              = 0177
open("/var/lib/sss/pubconf/.krb5info_dummy_KbctFU", O_RDWR|O_CREAT|O_EXCL, 0600) = 247
umask(0177)                             = 077
write(247, "10.1.1.1:88", 16)      = 16
fchmod(247, 0644)                       = 0
close(247)                              = 0
rename("/var/lib/sss/pubconf/.krb5info_dummy_KbctFU", "/var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM") = 0
gettimeofday({1462522993, 76737}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "(Fri May  6 08:23:13 2016) [sssd"..., 124) = 124
gettimeofday({1462522993, 76994}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "(Fri May  6 08:23:13 2016) [sssd"..., 104) = 104
geteuid()                               = 0
getegid()                               = 0
pipe([247, 248])                        = 0
pipe([249, 250])                        = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc77a3589d0) = -1 ENOMEM (Cannot allocate memory)

There's no close() call on FDs 247 to 250 in the trace, and those FDs are still held open now.

I've found a few machines that are good candidates for memory pressure. I've installed the latest CentOS 6.8 preview RPMs from here:

https://copr-be.cloud.fedoraproject.org/results/jhrozek/SSSD-6.8-preview/epel-6-x86_64/00174698-sssd/

After a short time an strace of the running sssd_be process catches this fork failure:

write(11, "(Fri May  6 11:22:48 2016) [sssd[be[example.com]]] [be_ptask_execute] (0x0400): Task [AD machine account password renewal]: executing task, timeout 60 seconds\n", 157) = 157
pipe([22, 26])                          = 0
pipe([27, 28])                          = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63c529e9d0) = -1 ENOMEM (Cannot allocate memory)
gettimeofday({1462533768, 167433}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "(Fri May  6 11:22:48 2016) [sssd[be[example.com]]] [ad_machine_account_password_renewal_send] (0x0020): fork failed [12][Cannot allocate memory].\n", 144) = 144

and if I look at the open FDs, 26, 27 and 28 are hanging open:

...
sssd_be 24306 root   20r   REG              253,0 10406312    267935 /var/lib/sss/mc/initgroups (deleted)
sssd_be 24306 root   21u  unix 0xffff880415a2c700      0t0 101198958 /var/lib/sss/pipes/private/sbus-dp_example.com.24306
sssd_be 24306 root   22r  FIFO                0,8      0t0 101253974 pipe
sssd_be 24306 root   24u  IPv4          101264903      0t0       TCP 10.1.1.1:45317->dc03:389 (ESTABLISHED)
sssd_be 24306 root   25u  IPv4          101267713      0t0       TCP 10.1.1.1:53785->dc03:3268 (ESTABLISHED)
sssd_be 24306 root   26w  FIFO                0,8      0t0 101253974 pipe
sssd_be 24306 root   27r  FIFO                0,8      0t0 101253975 pipe
sssd_be 24306 root   28w  FIFO                0,8      0t0 101253975 pipe
sssd_be 24306 root   29u  IPv4          101267715      0t0       TCP 10.1.1.1:45389->dc03:389 (ESTABLISHED)

It fits the same pattern as before, so may not be fixed in 13.2 either. I will continue to collect information over the weekend.

I'm sorry about a lack of response in this ticket. This seems like a legitimate bug we should fix. We've been busy working on meeting deadlines for the 1.14 release, though..

I'll leave the ticket in the NEEDS_TRIAGE milestone for one more week and then find an assignee to work on it..

Thank you for the report, I see the bug now. If you let me know the version you are running now, I can build you a test package.

Fields changed

owner: somebody => jhrozek
status: new => assigned

Patch on list

patch: 0 => 1

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13.5

resolution: => fixed
status: assigned => closed

Thanks for the patch. Will 1.13.5 make it's way into EL 6.8 or has it missed the cut off? Any chance I could get a bundle of RPMs to test with like in https://copr-be.cloud.fedoraproject.org/results/jhrozek/SSSD-6.8-preview/epel-6-x86_64/?

Not 1.13.5, but I nominated this patch for the next 6.8 update. I can either build you a test package based on the existing 6.8 update (I need to do it anyway for RHEL..) or you can wait until the fix is released in RHEL itself. I would appreciate testing if you have some time, though..

My email reply got lost...

Yes, quite happy to test a patched build on top of a 6.8 RPM before it gets to 6.8 Update.

Replying to [comment:14 lukebigum]:

My email reply got lost...

Yes, quite happy to test a patched build on top of a 6.8 RPM before it gets to 6.8 Update.

Sorry about the delay, here are the test builds:
https://jhrozek.fedorapeople.org/sssd-test-builds/sssd-6.8-fd-leak/

Thanks, I've updated 150 of 450 servers that are affected, should be a good cross section of hosts to get feedback from. Over the weekend the open file count had started to creep up, so in a few more days I hope I'll be able to tell if 1.13.3-22.el6_8.3 makes a difference.

Some fd leak is to be expected, (there might be a connection open per domain to each domain's DC), but the count should stabilize and not leak further. Please open a bug if you still see some leaks.

Metadata Update from @lukebigum:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.13.5

7 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4047

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata