Issue #3006: sssd file descriptor leak after legitimate fork failures - sssd

SSSD / sssd

#3006 sssd file descriptor leak after legitimate fork failures

Closed: Fixed None Opened 7 years ago by lukebigum.

If the AD sssd_be gets a legitimate fork() failure, such as ENOMEM / 'Cannot allocate memory', it enters a state where it leaks file descriptors, soon runs out, and then can't make any new connections. Restarting sssd fixes the problem.

In some cases the ENOMEM is legitimate, eg: Java has used every scrap of RAM on the box. In other cases my monitoring does not pinpoint the cause of the fork failure.

This is sssd-1.12.4-47.el6_7.8.x86_64, EL 6.7, but running on a EL 6.6 server (I fast tracked sssd to see if it made any difference). I haven't tried any 1.13 builds.

I will attach a log file of sssd_be at debug level 7 (hopefully that's enough). The log contains two instances of the problem starting at:

Sun May 1 10:41:44
Thu May 5 10:36:57

After these points if you look at the sdap_ldap_connect_callback_add log lines you will see the FD number increasing.

Unfortunately I don't have any lsof output to match the log, but from memory the FD leaks were for some sort of pipe (they did not appear to be TCP connections to LDAP).

What other debugging can I get you to diagnose this problem?

lukebigum commented 7 years ago

attachment
sssd_be.log.gz

lukebigum commented 7 years ago

I had to sanitise that log file quite heavily, let me know if it is excluding something you need to see.

lukebigum commented 7 years ago

Have another instance of a server this morning unable to fork any child processes for a few hours:

(Fri May  6 03:08:09 2016) [sssd[be[example.com]]] [sdap_fork_child] (0x0020): fork failed [12][Cannot allocate memory].
(Fri May  6 03:08:09 2016) [sssd[be[example.com]]] [sdap_kinit_done] (0x0020): child failed (12 [Cannot allocate memory])
(Fri May  6 03:08:09 2016) [sssd[be[example.com]]] [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret [12](Cannot allocate memory)

The server recovered after a few hours and SSSD is working, but the sssd_be process is still holding open a large number of pipes, about 200 more FDs than it usually would have:

sssd_be 12159 root  198w  FIFO                0,8      0t0 349711344 pipe
sssd_be 12159 root  199r  FIFO                0,8      0t0 349711345 pipe
sssd_be 12159 root  200w  FIFO                0,8      0t0 349711345 pipe
sssd_be 12159 root  201r  FIFO                0,8      0t0 349711347 pipe
sssd_be 12159 root  202w  FIFO                0,8      0t0 349711347 pipe
sssd_be 12159 root  203r  FIFO                0,8      0t0 349711348 pipe
sssd_be 12159 root  204w  FIFO                0,8      0t0 349711348 pipe
sssd_be 12159 root  205r  FIFO                0,8      0t0 350446786 pipe
sssd_be 12159 root  206u  IPv4          351864269      0t0       TCP 10.1.1.1:35448->dc:389 (ESTABLISHED)
sssd_be 12159 root  207w  FIFO                0,8      0t0 350446786 pipe
sssd_be 12159 root  208r  FIFO                0,8      0t0 350446787 pipe
sssd_be 12159 root  209w  FIFO                0,8      0t0 350446787 pipe
sssd_be 12159 root  210w  FIFO                0,8      0t0 350446789 pipe
sssd_be 12159 root  211r  FIFO                0,8      0t0 350446790 pipe
sssd_be 12159 root  212w  FIFO                0,8      0t0 350446790 pipe
sssd_be 12159 root  213u  IPv4          351865156      0t0       TCP 10.1.1.1:42603->dc:3268 (ESTABLISHED)
sssd_be 12159 root  214w  FIFO                0,8      0t0 350446794 pipe
sssd_be 12159 root  215r  FIFO                0,8      0t0 350446795 pipe
sssd_be 12159 root  216w  FIFO                0,8      0t0 350446795 pipe
sssd_be 12159 root  217u  IPv4          351884545      0t0       TCP 10.1.1.1:35578->dc:389 (ESTABLISHED)

Could they be the stdin/stdout file handles of the ldap_child process that failed to fork?

lukebigum commented 7 years ago

Managed to catch an strace of a failing process, this is the chunk of the trace before the clone() failure:

write(11, "(Fri May  6 08:23:13 2016) [sssd"..., 155) = 155
umask(077)                              = 0177
open("/var/lib/sss/pubconf/.krb5info_dummy_KbctFU", O_RDWR|O_CREAT|O_EXCL, 0600) = 247
umask(0177)                             = 077
write(247, "10.1.1.1:88", 16)      = 16
fchmod(247, 0644)                       = 0
close(247)                              = 0
rename("/var/lib/sss/pubconf/.krb5info_dummy_KbctFU", "/var/lib/sss/pubconf/kdcinfo.EXAMPLE.COM") = 0
gettimeofday({1462522993, 76737}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "(Fri May  6 08:23:13 2016) [sssd"..., 124) = 124
gettimeofday({1462522993, 76994}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "(Fri May  6 08:23:13 2016) [sssd"..., 104) = 104
geteuid()                               = 0
getegid()                               = 0
pipe([247, 248])                        = 0
pipe([249, 250])                        = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fc77a3589d0) = -1 ENOMEM (Cannot allocate memory)

There's no close() call on FDs 247 to 250 in the trace, and those FDs are still held open now.

lukebigum commented 7 years ago

I've found a few machines that are good candidates for memory pressure. I've installed the latest CentOS 6.8 preview RPMs from here:

https://copr-be.cloud.fedoraproject.org/results/jhrozek/SSSD-6.8-preview/epel-6-x86_64/00174698-sssd/

After a short time an strace of the running sssd_be process catches this fork failure:

write(11, "(Fri May  6 11:22:48 2016) [sssd[be[example.com]]] [be_ptask_execute] (0x0400): Task [AD machine account password renewal]: executing task, timeout 60 seconds\n", 157) = 157
pipe([22, 26])                          = 0
pipe([27, 28])                          = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f63c529e9d0) = -1 ENOMEM (Cannot allocate memory)
gettimeofday({1462533768, 167433}, NULL) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=118, ...}) = 0
write(11, "(Fri May  6 11:22:48 2016) [sssd[be[example.com]]] [ad_machine_account_password_renewal_send] (0x0020): fork failed [12][Cannot allocate memory].\n", 144) = 144

and if I look at the open FDs, 26, 27 and 28 are hanging open:

...
sssd_be 24306 root   20r   REG              253,0 10406312    267935 /var/lib/sss/mc/initgroups (deleted)
sssd_be 24306 root   21u  unix 0xffff880415a2c700      0t0 101198958 /var/lib/sss/pipes/private/sbus-dp_example.com.24306
sssd_be 24306 root   22r  FIFO                0,8      0t0 101253974 pipe
sssd_be 24306 root   24u  IPv4          101264903      0t0       TCP 10.1.1.1:45317->dc03:389 (ESTABLISHED)
sssd_be 24306 root   25u  IPv4          101267713      0t0       TCP 10.1.1.1:53785->dc03:3268 (ESTABLISHED)
sssd_be 24306 root   26w  FIFO                0,8      0t0 101253974 pipe
sssd_be 24306 root   27r  FIFO                0,8      0t0 101253975 pipe
sssd_be 24306 root   28w  FIFO                0,8      0t0 101253975 pipe
sssd_be 24306 root   29u  IPv4          101267715      0t0       TCP 10.1.1.1:45389->dc03:389 (ESTABLISHED)

It fits the same pattern as before, so may not be fixed in 13.2 either. I will continue to collect information over the weekend.

jhrozek commented 7 years ago

I'm sorry about a lack of response in this ticket. This seems like a legitimate bug we should fix. We've been busy working on meeting deadlines for the 1.14 release, though..

I'll leave the ticket in the NEEDS_TRIAGE milestone for one more week and then find an assignee to work on it..

jhrozek commented 7 years ago

Thank you for the report, I see the bug now. If you let me know the version you are running now, I can build you a test package.

jhrozek commented 7 years ago

Fields changed

owner: somebody => jhrozek
status: new => assigned

jhrozek commented 7 years ago

Patch on list

patch: 0 => 1

jhrozek commented 7 years ago

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=1340176 (Red Hat Enterprise Linux 6)

rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=1340176 1340176]

jhrozek commented 7 years ago

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13.5

jhrozek commented 7 years ago

master: 45e11be
sssd-1-13: 2fb7500

resolution: => fixed
status: assigned => closed

lukebigum commented 7 years ago

Thanks for the patch. Will 1.13.5 make it's way into EL 6.8 or has it missed the cut off? Any chance I could get a bundle of RPMs to test with like in https://copr-be.cloud.fedoraproject.org/results/jhrozek/SSSD-6.8-preview/epel-6-x86_64/?

jhrozek commented 7 years ago

Not 1.13.5, but I nominated this patch for the next 6.8 update. I can either build you a test package based on the existing 6.8 update (I need to do it anyway for RHEL..) or you can wait until the fix is released in RHEL itself. I would appreciate testing if you have some time, though..

lukebigum commented 7 years ago

My email reply got lost...

Yes, quite happy to test a patched build on top of a 6.8 RPM before it gets to 6.8 Update.

jhrozek commented 7 years ago

Replying to [comment:14 lukebigum]:

My email reply got lost...

Yes, quite happy to test a patched build on top of a 6.8 RPM before it gets to 6.8 Update.

Sorry about the delay, here are the test builds:
https://jhrozek.fedorapeople.org/sssd-test-builds/sssd-6.8-fd-leak/

lukebigum commented 7 years ago

Thanks, I've updated 150 of 450 servers that are affected, should be a good cross section of hosts to get feedback from. Over the weekend the open file count had started to creep up, so in a few more days I hope I'll be able to tell if 1.13.3-22.el6_8.3 makes a difference.

jhrozek commented 7 years ago

Some fd leak is to be expected, (there might be a connection open per domain to each domain's DC), but the count should stabilize and not leak further. Please open a bug if you still see some leaks.

Metadata Update from @lukebigum:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.13.5

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/4047

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

jhrozek

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

SSSD 1.13.5

type

defect

component

SSSD

version

1.12.4

selected

None

testsupdated

patch

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1340176

design_review

review

changelog

None

keywords

None

coverity

None

mark

blocking

None

design

None

sensitive

None

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#3006 sssd file descriptor leak after legitimate fork failures Closed: Fixed None Opened 7 years ago by lukebigum.

Metadata

#3006 sssd file descriptor leak after legitimate fork failures

Closed: Fixed None Opened 7 years ago by lukebigum.