#1806 sssd_be goes to 99% CPU and causes significant login delays when client is under load
Closed: Fixed None Opened 11 years ago by jraquino.

I have a system with a reproducible problem with sssd when under load.

The sssd.log shows a reoccurring number of messages stating: A service PING timed out on [domain.com]. Attempt [0]

Followed by: Killing service [expertcity.com], not responding to pings!

Following a restart of sssd, the sssd_be process spikes at 99% cpu, and a delay of 30-60secs can be experienced sshing to the device. Subsequent logins seem fine until whichever cache is effected needs to be renewed again, which in turn reproduces the long delay.

The system is a VM with 2 cores assigned. Load can be anywhere from 4-12 to reproduce the issue.


JR, do you happen to have some kind of debug logs from the sssd_be process? We suspect that the load might be due to the memberof plugin processing group memberships while saving nested group structures, but we need the logs to be completely sure.

Thanks!

By the way, incrasing the internal "timeout" might help you avoid the monitor process kill the back end process. The "timeout" parameter is undocumented, but it specifies the value (in seconds) between heartbeat pings between the monitor and the sssd_be. By default, it is set to 10, you might want to increase to, say, 30.

Hi, did you have a chance to test the increased timeout or gather the logs?

Putting to 1.9.5 for investigation. We will clone as appropriate when/if we know the scope of the problem.

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.9.5

Today I can reproduce this problem on an 8 core physical host. I will attach the logs, but I think it may be necessary for me to host a GoToMeeting session and just give you Keyboard and mouse to perform GDB debugging...

CPU locks out at 100% on sssd_be for at least 25secs during login any time after a sssd restart.

Worked with sgallagh today:

Turned debugging up to 9 and found what appears to be a 20+ second ltdb operation:

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Added timed event "ltdb_callback": 0x7d52da0

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Added timed event "ltdb_timeout": 0x7d52ec0

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Destroying timer event 0x7cfbc50 "ltdb_timeout"

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Ending timer event 0x7cfbb30 "ltdb_callback"

(Wed Mar 6 18:45:27 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Destroying timer event 0x7d52ec0 "ltdb_timeout"

(Wed Mar 6 18:45:27 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Ending timer event 0x7d52da0 "ltdb_callback"

  • Temporarily moved /var/lib/sss/db/cache_expertcity.com.ldb to /dev/shm/ and symlinked.
    The problem persisted and seems to have ruled out the possibility of a Disk I/O bottleneck.

Signs seem to point to a memberof operation

  • 11:53 < jhrozek:#sssd> I will run a local experiment with something like callgrind maybe it would be able to detect a tight loop

Via IRC Simo suggested: only cachegrind comes to mind as an alternative

Hi,

there is a 20 seconds gap between the time when LDAP lookup of HBAC rules is finished and rules processing is started. I expect that the rules are beeing stored in the cache during this time.

(Wed Mar  6 17:51:38 2013) [sssd[be[expertcity.com]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set
(Wed Mar  6 17:51:58 2013) [sssd[be[expertcity.com]]] [hbac_attrs_to_rule] (0x1000): Processing rule [dev-general]

How many HBAC rules do you have? How large are they?

Can you test whether setting access_provider = permit or ipa_hbac_refresh to some greater value (maybe 30) help?

Do you still have the log from comment # 7? Can you attach it please?

I have 62 HBAC rules. I am not sure what you want to qualify for 'large'

I have hosts in development that are members of multiple HBAC rules...

access_provider = ipa
ipa_hbac_refresh = 1800

I've privately emailed you the cachegrind and callgrind logs.

review: => 0

Fields changed

milestone: SSSD 1.9.5 => NEEDS_TRIAGE

Is there a status update on this ticket? I've not heard anything in quite some time.

Andreas (who maintains Samba) was discussing the issue with other Samba developers as the problem is most likely not in the sssd but the underlying Samba libraries. They suggested gathering performance data for the cache libraries using perf, but that's not available on RHEL5, I'm afraid.

What's available on RHEL5 is Google's perftools, I will create you a build instrumented with the perftools and get back to you.

Fields changed

changelog: =>
owner: somebody => jhrozek
patch: 0 => 1
priority: major => critical
status: new => assigned

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.10.0

resolution: => fixed
status: assigned => closed

Fields changed

changelog: => N/A, just a bugfix

Metadata Update from @jraquino:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.10.0

7 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/2848

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata