Issue #1806: sssd_be goes to 99% CPU and causes significant login delays when client is under load - sssd

SSSD / sssd

#1806 sssd_be goes to 99% CPU and causes significant login delays when client is under load

Closed: Fixed None Opened 11 years ago by jraquino.

I have a system with a reproducible problem with sssd when under load.

The sssd.log shows a reoccurring number of messages stating: A service PING timed out on [domain.com]. Attempt [0]

Followed by: Killing service [expertcity.com], not responding to pings!

Following a restart of sssd, the sssd_be process spikes at 99% cpu, and a delay of 30-60secs can be experienced sshing to the device. Subsequent logins seem fine until whichever cache is effected needs to be renewed again, which in turn reproduces the long delay.

The system is a VM with 2 cores assigned. Load can be anywhere from 4-12 to reproduce the issue.

jhrozek commented 11 years ago

JR, do you happen to have some kind of debug logs from the sssd_be process? We suspect that the load might be due to the memberof plugin processing group memberships while saving nested group structures, but we need the logs to be completely sure.

Thanks!

jhrozek commented 11 years ago

By the way, incrasing the internal "timeout" might help you avoid the monitor process kill the back end process. The "timeout" parameter is undocumented, but it specifies the value (in seconds) between heartbeat pings between the monitor and the sssd_be. By default, it is set to 10, you might want to increase to, say, 30.

jhrozek commented 11 years ago

Hi, did you have a chance to test the increased timeout or gather the logs?

jhrozek commented 11 years ago

Putting to 1.9.5 for investigation. We will clone as appropriate when/if we know the scope of the problem.

jhrozek commented 11 years ago

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.9.5

jraquino commented 11 years ago

Today I can reproduce this problem on an 8 core physical host. I will attach the logs, but I think it may be necessary for me to host a GoToMeeting session and just give you Keyboard and mouse to perform GDB debugging...

CPU locks out at 100% on sssd_be for at least 25secs during login any time after a sssd restart.

jraquino commented 11 years ago

attachment
sssd_example.log

jraquino commented 11 years ago

attachment
messages.log

jraquino commented 11 years ago

attachment
sssd.log

jraquino commented 11 years ago

attachment
sssd_example.com.log

jraquino commented 11 years ago

Worked with sgallagh today:

Turned debugging up to 9 and found what appears to be a 20+ second ltdb operation:

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Added timed event "ltdb_callback": 0x7d52da0

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Added timed event "ltdb_timeout": 0x7d52ec0

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Destroying timer event 0x7cfbc50 "ltdb_timeout"

(Wed Mar 6 18:45:09 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Ending timer event 0x7cfbb30 "ltdb_callback"

(Wed Mar 6 18:45:27 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Destroying timer event 0x7d52ec0 "ltdb_timeout"

(Wed Mar 6 18:45:27 2013) [sssd[be[expertcity.com]]] [ldb] (0x4000): tevent: Ending timer event 0x7d52da0 "ltdb_callback"

Temporarily moved /var/lib/sss/db/cache_expertcity.com.ldb to /dev/shm/ and symlinked.
The problem persisted and seems to have ruled out the possibility of a Disk I/O bottleneck.

Signs seem to point to a memberof operation

11:53 < jhrozek:#sssd> I will run a local experiment with something like callgrind maybe it would be able to detect a tight loop

jraquino commented 11 years ago

Via IRC Simo suggested: only cachegrind comes to mind as an alternative

pbrezina commented 11 years ago

Hi,

there is a 20 seconds gap between the time when LDAP lookup of HBAC rules is finished and rules processing is started. I expect that the rules are beeing stored in the cache during this time.

(Wed Mar  6 17:51:38 2013) [sssd[be[expertcity.com]]] [sdap_get_generic_ext_done] (0x0400): Search result: Success(0), no errmsg set
(Wed Mar  6 17:51:58 2013) [sssd[be[expertcity.com]]] [hbac_attrs_to_rule] (0x1000): Processing rule [dev-general]

How many HBAC rules do you have? How large are they?

Can you test whether setting access_provider = permit or ipa_hbac_refresh to some greater value (maybe 30) help?

Do you still have the log from comment # 7? Can you attach it please?

jraquino commented 11 years ago

I have 62 HBAC rules. I am not sure what you want to qualify for 'large'

I have hosts in development that are members of multiple HBAC rules...

access_provider = ipa
ipa_hbac_refresh = 1800

I've privately emailed you the cachegrind and callgrind logs.

review: => 0

dpal commented 11 years ago

Fields changed

milestone: SSSD 1.9.5 => NEEDS_TRIAGE

jraquino commented 10 years ago

Is there a status update on this ticket? I've not heard anything in quite some time.

jhrozek commented 10 years ago

Andreas (who maintains Samba) was discussing the issue with other Samba developers as the problem is most likely not in the sssd but the underlying Samba libraries. They suggested gathering performance data for the cache libraries using perf, but that's not available on RHEL5, I'm afraid.

What's available on RHEL5 is Google's perftools, I will create you a build instrumented with the perftools and get back to you.

jhrozek commented 10 years ago

Fields changed

changelog: =>
owner: somebody => jhrozek
patch: 0 => 1
priority: major => critical
status: new => assigned

dpal commented 10 years ago

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.10.0

dpal commented 10 years ago

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=979045

rhbz: => [https://bugzilla.redhat.com/show_bug.cgi?id=979045 979045]

jhrozek commented 10 years ago

sssd-1-5: e526ea5
sssd-1-9: e7769aa
master: 1190b58

resolution: => fixed
status: assigned => closed

jhrozek commented 10 years ago

Fields changed

changelog: => N/A, just a bugfix

Metadata Update from @jraquino:
- Issue assigned to jhrozek
- Issue set to the milestone: SSSD 1.10.0

7 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/2848

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

jhrozek

Tags

None

Blocking

None

Depending on

None

Priority

critical

Milestone

SSSD 1.10.0

type

defect

component

SSSD

version

1.9.4

selected

None

testsupdated

patch

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=979045

design_review

review

changelog

N/A, just a bugfix

keywords

None

coverity

None

mark

None

blocking

None

design

None

sensitive

None

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#1806 sssd_be goes to 99% CPU and causes significant login delays when client is under load Closed: Fixed None Opened 11 years ago by jraquino.

Metadata

#1806 sssd_be goes to 99% CPU and causes significant login delays when client is under load

Closed: Fixed None Opened 11 years ago by jraquino.