#1038 Failure in sssd_pam after SIGSEGV of sssd_be
Closed: Fixed None Opened 12 years ago by prefect.

After #1037 had occurred (a SIGSEGV in sssd_be) logins were no longer possible on the system, although resolving of uids was still functional.

Logged in /var/log/secure was:

crond[8488]: pam_sss(crond:account): Request to sssd failed. Timer expired

Restarting SSSD resumed normal operation, but pam_sssd segfaulted on restart. The core file for this was not retained by abrtd.


This excerpt from syslog is interesting:

Oct  7 08:34:39 machine kernel: Process 12916(abrt-hook-ccpp) has RLIMIT_CORE set to 1

The abrt developers say that this indicates that abrt itself might have crashed. IIRC you're running 6.0..that release had a bug where if /proc/<PID>/exe dissapeared before abrt was able to read it, abrt crashed. Incidentally, I saw something similar in logs of one of our test systems (running 6.2 beta):

kernel: sssd_pam[3919]: segfault at 3032756193 ip 0000003c1ac0247d sp 00007ffff9b48528 error 4 in libtevent.so.
0.9.8[3c1ac00000+9000]
abrt[4159]: Can't read /proc/3919/exe link

I'll try to reproduce the problem locally as well..

In the meantime, I've got two suggestions that might help us catch the bug should it hit you again:
1. raise the value /proc/sys/kernel/core_pipe_limit. If it was set to a very low value (1 or 2 perhaps). If multiple processes crashed at the same time, the kernel would not wait for the core-collecting process to grab the info, so later /proc/<pid>/exe would not be available when abrt gets to process it and abrt would crash
2. raising ulimit to unlimited should produce the core file in a "classic" way even if abrt chokes on it

Picking up, I can reproduce this now at will.

owner: somebody => jhrozek

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.5.14
priority: major => blocker

The backtrace:

(gdb) bt full
#0  _tevent_add_timer (ev=0x1927250, mem_ctx=0x19288b0, next_event=..., handler=0x340841b790 <ltdb_callback>, private_data=0x19288b0, 
    handler_name=0x340842ae7e "ltdb_callback", location=0x340842ae67 "ldb_tdb/ldb_tdb.c:1282") at tevent.c:358
No locals.
#1  0x000000340841a163 in ltdb_handle_request (module=0x1927860, req=0x1928620) at ldb_tdb/ldb_tdb.c:1282
        ldb = 0x1927160
        ev = 0x1927250
        ac = 0x19288b0
        te = <value optimized out>
        tv = {tv_sec = 0, tv_usec = 0}
#2  0x000000340841196a in ldb_next_request (module=0x1927860, request=0x1928620) at common/ldb_modules.c:563
        ret = <value optimized out>
#3  0x000000340840808b in asq_search (module=0x1927380, req=0x1928620) at modules/asq.c:358
        ldb = 0x1927160
        base_req = <value optimized out>
        control = 0x0
        ac = <value optimized out>
        ret = 2
#4  0x000000340840bc2f in ldb_search (ldb=0x1927160, mem_ctx=0x192aef0, result=0x7fff1d8d0500, base=<value optimized out>, 
    scope=<value optimized out>, attrs=0x647aa0, exp_fmt=0x43e5a0 "(&(objectclass=user)(|(nameAlias=%s)(name=%s)))") at common/ldb.c:1349
        req = 0x1928620
        res = 0x1923cf0
        expression = 0x19249b0 "(&(objectclass=user)(|(nameAlias=kau20)(name=kau20)))"
        ap = {{gp_offset = 48, fp_offset = 48, overflow_arg_area = 0x7fff1d8d0498, reg_save_area = 0x7fff1d8d03a0}}
        ret = <value optimized out>
#5  0x00000000004212dc in sysdb_getpwnam (mem_ctx=0x19269f0, ctx=0x1927060, domain=0x1922610, name=0x1929b20 "kau20", _res=0x1926a10)
    at src/db/sysdb_search.c:64
        tmpctx = 0x192aef0
        attrs = {0x43ee84 "name", 0x43ef7d "uidNumber", 0x43ef61 "gidNumber", 0x43ef87 "gecos", 0x43ef8d "homeDirectory", 
          0x43ef9b "loginShell", 0x43ef19 "lastUpdate", 0x43ef24 "dataExpireTimestamp", 0x43ef38 "initgrExpireTimestamp", 
          0x43e6fd "objectClass", 0x0}
        base_dn = 0x1931c30
        res = 0x0
        sanitized_name = 0x1927310 "kau20"
        ret = 0
#6  0x000000000040bbef in pam_check_user_search (preq=0x19269f0) at src/responder/pam/pamsrv_cmd.c:844
        dom = 0x1922610
        cctx = 0x1928cc0
        name = 0x1929b20 "kau20"
        sysdb = 0x1927060
        cacheExpire = 26365472
        ret = 0
        __FUNCTION__ = "pam_check_user_search"
#7  0x000000000040c36f in pam_check_user_dp_callback (err_maj=3, err_min=5, err_msg=0x4424fa "Internal Error", ptr=0x19269f0)
    at src/responder/pam/pamsrv_cmd.c:955
        preq = 0x19269f0
        ret = 0
        pctx = 0x1924060
        __FUNCTION__ = "pam_check_user_dp_callback"
#8  0x0000000000436495 in sss_dp_req_destructor (ptr=0x1924e20) at src/responder/common/responder_dp.c:100
        sdp_req = 0x1924e20
        cb = 0x192b760
        next = 0x0
        key = {type = HASH_KEY_STRING, {str = 0x1931920 "3kau20@LDAP", ul = 26417440}}
        hret = 0
        __FUNCTION__ = "sss_dp_req_destructor"
#9  0x000000340a002d9e in _talloc_free_internal (ptr=0x1924e20, location=0x340a007b1d "talloc.c:1893") at talloc.c:600
        d = 0x4362da <sss_dp_req_destructor>
        tc = 0x4362da
#10 0x000000340a002c2b in _talloc_free_internal (ptr=0x19222b0, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1924e20
        new_parent = 0x0
        tc = 0x1924e20
#11 0x000000340a002c2b in _talloc_free_internal (ptr=0x1924060, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x19222b0
        new_parent = 0x0
        tc = 0x19222b0
#12 0x000000340a002c2b in _talloc_free_internal (ptr=0x1921420, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1924060
        new_parent = 0x0
        tc = 0x1924060
#13 0x000000340a002c2b in _talloc_free_internal (ptr=0x1920320, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1921420
        new_parent = 0x0
        tc = 0x1921420
#14 0x000000340a001abb in _talloc_free_internal (ptr=0x1920140, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1920320
        new_parent = 0x0
#15 _talloc_free (ptr=0x1920140, location=0x340a007b1d "talloc.c:1893") at talloc.c:1133
        tc = 0x1920320
#16 0x0000003406035d92 in __run_exit_handlers (status=0) at exit.c:78
        atfct = <value optimized out>
        onfct = <value optimized out>
        cxafct = <value optimized out>
        f = <value optimized out>
#17 exit (status=0) at exit.c:100
No locals.
#18 0x000000000042d623 in default_quit (ev=0x1920320, se=0x19211a0, signum=15, count=1, siginfo=0x0, private_data=0x0)
    at src/util/server.c:251
        done_sigterm = 0
        __FUNCTION__ = "default_quit"
#19 0x0000003407403baa in tevent_common_check_signal (ev=0x1920320) at tevent_signal.c:353
        se = 0x19211a0
        count = 1
        sl = <value optimized out>
        next = 0x0
        counter = {count = <value optimized out>, seen = 0}
        clear_processed_siginfo = <value optimized out>
        i = <value optimized out>
#20 0x00000034074054fa in epoll_event_loop (ev=<value optimized out>, location=<value optimized out>) at tevent_standard.c:267
        ret = -1
        i = <value optimized out>
        events = {{events = 17, data = {ptr = 0x1931c30, fd = 26418224, u32 = 26418224, u64 = 26418224}}}
        timeout = <value optimized out>
#21 std_event_loop_once (ev=<value optimized out>, location=<value optimized out>) at tevent_standard.c:544
        std_ev = 0x19203e0
        tval = {tv_sec = 0, tv_usec = 999965}
#22 0x00000034074026d0 in _tevent_loop_once (ev=0x1920320, location=0x440ba7 "src/util/server.c:526") at tevent.c:490
        ret = <value optimized out>
        nesting_stack_ptr = 0x0
#23 0x000000340740273b in tevent_common_loop_wait (ev=0x1920320, location=0x440ba7 "src/util/server.c:526") at tevent.c:591
        ret = <value optimized out>
#24 0x000000000042e67f in server_loop (main_ctx=0x1921420) at src/util/server.c:526
No locals.
#25 0x0000000000408487 in main (argc=4, argv=0x7fff1d8d0d08) at src/responder/pam/pamsrv.c:230
        opt = -1
        pc = 0x191f010
        main_ctx = 0x1921420
        ret = 0
        long_options = {{longName = 0x0, shortName = 0 '\000', argInfo = 4, arg = 0x647b40, val = 0, descrip = 0x43a1e6 "Help options:", 
            argDescrip = 0x0}, {longName = 0x43a1f4 "debug-level", shortName = 100 'd', argInfo = 2, arg = 0x647c38, val = 0, 
            descrip = 0x43a18f "Debug level", argDescrip = 0x0}, {longName = 0x43a200 "debug-to-files", shortName = 102 'f', argInfo = 0, 
            arg = 0x647c3c, val = 0, descrip = 0x43a1a0 "Send the debug output to files instead of stderr", argDescrip = 0x0}, {
            longName = 0x43a20f "debug-timestamps", shortName = 0 '\000', argInfo = 2, arg = 0x647b00, val = 0, 
            descrip = 0x43a1d1 "Add debug timestamps", argDescrip = 0x0}, {longName = 0x0, shortName = 0 '\000', argInfo = 0, arg = 0x0, 
            val = 0, descrip = 0x0, argDescrip = 0x0}}
        __FUNCTION__ = "main"
(gdb) print ev
$2 = (struct tevent_context *) 0x1927250
(gdb) print *ev
$3 = {ops = 0x3400000002, fd_events = 0x2, timer_events = 0x1927bf0, immediate_events = 0x0, signal_events = 0x0, 
  additional_data = 0x1927310, pipe_fde = 0x0, pipe_fds = {49, 0}, debug_ops = {debug = 0x1926b70, context = 0x19286f0}, nesting = {
    allowed = true, level = 0, hook_fn = 0, hook_private = 0x30}}

Valgrind seems to suggest we're accessing a memory that was already free()-d:

    ==7271== Invalid read of size 8
    ==7271==    at 0x340A001EFA: talloc_get_name (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A001F5D: talloc_check_name (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x34074052E4: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x34074026CF: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x340840A463: ldb_wait (in /usr/lib64/libldb.so.0.9.10)
    ==7271==    by 0x340840BC48: ldb_search (in /usr/lib64/libldb.so.0.9.10)
    ==7271==    by 0x42BB5C: sysdb_getpwnam (sysdb_search.c:59)
    ==7271==    by 0x4093E4: pam_check_user_search (pamsrv_cmd.c:849)
    ==7271==    by 0x40D438: pam_check_user_dp_callback (pamsrv_cmd.c:960)
    ==7271==    by 0x4154E5: sss_dp_req_destructor (responder_dp.c:100)
    ==7271==    by 0x340A002D9D: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A002C2A: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==  Address 0x4c4d3f0 is 48 bytes inside a block of size 104 free'd
    ==7271==    at 0x4A0595D: free (vg_replace_malloc.c:366)
    ==7271==    by 0x340A002CA7: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A002C2A: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A001ABA: _talloc_free (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x3406035D91: exit (in /lib64/libc-2.12.so)
    ==7271==    by 0x438426: default_quit (server.c:251)
    ==7271==    by 0x3407403BA9: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x34074054F9: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x34074026CF: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x340740273A: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x437E36: server_loop (server.c:571)
    ==7271==    by 0x4087B9: main (pamsrv.c:235)

Fields changed

owner: jhrozek => jzeleny
patch: 0 => 1
status: new => assigned

Fixed in:
- 6d0fbda (master)
- 1b8f712 (sssd-1-6)
- 453698e (sssd-1-5)

resolution: => fixed
status: assigned => closed

Fields changed

rhbz: => 0

Metadata Update from @prefect:
- Issue assigned to jzeleny
- Issue set to the milestone: SSSD 1.5.14

7 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/2080

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata