Ticket #1038 (closed defect: fixed)

Opened 3 years ago

Last modified 2 years ago

Failure in sssd_pam after SIGSEGV of sssd_be

Reported by: prefect Owned by: jzeleny
Priority: blocker Milestone: SSSD 1.5.14
Component: PAM Version: 1.6.1
Keywords: Cc:
Blocked By: Blocking:
Tests Updated: no Coverity Bug:
Patch Submitted: yes Red Hat Bugzilla: 0
Design link:
Feature Milestone:
Design review: Fedora test page:
Chosen: Candidate to push out:
Release Notes:

Description

After #1037 had occurred (a SIGSEGV in sssd_be) logins were no longer possible on the system, although resolving of uids was still functional.

Logged in /var/log/secure was:

crond[8488]: pam_sss(crond:account): Request to sssd failed. Timer expired

Restarting SSSD resumed normal operation, but pam_sssd segfaulted on restart. The core file for this was not retained by abrtd.

Attachments

syslog (790 bytes) - added by prefect 3 years ago.
syslog exerpt of restart

Change History

Changed 3 years ago by prefect

syslog exerpt of restart

comment:1 Changed 3 years ago by jhrozek

This excerpt from syslog is interesting:

Oct  7 08:34:39 machine kernel: Process 12916(abrt-hook-ccpp) has RLIMIT_CORE set to 1

The abrt developers say that this indicates that abrt itself might have crashed. IIRC you're running 6.0..that release had a bug where if /proc/<PID>/exe dissapeared before abrt was able to read it, abrt crashed. Incidentally, I saw something similar in logs of one of our test systems (running 6.2 beta):

kernel: sssd_pam[3919]: segfault at 3032756193 ip 0000003c1ac0247d sp 00007ffff9b48528 error 4 in libtevent.so.
0.9.8[3c1ac00000+9000]
abrt[4159]: Can't read /proc/3919/exe link

I'll try to reproduce the problem locally as well..

In the meantime, I've got two suggestions that might help us catch the bug should it hit you again:

  1. raise the value /proc/sys/kernel/core_pipe_limit. If it was set to a very low value (1 or 2 perhaps). If multiple processes crashed at the same time, the kernel would not wait for the core-collecting process to grab the info, so later /proc/<pid>/exe would not be available when abrt gets to process it and abrt would crash
  2. raising ulimit to unlimited should produce the core file in a "classic" way even if abrt chokes on it

comment:2 Changed 3 years ago by jhrozek

  • Owner changed from somebody to jhrozek

Picking up, I can reproduce this now at will.

comment:3 Changed 3 years ago by dpal

  • Priority changed from major to blocker
  • Milestone changed from NEEDS_TRIAGE to SSSD 1.5.14

comment:4 Changed 3 years ago by jhrozek

The backtrace:

(gdb) bt full
#0  _tevent_add_timer (ev=0x1927250, mem_ctx=0x19288b0, next_event=..., handler=0x340841b790 <ltdb_callback>, private_data=0x19288b0, 
    handler_name=0x340842ae7e "ltdb_callback", location=0x340842ae67 "ldb_tdb/ldb_tdb.c:1282") at tevent.c:358
No locals.
#1  0x000000340841a163 in ltdb_handle_request (module=0x1927860, req=0x1928620) at ldb_tdb/ldb_tdb.c:1282
        ldb = 0x1927160
        ev = 0x1927250
        ac = 0x19288b0
        te = <value optimized out>
        tv = {tv_sec = 0, tv_usec = 0}
#2  0x000000340841196a in ldb_next_request (module=0x1927860, request=0x1928620) at common/ldb_modules.c:563
        ret = <value optimized out>
#3  0x000000340840808b in asq_search (module=0x1927380, req=0x1928620) at modules/asq.c:358
        ldb = 0x1927160
        base_req = <value optimized out>
        control = 0x0
        ac = <value optimized out>
        ret = 2
#4  0x000000340840bc2f in ldb_search (ldb=0x1927160, mem_ctx=0x192aef0, result=0x7fff1d8d0500, base=<value optimized out>, 
    scope=<value optimized out>, attrs=0x647aa0, exp_fmt=0x43e5a0 "(&(objectclass=user)(|(nameAlias=%s)(name=%s)))") at common/ldb.c:1349
        req = 0x1928620
        res = 0x1923cf0
        expression = 0x19249b0 "(&(objectclass=user)(|(nameAlias=kau20)(name=kau20)))"
        ap = {{gp_offset = 48, fp_offset = 48, overflow_arg_area = 0x7fff1d8d0498, reg_save_area = 0x7fff1d8d03a0}}
        ret = <value optimized out>
#5  0x00000000004212dc in sysdb_getpwnam (mem_ctx=0x19269f0, ctx=0x1927060, domain=0x1922610, name=0x1929b20 "kau20", _res=0x1926a10)
    at src/db/sysdb_search.c:64
        tmpctx = 0x192aef0
        attrs = {0x43ee84 "name", 0x43ef7d "uidNumber", 0x43ef61 "gidNumber", 0x43ef87 "gecos", 0x43ef8d "homeDirectory", 
          0x43ef9b "loginShell", 0x43ef19 "lastUpdate", 0x43ef24 "dataExpireTimestamp", 0x43ef38 "initgrExpireTimestamp", 
          0x43e6fd "objectClass", 0x0}
        base_dn = 0x1931c30
        res = 0x0
        sanitized_name = 0x1927310 "kau20"
        ret = 0
#6  0x000000000040bbef in pam_check_user_search (preq=0x19269f0) at src/responder/pam/pamsrv_cmd.c:844
        dom = 0x1922610
        cctx = 0x1928cc0
        name = 0x1929b20 "kau20"
        sysdb = 0x1927060
        cacheExpire = 26365472
        ret = 0
        __FUNCTION__ = "pam_check_user_search"
#7  0x000000000040c36f in pam_check_user_dp_callback (err_maj=3, err_min=5, err_msg=0x4424fa "Internal Error", ptr=0x19269f0)
    at src/responder/pam/pamsrv_cmd.c:955
        preq = 0x19269f0
        ret = 0
        pctx = 0x1924060
        __FUNCTION__ = "pam_check_user_dp_callback"
#8  0x0000000000436495 in sss_dp_req_destructor (ptr=0x1924e20) at src/responder/common/responder_dp.c:100
        sdp_req = 0x1924e20
        cb = 0x192b760
        next = 0x0
        key = {type = HASH_KEY_STRING, {str = 0x1931920 "3kau20@LDAP", ul = 26417440}}
        hret = 0
        __FUNCTION__ = "sss_dp_req_destructor"
#9  0x000000340a002d9e in _talloc_free_internal (ptr=0x1924e20, location=0x340a007b1d "talloc.c:1893") at talloc.c:600
        d = 0x4362da <sss_dp_req_destructor>
        tc = 0x4362da
#10 0x000000340a002c2b in _talloc_free_internal (ptr=0x19222b0, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1924e20
        new_parent = 0x0
        tc = 0x1924e20
#11 0x000000340a002c2b in _talloc_free_internal (ptr=0x1924060, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x19222b0
        new_parent = 0x0
        tc = 0x19222b0
#12 0x000000340a002c2b in _talloc_free_internal (ptr=0x1921420, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1924060
        new_parent = 0x0
        tc = 0x1924060
#13 0x000000340a002c2b in _talloc_free_internal (ptr=0x1920320, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1921420
        new_parent = 0x0
        tc = 0x1921420
#14 0x000000340a001abb in _talloc_free_internal (ptr=0x1920140, location=0x340a007b1d "talloc.c:1893") at talloc.c:631
        child = 0x1920320
        new_parent = 0x0
#15 _talloc_free (ptr=0x1920140, location=0x340a007b1d "talloc.c:1893") at talloc.c:1133
        tc = 0x1920320
#16 0x0000003406035d92 in __run_exit_handlers (status=0) at exit.c:78
        atfct = <value optimized out>
        onfct = <value optimized out>
        cxafct = <value optimized out>
        f = <value optimized out>
#17 exit (status=0) at exit.c:100
No locals.
#18 0x000000000042d623 in default_quit (ev=0x1920320, se=0x19211a0, signum=15, count=1, siginfo=0x0, private_data=0x0)
    at src/util/server.c:251
        done_sigterm = 0
        __FUNCTION__ = "default_quit"
#19 0x0000003407403baa in tevent_common_check_signal (ev=0x1920320) at tevent_signal.c:353
        se = 0x19211a0
        count = 1
        sl = <value optimized out>
        next = 0x0
        counter = {count = <value optimized out>, seen = 0}
        clear_processed_siginfo = <value optimized out>
        i = <value optimized out>
#20 0x00000034074054fa in epoll_event_loop (ev=<value optimized out>, location=<value optimized out>) at tevent_standard.c:267
        ret = -1
        i = <value optimized out>
        events = {{events = 17, data = {ptr = 0x1931c30, fd = 26418224, u32 = 26418224, u64 = 26418224}}}
        timeout = <value optimized out>
#21 std_event_loop_once (ev=<value optimized out>, location=<value optimized out>) at tevent_standard.c:544
        std_ev = 0x19203e0
        tval = {tv_sec = 0, tv_usec = 999965}
#22 0x00000034074026d0 in _tevent_loop_once (ev=0x1920320, location=0x440ba7 "src/util/server.c:526") at tevent.c:490
        ret = <value optimized out>
        nesting_stack_ptr = 0x0
#23 0x000000340740273b in tevent_common_loop_wait (ev=0x1920320, location=0x440ba7 "src/util/server.c:526") at tevent.c:591
        ret = <value optimized out>
#24 0x000000000042e67f in server_loop (main_ctx=0x1921420) at src/util/server.c:526
No locals.
#25 0x0000000000408487 in main (argc=4, argv=0x7fff1d8d0d08) at src/responder/pam/pamsrv.c:230
        opt = -1
        pc = 0x191f010
        main_ctx = 0x1921420
        ret = 0
        long_options = {{longName = 0x0, shortName = 0 '\000', argInfo = 4, arg = 0x647b40, val = 0, descrip = 0x43a1e6 "Help options:", 
            argDescrip = 0x0}, {longName = 0x43a1f4 "debug-level", shortName = 100 'd', argInfo = 2, arg = 0x647c38, val = 0, 
            descrip = 0x43a18f "Debug level", argDescrip = 0x0}, {longName = 0x43a200 "debug-to-files", shortName = 102 'f', argInfo = 0, 
            arg = 0x647c3c, val = 0, descrip = 0x43a1a0 "Send the debug output to files instead of stderr", argDescrip = 0x0}, {
            longName = 0x43a20f "debug-timestamps", shortName = 0 '\000', argInfo = 2, arg = 0x647b00, val = 0, 
            descrip = 0x43a1d1 "Add debug timestamps", argDescrip = 0x0}, {longName = 0x0, shortName = 0 '\000', argInfo = 0, arg = 0x0, 
            val = 0, descrip = 0x0, argDescrip = 0x0}}
        __FUNCTION__ = "main"
(gdb) print ev
$2 = (struct tevent_context *) 0x1927250
(gdb) print *ev
$3 = {ops = 0x3400000002, fd_events = 0x2, timer_events = 0x1927bf0, immediate_events = 0x0, signal_events = 0x0, 
  additional_data = 0x1927310, pipe_fde = 0x0, pipe_fds = {49, 0}, debug_ops = {debug = 0x1926b70, context = 0x19286f0}, nesting = {
    allowed = true, level = 0, hook_fn = 0, hook_private = 0x30}}

comment:6 Changed 3 years ago by jhrozek

Valgrind seems to suggest we're accessing a memory that was already free()-d:

    ==7271== Invalid read of size 8
    ==7271==    at 0x340A001EFA: talloc_get_name (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A001F5D: talloc_check_name (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x34074052E4: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x34074026CF: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x340840A463: ldb_wait (in /usr/lib64/libldb.so.0.9.10)
    ==7271==    by 0x340840BC48: ldb_search (in /usr/lib64/libldb.so.0.9.10)
    ==7271==    by 0x42BB5C: sysdb_getpwnam (sysdb_search.c:59)
    ==7271==    by 0x4093E4: pam_check_user_search (pamsrv_cmd.c:849)
    ==7271==    by 0x40D438: pam_check_user_dp_callback (pamsrv_cmd.c:960)
    ==7271==    by 0x4154E5: sss_dp_req_destructor (responder_dp.c:100)
    ==7271==    by 0x340A002D9D: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A002C2A: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==  Address 0x4c4d3f0 is 48 bytes inside a block of size 104 free'd
    ==7271==    at 0x4A0595D: free (vg_replace_malloc.c:366)
    ==7271==    by 0x340A002CA7: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A002C2A: ??? (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x340A001ABA: _talloc_free (in /usr/lib64/libtalloc.so.2.0.1)
    ==7271==    by 0x3406035D91: exit (in /lib64/libc-2.12.so)
    ==7271==    by 0x438426: default_quit (server.c:251)
    ==7271==    by 0x3407403BA9: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x34074054F9: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x34074026CF: _tevent_loop_once (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x340740273A: ??? (in /usr/lib64/libtevent.so.0.9.8)
    ==7271==    by 0x437E36: server_loop (server.c:571)
    ==7271==    by 0x4087B9: main (pamsrv.c:235) 

comment:7 Changed 3 years ago by jzeleny

  • Owner changed from jhrozek to jzeleny
  • Patch Submitted set
  • Status changed from new to assigned

comment:8 Changed 3 years ago by sgallagh

  • Resolution set to fixed
  • Status changed from assigned to closed

comment:9 Changed 2 years ago by dpal

  • Red Hat Bugzilla set to 0
Note: See TracTickets for help on using tickets.