Learn more about these different git repos.
Other Git URLs
We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed. (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:
5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128 5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128 5 5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128 5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701 5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128 5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702
What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:
start:
25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088> 25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>
start another:
25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085> 25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012> 25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>
does a bunch of other stuff including reading from another socket:
25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011> 25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008> 25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>
Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:
25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025> 25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008> 25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567> 25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012> 25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283> 25384 08:30:33.072590 close(22) = 0 <0.000100>
This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not.
strace of sssd_be and main sssd process sssd_be.strace.gz
The very slow I/O appears to be manifested in sssd with results like:
25384 08:30:03.559990 fdatasync(14 <unfinished ...> 25384 08:30:06.736727 <... fdatasync resumed> ) = 0 <3.176716> 25384 08:30:06.737027 msync(0x7fc5cbc76000, 45172, MS_SYNC) = 0 <0.782016> 25384 08:30:07.519175 fdatasync(14 <unfinished ...> 25384 08:30:08.730207 <... fdatasync resumed> ) = 0 <1.210987> 25384 08:30:08.730291 msync(0x7fc5cbc76000, 24, MS_SYNC) = 0 <2.686098>
which may be contributing to sssd not getting around to processing the ldap connection until after the timeout. Not quite sure why sssd does so many *sync() calls, although I can guess.
I fixed formatting in the description.
description: We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:
However, the network connection is fine as the TCP handshake completes in ~ .05 seconds: 5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128 5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128 5 5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128 5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701 5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128 5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702
start: 25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088> 25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>
start another: 25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085> 25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012> 25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>
does a bunch of other stuff including reading from another socket: 25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011> 25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008> 25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>
Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it: 25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025> 25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008> 25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567> 25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012> 25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283> 25384 08:30:33.072590 close(22) = 0 <0.000100>
This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not. => We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like: {{{ (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed. (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed. (Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed. }}} However, the network connection is fine as the TCP handshake completes in ~ .05 seconds: {{{ 5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128 5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128 5 5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128 5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701 5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128 5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702 }}} What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:
start: {{{ 25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088> 25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024> }}}
start another: {{{ 25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085> 25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012> 25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014> }}}
does a bunch of other stuff including reading from another socket: {{{ 25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011> 25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008> 25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021> }}}
Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it: {{{ 25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025> 25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008> 25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567> 25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012> 25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283> 25384 08:30:33.072590 close(22) = 0 <0.000100> }}}
Fields changed
owner: somebody => lslebodn
I replied to sssd-users. https://lists.fedorahosted.org/pipermail/sssd-users/2015-January/002557.html
cc: => lslebodn@redhat.com owner: lslebodn =>
I tried reproducing this again with full debugging to no avail it the short time I had available. Things are much better though since switching to the tmpfs storage so a I may have a useful workaround.
cc: lslebodn@redhat.com => lslebodn@redhat.com, preichl@redhat.com
milestone: NEEDS_TRIAGE => SSSD 1.13 alpha
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'ldap.cora.nwra.com' is 'working' (Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'ldap.cora.nwra.com' is 'working' (Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds (Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'ldap.cora.nwra.com' is 'working' (Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [be_resolve_server_process] (0x0040): The fail over cycled through all available servers (Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 2
This part of provided log indicates that there may be a problem with handling a fail over - it results with ENOENT, but some server was on previous line marked as working.
rhbz: => todo
We will continue investigating, but after the Alpha..
milestone: SSSD 1.13 alpha => SSSD 1.13 beta sensitive: => 0
Required for downstream, but not for Beta
milestone: SSSD 1.13 beta => SSSD 1.13
I started looking into the sdap code and this needs more work than what's in scope of released branch..
milestone: SSSD 1.13.2 => SSSD 1.14 beta
Unfortunately the sdap_ops code wtill needs more work.
milestone: SSSD 1.14 beta => SSSD 1.15 beta
Metadata Update from @orion: - Issue set to the milestone: SSSD Future releases (no date set yet)
Metadata Update from @thalman: - Custom field design_review reset (from 0) - Custom field mark reset (from 0) - Custom field patch reset (from 0) - Custom field review reset (from 0) - Custom field sensitive reset (from 0) - Custom field testsupdated reset (from 0) - Issue close_status updated to: None - Issue tagged with: Canditate to close
Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.
Given that we are unable to fulfill this request I am closing the issue as wontfix.
If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.
Thank you for understanding.
Metadata Update from @pbrezina: - Issue close_status updated to: wontfix - Issue status updated to: Closed (was: Open)
SSSD is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in SSSD's github repository.
This issue has been cloned to Github and is available here: - https://github.com/SSSD/sssd/issues/3604
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Login to comment on this ticket.