#2562 sssd can timeout ldap network connection incorrectly
Closed: wontfix 4 years ago by pbrezina. Opened 9 years ago by orion.

We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:

(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.

However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:

5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128
5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128
5
5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128
5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701
5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128
5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702

What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:

start:

25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088>
25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>

start another:

25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085>
25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012>
25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>

does a bunch of other stuff including reading from another socket:

25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011>
25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008>
25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>

Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:

25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025>
25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008>
25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567>
25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012>
25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283>
25384 08:30:33.072590 close(22)         = 0 <0.000100>

This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not.


strace of sssd_be and main sssd process
sssd_be.strace.gz

The very slow I/O appears to be manifested in sssd with results like:

25384 08:30:03.559990 fdatasync(14 <unfinished ...>
25384 08:30:06.736727 <... fdatasync resumed> ) = 0 <3.176716>
25384 08:30:06.737027 msync(0x7fc5cbc76000, 45172, MS_SYNC) = 0 <0.782016>
25384 08:30:07.519175 fdatasync(14 <unfinished ...>
25384 08:30:08.730207 <... fdatasync resumed> ) = 0 <1.210987>
25384 08:30:08.730291 msync(0x7fc5cbc76000, 24, MS_SYNC) = 0 <2.686098>

which may be contributing to sssd not getting around to processing the ldap connection until after the timeout. Not quite sure why sssd does so many *sync() calls, although I can guess.

I fixed formatting in the description.

description: We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:

(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.

However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:
5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128
5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128
5
5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128
5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701
5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128
5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702

What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:

start:
25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088>
25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>

start another:
25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085>
25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012>
25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>

does a bunch of other stuff including reading from another socket:
25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011>
25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008>
25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>

Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:
25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025>
25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008>
25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567>
25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012>
25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283>
25384 08:30:33.072590 close(22) = 0 <0.000100>

This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not. => We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:
{{{
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
}}}
However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:
{{{
5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128
5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128
5
5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128
5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701
5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128
5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702
}}}
What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:

start:
{{{
25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088>
25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>
}}}

start another:
{{{
25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085>
25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012>
25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>
}}}

does a bunch of other stuff including reading from another socket:
{{{
25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011>
25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008>
25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>
}}}

Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:
{{{
25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025>
25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008>
25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567>
25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012>
25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283>
25384 08:30:33.072590 close(22) = 0 <0.000100>
}}}

This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not.

Fields changed

owner: somebody => lslebodn

Fields changed

cc: => lslebodn@redhat.com
owner: lslebodn =>

I tried reproducing this again with full debugging to no avail it the short time I had available. Things are much better though since switching to the tmpfs storage so a I may have a useful workaround.

Fields changed

cc: lslebodn@redhat.com => lslebodn@redhat.com, preichl@redhat.com

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13 alpha

(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'ldap.cora.nwra.com' is 'working'
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'ldap.cora.nwra.com' is 'working'
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'ldap.cora.nwra.com' is 'working'
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [be_resolve_server_process] (0x0040): The fail over cycled through all available servers
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 2

This part of provided log indicates that there may be a problem with handling a fail over - it results with ENOENT, but some server was on previous line marked as working.

Fields changed

rhbz: => todo

We will continue investigating, but after the Alpha..

milestone: SSSD 1.13 alpha => SSSD 1.13 beta
sensitive: => 0

Required for downstream, but not for Beta

milestone: SSSD 1.13 beta => SSSD 1.13

I started looking into the sdap code and this needs more work than what's in scope of released branch..

milestone: SSSD 1.13.2 => SSSD 1.14 beta

Unfortunately the sdap_ops code wtill needs more work.

milestone: SSSD 1.14 beta => SSSD 1.15 beta

Metadata Update from @orion:
- Issue set to the milestone: SSSD Future releases (no date set yet)

7 years ago

Metadata Update from @thalman:
- Custom field design_review reset (from 0)
- Custom field mark reset (from 0)
- Custom field patch reset (from 0)
- Custom field review reset (from 0)
- Custom field sensitive reset (from 0)
- Custom field testsupdated reset (from 0)
- Issue close_status updated to: None
- Issue tagged with: Canditate to close

4 years ago

Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.

Given that we are unable to fulfill this request I am closing the issue as wontfix.

If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.

Thank you for understanding.

Metadata Update from @pbrezina:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

4 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3604

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata