Issue #2562: sssd can timeout ldap network connection incorrectly - sssd

SSSD / sssd

#2562 sssd can timeout ldap network connection incorrectly

Closed: wontfix 4 years ago by pbrezina. Opened 9 years ago by orion.

We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:

(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.

However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:

5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128
5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128
5
5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128
5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701
5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128
5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702

What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:

start:

25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088>
25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>

start another:

25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085>
25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012>
25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>

does a bunch of other stuff including reading from another socket:

25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011>
25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008>
25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>

Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:

25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025>
25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008>
25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567>
25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012>
25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283>
25384 08:30:33.072590 close(22)         = 0 <0.000100>

This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not.

orion commented 9 years ago

strace of sssd_be and main sssd process
sssd_be.strace.gz

orion commented 9 years ago

The very slow I/O appears to be manifested in sssd with results like:

25384 08:30:03.559990 fdatasync(14 <unfinished ...>
25384 08:30:06.736727 <... fdatasync resumed> ) = 0 <3.176716>
25384 08:30:06.737027 msync(0x7fc5cbc76000, 45172, MS_SYNC) = 0 <0.782016>
25384 08:30:07.519175 fdatasync(14 <unfinished ...>
25384 08:30:08.730207 <... fdatasync resumed> ) = 0 <1.210987>
25384 08:30:08.730291 msync(0x7fc5cbc76000, 24, MS_SYNC) = 0 <2.686098>

which may be contributing to sssd not getting around to processing the ldap connection until after the timeout. Not quite sure why sssd does so many *sync() calls, although I can guess.

lslebodn commented 9 years ago

I fixed formatting in the description.

description: We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:

(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.

However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:
5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128
5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128
5
5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128
5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701
5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128
5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702

What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:

start:
25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088>
25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>

start another:
25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085>
25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012>
25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>

does a bunch of other stuff including reading from another socket:
25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011>
25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008>
25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>

Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:
25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025>
25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008>
25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567>
25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012>
25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283>
25384 08:30:33.072590 close(22) = 0 <0.000100>

This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not. => We're running sssd-1.12.2-28.el7.x86_64 on a CentOS 7 VPS running our mail server. The VPS performance is not particularly great, especially I/O, but cpu and network seems adequate. However, we frequently see messages like:
{{{
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.
(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sdap_sys_connect_done] (0x0020): sdap_async_connect_call request failed.
}}}
However, the network connection is fine as the TCP handshake completes in ~ .05 seconds:
{{{
5238 08:30:26.958010000 VPSSERVER -> LDAP_SERVER TCP 74 45578 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248150 TSecr=0 WS=128
5239 08:30:26.958921000 VPSSERVER -> LDAP_SERVER TCP 74 45579 > ldaps [SYN] Seq=0 Win=14600 Len=0 MSS=1460 SACK_PERM=1 TSval=690248151 TSecr=0 WS=128
5
5247 08:30:26.999970000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45578 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752701 TSecr=690248150 WS=128
5248 08:30:27.000010000 VPSSERVER -> LDAP_SERVER TCP 66 45578 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752701
5249 08:30:27.000061000 LDAP_SERVER -> VPSSERVER TCP 74 ldaps > 45579 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSval=435752702 TSecr=690248151 WS=128
5250 08:30:27.000094000 VPSSERVER -> LDAP_SERVER TCP 66 45579 > ldaps [ACK] Seq=1 Ack=1 Win=14720 Len=0 TSval=690248192 TSecr=435752702
}}}
What appears to happen is that sssd just doesn't get back around to processing the connection until after the 6 second timeout passes:

start:
{{{
25384 08:30:26.957950 connect(22, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000088>
25384 08:30:26.958107 epoll_ctl(3, EPOLL_CTL_ADD, 22, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3676516320, u64=140487761801184}}) = 0 <0.000024>
}}}

start another:
{{{
25384 08:30:26.958868 connect(23, {sa_family=AF_INET, sin_port=htons(636), sin_addr=inet_addr("LDAP_SERVER")}, 128) = -1 EINPROGRESS (Operation now in progress) <0.000085>
25384 08:30:26.959003 epoll_ctl(3, EPOLL_CTL_ADD, 23, {EPOLLIN|EPOLLOUT|EPOLLERR|EPOLLHUP, {u32=3677088592, u64=140487762373456}}) = 0 <0.000012>
25384 08:30:26.959341 epoll_wait(3, {{EPOLLIN, {u32=3676419808, u64=140487761704672}}}, 1, 4994) = 1 <0.000014>
}}}

does a bunch of other stuff including reading from another socket:
{{{
25384 08:30:26.959461 poll([{fd=20, events=POLLIN|POLLPRI}], 1, 0) = 1 ([{fd=20, revents=POLLIN}]) <0.000011>
25384 08:30:26.959513 fcntl(20, F_GETFL) = 0x2 (flags O_RDWR) <0.000008>
25384 08:30:26.959555 read(20, "\27\3\1\0P", 5) = 5 <0.000021>
}}}

Finally gets back to the connection but it's later than 6 seconds and it closes it instead of simply using it:
{{{
25384 08:30:33.071529 epoll_ctl(3, EPOLL_CTL_DEL, 22, {0, {u32=0, u64=0}}) = 0 <0.000025>
25384 08:30:33.071583 gettimeofday({1421857833, 71595}, NULL) = 0 <0.000008>
25384 08:30:33.071619 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000567>
25384 08:30:33.072227 stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2294, ...}) = 0 <0.000012>
25384 08:30:33.072276 write(10, "(Wed Jan 21 08:30:33 2015) [sssd[be[default]]] [sss_ldap_init_sys_connect_done] (0x0020): sdap_async_sys_connect request failed.\n", 129) = 129 <0.000283>
25384 08:30:33.072590 close(22) = 0 <0.000100>
}}}

This can then trigger a bunch of problems as sssd tries to connect to backup and determine if LDAP is working or not.

jhrozek commented 9 years ago

Fields changed

owner: somebody => lslebodn

lslebodn commented 9 years ago

I replied to sssd-users.
https://lists.fedorahosted.org/pipermail/sssd-users/2015-January/002557.html

lslebodn commented 9 years ago

Fields changed

cc: => lslebodn@redhat.com
owner: lslebodn =>

orion commented 9 years ago

I tried reproducing this again with full debugging to no avail it the short time I had available. Things are much better though since switching to the tmpfs storage so a I may have a useful workaround.

jhrozek commented 9 years ago

Fields changed

cc: lslebodn@redhat.com => lslebodn@redhat.com, preichl@redhat.com

jhrozek commented 9 years ago

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13 alpha

preichl commented 9 years ago

(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'ldap.cora.nwra.com' is 'working'
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_port_status] (0x1000): Port status of port 636 for server 'ldap.cora.nwra.com' is 'working'
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [get_server_status] (0x1000): Status of server 'ldap.cora.nwra.com' is 'working'
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [be_resolve_server_process] (0x0040): The fail over cycled through all available servers
(Wed Jan 28 19:48:35 2015) [sssd[be[default]]] [be_resolve_server_done] (0x1000): Server resolution failed: 2

This part of provided log indicates that there may be a problem with handling a fail over - it results with ENOENT, but some server was on previous line marked as working.

jhrozek commented 9 years ago

Fields changed

rhbz: => todo

jhrozek commented 8 years ago

We will continue investigating, but after the Alpha..

milestone: SSSD 1.13 alpha => SSSD 1.13 beta
sensitive: => 0

jhrozek commented 8 years ago

Required for downstream, but not for Beta

milestone: SSSD 1.13 beta => SSSD 1.13

jhrozek commented 8 years ago

I started looking into the sdap code and this needs more work than what's in scope of released branch..

milestone: SSSD 1.13.2 => SSSD 1.14 beta

jhrozek commented 7 years ago

Unfortunately the sdap_ops code wtill needs more work.

milestone: SSSD 1.14 beta => SSSD 1.15 beta

Metadata Update from @orion:
- Issue set to the milestone: SSSD Future releases (no date set yet)

7 years ago

Metadata Update from @thalman:
- Custom field design_review reset (from 0)
- Custom field mark reset (from 0)
- Custom field patch reset (from 0)
- Custom field review reset (from 0)
- Custom field sensitive reset (from 0)
- Custom field testsupdated reset (from 0)
- Issue close_status updated to: None
- Issue tagged with: Canditate to close

4 years ago

pbrezina commented 4 years ago

Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.

Given that we are unable to fulfill this request I am closing the issue as wontfix.

If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.

Thank you for understanding.

Metadata Update from @pbrezina:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

4 years ago

pbrezina commented 3 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3604

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

major

Milestone

SSSD Future releases (no date set yet)

type

defect

component

SSSD

version

1.12.2

selected

None

testsupdated

false

patch

false

rhbz

todo

design_review

false

review

false

changelog

None

keywords

None

coverity

None

mark

false

blocking

None

design

None

sensitive

false

lslebodn@redhat.com, preichl@redhat.com

blockedby

None

feature_milestone

None

SSSD / sssd

Source Code

Documentation

#2562 sssd can timeout ldap network connection incorrectly Closed: wontfix 4 years ago by pbrezina. Opened 9 years ago by orion.

Metadata

Candidate to close

#2562 sssd can timeout ldap network connection incorrectly

Closed: wontfix 4 years ago by pbrezina. Opened 9 years ago by orion.