#2350 Parallel make of "tests" target fails
Closed: wontfix 4 years ago by pbrezina. Opened 9 years ago by nkondras.

Execution of "make -j4 tests" is seen failing on Fedora 20 VM with probability of about 65%.

Failures seem to be due to race conditions, such as some files built in two processes at the same time. The problem might be due to a .PHONY target used somewhere in the dependency tree.

A similar-effect command "make -j4 check LOG_COMPILER=true" completes reliably without failures.


Log file from build will be helpful. There should not be problem with building some file in two processes at the same time. libtool should prevent this problem.

It will generate prefix for object files
eg.

   gcc -o src/tools/sss_cache-tools_util.o src/tools/tools_util.c
                    ^^^^^^^^^

cc: => lslebodn@redhat.com

Attached logs of 20 consecutive runs of "make -j4 tests; make -j4 clean" on an otherwise unloaded Fedora 20 VM. Of these 13 have failed. You can find the failed ones by grepping for "Error" among them.

The failures are mostly different. They reproduce with make debugging enabled ("make -d") too. I can do some such runs and attach the outputs as well.

sh-4.2$ cat -n make_tests_02.log | grep libsss_nss_idmap.so
   617  libtool: link: gcc -shared  -fPIC -DPIC  src/sss_client/idmap/.libs/sss_nss_idmap.o src/sss_client/.libs/common.o src/util/.libs/strtonum.o   -lpthread -ldl  -O2   -Wl,-soname -Wl,libsss_nss_idmap.so.0 -o .libs/libsss_nss_idmap.so.0.0.1
   619  libtool: link: (cd ".libs" && rm -f "libsss_nss_idmap.so.0" && ln -s "libsss_nss_idmap.so.0.0.1" "libsss_nss_idmap.so.0")
   620  libtool: link: (cd ".libs" && rm -f "libsss_nss_idmap.so" && ln -s "libsss_nss_idmap.so.0.0.1" "libsss_nss_idmap.so")
//libsss_nss_idmap.so was successfully linked.

  1750  libtool: link: rm -fr  .libs/libsss_nss_idmap.la .libs/libsss_nss_idmap.lai .libs/libsss_nss_idmap.so .libs/libsss_nss_idmap.so.0 .libs/libsss_nss_idmap.so.0.0.1
// libsss_nss_idmap.so was removed. I don't know why

  1751  libtool: link: gcc -shared  -fPIC -DPIC  src/sss_client/idmap/.libs/sss_nss_idmap.o src/sss_client/.libs/common.o src/util/.libs/strtonum.o   -lpthread -ldl  -O2   -Wl,-soname -Wl,libsss_nss_idmap.so.0 -o .libs/libsss_nss_idmap.so.0.0.1
//libsss_nss_idmap.so should be re-linked one more time
//but another process wants to link with this library on line 1753

  1753  libtool: link: gcc -Wall -Wshadow -Wstrict-prototypes -Wpointer-arith -Wcast-qual -Wcast-align -Wwrite-strings -Wundef -Werror-implicit-function-declaration -Winit-self -fno-strict-aliasing -std=gnu99 -g -O2 -D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE -o .libs/sss_nss_idmap-tests src/tests/cmocka/sss_nss_idmap_tests-sss_nss_idmap-tests.o  -lcmocka ./.libs/libsss_nss_idmap.so -lpthread -ldl
  1754  gcc: error: ./.libs/libsss_nss_idmap.so: No such file or directory

Are you sure that there are not two different builds stepping each other?

sssd in koji is build in parallel with 16 sub-processes
and there is not any problem.

Removal could be a part of the linking rule.

Yes, I'm quite sure there is only one make running. Does koji build "tests" target as well?

removing symbolic links is part of linking rule (lines 619, 620) in comment 4. I do not see reason why
file ".libs/libsss_nss_idmap.so.0.0.1" needs to be removed. I tried to run "make -j4 tests" on two different machines and was not able to reproduce problem. (file .libs/libsss_nss_idmap.so.0.0.1 was not removed after linking this library)
"make -j tests" passed as well.

koji runs command make twice: https://kojipkgs.fedoraproject.org//packages/sssd/1.12.0/2.fc21.beta1/data/logs/x86_64/build.log
1. make -j4 all docs
2. make -j4 check

Did you try Fedora 20? I've seen this happen on Fedora 20 on a Brno Lab VM, and perhaps RHEL7 there as well. "make -j4 check" works fine for me too.

Could you try running, say, ten builds in a row?

However, this is not critical for me at the moment as I can use "make -j4 check LOG_COMPILER=true" instead.

Fields changed

milestone: NEEDS_TRIAGE => SSSD 1.13 beta
rhbz: => 0

Apparently this reproduces well when building on sshfs, probably because of the slowdown of file operations. However, I made it fail once on a native virtio drive in a Fedora 20 VM today. And I've seen failures on Brno Lab VMs as well, using native drives, which probably have slower IO operations there due to contention.

I will try tomorrow building sssd in remote filesystem (sshfs)

This has reproduced in a Debian VM, building on sshfs as well.

It has to be sshfs specific problem.

I tested build in VM with various file systems: tmpfs, ext4, nfs and it works for me without any probelm.

sh-4.2$ file id_log.txt 
id_log.txt: ERROR: cannot open `id_log.txt' (No such file or directory)
sh-4.2$ for i in `seq 10`; do make -j4 tests && make -j4 clean && echo "$i" >> id_log.txt; done >/dev/null 2>&1

sh-4.2$ 
sh-4.2$ cat id_log.txt 
1
2
3
4
5
6
7
8
9
10
sh-4.2$ cat /etc/issue
Fedora release 20 (Heisenbug)
Kernel \r on an \m (\l)

_comment0: It has to be sshfs specific problem.

I tested build in VM with various file systems: '''tmpfs, hdd, nfs''' and it works for me without any probelm.

{{{
sh-4.2$ file id_log.txt
id_log.txt: ERROR: cannot open id_log.txt' (No such file or directory) sh-4.2$ for i inseq 10`; do make -j4 tests && make -j4 clean && echo "$i" >> id_log.txt; done >/dev/null 2>&1

sh-4.2$
sh-4.2$ cat id_log.txt
1
2
3
4
5
6
7
8
9
10
sh-4.2$ cat /etc/issue
Fedora release 20 (Heisenbug)
Kernel \r on an \m (\l)
}}} => 1402060360921978

The milestone of this ticket is set to the SSSD 1.13 beta. In my opinion, it is not problem in sssd. it must be problem with sshfs. We have other working alternatives. I would like to move this ticket to the milestone SSSD Defered or close as NOTABUG

I've seen this happen on other filesystems, although much less often. I haven't noticed any problems building other projects on sshfs, either. However, as I have a workaround, it is not critical. I wouldn't say it's NOTABUG, but rather "worksforme".

I cannot see any benefit to build on sshfs and I was not able to reproduce it with nfs.

Would you prefer to move this ticket to SSSD Defered or close as WORKSFORME?

Let's close it. I'll reopen it once I get a better way to reproduce it, or it becomes a problem.

It looks like you think it is a problem in sssd. There is a workaround. So, I will move this ticket to deferred
and it can be closed after few years :-)

Feel free to send patches if you find a solution.

milestone: SSSD 1.13 beta => SSSD Deferred

Log "make -j4 tests" failing on an xfs/virtio filesystem.
rhel7_virtio_make_tests_failure.log.xz

Important part of log rhel7_virtio_make_tests_failure.log.xz:

libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I.. -Wall -Iinclude -I..
  //snip
  -MT src/monitor/libsss_util_la-monitor_sbus.lo -MD -MP -MF
   src/monitor/.deps/libsss_util_la-monitor_sbus.Tpo 
   -c ../src/monitor/monitor_sbus.c  -fPIC -DPIC -o src/monitor/
   .libs/libsss_util_la-monitor_sbus.o

/bin/sh ./libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.  //snip
/bin/sh ./libtool  --tag=CC   --mode=compile gcc -DHAVE_CONFIG_H -I.  //snip
mv -f src/monitor/.deps/libsss_util_la-monitor_sbus.Tpo src/monitor/.deps/libsss_util_la-monitor_sbus.Plo
mv -f src/monitor/.deps/libsss_util_la-monitor_sbus.Tpo src/monitor/.deps/libsss_util_la-monitor_sbus.Plo
mv: cannot stat ‘src/monitor/.deps/libsss_util_la-monitor_sbus.Tpo’: No such file or directory
make[2]: *** [src/monitor/libsss_util_la-monitor_sbus.lo] Error 1
make[2]: *** Waiting for unfinished jobs....

The command mv was executed twice and the second try was not successful. I am not sure why it was executed twice.

Is this issue reproducible without OOM in log files?

A process being killed seems very unlikely to trigger an additional execution of "mv". Still, the last log was taken on a system with 1GB of memory, which never had an OOM killer kill anything, according to journalctl.

The last log was taken 4 days ago.

Changed 4 days ago by nkondras

    Attachment rhel7_virtio_make_tests_failure.log.xz​ added

Log "make -j4 tests" failing on an xfs/virtio filesystem.

The memory was increased yesterday. Did you mean another log file which is not attached in this ticket?

No, the last log was taken from a local VM running on my laptop.

Replying to [comment:22 nkondras]:

No, the last log was taken from a local VM running on my laptop.

I see

Replying to [comment:20 nkondras]:

A process being killed seems very unlikely to trigger an additional execution of "mv". Still, the last log was taken on a system with 1GB of memory, which never had an OOM killer kill anything, according to journalctl.
I cannot explain additional execution of "mv" either. It looks like another issue in automake or libtool.

Metadata Update from @nkondras:
- Issue set to the milestone: SSSD Patches welcome

7 years ago

Thank you for taking time to submit this request for SSSD. Unfortunately this issue was not given priority and the team lacks the capacity to work on it at this time.

Given that we are unable to fulfill this request I am closing the issue as wontfix.

If the issue still persist on recent SSSD you can request re-consideration of this decision by reopening this issue. Please provide additional technical details about its importance to you.

Thank you for understanding.

Metadata Update from @pbrezina:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

4 years ago

SSSD is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in SSSD's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/SSSD/sssd/issues/3392

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Login to comment on this ticket.

Metadata