Occasionally our LDAP instance will get a problem where various searches hang. Large searches over big chunks of the userRoot db will just stop mid way through. This appears to be caused by the entryrdn index becoming corrupt since it can be fixed by regenerating the entryrdn index.
To give an example of a command that fails:
ldapsearch -x -b "dc=ourorg,dc=com" -h localhost -D "cn=directory manager" -w removed objectclass=posixAccount > /dev/shm/ldap_search_output
This hangs after reading a large number of entries. The number of entries change each time, but I believe it may be actually failing on the same specific entry each time, and just reading them in a different order.
This can be fixed by stopping ns-slapd and moving /var/lib/dirsrv/slapd-ldap-04/db/userRoot/entryrdn.db4 out of the way before running ./db2index -n userRoot -t entryrdn, but it will then reoccur some time later, often within an hour or two.
/var/lib/dirsrv/slapd-ldap-04/db/userRoot/entryrdn.db4
./db2index -n userRoot -t entryrdn
Our system (this is a physical server):
$ cat /etc/redhat-release CentOS release 6.5 (Final) $ uname -r 2.6.32-431.3.1.el6.x86_64 $ rpm -q 389-ds 389-ds-1.2.2-1.el6.noarch
A stacktrace taken at the start of a long search, while the problem was present. stacktrace.1392167403.pre-index_search_start.txt
A stacktrace taken after the long search hung. stacktrace.1392167889.pre-index_search_hung.txt
A stacktrace taken at the start of the same long search after I re-indexed. stacktrace.1392168177.post-reindex_search_start.txt
A stacktrace taken after the searcha above completed OK. stacktrace.1392168303.post-reindex_search_finished.txt
One other note that I forgot to include in my main comment is that we're also occasionally seeing these errors when entries are deleted by our programs:
{{{ [12/Feb/2014:20:20:46 +0000] entryrdn-index - _entryrdn_delete_key: Failed to remove ou=test; has children [12/Feb/2014:20:20:46 +0000] - database index operation failed BAD 1031, err=-1 Unknown error: -1 }}}
They may be related, but aren't directly causing this, since we sometimes see this problem without any of those errors happening, and sometimes see these errors without the problem occurring.
dbscan of a backup of the entryrdn.db4 file taken while the system was broken. entryrdn.db4.broken.dbscan.xz
A dbscan of the entryrdn.db4 while it's working. entryrdn.db4.working.dbscan.xz
Thank you for the input!
$ rpm -qa | grep 389 389-ds-base-1.2.11.15-31.el6_5.x86_64 389-ds-console-doc-1.2.6-1.el6.noarch 389-admin-1.1.35-1.el6.x86_64 389-ds-console-1.2.6-1.el6.noarch 389-admin-console-doc-1.1.8-1.el6.noarch 389-ds-1.2.2-1.el6.noarch 389-adminutil-1.1.19-1.el6.x86_64 389-console-1.1.7-1.el6.noarch 389-admin-console-1.1.8-1.el6.noarch 389-dsgw-1.1.11-1.el6.x86_64 389-ds-base-debuginfo-1.2.11.15-31.el6_5.x86_64 389-ds-base-libs-1.2.11.15-31.el6_5.x86_64
The last couple of thousand entries in the error log -- includes two re-indexes. errors
Reading the entryrdn.db4.*.dbscan files, an entry in the primary db is corrupted and it breaks the entryrdn file, as well...
Could you run dbscan -f /path/to/db/userRoot/id2entry.db -K 276949 and attach the output? I'm afraid the DN looks like this: dn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=projects,...
Thanks. OK, I've done that:
{{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 -K 276949 id 276949 rdn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411 e1-9252e9c7-2a0ffae9 nsUniqueId: 97563282-843411e1-9252e9c7-2a0ffae9 objectClass: organizationalUnit objectClass: top objectClass;vucsn-52faa506000409150000: nsTombstone ou: nvoq creatorsName: cn=directory manager modifiersName: cn=directory manager createTimestamp: 20120412001529Z modifyTimestamp: 20120412001529Z parentid: 73137 entryid: 276949 tombstoneNumSubordinates: 1 nsParentUniqueId: f976ee35-7f9111de-803bb14c-f78174f9 nscpEntryDN: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=project s,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com }}}
This is a tombstone entry if I'm reading that right, and I think they only exist on servers that are being replicated correct? This server isn't replicating anything under dc=cvsdude,dc=com, but is configured to replicate, but we haven't created the slave yet.
dc=cvsdude,dc=com
In case it's relevant: The data it contains was on a server with replication problems earlier though. To transfer it we imported a backup, and then used db2ldif to generate an ldif file before recreating the database and importing the generated ldif (I wanted to try to clear out any corruption in the backup).
This is a tombstone entry if I'm reading that right, and I think they only exist on servers that are being replicated correct? This server isn't replicating anything under dc=cvsdude,dc=com, but is configured to replicate, but we haven't created the slave yet. Yes, this is a tombstone entry, but the RDN is corrupted. The correct RDN for this entry is: rdn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq
Because of this corruption, entryrdn gets confused and duplicated RDN is added to the index. It caused the hang.
Unfortunately, this entry was generated quite long time ago. We fixed a related bug in Jun, 2013.
createTimestamp: 20120412001529Z modifyTimestamp: 20120412001529Z
To fix this problem, we have to delete the tombstone entry ID 276949 (and its child ID 582043, which might be broken, as well) C276949 ID: 582043; RDN: "nsuniqueid=3154a601-3d6111e3-8172e9c7-2a0ffae9,cn=svn"; NRDN: "nsuniqueid=3154a601-3d6111e3-8172e9c7-2a0ffae9,cn=svn" C276949 ID: 276949; RDN: "nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9"; NRDN: "nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9"
Again, unfortunately, you cannot manually delete tombstone entries using ldapdelete. I wonder why these old tombstones are kept in the database. You don't reap tombstone entries? You could try setting some short time (e.g., 60 seconds) to nsDS5ReplicaPurgeDelay and nsDS5ReplicaTombstonePurgeInterval. And do a modify operation to trigger the reap. (Please do it with the cleaner entryrdn after reindexing entryrdn...) https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#Replication_Attributes_under_cnreplica_cnsuffixName_cnmapping_tree_cnconfig-nsDS5ReplicaPurgeDelay https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#Replication_Attributes_under_cnreplica_cnsuffixName_cnmapping_tree_cnconfig-nsDS5ReplicaTombstonePurgeInterval
Once the tombstone purge is done, please run "dbscan -f /path/to/db/userRoot/id2entry.db -K 276949" once again and see if the corrupted entry was purged or not.
If it still exists, you may need to do this raw level recovery...
{{{ 1. stop the server 2. cd /usr/lib[64]/dirsrv/slapd-ID; ./db2ldif -r -n <your_backend_name> 3. Open the exported ldif file and delete the problematic entries from the ldif file. dn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=projects,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com dn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=projects,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com If you see any child/grandchild entries (which dn includes ou=nvoq,ou=projects,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com) need to be deleted, too. 4. Import the edited ldif file on the master. 5. Start the server and reinitialize replicas. }}}
Thank you for you help, that seems to have removed the corrupted entrys:
{{{ $ dbscan -f db/userRoot/id2entry.db4 -K 276949 Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found $ dbscan -f db/userRoot/id2entry.db4 -K 582043 Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found }}}
When I was looking at the replica agreements I noticed that one of our databases (we have two on the same server) had nsDS5ReplicaPurgeDelay already set to it's default value. We haven't had these sorts of problems on that database, only on the other. Is it possible that it's never purging them unless you specifically set that option?
(I've specifically set it and nsDS5ReplicaTombstonePurgeInterval on both of our replicas to their "defaults" to be on the safe side now)
We're seeing similar problems again. How would we go about checking for such corruption ourself?
(Would you prefer I take this discussion to the mailing list?)
Replying to [comment:8 tpollard]:
We're seeing similar problems again. How would we go about checking for such corruption ourself? (Would you prefer I take this discussion to the mailing list?)
Probably, you may want to dump id2entry by "dbscan -f /path/to/id2entry.db4 > id2entry.db4.out". egrep -i "nsuniqueid=.,nsuniqueid=. id2entry.db4.out
If you see them, I'd like you to clean them up. If not, please attach the entryrdn dump one more time.
Thanks!
Thanks again. I tried setting the purges down to 60 seconds to clear out the old entries now, but we're still seeing the hangs, and we've got another weird issue now:
{{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11" }}}
One of my co-workers rebuilt the index just a few hours ago, so this must have started since then.
The nsds5ReplicaPurgeDelay and nsds5ReplicaTombstonePurgeInterval settings are both on 60 seconds at the moment. Could that be causing this oddness with the index?
We've also seen several segfaults over the last 24 hours; is there any chance they are related? (We're going to try to get core dumps for them if they happen again Today so we can report them properly)
Replying to [comment:10 tpollard]:
Thanks again. I tried setting the purges down to 60 seconds to clear out the old entries now, but we're still seeing the hangs, and we've got another weird issue now: {{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11" }}} One of my co-workers rebuilt the index just a few hours ago, so this must have started since then.
Could you please attach the output from "dbscan -f db/userRoot/entryrdn.db4" one more time? Also, could you find out the entry having the nsuniqueid 48873a01-980f11e3-940aa954-a476eb11? I'd like to learn how it looks.
It should not be. The combination should work...
Yes, if the entryrdn is corrupted, it could easily crash the server... Yes, the stacktrace should be helpful to understand what is happening...
Replying to [comment:11 nhosoi]:
Im generating the dbscan of the entryrdn.db4 now and I'll attach it after I finish compressing it.
Looks like it doesn't exist, not even in the entryrdn index: {{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ ldapsearch -x -b 'dc=cvsdude,dc=com' nsuniqueid=48873a01-980f11e3-940aa954-a476eb11
search: 2 result: 0 Success
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 | grep '48873a01-980f11e3-940aa954-a476eb11' [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | grep '48873a01-980f11e3-940aa954-a476eb11' ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11" }}}
We've also seen several segfaults over the last 24 hours; is there any chance they are related? (We're going to try to get core dumps for them if they happen again Today so we can report them properly) Yes, if the entryrdn is corrupted, it could easily crash the server... Yes, the stacktrace should be helpful to understand what is happening...
I'll see what I can do about getting a core dump (or did you want a stacktrace of it while it's running?)
Another dbscan of entryrdn entryrdn.dbscan.201402180015.xz
What does this CLI return? dbscan -f db/userRoot/id2entry.db4 -K 805177
And ldapsearch does not return a tombstone entry by default. Does this return any entries? If yes, could you find nsuniqueid=48873a01-980f11e3-940aa954-a476eb11 in the search output? ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w your_pw "(objectclass=nstombstone)"
Replying to [comment:13 nhosoi]:
{{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 -K 805177 Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11" }}}
It does return results, but none of them match 48873a01-980f11e3-940aa954-a476eb11:
{{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w REMOVED "(objectclass=nstombstone)"
dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=cvsdude,dc=com objectClass: top objectClass: nsTombstone objectClass: extensibleobject [snip]
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w REMOVED "(objectclass=nstombstone)" | grep 48873a01-980f11e3-940aa954-a476eb11 [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ }}}
Okay, now you have no tombstone entries in your primary db (id2entry.db4).
But reindexing entryrdn still generates corrupted nodes with double nsuniqueid? If so, it is really bad...
Also, entryrdn is referring ID: 805177, but it does not exist in id2entry.db4, which is strange. I'm wondering when the corrupted entryrdn was created, ID: 805177 still existed. Then, the tombstone entry was purged. But due to the corrupted entryrdn node, it failed to delete the node in the entryrdn index...?
Do you happen to keep a backup (hopefully)? If yes, could you run dbscan -f /path/to/backup/db/id2entry.db4 -K 805177? Our understanding is double nsuniqueid bug used to be in the primary db. And entryrdn gets broken due to the double nsuniqueid'ed DN. But if the double nsuniqueid node in entryrdn could be created from a healthy tombstone entry, it is a different issue. I'd like to know you have a latter possibility...
Replying to [comment:15 nhosoi]:
There is tombstone entries, I just snipped them from the output I sent you.
I suspect it might have been created in the four or five hours between when my colleague reindexed the database and when I was checking it this morning.
When I reindexed the database just then it went away:
{{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ }}}
Also, entryrdn is referring ID: 805177, but it does not exist in id2entry.db4, which is strange. I'm wondering when the corrupted entryrdn was created, ID: 805177 still existed. Then, the tombstone entry was purged. But due to the corrupted entryrdn node, it failed to delete the node in the entryrdn index...? Do you happen to keep a backup (hopefully)? If yes, could you run dbscan -f /path/to/backup/db/id2entry.db4 -K 805177? Our understanding is double nsuniqueid bug used to be in the primary db. And entryrdn gets broken due to the double nsuniqueid'ed DN. But if the double nsuniqueid node in entryrdn could be created from a healthy tombstone entry, it is a different issue. I'd like to know you have a latter possibility...
Yes, we take backups every twelve hours. I checked as far back as we've got, which is nearly two weeks, and none of them contained 805177. It's possible that it only existed for less than 12 hours though perhaps.
Hello,
Do you have any updates on this issue? If you don't see the double tombstone problem since you cleaned them up, can we close this ticket once? If you see the problem, you could reopen it.
Thank you for your help, --noriko
Hi Noriko,
The double tombstone entries keep coming back. We had to do another index rebuild this morning and there's also this:
{{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=." ID: 857936; RDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11"; NRDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11" ID: 857936; RDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11"; NRDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11" ID: 566309; RDN: "nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9,nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9"; NRDN: "nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9,nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9" ID: 709711; RDN: "nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5,nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5"; NRDN: "nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5,nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5" ID: 820687; RDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11"; NRDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11" ID: 820687; RDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11"; NRDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11" ID: 821901; RDN: "nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11,nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11"; NRDN: "nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11,nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11" ID: 824198; RDN: "nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11,nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11"; NRDN: "nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11,nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11" ID: 841628; RDN: "nsuniqueid=42911601-9fd211e3-b314a954-a476eb11,nsuniqueid=42911601-9fd211e3-b314a954-a476eb11"; NRDN: "nsuniqueid=42911601-9fd211e3-b314a954-a476eb11,nsuniqueid=42911601-9fd211e3-b314a954-a476eb11" ID: 846125; RDN: "nsuniqueid=efc66584-a08811e3-b779a954-a476eb11,nsuniqueid=efc66584-a08811e3-b779a954-a476eb11"; NRDN: "nsuniqueid=efc66584-a08811e3-b779a954-a476eb11,nsuniqueid=efc66584-a08811e3-b779a954-a476eb11" ID: 851726; RDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11"; NRDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11" ID: 851726; RDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11"; NRDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11" ID: 852972; RDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"; NRDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11" ID: 852972; RDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"; NRDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11" ID: 852972; RDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"; NRDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 -K 857936 id 857936 rdn: nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011 e3-9919a954-a476eb11 objectClass;vucsn-53167631000009150000: organizationalUnit objectClass;vucsn-53167631000009150000: top objectClass;vucsn-53167803000009150000: nsTombstone ou;vucsn-53167631000009150000;mdcsn-53167631000009150000: lab3 creatorsName;vucsn-53167631000009150000: cn=directory manager modifiersName;vucsn-53167631000009150000: cn=directory manager createTimestamp;vucsn-53167631000009150000: 20140305005616Z modifyTimestamp;vucsn-53167631000009150000: 20140305005616Z nsUniqueId: e4fd2f01-a40011e3-9919a954-a476eb11 parentid: 857935 entryid: 857936 tombstoneNumSubordinates: 1 nsParentUniqueId: 9d76a305-a40011e3-9919a954-a476eb11 nscpEntryDN: nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=project s,o=asken360,ou=organizations,dc=cvsdude,dc=com }}}
Also is this normal given those entries: {{{ [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 -K 857936 Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found [root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 -K 852972 Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found }}}
We also have both nsds5ReplicaPurgeDelay and nsds5ReplicaTombstonePurgeInterval set to 60.
I'm not sure whether that's the only problem. We've seen a few hangs, segfaults, and general protection faults over the past few weeks. I'll attach a stacktrace from one of the hangs, and one of the segfaults to this ticket in a few moments.
How would we check if they are related?
Stacktrace from a recent hang stacktrace.1393817425.hang.txt
Stacktrace from the core dump of a recent segfault. stacktrace.1393982189.segfault_core_dump.txt
That does not sound good... Double tombstoned entry should not be generated. That's a bug.
Do you have any log, for instance, operated on this entry? (Could you grep this entry in the access log /var/log/dirsrv/slapd-ID/access?) ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com
I guess this entry was deleted at some time. I'm interested in how it happened, e.g., it was removed on multiple masters almost at the same time?
Also, it shows there is/was a child entry of this ou=lab3. Do you see the child entry in the id2entry? Is it also a normal entry or a tombstone (a normal tombstone or a double tombstone)?
If you could give us your typical use case and (if possible your typical entries), it'd help our debugging effort a lot. Thank you!
Replying to [comment:19 nhosoi]:
That does not sound good... Double tombstoned entry should not be generated. That's a bug. Do you have any log, for instance, operated on this entry? (Could you grep this entry in the access log /var/log/dirsrv/slapd-ID/access?) ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com
Unfortunately the access log is turned off. I'll turn it on to help with debugging.
It shouldn't have been. Our software should only write to one server at a time (we even use a single IP address that we switch from one master to another depending on which is live).
I found one child, and it looks like it's your typical tombstone entry (though of course I'm not an expert at what should be there, so I may be misreading it):
{{{ id 857937 rdn: nsuniqueid=e4fd2f02-a40011e3-9919a954-a476eb11,cn=svn objectClass;vucsn-53167632000009150000: groupOfUniqueNames objectClass;vucsn-53167632000009150000: top objectClass;vucsn-531677eb000009150000: nsTombstone cn;vucsn-53167632000009150000;mdcsn-53167632000009150000: svn uniqueMember;vucsn-53167632000009150000: uid=asken360,ou=users,o=asken360,ou=o rganizations,dc=cvsdude,dc=com creatorsName;vucsn-53167632000009150000: cn=directory manager modifiersName;vucsn-53167632000009150000: cn=directory manager createTimestamp;vucsn-53167632000009150000: 20140305005617Z modifyTimestamp;vucsn-53167632000009150000: 20140305005617Z nsUniqueId: e4fd2f02-a40011e3-9919a954-a476eb11 parentid: 857936 entryid: 857937 nsParentUniqueId: e4fd2f01-a40011e3-9919a954-a476eb11 nscpEntryDN: cn=svn,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude ,dc=com }}}
The problems seem to be in our organizations and projects, not the users, so I'll go through the basics of that.
Each organisation has a o=ACCOUNT,ou=organizations,dc=cvsdude,dc=com entry from their organisation.
o=ACCOUNT,ou=organizations,dc=cvsdude,dc=com
Below that there are two sub-trees called ou=projects and ou=users. ou=users is exactly what it says on the tin, and ou=projects contains another sub-tree for each project we have for the organisation, with each project sub-tree having a service specific group below that for permissions.
ou=projects
ou=users
This is automatically created and deleted by a Perl daemon as needed. In the event that it needs to rebuild it I believe it will delete parts of the tree and replace them with new entries.
I'll attach an example of one of our test accounts.
ldapsearch output for the "bees" testing org test_org_ldapsearch_out.txt
Thank you!! Your comment contains lots of useful information.
Replying to [comment:20 tpollard]:
Replying to [comment:19 nhosoi]: That does not sound good... Double tombstoned entry should not be generated. That's a bug. Do you have any log, for instance, operated on this entry? (Could you grep this entry in the access log /var/log/dirsrv/slapd-ID/access?) ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com Unfortunately the access log is turned off. I'll turn it on to help with debugging.
All right. Then, on the crashed host, how about running "db_scan -f /path/to/changelog_db/your_changelog.db4 > /tmp/changelog.out" and searching ou=lab3? Changelog file looks something like this "ce76f717-3cef11e3-a3a9b14a-77807b55_52698b42000000010000.db4".
The following is a great input! I'm going to set up following your tree structure. BTW, how many masters do you have? 2 or more? If more, how they are connected? Could it be in a circle shape or a mesh?
The problems seem to be in our organizations and projects, not the users, so I'll go through the basics of that. Each organisation has a o=ACCOUNT,ou=organizations,dc=cvsdude,dc=com entry from their organisation. Below that there are two sub-trees called ou=projects and ou=users. ou=users is exactly what it says on the tin, and ou=projects contains another sub-tree for each project we have for the organisation, with each project sub-tree having a service specific group below that for permissions. This is automatically created and deleted by a Perl daemon as needed. In the event that it needs to rebuild it I believe it will delete parts of the tree and replace them with new entries.
I'm curious about the Perl daemon. How often the deletion and creation occurs? And could any entries under the sub-tree could be a busy entry like searched/updated in the middle of the deletion by the Perl daemon?
I'll attach an example of one of our test accounts. Thanks!
Replying to [comment:21 nhosoi]:
OK, I've done that, and I'll attach a text file containing the four entries I found.
2 masters with a shared IP. To switch from one to the other we just change the routing for the IP. Note that the same one (ldap-04) has been the master for several weeks now.
This is automatically created and deleted by a Perl daemon as needed. In the event that it needs to rebuild it I believe it will delete parts of the tree and replace them with new entries. I'm curious about the Perl daemon. How often the deletion and creation occurs?
I'm curious about the Perl daemon. How often the deletion and creation occurs?
About 100 times an hour, but that's with a mixture of orgs. According to the daemon's logs in the past 43 hours there have been 4107 rebuilds, for 1340 different orgs.
I do note that asken360 has been rebuilt 8 times during that period, all of them in short period of 13 minutes between 2014/03/05 00:54:37 and 2014/03/05 01:07:12 (GMT). The replgen of all the entries I found in the changelog for lab3 where during that 13 minute period as well. (see the attached file)
Actually reading through the Perl source-code I think it might be more intelligent than I originally though. It looks like it generates an internal Perl datastructure, and compares it with what it gets from LDAP, and deletes or creates entries as needed.
I think I may have found a bug with it now too. It looks like it tries to clear out anything below the o=ORG,ou=organizations,dc=cvsdude level that isn't "supposed" to be there as far as it can see.
o=ORG,ou=organizations,dc=cvsdude
I note that according to the logs at 01:04:02 it deleted or tried to delete nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com so it appears it doesn't check if the excess entry is a tombstone entry. That matches up with a changelog entry for the same time and the same DN.
nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com
Would I be correct in thinking that this might cause bad things to happen?
And could any entries under the sub-tree could be a busy entry like searched/updated in the middle of the deletion by the Perl daemon?
Hopefully not, but our various servers may make read-only requests to find out whether a user is part of a group that's being deleted. (This would be to the same master)
Entries that match lab3 from a dbscan of the changelog dbscan_lab3.txt
Thank you for the changelog data. On my side, I'm running a test script which repeat adding and deleteing the sub-tree ou=test,ou=projects,o=bees,ou=organizations,dc=cvsdude,dc=com and its children on each master (I set up 3 way MMR). I'm having an interesting problem which shows an assertion failure. Please note assertion is only enabled in the debug build.) Hopefully, we are sharing the same problem...
I think I may have found a bug with it now too. It looks like it tries to clear out anything below the o=ORG,ou=organizations,dc=cvsdude level that isn't "supposed" to be there as far as it can see. I note that according to the logs at 01:04:02 it deleted or tried to delete nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com so it appears it doesn't check if the excess entry is a tombstone entry. That matches up with a changelog entry for the same time and the same DN. Would I be correct in thinking that this might cause bad things to happen?
It shouldn't be... A tombstone entry is not allowed to "delete" by the delete request from the LDAP clients. The attempt should be safely rejected. But there could be a bug there... Let me continue investigating it, as well. Thanks!
Hi, I have a case which looks like there's some overlap with what you're seeing, particularly the messages like: {{{ [12/Feb/2014:20:20:46 +0000] entryrdn-index - _entryrdn_delete_key: Failed to remove ou=test; has children [12/Feb/2014:20:20:46 +0000] - database index operation failed BAD 1031, err=-1 Unknown error: -1 }}}
I have found something which might lead to the solution - see https://fedorahosted.org/389/ticket/47764#comment:1
Fixed as a part of #47750. Marked as duplicate.
Metadata Update from @tpollard: - Issue assigned to nhosoi - Issue set to the milestone: 1.2.11.30
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1032
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Duplicate)
Login to comment on this ticket.