#47696 Large Searches Hang - Possibly entryrdn related
Closed: wontfix None Opened 10 years ago by tpollard.

Occasionally our LDAP instance will get a problem where various searches hang. Large searches over big chunks of the userRoot db will just stop mid way through. This appears to be caused by the entryrdn index becoming corrupt since it can be fixed by regenerating the entryrdn index.

To give an example of a command that fails:

ldapsearch -x -b "dc=ourorg,dc=com" -h localhost -D "cn=directory manager" -w removed objectclass=posixAccount > /dev/shm/ldap_search_output

This hangs after reading a large number of entries. The number of entries change each time, but I believe it may be actually failing on the same specific entry each time, and just reading them in a different order.

This can be fixed by stopping ns-slapd and moving /var/lib/dirsrv/slapd-ldap-04/db/userRoot/entryrdn.db4 out of the way before running ./db2index -n userRoot -t entryrdn, but it will then reoccur some time later, often within an hour or two.

Our system (this is a physical server):

$ cat /etc/redhat-release 
CentOS release 6.5 (Final)
$ uname -r
2.6.32-431.3.1.el6.x86_64
$ rpm -q 389-ds
389-ds-1.2.2-1.el6.noarch

A stacktrace taken at the start of a long search, while the problem was present.
stacktrace.1392167403.pre-index_search_start.txt

A stacktrace taken at the start of the same long search after I re-indexed.
stacktrace.1392168177.post-reindex_search_start.txt

One other note that I forgot to include in my main comment is that we're also occasionally seeing these errors when entries are deleted by our programs:

{{{
[12/Feb/2014:20:20:46 +0000] entryrdn-index - _entryrdn_delete_key: Failed to remove ou=test; has children
[12/Feb/2014:20:20:46 +0000] - database index operation failed BAD 1031, err=-1 Unknown error: -1
}}}

They may be related, but aren't directly causing this, since we sometimes see this problem without any of those errors happening, and sometimes see these errors without the problem occurring.

dbscan of a backup of the entryrdn.db4 file taken while the system was broken.
entryrdn.db4.broken.dbscan.xz

A dbscan of the entryrdn.db4 while it's working.
entryrdn.db4.working.dbscan.xz

Thank you for the input!

$ rpm -qa | grep 389
389-ds-base-1.2.11.15-31.el6_5.x86_64
389-ds-console-doc-1.2.6-1.el6.noarch
389-admin-1.1.35-1.el6.x86_64
389-ds-console-1.2.6-1.el6.noarch
389-admin-console-doc-1.1.8-1.el6.noarch
389-ds-1.2.2-1.el6.noarch
389-adminutil-1.1.19-1.el6.x86_64
389-console-1.1.7-1.el6.noarch
389-admin-console-1.1.8-1.el6.noarch
389-dsgw-1.1.11-1.el6.x86_64
389-ds-base-debuginfo-1.2.11.15-31.el6_5.x86_64
389-ds-base-libs-1.2.11.15-31.el6_5.x86_64

The last couple of thousand entries in the error log -- includes two re-indexes.
errors

Reading the entryrdn.db4.*.dbscan files, an entry in the primary db is corrupted and it breaks the entryrdn file, as well...

Could you run dbscan -f /path/to/db/userRoot/id2entry.db -K 276949 and attach the output?
I'm afraid the DN looks like this:
dn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=projects,...

Thanks. OK, I've done that:

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 -K 276949
id 276949
rdn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411
e1-9252e9c7-2a0ffae9
nsUniqueId: 97563282-843411e1-9252e9c7-2a0ffae9
objectClass: organizationalUnit
objectClass: top
objectClass;vucsn-52faa506000409150000: nsTombstone
ou: nvoq
creatorsName: cn=directory manager
modifiersName: cn=directory manager
createTimestamp: 20120412001529Z
modifyTimestamp: 20120412001529Z
parentid: 73137
entryid: 276949
tombstoneNumSubordinates: 1
nsParentUniqueId: f976ee35-7f9111de-803bb14c-f78174f9
nscpEntryDN: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=project
s,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com
}}}

This is a tombstone entry if I'm reading that right, and I think they only exist on servers that are being replicated correct? This server isn't replicating anything under dc=cvsdude,dc=com, but is configured to replicate, but we haven't created the slave yet.

In case it's relevant: The data it contains was on a server with replication problems earlier though. To transfer it we imported a backup, and then used db2ldif to generate an ldif file before recreating the database and importing the generated ldif (I wanted to try to clear out any corruption in the backup).

This is a tombstone entry if I'm reading that right, and I think they only exist on servers that are being replicated correct? This server isn't replicating anything under dc=cvsdude,dc=com, but is configured to replicate, but we haven't created the slave yet.
Yes, this is a tombstone entry, but the RDN is corrupted. The correct RDN for this entry is:
rdn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq

Because of this corruption, entryrdn gets confused and duplicated RDN is added to the index. It caused the hang.

In case it's relevant: The data it contains was on a server with replication problems earlier though. To transfer it we imported a backup, and then used db2ldif to generate an ldif file before recreating the database and importing the generated ldif (I wanted to try to clear out any corruption in the backup).

Unfortunately, this entry was generated quite long time ago. We fixed a related bug in Jun, 2013.

    createTimestamp: 20120412001529Z
    modifyTimestamp: 20120412001529Z

To fix this problem, we have to delete the tombstone entry ID 276949 (and its child ID 582043, which might be broken, as well)
C276949
ID: 582043; RDN: "nsuniqueid=3154a601-3d6111e3-8172e9c7-2a0ffae9,cn=svn"; NRDN: "nsuniqueid=3154a601-3d6111e3-8172e9c7-2a0ffae9,cn=svn"
C276949
ID: 276949; RDN: "nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9"; NRDN: "nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9"

Again, unfortunately, you cannot manually delete tombstone entries using ldapdelete. I wonder why these old tombstones are kept in the database. You don't reap tombstone entries? You could try setting some short time (e.g., 60 seconds) to nsDS5ReplicaPurgeDelay and nsDS5ReplicaTombstonePurgeInterval. And do a modify operation to trigger the reap. (Please do it with the cleaner entryrdn after reindexing entryrdn...)
https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#Replication_Attributes_under_cnreplica_cnsuffixName_cnmapping_tree_cnconfig-nsDS5ReplicaPurgeDelay
https://access.redhat.com/site/documentation/en-US/Red_Hat_Directory_Server/9.0/html/Configuration_Command_and_File_Reference/Core_Server_Configuration_Reference.html#Replication_Attributes_under_cnreplica_cnsuffixName_cnmapping_tree_cnconfig-nsDS5ReplicaTombstonePurgeInterval

Once the tombstone purge is done, please run "dbscan -f /path/to/db/userRoot/id2entry.db -K 276949" once again and see if the corrupted entry was purged or not.

If it still exists, you may need to do this raw level recovery...

{{{
1. stop the server
2. cd /usr/lib[64]/dirsrv/slapd-ID; ./db2ldif -r -n <your_backend_name>
3. Open the exported ldif file and delete the problematic entries from the ldif file.
dn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=projects,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com
dn: nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,nsuniqueid=97563282-843411e1-9252e9c7-2a0ffae9,ou=nvoq,ou=projects,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com
If you see any child/grandchild entries (which dn includes ou=nvoq,ou=projects,o=sublimemedia,ou=organizations,dc=cvsdude,dc=com) need to be deleted, too.
4. Import the edited ldif file on the master.
5. Start the server and reinitialize replicas.
}}}

Thank you for you help, that seems to have removed the corrupted entrys:

{{{
$ dbscan -f db/userRoot/id2entry.db4 -K 276949
Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found
$ dbscan -f db/userRoot/id2entry.db4 -K 582043
Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found
}}}

When I was looking at the replica agreements I noticed that one of our databases (we have two on the same server) had nsDS5ReplicaPurgeDelay already set to it's default value. We haven't had these sorts of problems on that database, only on the other. Is it possible that it's never purging them unless you specifically set that option?

(I've specifically set it and nsDS5ReplicaTombstonePurgeInterval on both of our replicas to their "defaults" to be on the safe side now)

We're seeing similar problems again. How would we go about checking for such corruption ourself?

(Would you prefer I take this discussion to the mailing list?)

Replying to [comment:8 tpollard]:

We're seeing similar problems again. How would we go about checking for such corruption ourself?

(Would you prefer I take this discussion to the mailing list?)

Probably, you may want to dump id2entry by "dbscan -f /path/to/id2entry.db4 > id2entry.db4.out".
egrep -i "nsuniqueid=.,nsuniqueid=. id2entry.db4.out

If you see them, I'd like you to clean them up. If not, please attach the entryrdn dump one more time.

Thanks!

Thanks again. I tried setting the purges down to 60 seconds to clear out the old entries now, but we're still seeing the hangs, and we've got another weird issue now:

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"
}}}

One of my co-workers rebuilt the index just a few hours ago, so this must have started since then.

The nsds5ReplicaPurgeDelay and nsds5ReplicaTombstonePurgeInterval settings are both on 60 seconds at the moment. Could that be causing this oddness with the index?

We've also seen several segfaults over the last 24 hours; is there any chance they are related? (We're going to try to get core dumps for them if they happen again Today so we can report them properly)

Replying to [comment:10 tpollard]:

Thanks again. I tried setting the purges down to 60 seconds to clear out the old entries now, but we're still seeing the hangs, and we've got another weird issue now:

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"
}}}

One of my co-workers rebuilt the index just a few hours ago, so this must have started since then.

Could you please attach the output from "dbscan -f db/userRoot/entryrdn.db4" one more time? Also, could you find out the entry having the nsuniqueid 48873a01-980f11e3-940aa954-a476eb11? I'd like to learn how it looks.

The nsds5ReplicaPurgeDelay and nsds5ReplicaTombstonePurgeInterval settings are both on 60 seconds at the moment. Could that be causing this oddness with the index?

It should not be. The combination should work...

We've also seen several segfaults over the last 24 hours; is there any chance they are related? (We're going to try to get core dumps for them if they happen again Today so we can report them properly)

Yes, if the entryrdn is corrupted, it could easily crash the server... Yes, the stacktrace should be helpful to understand what is happening...

Replying to [comment:11 nhosoi]:

Could you please attach the output from "dbscan -f db/userRoot/entryrdn.db4" one more time? Also, could you find out the entry having the nsuniqueid 48873a01-980f11e3-940aa954-a476eb11? I'd like to learn how it looks.

Im generating the dbscan of the entryrdn.db4 now and I'll attach it after I finish compressing it.

Looks like it doesn't exist, not even in the entryrdn index:
{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ ldapsearch -x -b 'dc=cvsdude,dc=com' nsuniqueid=48873a01-980f11e3-940aa954-a476eb11

extended LDIF

LDAPv3

base <dc=cvsdude,dc=com> with scope subtree

filter: nsuniqueid=48873a01-980f11e3-940aa954-a476eb11

requesting: ALL

search result

search: 2
result: 0 Success

numResponses: 1

[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 | grep '48873a01-980f11e3-940aa954-a476eb11'
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | grep '48873a01-980f11e3-940aa954-a476eb11'
ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"
}}}

We've also seen several segfaults over the last 24 hours; is there any chance they are related? (We're going to try to get core dumps for them if they happen again Today so we can report them properly)

Yes, if the entryrdn is corrupted, it could easily crash the server... Yes, the stacktrace should be helpful to understand what is happening...

I'll see what I can do about getting a core dump (or did you want a stacktrace of it while it's running?)

What does this CLI return?
dbscan -f db/userRoot/id2entry.db4 -K 805177

And ldapsearch does not return a tombstone entry by default. Does this return any entries? If yes, could you find nsuniqueid=48873a01-980f11e3-940aa954-a476eb11 in the search output?
ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w your_pw "(objectclass=nstombstone)"

Replying to [comment:13 nhosoi]:

What does this CLI return?
dbscan -f db/userRoot/id2entry.db4 -K 805177

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 -K 805177
Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
ID: 805177; RDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"; NRDN: "nsuniqueid=48873a01-980f11e3-940aa954-a476eb11,nsuniqueid=48873a01-980f11e3-940aa954-a476eb11"
}}}

And ldapsearch does not return a tombstone entry by default. Does this return any entries? If yes, could you find nsuniqueid=48873a01-980f11e3-940aa954-a476eb11 in the search output?
ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w your_pw "(objectclass=nstombstone)"

It does return results, but none of them match 48873a01-980f11e3-940aa954-a476eb11:

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w REMOVED "(objectclass=nstombstone)"

extended LDIF

LDAPv3

base <dc=cvsdude,dc=com> with scope subtree

filter: (objectclass=nstombstone)

requesting: ALL

ffffffff-ffffffff-ffffffff-ffffffff, cvsdude.com

dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=cvsdude,dc=com
objectClass: top
objectClass: nsTombstone
objectClass: extensibleobject
[snip]

search result

search: 2
result: 0 Success

numResponses: 9

numEntries: 8

[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ ldapsearch -x -b 'dc=cvsdude,dc=com' -D "cn=directory manager" -w REMOVED "(objectclass=nstombstone)" | grep 48873a01-980f11e3-940aa954-a476eb11
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$
}}}

Okay, now you have no tombstone entries in your primary db (id2entry.db4).

But reindexing entryrdn still generates corrupted nodes with double nsuniqueid? If so, it is really bad...

Also, entryrdn is referring ID: 805177, but it does not exist in id2entry.db4, which is strange. I'm wondering when the corrupted entryrdn was created, ID: 805177 still existed. Then, the tombstone entry was purged. But due to the corrupted entryrdn node, it failed to delete the node in the entryrdn index...?

Do you happen to keep a backup (hopefully)? If yes, could you run dbscan -f /path/to/backup/db/id2entry.db4 -K 805177? Our understanding is double nsuniqueid bug used to be in the primary db. And entryrdn gets broken due to the double nsuniqueid'ed DN. But if the double nsuniqueid node in entryrdn could be created from a healthy tombstone entry, it is a different issue. I'd like to know you have a latter possibility...

Replying to [comment:15 nhosoi]:

Okay, now you have no tombstone entries in your primary db (id2entry.db4).

There is tombstone entries, I just snipped them from the output I sent you.

But reindexing entryrdn still generates corrupted nodes with double nsuniqueid? If so, it is really bad...

I suspect it might have been created in the four or five hours between when my colleague reindexed the database and when I was checking it this morning.

When I reindexed the database just then it went away:

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$
}}}

Also, entryrdn is referring ID: 805177, but it does not exist in id2entry.db4, which is strange. I'm wondering when the corrupted entryrdn was created, ID: 805177 still existed. Then, the tombstone entry was purged. But due to the corrupted entryrdn node, it failed to delete the node in the entryrdn index...?

Do you happen to keep a backup (hopefully)? If yes, could you run dbscan -f /path/to/backup/db/id2entry.db4 -K 805177? Our understanding is double nsuniqueid bug used to be in the primary db. And entryrdn gets broken due to the double nsuniqueid'ed DN. But if the double nsuniqueid node in entryrdn could be created from a healthy tombstone entry, it is a different issue. I'd like to know you have a latter possibility...

Yes, we take backups every twelve hours. I checked as far back as we've got, which is nearly two weeks, and none of them contained 805177. It's possible that it only existed for less than 12 hours though perhaps.

Hello,

Do you have any updates on this issue? If you don't see the double tombstone problem since you cleaned them up, can we close this ticket once? If you see the problem, you could reopen it.

Thank you for your help,
--noriko

Hi Noriko,

The double tombstone entries keep coming back. We had to do another index rebuild this morning and there's also this:

{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 | egrep -i "nsuniqueid=.,nsuniqueid=."
ID: 857936; RDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11"; NRDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11"
ID: 857936; RDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11"; NRDN: "nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11"
ID: 566309; RDN: "nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9,nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9"; NRDN: "nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9,nsuniqueid=d5cc220a-34ef11e3-9c3ee9c7-2a0ffae9"
ID: 709711; RDN: "nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5,nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5"; NRDN: "nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5,nsuniqueid=88d54183-7da511e3-9c7d92f7-d9a18ca5"
ID: 820687; RDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11"; NRDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11"
ID: 820687; RDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11"; NRDN: "nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11,nsuniqueid=ed7e6d01-9ac411e3-b352a954-a476eb11"
ID: 821901; RDN: "nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11,nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11"; NRDN: "nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11,nsuniqueid=bf6d4d13-9b0511e3-b352a954-a476eb11"
ID: 824198; RDN: "nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11,nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11"; NRDN: "nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11,nsuniqueid=acf22282-9b9c11e3-b6c1a954-a476eb11"
ID: 841628; RDN: "nsuniqueid=42911601-9fd211e3-b314a954-a476eb11,nsuniqueid=42911601-9fd211e3-b314a954-a476eb11"; NRDN: "nsuniqueid=42911601-9fd211e3-b314a954-a476eb11,nsuniqueid=42911601-9fd211e3-b314a954-a476eb11"
ID: 846125; RDN: "nsuniqueid=efc66584-a08811e3-b779a954-a476eb11,nsuniqueid=efc66584-a08811e3-b779a954-a476eb11"; NRDN: "nsuniqueid=efc66584-a08811e3-b779a954-a476eb11,nsuniqueid=efc66584-a08811e3-b779a954-a476eb11"
ID: 851726; RDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11"; NRDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11"
ID: 851726; RDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11"; NRDN: "nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11,nsuniqueid=d150ac46-a2cc11e3-ba13a954-a476eb11"
ID: 852972; RDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"; NRDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"
ID: 852972; RDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"; NRDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"
ID: 852972; RDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"; NRDN: "nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11,nsuniqueid=f44d8101-a30911e3-8f89a954-a476eb11"

[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/id2entry.db4 -K 857936
id 857936
rdn: nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,nsuniqueid=e4fd2f01-a40011
e3-9919a954-a476eb11
objectClass;vucsn-53167631000009150000: organizationalUnit
objectClass;vucsn-53167631000009150000: top
objectClass;vucsn-53167803000009150000: nsTombstone
ou;vucsn-53167631000009150000;mdcsn-53167631000009150000: lab3
creatorsName;vucsn-53167631000009150000: cn=directory manager
modifiersName;vucsn-53167631000009150000: cn=directory manager
createTimestamp;vucsn-53167631000009150000: 20140305005616Z
modifyTimestamp;vucsn-53167631000009150000: 20140305005616Z
nsUniqueId: e4fd2f01-a40011e3-9919a954-a476eb11
parentid: 857935
entryid: 857936
tombstoneNumSubordinates: 1
nsParentUniqueId: 9d76a305-a40011e3-9919a954-a476eb11
nscpEntryDN: nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=project
s,o=asken360,ou=organizations,dc=cvsdude,dc=com
}}}

Also is this normal given those entries:
{{{
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 -K 857936
Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found
[root@ldap-04]/var/lib/dirsrv/slapd-ldap-04$ dbscan -f db/userRoot/entryrdn.db4 -K 852972
Can't set cursor to returned item: DB_NOTFOUND: No matching key/data pair found
}}}

We also have both nsds5ReplicaPurgeDelay and nsds5ReplicaTombstonePurgeInterval set to 60.

I'm not sure whether that's the only problem. We've seen a few hangs, segfaults, and general protection faults over the past few weeks. I'll attach a stacktrace from one of the hangs, and one of the segfaults to this ticket in a few moments.

How would we check if they are related?

Stacktrace from the core dump of a recent segfault.
stacktrace.1393982189.segfault_core_dump.txt

That does not sound good... Double tombstoned entry should not be generated. That's a bug.

Do you have any log, for instance, operated on this entry? (Could you grep this entry in the access log /var/log/dirsrv/slapd-ID/access?)
ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com

I guess this entry was deleted at some time. I'm interested in how it happened, e.g., it was removed on multiple masters almost at the same time?

Also, it shows there is/was a child entry of this ou=lab3. Do you see the child entry in the id2entry? Is it also a normal entry or a tombstone (a normal tombstone or a double tombstone)?

If you could give us your typical use case and (if possible your typical entries), it'd help our debugging effort a lot. Thank you!

Replying to [comment:19 nhosoi]:

That does not sound good... Double tombstoned entry should not be generated. That's a bug.

Do you have any log, for instance, operated on this entry? (Could you grep this entry in the access log /var/log/dirsrv/slapd-ID/access?)
ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com

Unfortunately the access log is turned off. I'll turn it on to help with debugging.

I guess this entry was deleted at some time. I'm interested in how it happened, e.g., it was removed on multiple masters almost at the same time?

It shouldn't have been. Our software should only write to one server at a time (we even use a single IP address that we switch from one master to another depending on which is live).

Also, it shows there is/was a child entry of this ou=lab3. Do you see the child entry in the id2entry? Is it also a normal entry or a tombstone (a normal tombstone or a double tombstone)?

I found one child, and it looks like it's your typical tombstone entry (though of course I'm not an expert at what should be there, so I may be misreading it):

{{{
id 857937
rdn: nsuniqueid=e4fd2f02-a40011e3-9919a954-a476eb11,cn=svn
objectClass;vucsn-53167632000009150000: groupOfUniqueNames
objectClass;vucsn-53167632000009150000: top
objectClass;vucsn-531677eb000009150000: nsTombstone
cn;vucsn-53167632000009150000;mdcsn-53167632000009150000: svn
uniqueMember;vucsn-53167632000009150000: uid=asken360,ou=users,o=asken360,ou=o
rganizations,dc=cvsdude,dc=com
creatorsName;vucsn-53167632000009150000: cn=directory manager
modifiersName;vucsn-53167632000009150000: cn=directory manager
createTimestamp;vucsn-53167632000009150000: 20140305005617Z
modifyTimestamp;vucsn-53167632000009150000: 20140305005617Z
nsUniqueId: e4fd2f02-a40011e3-9919a954-a476eb11
parentid: 857936
entryid: 857937
nsParentUniqueId: e4fd2f01-a40011e3-9919a954-a476eb11
nscpEntryDN: cn=svn,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude
,dc=com
}}}

If you could give us your typical use case and (if possible your typical entries), it'd help our debugging effort a lot. Thank you!

The problems seem to be in our organizations and projects, not the users, so I'll go through the basics of that.

Each organisation has a o=ACCOUNT,ou=organizations,dc=cvsdude,dc=com entry from their organisation.

Below that there are two sub-trees called ou=projects and ou=users. ou=users is exactly what it says on the tin, and ou=projects contains another sub-tree for each project we have for the organisation, with each project sub-tree having a service specific group below that for permissions.

This is automatically created and deleted by a Perl daemon as needed. In the event that it needs to rebuild it I believe it will delete parts of the tree and replace them with new entries.

I'll attach an example of one of our test accounts.

ldapsearch output for the "bees" testing org
test_org_ldapsearch_out.txt

Thank you!! Your comment contains lots of useful information.

Replying to [comment:20 tpollard]:

Replying to [comment:19 nhosoi]:

That does not sound good... Double tombstoned entry should not be generated. That's a bug.

Do you have any log, for instance, operated on this entry? (Could you grep this entry in the access log /var/log/dirsrv/slapd-ID/access?)
ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com

Unfortunately the access log is turned off. I'll turn it on to help with debugging.

All right. Then, on the crashed host, how about running "db_scan -f /path/to/changelog_db/your_changelog.db4 > /tmp/changelog.out" and searching ou=lab3? Changelog file looks something like this "ce76f717-3cef11e3-a3a9b14a-77807b55_52698b42000000010000.db4".

The following is a great input! I'm going to set up following your tree structure. BTW, how many masters do you have? 2 or more? If more, how they are connected? Could it be in a circle shape or a mesh?

The problems seem to be in our organizations and projects, not the users, so I'll go through the basics of that.

Each organisation has a o=ACCOUNT,ou=organizations,dc=cvsdude,dc=com entry from their organisation.

Below that there are two sub-trees called ou=projects and ou=users. ou=users is exactly what it says on the tin, and ou=projects contains another sub-tree for each project we have for the organisation, with each project sub-tree having a service specific group below that for permissions.

This is automatically created and deleted by a Perl daemon as needed. In the event that it needs to rebuild it I believe it will delete parts of the tree and replace them with new entries.

I'm curious about the Perl daemon. How often the deletion and creation occurs? And could any entries under the sub-tree could be a busy entry like searched/updated in the middle of the deletion by the Perl daemon?

I'll attach an example of one of our test accounts.
Thanks!

Replying to [comment:21 nhosoi]:

All right. Then, on the crashed host, how about running "db_scan -f /path/to/changelog_db/your_changelog.db4 > /tmp/changelog.out" and searching ou=lab3? Changelog file looks something like this "ce76f717-3cef11e3-a3a9b14a-77807b55_52698b42000000010000.db4".

OK, I've done that, and I'll attach a text file containing the four entries I found.

The following is a great input! I'm going to set up following your tree structure. BTW, how many masters do you have? 2 or more? If more, how they are connected? Could it be in a circle shape or a mesh?

2 masters with a shared IP. To switch from one to the other we just change the routing for the IP. Note that the same one (ldap-04) has been the master for several weeks now.

This is automatically created and deleted by a Perl daemon as needed. In the event that it needs to rebuild it I believe it will delete parts of the tree and replace them with new entries.

I'm curious about the Perl daemon. How often the deletion and creation occurs?

About 100 times an hour, but that's with a mixture of orgs. According to the daemon's logs in the past 43 hours there have been 4107 rebuilds, for 1340 different orgs.

I do note that asken360 has been rebuilt 8 times during that period, all of them in short period of 13 minutes between 2014/03/05 00:54:37 and 2014/03/05 01:07:12 (GMT). The replgen of all the entries I found in the changelog for lab3 where during that 13 minute period as well. (see the attached file)

Actually reading through the Perl source-code I think it might be more intelligent than I originally though. It looks like it generates an internal Perl datastructure, and compares it with what it gets from LDAP, and deletes or creates entries as needed.

I think I may have found a bug with it now too. It looks like it tries to clear out anything below the o=ORG,ou=organizations,dc=cvsdude level that isn't "supposed" to be there as far as it can see.

I note that according to the logs at 01:04:02 it deleted or tried to delete nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com so it appears it doesn't check if the excess entry is a tombstone entry. That matches up with a changelog entry for the same time and the same DN.

Would I be correct in thinking that this might cause bad things to happen?

And could any entries under the sub-tree could be a busy entry like searched/updated in the middle of the deletion by the Perl daemon?

Hopefully not, but our various servers may make read-only requests to find out whether a user is part of a group that's being deleted. (This would be to the same master)

Entries that match lab3 from a dbscan of the changelog
dbscan_lab3.txt

Thank you for the changelog data. On my side, I'm running a test script which repeat adding and deleteing the sub-tree ou=test,ou=projects,o=bees,ou=organizations,dc=cvsdude,dc=com and its children on each master (I set up 3 way MMR). I'm having an interesting problem which shows an assertion failure. Please note assertion is only enabled in the debug build.) Hopefully, we are sharing the same problem...

I think I may have found a bug with it now too. It looks like it tries to clear out anything below the o=ORG,ou=organizations,dc=cvsdude level that isn't "supposed" to be there as far as it can see.
I note that according to the logs at 01:04:02 it deleted or tried to delete nsuniqueid=e4fd2f01-a40011e3-9919a954-a476eb11,ou=lab3,ou=projects,o=asken360,ou=organizations,dc=cvsdude,dc=com so it appears it doesn't check if the excess entry is a tombstone entry. That matches up with a changelog entry for the same time and the same DN.
Would I be correct in thinking that this might cause bad things to happen?

It shouldn't be... A tombstone entry is not allowed to "delete" by the delete request from the LDAP clients. The attempt should be safely rejected. But there could be a bug there... Let me continue investigating it, as well.
Thanks!

Hi, I have a case which looks like there's some overlap with what you're seeing, particularly the messages like:
{{{
[12/Feb/2014:20:20:46 +0000] entryrdn-index - _entryrdn_delete_key: Failed to remove ou=test; has children
[12/Feb/2014:20:20:46 +0000] - database index operation failed BAD 1031, err=-1 Unknown error: -1
}}}

I have found something which might lead to the solution - see https://fedorahosted.org/389/ticket/47764#comment:1

Fixed as a part of #47750.
Marked as duplicate.

Metadata Update from @tpollard:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.2.11.30

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1032

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Duplicate)

3 years ago

Login to comment on this ticket.

Metadata