Issue #539: mm on app4 mostly hung - fedora-infrastructure

fedora-infrastructure

#539 mm on app4 mostly hung

Closed: Fixed None Opened 15 years ago by mdomsch.

MirrorManager code running on app4 hitting db2 is failing; This has caused the mirrorlist and publiclist pages to stop refreshing on 5/10, and the MM admin web interface is unusable.

I killed and restarted start-mirrors on app4, which seems to have resolved it for now.

mdomsch commented 15 years ago

I'm going to re-open this, because it's still taking >10 minutes to render each publiclist page, when it usually takes about 10 minutes to render all the publiclist pages.

mdomsch commented 15 years ago

and now it's not completing any page renders again at all. :-(

toshio commented 15 years ago

{{{
mirrormanager=# select * from pgstattuple('host_category_dir');
-[ RECORD 1 ]------+----------
table_len | 733134848
tuple_count | 200905
tuple_len | 17617190
tuple_percent | 2.4
dead_tuple_count | 7408720
dead_tuple_len | 649852556
dead_tuple_percent | 88.64
free_space | 2734892
free_percent | 0.37
}}}

The hourly cron job that's vacuuming these tables has to go through three that are very large. And it's still chugging away at the first one after most of an hour.

I think that the db needs a vacuum full to trim off the 600MB of dead tuples. I'm shutting down the mirrormanager admin interface and doing that to deal with this. Holler if something besides the admin interface breaks because of this.

toshio commented 15 years ago

Alright. We're back to normal operations for now:

{{{
mirrormanager=# select * from pgstattuple('host_category_dir');
-[ RECORD 1 ]------+---------
table_len | 45891584
tuple_count | 201854
tuple_len | 17676390
tuple_percent | 38.52
dead_tuple_count | 18510
dead_tuple_len | 1523848
dead_tuple_percent | 3.32
free_space | 23678696
free_percent | 51.6
}}}

mdomsch, do you have a MirrorManager SOP somewhere? If so, this recipe should probably go in there somewhere:

Turn off crond on app4 so that the mirror caching script doesn't try to pull data from the db during this time.
Turn off vacuuming of mirrormanager in the cron jobs: puppet1: configs/db/vacuum-hourly, configs/db/vacuum-daily. push the changes to db2.
Turn off mirrormanager (the admin interface) on all the app servers using supervisor
On db2, make sure all the queries that were accessing mirrormanager are shutdown. If not, try to cancel them by sending them a SIGINT. (See the database SOP for details on listing current queries)
Run a full vacuum of the mirrormanager database: screen ; sudo -u postgres /usr/bin/vacuumdb -v -f -d mirrormanager
Run a pgstattuple (database SOP) just to be sure things worked.
Reenable everything you turned off earlier.

Note that we're going to have to figure out what's going wrong and how we can fix this eventually. Some things to explore:
* Does the database gradually get worse or does something get out of whack and then it gets worse quickly once that happens?
* Test this by taking periodic pgstattuples to see if the cron job is able to keep the total table size relatively constant (ie: dead tuples may grow but the cron job should vacuum those and allocate them to free space to be reused.)

Some ideas for things that may make this better at the mirrormanager level:

have mirrormanager's sync to mirrorlist do the vacuum. if we're in some sort of race where the sync and the vacuum cannot be done at the same time, serializing the two operations in the same script could help.
have mirrormanager's sync create a temp table and do it's select from that. If it's significantly faster to create the temp table than to select directly from the host_category_dir, host, and directory tables, selecting from those tables into temp tables then performing the sync's select on them may be an option to keep the vacuum and select from both wanting to access the same table.
Figure out how to stop updating so many rows in mirrormanager. Mirrormanager updates more rows than any of our other databases. This causes more dead tuples than any of our other dbs. Perhaps we can change this mode of operation and things will become faster.

Note that 1 and 2 are predicated on there being some sort of contention between the sync script's selects and the vacuums being run from cron. This has not been proven to be true so they might not help. 3 should be helpful no matter what the cause of this issue but I don't know how realistic it is or if it will cause performance problems within mirrormanager.

toshio commented 15 years ago

Log of the vacuum full run
mirrormanager-full.txt

toshio commented 15 years ago

Attached the log of the vacuum full run. Next time this occurs we may want to log the output of the normal vacuum before doing the vacuum full to see if we can get more infor on why it's not clearing out the dead tuples. This time around, I just let the cron job invoke vacuum and then noted that it didn't seem to have done anything via pgstattuple afterwards.

Command that will produce logs but otherwise be like the cron job:
{{{
sudo -u postgres /usr/bin/vacuumdb -v -d mirrormanager -t host-category-dir
}}}

mdomsch commented 15 years ago

There are 3 places in the MM codepath that updated host_category_dir tuples unnecessarily: twice in the crawler, and once in report_mirror checkin. The patch below is queued to go in after the change freeze, to elimiated these extraneous updates. This should have a dramatic impact on the growth of that table's dead rows.

diff --git a/mirrors/crawler_perhost b/mirrors/crawler_perhost
index a264424..c31b4b8 100755
--- a/mirrors/crawler_perhost
+++ b/mirrors/crawler_perhost
@@ -298,10 +298,11 @@ def sync_hcds(host, host_category_dirs):

     if hcd.directory is None:
         hcd.directory = d

hcd.lastCrawled=now
hcd.up2date=up2date
if hcd.up2date != up2date:
hcd.up2date=up2date
hcd.sync()
current_hcds[hcd] = True
hcd.sync()
+

# now-historical HostCategoryDirs are not up2date
# we wait for a cascading Directory delete to delete this
@@ -312,9 +313,9 @@ def sync_hcds(host, host_category_dirs):
try:
thcd = current_hcds[hcd]
except KeyError:
- hcd.lastCrawled=datetime.utcnow()
- hcd.up2date=False
- hcd.sync()
+ if hcd.up2date != False:
+ hcd.up2date=False
+ hcd.sync()

def method_pref(urls):
diff --git a/mirrors/mirrors/model.py b/mirrors/mirrors/model.py
index c28d0df..8c2e74a 100644
--- a/mirrors/mirrors/model.py
+++ b/mirrors/mirrors/model.py
@@ -241,9 +241,10 @@ class Host(SQLObject):
if hcdir.count() > 0:
hcdir = hcdir[0]
# don't store files, we don't need it right now
- hcdir.files = None
- hcdir.up2date = True
- hcdir.sync()
+ # hcdir.files = None
+ if hcdir.up2date != True:
+ hcdir.up2date = True
+ hcdir.sync()
marked_up2date += 1
else:
if len(d) > 0:

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#539 mm on app4 mostly hung Closed: Fixed None Opened 15 years ago by mdomsch.

Metadata

#539 mm on app4 mostly hung

Closed: Fixed None Opened 15 years ago by mdomsch.