#4850 site 105 : splitting categories into multiple hosts to avoid the 3 hours crawl timeout
Closed: Fixed None Opened 8 years ago by bellet.

Hi!

I'm sorry to bug you again with my difficulties... Following Adrian suggestion to split my site into separate hosts to allow the crawler to override the 3 hours timeout when carrying lot of content (this happened 4 times consecutively until this morning), I created three similar hosts entries, one for each category my mirror carries (fedora linux, fedora secondary, and fedora epel), I recreated the categories, and the url serving these categories in each host. It looked good.

After that modification, I thought it might be better to merge fedora linux and fedora epel back together in the same host, because crawling fedora epel was fast anyway, and would not significantly impact the crawl duration of the host registered to one of the other big category (fedora linux or fedora secondary).

The re-creation of the category worked fine, but when I tried to add the "URLs serving this content" to my fedora epel category, it failed with the error message "Could not add Category URL to the host". I tried various combinations without success. I'm pretty confident that the URL provided are correct, because they were those I had registered before I tried to split my single Host entry.

In my last attempt, I re-created an empty third site entry "fr2.rpmfind.net (3)", just for epel, and unfortunately, I could create the Fedora EPEL category, but could not add URLs to it.

So currently, maybe there're some traces left in the database of the URL for fr2.rpmfind.net serving fedora EPEL content, that prevent me from adding them back to the epel category of host "fr2.rpmfind.net (3)" :

{{{
http://fr2.rpmfind.net/linux/epel
rsync://fr2.rpmfind.net/linux/epel
ftp://fr2.rpmfind.net/linux/epel
}}}


I had a look at the database, and just as you said, the URLs were still in the database but not associated with a host. As each URL can only exist once, MirrorManager refused to add your EPEL URLs again. I am not sure why the URLs were not correctly deleted and so I opened a ticket to not forget about this bug:

https://github.com/fedora-infra/mirrormanager2/issues/115

Following changes were necessary to the database to fix your problem:
{{{
mirrormanager2=> delete from host_category_url where id=8008;
DELETE 1
mirrormanager2=> delete from host_category_url where id=8010;
DELETE 1
mirrormanager2=> delete from host_category_url where id=8011;
DELETE 1
}}}

I manually started a crawl of all three of your hosts.

Login to comment on this ticket.

Metadata