#4877 Provide crawler in europe
Closed: Fixed None Opened 8 years ago by adrian.

Since a while the MM2 crawler supports limiting its crawls by continent. It would probably reduce crawl times if we could crawl european mirrors from a VM based in Europe.

The existing crawlers have a 20GB disk and 32GB of RAM. If a VM with 16GB could be made available that should be enough as a start.


So, we have one site in germany where we could do a 12GB instance. However, the machine has 8 cpus and they are all in use. We could overcommit and give the crawler 2?

Would that be worth trying? or is that too constrained to be of use?

No, sounds good. Overcommitted CPUs sounds like no problem. We can also try if 10GB are enough, so that still some resources are available if required for other services.

The goal is to crawl the close mirrors much faster and that should be able with less RAM. I would start with a small number of threads and maybe a higher crawl frequency than every 12 hours. If there are no architectural conflicts I would welcome a larger swap size as some of the high memory requirements of the crawler process seems to come from python's not most effective memory management.

ok, mm-crawler03.fedoraproject.org is all setup.

Please let us know if you need anything further on it.

Thanks for mm-crawler03 but it seems like it does not work as expected. The main problem seems to be the distance to the database and that the crawler does too many SQL operations. I have compared a "Fedora EPEL" crawl of a single mirror. The mirror is (according to tracepath) 10 hops from mm-crawler03 but 23 hops from mm-crawler01. ping RTT from mm-crawler03 is 4.2 ms and from mm-crawler01 154.4.

The total crawl time from mm-crawler01 is:

$ time sudo -u mirrormanager /usr/bin/mm2_crawler --timeout-minutes 180 --threads 31 --startid=218 --stopid=219 --debug --category "Fedora EPEL"

real 1m39.988s[[BR]]
user 0m33.697s[[BR]]
sys 0m3.876s[[BR]]

and from mm-crawler03 it is:

$ time sudo -u mirrormanager /usr/bin/mm2_crawler --timeout-minutes 180 --threads 31 --startid=218 --stopid=219 --debug --category "Fedora EPEL"

real 50m23.693s[[BR]]
user 0m24.228s[[BR]]
sys 0m2.861s[[BR]]

So the crawler which is much closer (mm-crawler03) takes more than 25 times longer than the crawler which is further away (mm-crawler01)

Looking only at the actual crawl time of the remote host mm-crawler03 is much faster: 2.5 seconds
INFO:crawler:Hosts(2/2):Threads(1/31):218:ftp-stud.hs-esslingen.de:About to run following rsync command: rsync --temp-dir=/tmp -r --exclude=.snapshot --exclude='*.~tmp~' --no-motd --timeout 14400 rsync://ftp-stud.hs-esslingen.de/fedora-epel/

INFO:crawler:Hosts(2/2):Threads(1/31):218:ftp-stud.hs-esslingen.de:rsync time: 0:00:02.568403

mm-crawler01: 49.7 seconds

INFO:crawler:Hosts(2/2):Threads(1/31):218:ftp-stud.hs-esslingen.de:About to run following rsync command: rsync --temp-dir=/tmp -r --exclude=.snapshot --exclude='*.~tmp~' --no-motd --timeout 14400 rsync://ftp-stud.hs-esslingen.de/fedora-epel/

INFO:crawler:Hosts(2/2):Threads(1/31):218:ftp-stud.hs-esslingen.de:rsync time: 0:00:49.755972

The log output from the crawls show that mm-crawler01 needs 60 seconds to compare the data from the rsync crawl with the database and mm-crawler03 needs for the same operation 360 seconds. The call which actually requires most of the time is the session.commit() at the end which writes all the data to the database which is happens in no time on mm-crawler01 but requires a lot of time on mm-crawler03.

In addition getting the initial list of mirrors takes a very long time on mm-crawler03.

Although that the actually crawling is much faster the overall crawl time is over 25 times longer, so that crawling from mm-crawler03 makes not much sense.

ok, shall we just remove it then?

Or do you think there's any way we could cache things for it so it could still be of use?

I was thinking about a way to use it for caching. In theory it should be possible especially for the rsync case. We start rsync and are just parsing the output once it has finished. So we could run all the rsyncs on mm-crawler03, but we do not have the code yet to run the rsync somewhere else.

So right now it does not make much sense to keep mm-crawler03 and it should be deleted.

ok. I have removed the instance.

We can readd later if we think of a better way to do things.

There have been a few fixes which might have fixed the problem of the long crawler startup times.

I would like to try it out once more to have a crawler running in Europe to see how it behaves now.

This can wait until the alpha freeze is over.

mm-crawler03 lives again. ;)

Please delete mm-crawler03 again. The startup time is no longer a problem, but as the state of each directory has to be updated in the database one after another it is still too slow as it has to read each directory state from the database in case it needs updating. So it is still too slow and it will probably not get better in the near future.

Login to comment on this ticket.

Metadata