fedora-infrastructure

#3792 pkgdb needs a new script to sync summary and description

Closed: Fixed None Opened 10 years ago by toshio.

The packagedb has fields for a package's summary and description. These are used in bugzilla among other places. We used to have a script to parse this information out of the package repository metadata. It also populated the appdb portion of packgedb. Since we got rid of the appdb we also got rid of that script.

We could write a new script which populated those from the yum repo however, we're trying to move to a system where we have less things that have to grab that data directly (since we have a lot of duplication of effort when we do that. It would make sense to use pkgwat.api to grab the information since packages already has done that parsing.

pkgwat.api.get(pkgname) (implemented here: https://github.com/fedora-infra/pkgwat.api/issues/2 ) should be able to give us the information. We'll need to loop through all of the packages in pkgdb and retrieve the summary and description for those. Then enter them into the pkgdb.

From a very naive test, we think that doing the retrieval this way will take on the order of 5 or 6 hours. Adding the information to the db will be in addition (and will take a longer or shorter amount of time depending on whether we decide to use direct access to the pkgdb database or the json interface).

We want to run the cron job daily so once written we'll need to get together with threebean and test:

1) the cron job can complete within a day's timespan
2) the cron job doesn't hamper the normal operation of apps.fp.o/packages

If something fails there we'll have to see if packages can be enhanced (probably with some sort of bulk querable interface) to speed things up.

CC'ing cvsadmin-members so they know that the sync script is currently broken and they may get requests to do this manually for a while.

toshio commented 10 years ago

Marking EasyFix as I have a few tasks (python-fedora otp and pkgdb2 schema migration) to work on before I'll get back to this. Feel free to ping me on IRC if you want to work on this.

I think it would be fine to do this directly against the pkgdb db. So something like:

read /etc/pkgdb.cfg for the db information.
query db for all packagenames in package table
for loop on packagenames
pkgwat.api.get(packagename)
get the summary and description from that
Save those into the pkgdb package table

toshio commented 10 years ago

@ralph -- pingou and I came up with an alternate idea to run by you:

have the packages timed job that imports the information also import into pkgdb.
then we can look into having the packages import job use fedmsg to decide what new builds have occurred and only download the rpms to import those.

This would be a lot more efficient than the simple plan in comment:1 but I'm not sure if it remains an EasyFix to implement or if we should continue with the simple plan then iterate if we need to.

arielb commented 10 years ago

apologies, if the package is not installed to do something?
pkgwat.api.get(packagename) fail if not found.
or only check that is installed?

arielb commented 10 years ago

I'll do this https://github.com/fedora-infra/fedora-packages/blob/develop/fedoracommunity/search/index.py#L164

ralph commented 10 years ago

Replying to [comment:4 arielb]:

apologies, if the package is not installed to do something?
pkgwat.api.get(packagename) fail if not found.
or only check that is installed?

Ah, pkgwat.api.get will actually query the web api behind https://apps.fedoraproject.org/packages/, so, no need for the package to be installed locally.

arielb commented 10 years ago

attachment
summary.py

arielb commented 10 years ago

hi.
attached script.
Is this what it takes?

arielb commented 10 years ago

I wanted to comment that I found during testing sometimes failure.

attached err.txt

arielb commented 10 years ago

attachment
err.txt

pingou commented 10 years ago

The script looks nice but I'm not sure running it against all the packages is really optimal.

arielb commented 10 years ago

yep, as I thought, so I left a comment filter, which filter do you recommend?

{{{

in case it needs a filter

32 #where = or_(db.package.summary == None, db.package.description == None, \
33 # db.package.summary == '', db.package.description == '')
34 #count = db.package.filter(where).order_by(db.package.name).count()
35 #pkgs = db.package.filter(where).order_by(db.package.name).all()
}}}

pingou commented 10 years ago

You're asking which I recommend, but this is one filter, no?

Just a couple of thing, within (), {} or [], you do not need to use the \ at the end of a line.

You may want to save the query as
{{{
query = db.package.filter(or_(db.package.summary == None,
db.package.summary == '',
db.package.description == None,
db.package.description == '')
).order_by(db.package.name)
count = query.count()
pkgs = query.all()
}}}

It allows you to do {{{print query}}} if you need to debug it at some point, but that's mostly coding style and I'm sure some would disagree with me :)

arielb commented 10 years ago

yes, but was not sure if that was the filter indicated that for the requirement. :)

I had tested unfiltered and in 30 minutes had only 1150 rows.

time python summary.py

thank you for comment, I changes will.

arielb commented 10 years ago

attachment
summary.2.py

arielb commented 10 years ago

changes done.

arielb commented 10 years ago

changes done - summary.3.py

toshio commented 10 years ago

arielb: Ah... I meant to integrate adding the summary and description into the pkgdb into the existing script that processes the yum repo. But we can probably test summary3.py and see if the basic approach will work. Indexer comes from that script? If so, I guess we'll need to run this on the packages server?

toshio commented 10 years ago

Also -- I made one note in the summary3.py code of something we can change.

arielb commented 10 years ago

Indexer ---> I found it in "fedora-packages/fedoracommunity/search"

https://github.com/fedora-infra/fedora-packages/blob/develop/fedoracommunity/search/index.py#L89

arielb commented 10 years ago

inheriting from indexer
summary.3.py

pingou commented 10 years ago

With pkgdb2 coming along, should we look into porting this script to it?

kevin commented 10 years ago

Moving all currently open easyfix tickets to the HANDYWAVY-FUTURE milestone.

kevin commented 10 years ago

I'm clearing the assigned status on all easyfix tickets.

If you are an apprentice actively working on this ticket, feel free to reassign to yourself. Otherwise let a new apprentice have a look.

pingou commented 9 years ago

Note: in addition to summary and description we should also update the URL in pkgdb2

pingou commented 9 years ago

pkgdb2 now has the required api endpoint: https://admin.fedoraproject.org/pkgdb/api/#edit_a_package

pingou commented 9 years ago

I added a script update_package_info.py in https://github.com/fedora-infra/pkgdb2/tree/update_cron which does the basic task but we may want to optimize it.

Removing easyfix as I'll take care of this.

pingou commented 9 years ago

Up for review at: https://github.com/fedora-infra/pkgdb2/pull/78

pingou commented 9 years ago

Merged