#3792 pkgdb needs a new script to sync summary and description
Closed: Fixed None Opened 10 years ago by toshio.

The packagedb has fields for a package's summary and description. These are used in bugzilla among other places. We used to have a script to parse this information out of the package repository metadata. It also populated the appdb portion of packgedb. Since we got rid of the appdb we also got rid of that script.

We could write a new script which populated those from the yum repo however, we're trying to move to a system where we have less things that have to grab that data directly (since we have a lot of duplication of effort when we do that. It would make sense to use pkgwat.api to grab the information since packages already has done that parsing.

pkgwat.api.get(pkgname) (implemented here: https://github.com/fedora-infra/pkgwat.api/issues/2 ) should be able to give us the information. We'll need to loop through all of the packages in pkgdb and retrieve the summary and description for those. Then enter them into the pkgdb.

From a very naive test, we think that doing the retrieval this way will take on the order of 5 or 6 hours. Adding the information to the db will be in addition (and will take a longer or shorter amount of time depending on whether we decide to use direct access to the pkgdb database or the json interface).

We want to run the cron job daily so once written we'll need to get together with threebean and test:

1) the cron job can complete within a day's timespan
2) the cron job doesn't hamper the normal operation of apps.fp.o/packages

If something fails there we'll have to see if packages can be enhanced (probably with some sort of bulk querable interface) to speed things up.

CC'ing cvsadmin-members so they know that the sync script is currently broken and they may get requests to do this manually for a while.


Marking EasyFix as I have a few tasks (python-fedora otp and pkgdb2 schema migration) to work on before I'll get back to this. Feel free to ping me on IRC if you want to work on this.

I think it would be fine to do this directly against the pkgdb db. So something like:

  • read /etc/pkgdb.cfg for the db information.
  • query db for all packagenames in package table
  • for loop on packagenames
  • pkgwat.api.get(packagename)
  • get the summary and description from that
  • Save those into the pkgdb package table

@ralph -- pingou and I came up with an alternate idea to run by you:

  • have the packages timed job that imports the information also import into pkgdb.
  • then we can look into having the packages import job use fedmsg to decide what new builds have occurred and only download the rpms to import those.

This would be a lot more efficient than the simple plan in comment:1 but I'm not sure if it remains an EasyFix to implement or if we should continue with the simple plan then iterate if we need to.

apologies, if the package is not installed to do something?
pkgwat.api.get(packagename) fail if not found.
or only check that is installed?

Replying to [comment:4 arielb]:

apologies, if the package is not installed to do something?
pkgwat.api.get(packagename) fail if not found.
or only check that is installed?

Ah, pkgwat.api.get will actually query the web api behind https://apps.fedoraproject.org/packages/, so, no need for the package to be installed locally.

hi.
attached script.
Is this what it takes?

I wanted to comment that I found during testing sometimes failure.

attached err.txt

The script looks nice but I'm not sure running it against all the packages is really optimal.

yep, as I thought, so I left a comment filter, which filter do you recommend?

{{{

in case it needs a filter

32 #where = or_(db.package.summary == None, db.package.description == None, \
33 # db.package.summary == '', db.package.description == '')
34 #count = db.package.filter(where).order_by(db.package.name).count()
35 #pkgs = db.package.filter(where).order_by(db.package.name).all()
}}}

You're asking which I recommend, but this is one filter, no?

Just a couple of thing, within (), {} or [], you do not need to use the \ at the end of a line.

You may want to save the query as
{{{
query = db.package.filter(or_(db.package.summary == None,
db.package.summary == '',
db.package.description == None,
db.package.description == '')
).order_by(db.package.name)
count = query.count()
pkgs = query.all()
}}}

It allows you to do {{{print query}}} if you need to debug it at some point, but that's mostly coding style and I'm sure some would disagree with me :)

yes, but was not sure if that was the filter indicated that for the requirement. :)

I had tested unfiltered and in 30 minutes had only 1150 rows.

time python summary.py

thank you for comment, I changes will.

changes done - summary.3.py

arielb: Ah... I meant to integrate adding the summary and description into the pkgdb into the existing script that processes the yum repo. But we can probably test summary3.py and see if the basic approach will work. Indexer comes from that script? If so, I guess we'll need to run this on the packages server?

Also -- I made one note in the summary3.py code of something we can change.

With pkgdb2 coming along, should we look into porting this script to it?

Moving all currently open easyfix tickets to the HANDYWAVY-FUTURE milestone.

I'm clearing the assigned status on all easyfix tickets.

If you are an apprentice actively working on this ticket, feel free to reassign to yourself. Otherwise let a new apprentice have a look.

Note: in addition to summary and description we should also update the URL in pkgdb2

I added a script update_package_info.py in https://github.com/fedora-infra/pkgdb2/tree/update_cron which does the basic task but we may want to optimize it.

Removing easyfix as I'll take care of this.

Login to comment on this ticket.

Metadata