Ticket #21 (new enhancement)

Opened 4 years ago

Last modified 2 years ago

Import the Scraper

Reported by: toshio Owned by: joshkayse
Priority: major Milestone:
Component: BaseClient Version:
Keywords: Cc: mchua, ianweller, joshkayse, rrix
Blocked By: Blocking:

Description

Mel Chua has a script that pulls down data about what a Fedora Contributor has been up to lately. We want to integrate that into python-fedora.

I haven't seen the code yet but it sounds okay to me to put it into fedora/clients somehow. We'll need to work out the details. mchua now has commit rights in bzrpython-fedora so she can get started on this.

Change History

comment:1 Changed 4 years ago by joshkayse

  • Cc mchua, ianweller, joshkayse added; mchua ianweller removed

i would like to help with this

comment:2 Changed 4 years ago by toshio

@joshkayse that would be great. However, the scope of this has changed somewhat. I talked with mchua and ianweller on IRC and there's no script yet. There was some IRC conversation and a blog post about what data it would be nice to be able to correlate and there is support in python-fedora's current client libs for retrieving most of that data. Someone needs to write the pieces that actually pull the pieces of data together and correlates them. (As well as a front end script that makes use of that).

Are you still interested?

comment:3 Changed 4 years ago by joshkayse

Yes, I am still interested.

I envision it as a class representing a fas account as a contributor. It would have methods to retrieve the various pieces of information that would be modeled (like packages owned, recent wiki edits, etc). These methods would in turn use the API that has already been created (and any new API that would be needed) to query the authoritative sources for the information and then return that.

My first step would be to replicate mchua's original fas_scraper script using that framework.

Would this be a bad implementation or is there a better way to architect it?

Thanks, -josh

comment:4 Changed 4 years ago by ianweller

Since that's exactly what I was envisioning, that sounds like a good implementation to me. ;)

Toshio and I were still wondering whether or not this fits in python-fedora. I think we came to the conclusion to go ahead and develop it inside python-fedora (under fedora.client.scraper or something like that), and if it doesn't really fit along, move it into its own package and repository.

comment:5 Changed 4 years ago by toshio

yep. I think that's a good plan and implementation. Josh, I'll give you access to the bzr repository. Feel free to push a branch there for others to look at. If you need help learning bzr commands, I'm on irc.freenode.net in #fedora-admin as abadger1999.

comment:6 Changed 4 years ago by mchua

Josh, depending on when you'll be doing this work, I've got access to a class of 40 students at Allegheny College who will have a project assignment, one option for which is to explore Fedora data and see what interesting data correlations they can find. They'll be doing this between March 30 and May 4th, so if you're doing rapid development you may get a few testers. :) I'd be glad to introduce them to the library you make.

comment:7 Changed 4 years ago by joshkayse

  • Owner changed from toshio to joshkayse

I made my first commit to a branch called scraper in bzr.

So far it has rudimentary support for packages and wiki edits.

From what I can tell, there is no easy way to correlate planet blogs to FAS members, is there something I'm missing?

I'd like to add the ability to retrieve recent package updates made by a contributor but I'll have to work with lmacken to get that added to bodhi. I'd also like to be able to query fedorahosted projects because I feel that contributing to a fedorahosted project is contributing to fedora also. Right now I see no clear way to do that or even a way to approach it. Thoughts?

Major things that need to be done are pulling group membership from FAS.

comment:8 Changed 4 years ago by toshio

with access to the fedorapeople filesystem we could get access to the raw .planet files that get pulled in. I don't think our planet software makes that available from the web yet. Might want to ask skvidal on IRC if there is.

For fedorahosted projects, you should be able to look at commits to the source repositories as one source of information. There's also trac timelines:

https://fedorahosted.org/python-fedora/timeline

comment:9 Changed 4 years ago by rrix

  • Cc rrix added

comment:10 Changed 4 years ago by joshkayse

I'm not going to lie, I kind of stalled on this due to some other things going on. I have some raw code that I think I've pushed as a branch that pulls down some of the information but it is not necessarily pretty. I need guidelines on what sort of granularity I need to provide in my library.

comment:11 Changed 2 years ago by toshio

Since there isn't an actual Scraper to import atthe moment, I'm going to close this ticket.

Note: See TracTickets for help on using tickets.