Issue #5166: RFR: Copr - fedora-infrastructure

fedora-infrastructure

#5166 RFR: Copr

Closed: Fixed 3 years ago by smooge. Opened 8 years ago by msuchy.

This is official RFR for copr as fully supported infra service on copr.fedoraproject.org as replacement of copr.fedorainfracloud.org.

I am going to answer some question, which are raised either in RFR SOP or on infra mailing. If you need more, do not hesitate to ask here.

SOP?
https://infrastructure.fedoraproject.org/infra/docs/cloud.rst

What resources do you need?

We need machine for Copr fronted, which will be load balanced. (so either 2 of them or reverse caching proxy in front of it).
This is middle size machine (aprox 2CPU, 4GB RAM, 2 GB storage for DB [and I would like later to use Fedora PG server]).
And I will need 4TB storage for package repositories. (We currently use 2 TB). This can be other machine instance or just storage attached to fronted. This storage will be exported (via NFS?) to backend instance running in Fedora Cloud.

These above will be setup to HA.

Whole backend (be, keygen, dist git) will be in fedorainfracloud.org and will have community level of SLA. As there is no big issue if builders will stop building for few hours.

So just for the records the backend machines will be in Fedora Cloud and consist of:

backend machine: medium sized machine (aprox. 4 CPU, 4 GB RAM) no special requirements for data (let say 1 GB).
keygen machine: can be small machine (aprox. 1 CPU, 2GB RAM) with storage for system + 1GB of backed up data for gpg keys.
dist git machine: medium sized machine (aprox. 2 CPU, 4GB RAM) with 1 TB of storage for data.

Staging instance

We already have staging instances in FedoraCloud. We can keep use them.

Point of contact

Me. i.e. Miroslav Suchy msuchy@redhat.com
Other guy (clime) is currently in training and I plan to introduce him to Fedora Infrastructure team soon.
2 other team members should be able to solve some issued (Jakub Kadlcik, Adam Samalik) however they are not fedora infra members, nor apprentices. But they are pretty familiare with Copr code base and have access to dev machines.

Playbooks.

We already have Ansible playbooks for all machines, so we just alter the playbook for Frontend once we know what HW design and network topology we will be using.

puiterwijk commented 8 years ago

This sounds like a great start, thank you for all the information!

Just curious: would it be an idea to also have the backend nodes in the infra setup, and only have the builders in the cloud?
This doesn't mean that they need to be HA (no reason to as you explained yourself), but this would profit from our normal update cycles etc.

Are the packages for COPR available in Fedora/RHEL/EPEL proper?

Given that I have quite some experience with COPR, I will be the sysadmin-main sponsor for this RFR.

Note: I removed your phone number. Thank you very much for the info, and we'll keep it in our private contact documentation, but I don't think you want your phone number in public records.

puiterwijk commented 8 years ago

Oh, and for SOP I guess you meant to link to https://infrastructure.fedoraproject.org/infra/docs/copr.rst ?

msuchy commented 8 years ago

Replying to [comment:1 puiterwijk]:

Just curious: would it be an idea to also have the backend nodes in the infra setup, and only have the builders in the cloud?
This doesn't mean that they need to be HA (no reason to as you explained yourself), but this would profit from our normal update cycles etc.

I would gladly make it HA too. The fact is that right now I'm not sure what are the fedora-infra standards of making service HA. I will appreciate any pointers. So I will like start with frontend and we can then make backend HA. Or even dist-git and keygen.

Are the packages for COPR available in Fedora/RHEL/EPEL proper?

Yes. In fact it is part of our release process now:
https://fedorahosted.org/copr/wiki/HowToReleaseCopr#ReleasepackagetoFedora
Hmm I see that copr-dist-git is not in Fedora yet. I will submit package review today.

Given that I have quite some experience with COPR, I will be the sysadmin-main sponsor for this RFR.

Thank you.

Note: I removed your phone number. Thank you very much for the info, and we'll keep it in our private contact documentation, but I don't think you want your phone number in public records.

My phone is available on so many places, that I do not mind being listed publicly.

SOP I guess you meant to link to https://infrastructure.fedoraproject.org/infra/docs/copr.rst ?
Yes.

kevin commented 8 years ago

Thanks for filing this. :)

Some questions:

I don't think there's any way to make the storage work the way you outline. The cloud network is isolated from the internal network. If the storage is in the internal network there's no way to also export it to the cloud (aside hacks like sshfs or something) and vice versa. So, I think we would need both frontend and backend at least to be in the same network if they share storage. Or perhaps we could redo things so all the storage is only on frontend and backend syncs back to it when builds are done? One advantage of moving storage internal is that perhaps we could just add it to existing master mirrors in a subdirectory and start getting some mirrors to mirror it as well if we wanted.
I assume that all the instances here would be Fedora? Or is there RHEL7 support for the various rpms?
If we do move all the 'longer term' instances into internal infrastructure (frontend/backend/dist-git/keygen) that would leave builders the only thing depending on the cloud. Might it also be possible to make a option to use libvirt instances for builders? We could switch it to this when we have cloud downtime (It would be slower and less ideal, but might work), or perhaps aws would be another option as we have some community account for some instances there. (Althought that likely won't help with ppc64*)

Thanks again for filing this.

msuchy commented 8 years ago

Replying to [comment:4 kevin]:

I don't think there's any way to make the storage work the way you outline...
Yes. My question: do we have such big storage available which can be in internal network?

I assume that all the instances here would be Fedora? Or is there RHEL7 support for the various rpms?

All are Fedora but copr-dist-git which is RHEL7.

If we do move all the 'longer term' instances into internal infrastructure (frontend/backend/dist-git/keygen) that would leave builders the only thing depending on the cloud. Might it also be possible to make a option to use libvirt instances for builders? We could switch it to this when we have cloud downtime (It would be slower and less ideal, but might work), or perhaps aws would be another option as we have some community account for some instances there. (Althought that likely won't help with ppc64*)

Yes. We use playbook to spin-up new VM. So it is just matter of swapping two playbooks. And I can create playbook which will be spinnig libvirt or AWS machines. PPC is still marginal and I'm sure everybody can survive if the queue stop for few days. Personally I'm looking forward when those PPC machines get added to OpenStack - that would be big improvement.

kevin commented 8 years ago

Replying to [comment:5 msuchy]:

Replying to [comment:4 kevin]:

I don't think there's any way to make the storage work the way you outline...
Yes. My question: do we have such big storage available which can be in internal network?

Yeah, as long as it's ok to be NFS. We have some netapp storage we can create volumes on.

But it would be internal only, it could not be seen by the cloud at all.

I assume that all the instances here would be Fedora? Or is there RHEL7 support for the various rpms?

All are Fedora but copr-dist-git which is RHEL7.

ok.

If we do move all the 'longer term' instances into internal infrastructure (frontend/backend/dist-git/keygen) that would leave builders the only thing depending on the cloud. Might it also be possible to make a option to use libvirt instances for builders? We could switch it to this when we have cloud downtime (It would be slower and less ideal, but might work), or perhaps aws would be another option as we have some community account for some instances there. (Althought that likely won't help with ppc64*)

Yes. We use playbook to spin-up new VM. So it is just matter of swapping two playbooks. And I can create playbook which will be spinnig libvirt or AWS machines. PPC is still marginal and I'm sure everybody can survive if the queue stop for few days. Personally I'm looking forward when those PPC machines get added to OpenStack - that would be big improvement.

Yeah, they have arrived, we are just waiting for them to be racked and networked and added into the cloud now. ;) It's possible we could even use instances on them for noarch builds if they turn out to be faster than the x86 ones, but thats down the road.

msuchy commented 8 years ago

Replying to [comment:6 kevin]:

Yeah, as long as it's ok to be NFS. We have some netapp storage we can create volumes on.

NFS is fine.

But it would be internal only, it could not be seen by the cloud at all.

At all? Previously you said that sshfs would be possible. So ssh from cloud to internal (using public IP) will be possible?

I'm still trying to think where the cut line between fe, be and builders. Hmmm one of the option would be to give builders in Fedora Cloud public IP. Now when the old cloud is discontinued we should have enough IPs, isn't it?
Then it would be totaly painless and everything but the builders can be in internal and backend will have access to storage.

kevin commented 8 years ago

Sorry, I shouldn't have even mentioned sshfs, it's in no way up to being used in a production service like this, IMHO.

Yeah, if we gave builders public ip we could indeed put everything internal. Internal backend could then ssh to the cloud to manage builders. From my view this is the best plan.

I suppose we could split backend up and have a internal backend (that has access to the storage) and a backend-buildermanager in the cloud (that has limited storage). Then the backend-buildermanager talks to builders and copies results around. That seems like it's adding a lot of complexity for just saving us external ips.

msuchy commented 8 years ago

Replying to [comment:8 kevin]:

I suppose we could split backend up and have a internal backend (that has access to the storage) and a backend-buildermanager in the cloud (that has limited storage). Then the backend-buildermanager talks to builders and copies results around. That seems like it's adding a lot of complexity for just saving us external ips.

This is not even needed. Because fed-cloud09 has public hostname "fedorainfracloud.org" and enpoints are publicaly available. So I can ask fedora cloud to spin-up/terminate machine from internal network. I just cannot ssh there using its internal IP - I must assign those VMs public IP. And then it will work. Without backend being in cloud network.

puiterwijk commented 7 years ago

Replying to [comment:9 msuchy]:

Replying to [comment:8 kevin]:
This is not even needed. Because fed-cloud09 has public hostname "fedorainfracloud.org" and enpoints are publicaly available. So I can ask fedora cloud to spin-up/terminate machine from internal network. I just cannot ssh there using its internal IP - I must assign those VMs public IP. And then it will work. Without backend being in cloud network.

We don't want to assign a public IP to every builder, since we have a limited amount of public IP addresses.
It should be quite easy to setup a COPR jump host though that would be used to pipe ssh through, which could even just be controlled by an ssh configuration which means it will be transparent for Ansible.

Do note that the copr-backend is currently about to be retired from Fedora because of broken dependencies: https://fedorahosted.org/rel-eng/ticket/6416.
Please fix this and make sure that we can use the packages in Fedora to deploy this, since we do NOT want to deploy packages built on COPR inside the Fedora Infrastructure.

kevin commented 7 years ago

@clime You wanted to take over driving this forward?

I guess we need a bit more discussion as to exactly what parts of copr we want in the cloud and what parts in "normal" infrastructure. I think everyone is good with the web/frontends and keygen being in "normal" infra, but we aren't as sure on the plans for dist-git and backend as they have large storage requirements and must talk to the cloud to manage builders. I think we all agree builders should be still in the cloud.

clime commented 7 years ago

Yeah, I started with getting fedpkg-copr package into Fedora & EPEL repos so that, in the end, all the requirements are there (at least in Fedora). I'll go step-by-step to achieve this.

clime commented 7 years ago

fedpkg-copr is now in Fedora (and EPEL too). I think I found a way to get rid of this package but that's not really important here. Also the latest upstream of python-flask-whooshee is now packaged for f25. There is logstash package in backend, which is not in f25. I think we can drop it but I will do a bit more research on this one. Then there is missing docker build for ppc64le *. We need that on copr-dist-git machine currently :(.

this might be related but not completely sure: https://bugzilla.redhat.com
/show_bug.cgi?id=1315903

clime commented 7 years ago

Just a quick progress info as promised to @mizdebsk:

building of copr-dist-git is now disabled on ppc64 by using ExclusiveArch macro in the update https://bodhi.fedoraproject.org/updates/FEDORA-2017-e96e85fc05
the update for the latest version of dist-git that COPR needs is here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-a78656e54a
releasing python-flask-whooshee for f24 is in progress
logstash is now not used on copr-backend

Basically, we are very close to get this finally sorted out.

mizdebsk commented 6 years ago

Copr uses PostgreSQL server, correct? Does it work with bi-directional replication?

Currently our PostgreSQL db servers in staging are configured with BDR, which imposes some restrictions on apps.

clime commented 6 years ago

Copr uses PostgreSQL server, correct? Does it work with bi-directional replication?
Currently our PostgreSQL db servers in staging are configured with BDR, which imposes some restrictions on apps.

At a (very) quick look, I didn't see a problem.

There are two remaining issues in the packaging:

python-flask-whooshee not available for f24
module-build-service (alias fm-orchestrator) installed from a COPR repo and not from Fedora (new one)

module-build-service is available only for f26 currently. python-flask-whooshee can be built for f24 but I wanted to wait until I get agreement from a primary packager.

Sorry for a slow progress here.

clime commented 6 years ago

After gained agreement of the main packager, I submitted: https://bodhi.fedoraproject.org/updates/FEDORA-2017-f4bfbcd01e. If this gets to updates (it should hopefully), then we are only blocked by:

module-build-service (alias fm-orchestrator) not currently being available for f25.

clime commented 6 years ago

All COPR packages are now in Fedora in latest versions and they have been deployed into production from updates repository.

kevin commented 6 years ago

ok, so whats our status here?

packages are now installed all from koji.

Next steps/Issues:

We need to move to our new cloud hardware (which should have HA/not be breaking things all the time). We have the hardware, but the place they are going is not yet ready. It's supposed to have power done by end of Oct. Then we need networking and initial setup, then we can look at migrating instances over to it.
copr should pass a security audit. Is it ready for that?
we need to determine if we want to keep it as it is now (all in the cloud) or move frontends into our proxy setup. I think backend and builders should stay in the cloud.
Possibly other stuff I can't think of :)

clime commented 6 years ago

Hello!

copr should pass a security audit. Is it ready for that?
I will try to make it ready and ask for review.

Thank you!

kevin commented 6 years ago

This is waiting on a security audit and our new cloud deployed (hopefully soon).

Metadata Update from @kevin:
- Issue priority set to: Waiting on External

6 years ago

clime commented 5 years ago

Hello, we are ready for the security audit.

clime commented 5 years ago

Is there a list of requirements to pass the security audit?

clime commented 5 years ago

Note that the current operation of Copr depends on deprecated package python-novaclient-3.3.1-3.fc25 (https://koji.fedoraproject.org/koji/buildinfo?buildID=785608), which we install manually from Koji on copr-backend. I just hope that while waiting for this RFR to be completed, the package won't be auto-deleted from Koji. Newer python-novaclient package do not work with the current cloud. Newer packages like python3-shade do not work as well as far as I can tell. It would be a problem if this happened.

mizdebsk commented 5 years ago

Is there any update/progress on security audit? Or anything I could help with this RFR?

@clime, are there Python bindings for the new cloud available in current Fedoras, or you still need to depend on EOL f25 packages?

Metadata Update from @mizdebsk:
- Issue priority set to: Waiting on Assignee (was: Waiting on External)

5 years ago

clime commented 5 years ago

Is there any update/progress on security audit? Or anything I could help with this RFR?
@clime, are there Python bindings for the new cloud available in current Fedoras, or you still need to depend on EOL f25 packages?

We can use python3-shade to get the python bindings and hopefully, the following ansible modules will work correctly with the new cloud: https://docs.ansible.com/ansible/latest/modules/list_of_cloud_modules.html?highlight=openstack#openstack

clime commented 5 years ago

I think we can actually continue here per https://docs.pagure.org/infra-docs/sysadmin-guide/sops/requestforresources.html#development-instance.

@mizdebsk offered he can help with setting up staging copr-frontend in Fedora Infra and we will see about the rest of the machines later.

@puiterwijk maybe it would be more suitable if @mizdebsk took sponsorship of this ticket. What do you think?

Edited 5 years ago by clime

mizdebsk commented 5 years ago

We agreed with @puiterwijk that I will take over this RFR process.

@clime and me agreed that the current plan is to move frontend and its database to HA setup (two load-balanced frontends behind reverse proxies, database on db01). Other parts (backend, storage, builders, dist-git, keygen) should stay in the cloud, at least for now. The advantage of such setup is that Copr will be able to accept requests coming from users, webhooks, fedmsg etc. even during cloud outages.

Although the initial request indicated that existing cloud setup could be used as staging, I believe that we need a new staging setup as it will be quite different from the existing one. New staging deployment is planned on Monday. Ansible playbooks and roles will be adjusted to work with the new setup and staging frontends will be deployed as HA.

Metadata Update from @mizdebsk:
- Issue assigned to mizdebsk (was: puiterwijk)

5 years ago

mizdebsk commented 5 years ago

Staging frontend has been installed in HA setup. It is available at https://copr.stg.fedoraproject.org/. At this point backend/keygen/dist-git need to be set up in cloud and known problems discovered during staging testing need to be fixed.

Metadata Update from @mizdebsk:
- Issue priority set to: Waiting on Reporter (was: Waiting on Assignee)

5 years ago

clime commented 5 years ago

I've finished the cloud staging setup. The whole staging stack is now operable.

clime commented 5 years ago

Tasks related to RFR: https://pad.riseup.net/p/copr-rfr-keep

mizdebsk commented 5 years ago

SOP for troubleshooting backend is outdated and should be updated, including:

euca-describe-instances command fails to run (euca-describe-instances: error: missing access key ID; please supply one with -I)
/home/copr/delete-forgotten-instances.pl file does not exist on copr-be
fed-cloud02.cloud.fedoraproject.org host no longer exists

"PPC64LE Builders" section is outdated too and should probably be removed.

mizdebsk commented 4 years ago

Can I please have status update on progress of this RFR?

msuchy commented 4 years ago

We track the progress here https://trello.com/c/uFiTc5pe The progress got stuck during spring. Recently we have done public dumps.

smooge commented 4 years ago

Hey, was this covered by the AWS resources?

kevin commented 4 years ago

So, IMHO, I think we can close this ticket now.

Basically this was the orig request for resources with the idea that it would get fully moved into fedora infrastructure.

Over the years, I think that goal has changed, and our process has changed some.

The current goal is to just move copr to it's own virthosts that are mostly managed by copr folks, have builders in aws.

Then, have a agreement on what things fedora-infra can do to help out (restarting services, etc) and what things copr developers should always do.

Does that make sense, or is there still things we should do under this ticket?

msuchy commented 4 years ago

The AWS resources were "just" replacement of OpenStack builders. To help us evacuate OpenStack. I will recap comment #0: "... COPR as fully supported infra service on copr.fedoraproject.org as replacement of copr.fedorainfracloud.org." With the accent on "fully supported".

kevin commented 4 years ago

What does that mean though? That we deploy and run it and you only do upstream development and don't touch the deployed application? If so, I don't think we have the manpower for that, and I am not sure thats what you would really want either.

I think we are already planning on providing resources for you to run things and helping where we can. We should try and make sure we understand who is responsible for what.

We could discuss this more at devconf/after if you like...

Metadata Update from @cverna:
- Assignee reset

3 years ago

Metadata Update from @smooge:
- Issue tagged with: backlog, high-gain, high-trouble

3 years ago

smooge commented 3 years ago

The work on reaching the Service Level Expectations of CPE and COPR are being down outside of this ticket system.

Metadata Update from @smooge:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Reporter

fedora-infrastructure

Source Code

#5166 RFR: Copr Closed: Fixed 3 years ago by smooge. Opened 8 years ago by msuchy.

Metadata

request-for-resources high-trouble backlog high-gain

#5166 RFR: Copr

Closed: Fixed 3 years ago by smooge. Opened 8 years ago by msuchy.