Issue #4958: Deploy openQA - fedora-infrastructure

fedora-infrastructure

#4958 Deploy openQA

Closed: Fixed None Opened 8 years ago by adamwill.

So, we've got some machines to deploy openQA on. Let's deploy it!

I have drafted up the ansible stuff for this, would appreciate review before I try to fire it, as this is my first time playing with infra:

https://infrastructure.fedoraproject.org/cgit/ansible.git/log/?h=openqa

There are several commits - I can't rebase as I can't push --force, so just diff that branch against master to see the changes.

Note that I've tested the roles themselves locally with a couple of VMs and a simple playbook, but all the integration into infra ansible is untested.

Notes:

I don't know if there's some sort of non-personal default 'admin' identity we could make the admin user. For now this would make me, personally, the admin user (and I'd then grant privs to others). If there's some kind of appropriate account we can use for this, we just change the relevant vars in openqa and openqa-stg, it's easy.
The database server name is just hardcoded. If there's a better way to do that, let me know. The database deployment stuff in general is untested and copied from other services, please check it and let me know if it looks wrong.
There seem to be some tasks and roles that are kind of boilerplate, which almost every host/group has, so I put those in the playbooks for openqa too. If any of them shouldn't be there, let me know.
It might theoretically be useful to separate the 'openQA NFS server' role from the 'openQA server' role, but for now I combined them as it was kinda simpler that way: we assume the server also acts as the shared storage host. Note it would be somewhat trickier to separate these roles, as we'd have to make sure the server had write access to the share, which would require handling UIDs between the two systems, I think. With this design, we don't have to worry about any of that, and we also don't have to bother with limiting access to the share as it's read-only: there's nothing sensitive on the share, and the workers only need read access.
I set the dispatcher role up separately from the server role; for now they're kinda tied together because the scheduler scripts still do the work of downloading the ISO (so effectively it has to run on the same box as the server, or at least on a box with the NFS share mounted rw), but we can actually change that quite easily now, I just haven't got around to doing it, so it makes sense for this to be separate.
The plays won't always reduce state. For instance, if you set a box up to run 5 worker instances, then run the worker play with the instance count set to 3, you'll still have 5 workers running afterwards; it won't disable workers 4 and 5. Similarly if you run the dispatcher play with 'openqa_triggers' set to ['rawhide', 'branched'], then run it again with openqa_triggers set to ['rawhide'], the branched timer will still be enabled. Solving this seems like it would be rather painful and it doesn't really seem necessary
The createhdds handling is kind of bad; we really need the script to be more sophisticated, it almost needs to have like an internal version concept, so if we change the definition of some disk and re-run the script it will just re-generate that disk, basically, we need it to be smart enough that we can have an ansible task that just runs the script, and it will figure out how much work it needs to do and just do that much. Since the script can't do that, right now, I just hacked up something crappy that will at least work and not re-create all the images every time the script runs. This may have some missing dependencies at present, the VM I'm testing in doesn't have enough disk space to actually run the disk creation step and check it all works.
The secret variables that will need to be set for this to work are openqa_apikey , openqa_apisecret - those both need to be set to randomly generated 16-character hexadecimals - openqa_dbpassword - which is just a database password, just need to generate one - and wikitcms_password - which is the password of a FAS user account for submitting results; I know what this password is and can tell it to someone who can set secrets. We could use different database server accounts for openqa and openqa-stg, but right now I've set them both to use the username 'openqa', I don't really see that they need to use different accounts. Ditto openqa_apikey and openqa_apisecret, we could use different values for openqa and openqa-stg, but I'm not sure we need to.
I don't know how we typically handle updating infra hosts. I see there's a yum-cron task, but that's kinda old; is someone gonna add a dnf-automatic task? Is there a convention for this?

I made the 'dnf' tasks here use 'present' (not 'latest'), so they won't update packages, but the git tasks will update git checkout every time the plays get run. Right now openqa and openqa-stg are going to wind up with all the same bits; we could possibly tweak our git branching so stuff coming from git lands on openqa-stg first, not totally sure we want to bother with that right now. Similarly I could set up multiple COPRs and have openqa and openqa-stg use different ones so openqa-stg gets package updates from the COPR faster, but maybe that's forthefuture.

As you'll notice, several things come from git. openqa_fedora is the Fedora openQA tests, it really doesn't make sense to distribute these in any other way but a git repo, they're kinda like spin-kickstarts - packaging would just be too onerous, we need a very quick workflow for these. openqa_fedora_tools is the Fedora layer of tools around openQA, the biggest thing there is the tool for actually scheduling openQA jobs, fedora_openqa_schedule; that could be packaged but it's kinda hoop-jumping, it's fairly special-interest stuff. openQA-python-client is my small python client module for the openQA API, fedora_openqa_schedule uses it to talk to the server, ditto with the tools it could be packaged but not sure it needs to be. I don't know what the policy is on doing stuff like these git checkouts.
Right now we can only deploy to Fedora 23. F22 would probably be possible, I think I'd need to push a newer Mojolicious build to the COPR, not sure this would be of interest to anyone. Deploying on EPEL 7 would obviously be nice in some senses, but I have not looked at all at the status of the dependencies in EPEL 7 yet, I strongly suspect several will be missing / old.

Uh, dunno what else I need to note, really. There's a LOT of detail here, please do ask me on IRC about anything at all.

adamwill commented 8 years ago

Oh, I guess one more note: in this implementation I went with templating the client.conf and workers.ini files. There is an alternative here; we have a python script which handles editing those files, in openqa_fedora_tools - https://bitbucket.org/rajcze/openqa_fedora_tools/src/4c1e2ee42621537a51ef5d6a4f7fa5b1876a0bee/docker/webui/scripts/client-conf?at=develop . I'd need to put it in the openqa package, make it work on the proper paths (right now it operates on the paths used by the Docker setup), and also maybe fix it up to handle file permissions properly (I just realized that if you use it to create client.conf it'll probably create it world-readable, which we don't want). We could do that, but eh, the templating really isn't that bad.

adamwill commented 8 years ago

Oh, yeah - the 'Have tests owned by geekotest' task always shows 'changed', I think because the git checkout task winds up setting the ownership back to root. We can't run the git checkout as the 'geekotest' user because it's a system user, it's not allowed to run stuff. I'm not sure how much we care about avoiding spurious 'changed' statuses, if it's a big problem I can try and think of a way around that.

adamwill commented 8 years ago

For the record, puiterwijk pointed out that 'core' infra shouldn't depend on COPR, because COPR itself is only best-effort supported.

At present I think we're fine with openQA having only the same 'best effort' status as COPR. Nothing ultimately distro-critical depends on openQA being up all the time: nothing is currently gated on openQA tests. So if it's OK to use COPR but it just reduces the support status of the deployment, I think we're fine with that. If it's absolutely required for the packages to be copied from the COPR to the infra repo we can do that, it just seems like a painful extra step. The COPR exists explicitly to back this deployment of openQA, it has nothing in it but the packages required to make openQA work and openQA itself.

I listed the external hostnames of the web UIs as openqa.fedoraproject.org and openqa-stg.fedoraproject.org, but those are really just placeholders. They could easily be openqa(.stg).qa.fedoraproject.org (or I guess open.qa.fedoraproject.org if we want to get cute?) Or, you know, whatever's appropriate.

adamwill commented 8 years ago

This is now basically done, we have production and (quasi-)staging openqa instances both up and running at https://openqa.fedoraproject.org and https://openqa.stg.fedoraproject.org/ . I am working through various details that need cleaning up, but the main stuff is done.

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#4958 Deploy openQA Closed: Fixed None Opened 8 years ago by adamwill.

Metadata

#4958 Deploy openQA

Closed: Fixed None Opened 8 years ago by adamwill.