#5886 need method for distributing urgent fixes... urgently
Closed: It's all good 6 years ago Opened 10 years ago by mattdm.

Right now, the time it takes to do a push of F19 and F20 updates is a critical bottleneck in our ability to put out zero-day security updates. We need a solution for this.

I understand that Infrastructure is working on a Netapp fix which will give a 3x speedup. I think that's probably still too slow. We need to be able to put out urgent updates on the scale of minutes, not hours.


One idea I have is to create a separate "urgent updates" repository. Dennis suggests that mash could talk to bodhi to get the security updates. And of course any new dependencies would need to be pulled in alongside.

If it included deltarpms at all, it would do the push in two phases -- first without, and then with (as deltarpms are the most resources-intensive part of the process).

We would also make it clear that this repository is meant to be run alongside the existing updates, and does not replace it -- there are some policy details to work out.

So, just a few things to note:

The push with the openssl update in it ran into several issues, so it wasn't a typical push. For some reason there was a httpd restart that confused bodhi and it ended up remashing several repos. Additionally our storage is currently very slow and we are being moved to other storage.

I don't say this to disagree with the idea of this ticket at all, just more to say that looking at that push is not looking at a 'normal' one.

Even with a seperate critical repo I am not sure we are going to get down to minutes, but perhaps we can if we drop some features that people expect to get more speed. For example, we could have the critical-updates repo not do drpms. It probibly will still have to do multilib, as that could cause some people not to be able to update.

Would the urgent repo migrate things over to the normal stable updates repo after a while? That might help to keep it small.

More thoughts to come later...

Replying to [comment:2 kevin]:

Would the urgent repo migrate things over to the normal stable updates repo after a while? That might help to keep it small.

I'm thinking that packages would go into both the urgent repo (immediately) and the normal one (with the rest of that day's push). Then they could be expired out of the urgent repo after a week (or whatever time is technically best).

Initial implementation idea:

  • add a new tag and repo, e.g. fXY-fast-updates
  • tag builds manually
  • untag builds after they are tagged for 7 days

In Bodhi 2: Add option to request fast updates

CVE-2014-6271 (widely-reported bash exploit with remote compromise potential) is another good example.

Till's proposal like sane to me. Doing something more manual in the meantime would be fine too. I think it's likely to be something that would be used once or twice a year, but it's really important for our users at those times.

Once we have something that works smoothly for ultra-critical updates, we could move towards using it for all critical updates -- but first things first.

There was some confusion about what we want to do. So here is a more detailed proposal:

Initially we need a repository that is enabled by default enabled on user's systems, so the packages can bypass updates-testing. However since some testing might still make sense, It might make sense to also create a -testing repository that is disabled by default. This also has the benefit to allows easily testing whether the fast distribution repo setup still works by adding builds to the -testing repo. So the setup would then be:

  • add two new tags and repos, e.g. fXY-fast-updates and FXY-fast-updates-testing
  • sign the builds then tag the builds manually
  • react on the tag events via fedmsg and copy the RPMS, run createrepo and sync the output
  • untag builds after they are tagged for 7 days via run

Missing items:
Updateinfo from bodhi with security meta info: IMHO ok to not have it initially, because it still makes the situation better. Also e.g. for the bash update, the info was missing in Bodhi as well, so it is not critical.

I don't think we '''need''' a fast-updates-testing. In these cases, the testing is almost always done by people grabbing unpushed builds from koji.

I guess it doesn't hurt, and might make it easier for people to test ("yum --enablerepo=fast-testing install foopackage" instead of following the individual koji links). What would cause the packages to move from fast-updates-testing to fast-updates? Could that happen automatically with bodhi karma?

(We'd probably want something like 5 as a normal threshold for these, especially if it's automatic; one upside of the current log delay is that it gives time for a ''lot'' of people to test in the meantime and by the time it's going out to mirrors, it's pretty sure that the update at least isn't catastrophically wrong.)

The fedmsg / createrepo reactor should probably test for properly-signed RPMs just to be safe.

Replying to [comment:11 mattdm]:

I don't think we '''need''' a fast-updates-testing. In these cases, the testing is almost always done by people grabbing unpushed builds from koji.

It does not only allow to test updates but also to test the fast-updates distribution system, because if it is rarely used, it might be broken without some noticing until it is needed.

I guess it doesn't hurt, and might make it easier for people to test ("yum --enablerepo=fast-testing install foopackage" instead of following the individual koji links). What would cause the packages to move from fast-updates-testing to fast-updates? Could that happen automatically with bodhi karma?

In the beginning it would be manual. But since the repo should be available within 20 minutes testing can happen fast and it should be ok to create the testing repo, wait a little for testers feedback and then create the stable repo.

The fedmsg / createrepo reactor should probably test for properly-signed RPMs just to be safe.

Yes, but this is also something that is noticed if testers use the testing repo to install the RPMs.

+1 to draft.

For Mirror Manager: is it better to not use it, or to ask if mirrors will carry the repo with a continuous frequent (1hr?) sync period?

RE: mirrormanager:

  • If we use it, it will delay things by an hour or more while mirrormanager crawls the content and sees that the repo has changed and updates the metalink it serves.

  • it would allow us to use a metalink, which is more secure, but if the packages are signed not sure it matters.

  • Without mm, we won't be able to tell which mirrors are up to date. Perhaps however we don't care and just point these all to the master mirrors (since it should be short times and limited content).

In idea Misc, Bochecha and I had last week-end about !MirrorManager (note: something to consider for MirrorManager2 only).

We could implement a 'panic mode' in which we single out one or more RPM(s).
When starting this 'panic mode' we would de-activate all the mirrors in mirrormanager except for a) the private ones and b) our own servers (up to date right after the push).

This way, people running yum update will all end up on our servers.

Next to starting the 'panic mode' we would run a cron job every 30min/1hour checking of each mirror we have for the presence (or absence!) of the RPMs specified. As we found them (or not found them), we would re-activate the mirror in !MirrorManager.

This way, people would all end-up hitting mirrors that are up to date for these specific RPMs and little by little the number of mirrors hit would increase reducing the load on our servers.

Bonus: We should be able to monitor the rate of updates of our mirrors (see how quickly they sync).

After a day or two, we could stop the 'panic mode' and get !MirrorManager back into its normal state :)

That is of course about !MirrorManager itself and does not solve in anyway the time it takes to mash the updates.

I think if this is implemented, you want it to also email the contact points for all of the mirrors to let them know that this Panic Mode has been activated and that they should trigger a force-sync.

Could emailing the list be sufficient? Or do we really want to email all the individual mirror admins?

An interesting idea, but not sure I like it...

a) as you note it doesn't help us with mash/drpm buildtime, etc

b) It would mean that our master mirrors would need to handle ALL requests from all fedora and epel users. I'm not sure off hand if it could. It could well be swamped.

The mirror managers mailing list used to be reasonably active 7-8 years ago, but in the last couple, it is very, very quiet. I think most mirrors are on autopilot.

I recently read how Debian does this. In addition to their normal mirrors they have one (or several?) core servers that only host urgent security fixes. This reduces the time for getting the packages out the door and into the hands of their users.

Of course, this bypasses the need for mirrors but I wonder if this could be an acceptable alternative to trying to push the bits around. Would one or two servers be able to handle the load of a single update every so often?

Of course, this bypasses the need for mirrors but I wonder if this could be an acceptable alternative to trying to push the bits around. Would one or two servers be able to handle the load of a single update every so often?

I thought this had been discussed already and was considered reasonable (I might be widely wrong here). Ultimately it would only have to handle the package load while the content made it to the mirrors which could be done in parallel. IE push to both at the same time with priority to the mirrors so if the same NVR was on a mirror system it would prioritise that. The only issue with that approach would be clients refreshing meta data as you'd likely want a low TTL on the meta data cache but with few packages that should be small.

Replying to [comment:22 pbrobinson]:

Of course, this bypasses the need for mirrors but I wonder if this could be an acceptable alternative to trying to push the bits around. Would one or two servers be able to handle the load of a single update every so often?

I thought this had been discussed already and was considered reasonable

Maybe. I've gotten lost on this topic as it's been ongoing for so long, now.

ok. I think what we need is a straw-man proposal for people to tweak/poke holes in, and so I will do so:

prereqs:

  • bodhi adds fedora-urgent-NN setups. It's mash config has no drpms. Possibly it's interface doesn't even show this product if there are 0 updates in it (which should be the normal state).

  • fedora-release-repos pushes out a version with a new fedora-urgent-updates and fedora-urgent-updates-testing repos. They use metalinks and normally point to a empty repo.

Process:

  • Maintainer(s) follow the normal update process. Build in koji, submit update to bodhi, etc.

  • They submit a releng ticket asking for the update to be in urgent updates.

  • If approved, releng submits the update(s) to the urgent-updates product, signs them and pushes them to testing.

  • The repo is synced to a urgent-updates-testing repo and must get +3 karma to pass this point.

  • On stable karma the update(s) are pushed to the urgent-updates repo and synced out.

  • Mirrormanager is poked to update the repodata and metalink, which at first just points to master mirrors, but over time as more sync adds more mirrors.

  • After the update goes to stable in normal updates + 1 week, the urgent updates repo is cleared out and empty repo is pushed out.

comments:

  • This will be faster that current setup because it can be done independenty of normal updates pushes, the repos will be very small (mashing should take very little time), there are no drpms, etc.

  • The longest times here will be mirrormanager noticing the updated repos, and the human steps like noticing the ticket, pushing the updates, testing the updates, etc.

  • We really do need mirrormanager here unless we want all users to always hit master mirrors empty repo (which some may see as a way to track or count them). Also, we really want a metalink as it's much better than a baseurl.

  • We need bodhi here to have sanity checks like all rpms signed, repodata has security update info for security plugins, etc.

Issues:

  • Is a releng ticket right to ask for this? Who approves it and how?

  • Is this going to be fast enough to make it worth while?

  • Is there a way to reduce waiting for humans here without bypassing some important checking?

Hi. I know I'm coming in late to this discussion, but I would like to ask a question. This ticket indicates that the standard methods for pushing an update are too slow. Could someone detail the steps in that process and how long they take? I am interested to see if the standard process could be improved instead of creating a new one.

Replying to [comment:26 dgregor]:

Hi. I know I'm coming in late to this discussion, but I would like to ask a question. This ticket indicates that the standard methods for pushing an update are too slow. Could someone detail the steps in that process and how long they take? I am interested to see if the standard process could be improved instead of creating a new one.

Kevin described it with quite some details at: https://lists.fedoraproject.org/pipermail/devel/2015-March/209411.html

  • Is a releng ticket right to ask for this? Who approves it and how?
    My worries are that you have to know a) there's security process, b) ticket has to be created. For huge security issues, where a lot of people are involved, it's not a big deal. But for other security issues, especially for components maintainer create security erratum once in the life, it could be an issue. I'm not sure if it's possible in the current Bodhi codebase (or how big the task would be), maybe adding check for referenced bugs in the update for security fields and if severity is high enough (and embargo already passed), show dialog "this is very likely high urgent security update, please follow ... and file ticket" - the last part could be "click here to notify releng/security team". I know, I did not answer approval thing (and also notification part).

  • Is there a way to reduce waiting for humans here without bypassing some important checking?
    Embargo is the main delay factor. In our public infrastructure, you can start real job after embargo is lifted (of course, you can prepare all patches, test scratch build...). But I understand, private builds would be a huge amount of work, very likely not worth it. Last time Java guys asked, Board approved the idea but nobody pushed for implementation after it.

Replying to [comment:25 kevin]:

  • Is there a way to reduce waiting for humans here without bypassing some important checking?

An automated email going into test@ or test-announce@ list could help with gathering testers.

This is something we need to ensure we have a plan in place before f23 alpha

Sadly, alpha has come and gone... ;(

I have updated my proposal based on discussions:

https://fedoraproject.org/wiki/Urgent_updates_policy

The only outstanding questions in my mind on this are:

a) How much bodhi work will be needed.

b) if we should relax the requirements some. For example some updates are security related, but cannot be marked as such when submitted due to waiting for CVE, etc.

We had a meeting to kick this issue back into gear. Here are the notes from the public pad we used:

{{{
Requirements for quick fixes, 2016-Feb-24
Attending:
dgilmore
mattdm
admiller
pfrields
dgregor

Imagine a quick fix needed for a Heartbleed/Shellshock type exploit -- q.v. https://fedorahosted.org/rel-eng/ticket/5886
Determine requirements for a solution and get the ball rolling toward implementing

Requirements:
-- Barring utter breakdown of the internet, used once or twice a year. Not used frequently (so should be designed for infrequent, unfamiliar use.)
-- Available to mere mortal users in the order of minutes rather than hours
(meaning available on the user side?) Yes — after it's through QA, it should be basically immediately available for dnf install / update blinky in GNOME
-- Seamless with normal updates processes for users (doesn't require extra steps before or after on end user systems)
-- As few human bottllenecks as pragmatically and securely possible
-- Need IPv6 connectivity to repository
-- SOP for process (once figured out) -- include communication change

How do we make these fixes available for everyone?
dgilmore: can't just mash a repo for this
dgilmore: Need a tag in koji, an extra target in Bodhi, and we need to define a repo in our -repo RPM that will usually be empty but gets critical updates out by default
admiller: Why not just have a special Bodhi marking for critical security updates?
the new improved Bodhi still requires many more hours than reasonable to get content out to users
When MirrorManager picks something up, it still has to be sync'd out -- hours or more
pfrields: Can we do this with a primary location that is not mirrored, but also adds mirrored metalink if that fails under high load?
dgilmore: Probably a location in kojipkgs
mattdm: (does this risk killing kojipkgs?) pfrields: +1, needs to be aggressively cached and not too contentious
dgilmore: Need IPv6 only connection, which we don't have in PHX2?
mattdm: what about having 3-4 high bandwidth partners participate, and sync via push rather than pull?
front end proxies are there, so having them front the repo would probably work
Need lmacken input on how this would work in Bodhi, what's the size of that effort
May need a separate product -- Bodhi only lets you do one F24 push at a time, so we might need one for e.g. F24-Security
Need to restrict access somewhat for the push side -- the builds should be open to maintainer + provenpackager as usual
Perhaps https://admin.fedoraproject.org/accounts/group/view/security-team + a few other trusted?
Not sure if we want security team folks to have to interact with Bodhi twice a year and not know the process
Cloud and other image rebuilds
mattdm: important but a different scope
admiller: the piece that's not fast is generating new two week atomic images (which are built nightly), which follows on the compose process. Generating the updates rpm-ostree is actually pretty quick so that people who already have Atomic Host installed will get updates quickly, but getting new install and cloud (AWS/qcow2) images would be slower (handful of hours vs handful of minutes).
pfrields: need a document that describes decision tree for making new deliverables when required
should include escalation points
dnf: if cache is not old enough, you have to manually refresh
What is GNOME limitation?
dgregor: Don't see a need for this process to be dovetailed with Red Hat internally, different set of needs and constraints there
AGREED: not a problem in this case; the approach is more or less dictated by the rest of the toolchain

ACTION:
Paul - Set up meeting w/ dgilmore, lmacken to discuss Bodhi effort needed here and figure out any implementation problems
might combine this with kfenzi to figure out hosting partner/b'width side
Dennis -Talk with kevin to get concrete plan on hosting (see above?)
Mattdm - Talk to Eric Christensen (sparks) about security team and other access to this process
Paul -- find out how GNOME treats aging of repo metadata, hopefully same as dnf?

empty repodata
du -hs /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/*
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/401dc19bda88c82c403423fb835844d64345f7e95f5b9835888189c03834cc93-filelists.xml.gz
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/6bf9672d0862e8ef8b8ff05a2fd0208a922b1f5978e6589d87944c88259cb670-other.xml.gz
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/77a287c136f4ff47df506229b9ba67d57273aa525f06ddf41a3fef39908d61a7-other.sqlite.bz2
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/8596812757300b1d87f2682aff7d323fdeb5dd8ee28c11009e5980cb5cd4be14-primary.sqlite.bz2
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/dabe2ce5481d23de1f4f52bdcfee0f9af98316c9e0de2ce8123adeefa0dd08b9-primary.xml.gz
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/f8606d9f21d61a8bf405af7144e16f6d7cb1202becb78ba5fea7d0f1cd06a0b2-filelists.sqlite.bz2
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/prestodelta.xml.xz
4.0K /mnt/koji/mash/updates/f23-updates-blank/f23-updates/x86_64/repodata/repomd.xml

Downloads on client should be 8-12k (good news is the "du" is misleading because that's just the minimum block size adding up)
}}}

So, this is pretty much the conclusion that I thought Dennis and I came to, so thats good. ;)

If we use 'skip_if_unavailable=True' in the config we could even not have them have to download anything if there are not any updates in the repo.

I think we should be able to have a small repo on the master mirrors without them dying. (Of course you never know until you try). Especially if it's just for a day or so at a time.

Last I checked gnome-software checks very infrequently (once a day? once a week?) and only once it has a transaction with no problems will it prompt you to update. Perhaps it could have a seperate mode for urgent updates or something.

As I understand it, gnome-software checks once per day (or on-demand) for security issues and once per week for non-security issues. Once a day for security issues is probably okay... if its an important enough issue, I expect people will happily hit the refresh button.

Replying to [comment:35 sgallagh]:

As I understand it, gnome-software checks once per day (or on-demand) for security issues and once per week for non-security issues. Once a day for security issues is probably okay... if its an important enough issue, I expect people will happily hit the refresh button.

As long as it would then immediately appear once people did that, that's probably okay. (Since we're not auto-installing updates with DNF either, this is basically the same as needing to run DNF from the command line to see if an update is there.)

In the future, we could look at some sort of push notification system — maybe for security updates in general, not just critical-urgent. But, one step at a time!

dgilmore, lmacken, nirik, maxamillion and I met about this ticket earlier. We arrived at the consensus that if we can cut mash time down, we could have updates ready to push in less than an hour, possibly considerably less. There are some Koji fixes headed upstream that could help here, but we don't know how long they'll take. In the meantime though, we know what the problem is they address (file permissions that require cp versus hardlink), and lmacken is going to try those fixes this week in staging to compare performance. Both dgilmore and kevin will consult on ansible fixes, etc. to help out.

We know there is also some time needed for MirrorManager to invalidate old cache, but there are a few things we can do there as well. May tweak mirrormanager to in specific cases to point everything to master mirrors + ibiblio for an hour or two, until more mirrors synced. This might obviate the need for a separate repo. We're keeping the separate repo on the table as an alternative. Everyone agreed we can and should have this fixed before F24 Beta.

After changing the bodhi masher to run as the apache user, we got a small speed boost, but not the huge boost that we were hoping for. This is unfortuantely due to deltarpms.

I temporarily disabled deltarpm generation for f24-updates-testing, and it gave us a 75% speed boost, taking the mash time down from 40 minutes to 10. Given this delta overhead, I'm fairly confident that we could have bodhi disable deltas for pushes that contain critical updates, and get things out the door in under an hour. For example, f23-updates takes almost 3 hours to mash. If we disable deltas, I think we can get it down to ~45min.

So, do you folks think I should pursue the task of having bodhi disable
deltas for pushes containing critical security updates, or should we dive
into setting up some out-of-band repo for this?

Is it possible to generate deltas after mash and push has happened in an asynchronous way? If so, that seems like a pretty big win overall anyway.

If not, we might still want to discuss whether deltarpm generation is worth the significant time hit we take here.

I would rather we pursue a course of action that results in deltas not going away. we will never get the deltas back when they get temporarily disabled. the cached deltas will go away. I think the only viable option will be an out of band repo. The oout of band repo generations will also be much faster as it will be really small. Without major work in redesigning tools and workflows we can not do the delta generation in an asynchronous way.

There were some patches to createrepo that allowed for parallel drpm generation, but I don't think we ever got them working right and/or enabled them.

Perhaps we could engage with createrepo/createrepo_c upstream and see what they can do there?

Metadata Update from @mattdm:
- Issue set to the milestone: Fedora 23 Alpha
- Issue tagged with: meeting, planning

7 years ago

This is a very old ticket, does anyone still want it anymore?

Metadata Update from @mohanboddu:
- Issue close_status updated to: None

7 years ago

This is a very old ticket, does anyone still want it anymore?

Yes. It's back-burner right now, but we'll definitely want it the next time a big-name remote exploit shows up.

Note to self (Kate): Add to priority pipeline.

here is an example

https://bugzilla.redhat.com/show_bug.cgi?id=1456884
https://bodhi.fedoraproject.org/updates/FEDORA-2017-54580efa82

The bugzilla ticket is automatically created,

This is an automatically created tracking bug! It was created to ensure that one or more security vulnerabilities are fixed in affected versions of fedora-all.

can we have such automatic assignment to poke people or to loosen number of required votes to push the update?

@alsadi Well, there's always a balance. We wouldn't want to make things worse by pushing out a bad fix too quickly. The sudo update is an interesting example here. It's obviously a security fix, but the issue it corrects is an escalation from limited sudo privileges to full-root equivalent. In the default configuration in Fedora, we give full-root equivalent to members of the wheel group, and nothing else — in other words, unless you've got a special configuration, the issue doesn't matter.

Meanwhile, it turns out that the initial fix was incomplete — see https://bugzilla.redhat.com/show_bug.cgi?id=1459152. It isn't in this case, but could have been that the quick fix makes things worse. A bad update to sudo could even lock legitimate users out of their systems.

Yes, we need to get updates out quickly... but we also need to make sure that they're good updates. I don't think loosening the amount of required QA is the answer.

We discussed this in today's release engineering meeting.

@mattdm are you okay with rolling this into our speeding-up-compose ticket?

@kevin states that pushes are now much faster than when this was originally filed and the only other issue he can see would be to speed up the creation of DRPMs.

We'd like to close this issue, transfer data into the compose speed taiga issue, and open an upstream issue for DRPM creation.

Please advise if you agree with this course of action.

@kellin I'm okay with rolling it in, especially if it's the most likely way to get something to happen. :)

The original issue was this: there was a critical security issue affecting users, and we actually had a patch, and that patch was tested and everything, but we were two days slower than other distros in actually making that patch available because we had to wait for the process to crank through. (And that process wasn't even testing or otherwise adding value.) That's both bad for users and embarrassing for us as a project.

All update repos (epel6/7,fedora26/27)(updates/updates-testing) composed in 3 hours this morning.

Just doing f26/f27 stable updates should take at most 2 hours. We also have a script that will tell mirrormanager to invalidate all the old repodata and point everytone to us, so call it 3 hours...
I think thats much better than 2 days, and I would argue, "good enough" instead of inventing a new complex process that doesn't get used or optimized.

I agree — 2/3 hours is acceptable.

(Is there a close reason somewhere between "Invalid" and "Fixed"? Like a "This is Fine" close reason?)

Metadata Update from @ausil:
- Issue close_status updated to: It's all good
- Issue status updated to: Closed (was: Open)

6 years ago

Closed as "It's all good" status

Login to comment on this ticket.

Metadata