wiki:Meetings/2008-Aug-25/irclogs
Last modified 3 years ago Last modified on 04/23/11 20:52:42
16:00 < fabbione> ok guys
16:00 < fabbione> time to start
16:00 < fabbione> this is our first IRC meeting, so let's try to make the best out of it
16:01 < fabbione> we have ~1 hour of time
16:01 < fabbione> agenda: http://sources.redhat.com/cluster/wiki/Meetings/2008-Aug-25
16:01 -!- richterbag_AFK [n=richterd@nitrogen.citi.umich.edu] has quit ["[BX] Reserve your copy of BitchX-1.1-final for the Nintendo Gameboy today!"]
16:01 < fabbione> http://www.redhat.com/archives/cluster-devel/2008-August/msg00029.html
16:01 < fabbione> this is how is going to work
16:01 < fabbione> riley_dt: you are up first
16:02 -!- richterd [n=richterd@nitrogen.citi.umich.edu] has joined #linux-cluster
16:02 < riley_dt> ok
16:02 < riley_dt> ready tto start?
16:02 < fabbione> go ahead
16:02 < fabbione> paste your summary
16:03 < riley_dt> Corosync 0.91 released.
16:03 < riley_dt> Corosync package up for fedora packaging review.  Some comments so far but no
16:03 < riley_dt> approval.
16:03 < riley_dt> Message service synchronization engine done.  Message service looks pretty solid
16:03 < riley_dt> but requires more testing before hitting production status in tree.
16:03 < riley_dt> Major items to be completed before 1.0 of corosync:
16:03 < riley_dt> IPC forward port
16:03 < riley_dt> OpenAIS-trunk AMF porting to corosync (nearing completion)
16:03 < riley_dt> backwards compatability
16:03 < riley_dt> there is the status
16:03 < fabbione> riley_dt: what is the ETA for 1.0?
16:04 -!- rohara [n=rohara@65.122.67.200] has joined #linux-cluster
16:04 < riley_dt> i believe the original schedule was sep15
16:04 < riley_dt> or perhaps sep 30
16:04 < riley_dt> but sometime around then :)
16:04 < fabbione> and how does it look now compared to the TODO list?
16:04 < riley_dt> looks pretty good
16:05 < riley_dt> ipc forward port is last peice of major work left
16:05 < fabbione> ok
16:05 < riley_dt> the compatability effort will take some time
16:05 < riley_dt> amf porting is not a significant work item
16:05 < fabbione> i found a compat bug today btw.. still investigating tho
16:05 < riley_dt> yes there will be many at this time
16:05 < fabbione> anyway i will give you more info tomorrow
16:06 < fabbione> riley_dt: i pushed some buttons to get corosync in fedora. this is sort of blocking me to release new stuff there
16:06 < fabbione> let see what happens
16:06 < fabbione> anybody has questions for riley_dt ?
16:07 < richterd> fabbione: is there any likelihood of corosync going into fedora 10?  sorry if that's not germane.
16:07 < fabbione> richterd: we are working to get corosync in fedora 10.
16:07 < fabbione> the pacakge is pending review.
16:07 < riley_dt> richterd the plan is corosync in fedora 10
16:07 < richterd> thank you
16:07 < fabbione> richterd: https://bugzilla.redhat.com/show_bug.cgi?id=459281 FYI
16:08 < riley_dt> richtred i'd like to say that was further along but wit hthe recent fedora system failures getting the package reviewed has been difficult
16:08 -!- picachu [n=picachu@host-static-89-41-72-147.moldtelecom.md] has joined #linux-cluster
16:08 < picachu> anybody here?
16:08 < richterd> i imagine, yes.  thank you both.
16:08 -!- picachu is now known as kotique
16:08 < kotique> oh great
16:08 < fabbione> richterd: you are welcome
16:09 < fabbione> kotique: http://www.redhat.com/archives/cluster-devel/2008-August/msg00029.html
16:09 < fabbione> any more questions for riley_dt ?
16:09 < kotique> So, my yesterdays question. How do I combine 3 servers with a lot of storage into 1 big FS ?
16:09 < fabbione> kotique: we are having a team meeting right now. Please read the URL above
16:09 < kotique> ok, reading
16:09 < fabbione> kotique: thanks
16:09 < fabbione> ok
16:09 < fabbione> no more questions for richterd 
16:09 < fabbione> no more questions for riley_dt 
16:10 < fabbione> next in line would be chrissie but it's bank holiday in the UK
16:10 < fabbione> dct__: you are up next...
16:10 < fabbione> DLM/group/*controld status 
16:10 < dct__> ok
16:10 -!- didar is now known as didar_
16:11 < dct__> last week was quite a bit more RHEL stuff than usual, so not much past stuff to report, but list of todo items
16:11 -!- didar_ is now known as didar
16:11 < dct__> fix fenced bypass of victims to deal with problem of starting cluster
16:11 < dct__> with uncontrolled gfs state in the kernel
16:11 < dct__> enable and test disallowed code in new daemons
16:11 < dct__> libdlm remove dlm-control mknod, wait on udev
16:11 < dct__> tool formatting of new query output (fence_tool, dlm_tool, gfs_control)
16:11 < dct__> gfs_controld remount todo
16:11 < dct__> gfs_controld withdraw todo
16:11 < dct__> dlm_controld deadlock
16:11 < dct__> libdlm test lockspace name collisions caused by sysfs truncation
16:12 < dct__> daemons should narrow the coverage of the query lock
16:12 < dct__> end
16:12 < fabbione> ok
16:12 < fabbione> does anybody have questions for dct__ ?
16:13 < fabbione> dct__: what do you mean by: "daemons should narrow the coverage of the query lock" ?
16:13 < fabbione> (just not sure i understand it)
16:13 < dct__> the new daemons each have a separate thread just for answering queries (for status, etc)
16:14 < dct__> there's a lock that protects the info they are looking at to report
16:14 < dct__> this lock should be held over the minimum necessary calls
16:14 < fabbione> ah ok.. got it.. thanks
16:14 < dct__> so queries don't block if the main process is busy
16:15 < fabbione> ok thanks
16:15 < fabbione> any more questions for dct__ ??
16:15 < richterd> dct__: does the query lock affect daemon-daemon interaction, or just e.g. gfs_control-gfs_controld interaction?  i.e., does query-locking only block agains
t other queries?
16:16 < dct__> it's mostly things like fence_tool ls, dlm_tool ls, gfs_control ls
16:16 < dct__> (old group_tool queries)
16:16 < richterd> ok, thanks i recall now; haven't read that code in a month or so.  thank you.
16:16 < dct__> there are probably a couple of daemon-to-daemon queries
16:16 < richterd> oh, okay.  that's all i needed to know for my pNFS stuff.
16:17 < fabbione> ok awesome
16:17 < fabbione> any more questions for dct__ ??
16:17 < fabbione> ok
16:17 < fabbione> seems like we are good
16:17 < fabbione> lon: you are up next
16:17 < fabbione> # LDAP automatic schema generation status 
16:18 < fabbione> is lon actually around?
16:18 -!- ssato [n=ssato@NE1051lan9.rev.em-net.ne.jp] has joined #linux-cluster
16:19 < fabbione> i guess not.. anyway there was a minimal update. Lon has been working into automatically parsing data to generate LDAP schema 
16:19 < fabbione> and that is proceeding at a decent speed.
16:19 < fabbione> once the converter will work, it will be possible to add entries without manual intervention
16:19 < fabbione> so less work for the human
16:19 < fabbione> next is me
16:19 < fabbione> # Fedora Builds/Community 
16:20 < fabbione> - Cluster Summit organization: finalized hotel, started on schedule.
16:20 < fabbione> - investigating rgmanager messaging breakage in master (almost certainly caused by corosync)
16:20 < fabbione> - tested new libconfdb patch from chrissie
 16:20 < fabbione> - pushed corosync/openais split into ubuntu. Debian is pending Lenny release in sept.
16:20 < fabbione> - more scandisk fixes for stable2 and master.
16:20 < fabbione> - merged askant into contrib/
16:20 < fabbione> - patch review for beekhof
16:20 < fabbione> - build system cleanup from beekhof input.
16:20 < fabbione> - no new releases.
16:20 < fabbione> - update to F9 is pending infrastructure at this point.
16:20 < fabbione> - new unstable release is pending corosync in rawhide.
16:20 < fabbione> any question for fabbione?
16:21 < fabbione> ok.. i guess not
16:21 < fabbione> next in list (still me)
16:21 < richterd> (no question, but thanks for the FYI about corosync/openAIS in ubuntu and debian -- pNFS folks will like that)
16:22 < fabbione> richterd: welcome :)
16:22 < fabbione> # Cluster Summit status 
16:22 < fabbione> for those of you that have asked for hotel, that has been sorted, and you should have received an email
16:22 -!- shame [n=mike@24-182-108-29.dhcp.ftwo.tx.charter.com] has quit [Read error: 113 (No route to host)]
16:22 < fabbione> work on details of the schedule are in progress
16:22 < fabbione> http://sources.redhat.com/cluster/wiki/ClusterSummit2008/Schedule
16:22 < fabbione> you are welcome to add any comment, suggestion etc.
16:22 < fabbione> i will send more detailed info on it tomorrow
16:23 < fabbione> specially for the usage of the small room 
16:23 < fabbione> "not allocated" will be used to reschedule stuff that cannot be done in 50 minutes
16:24 < fabbione> any question on the cluster summit organization?
16:24 < chrissie> no questions, juat a vote of thanks for doing it
16:25 < kanderso> ditto
16:25 < fabbione> chrissie: thanks to you :) you have been the shoulder where i was crying for desparation ;)))
16:25 -!- lon [n=lhh@nat/redhat/x-a4fd293b0962be2f] has quit [Read error: 110 (Connection timed out)]
16:25 -!- pleemans [n=peter@dD577D009.access.telenet.be] has quit ["Ex-Chat"]
16:25 < fabbione> ok
16:25 < fabbione> let's move forward
16:25 < fabbione> kanderso: do you have any update/question for us?
16:26 < kanderso> just would like update on progress on chkpt bug and current strategy
16:26 < fabbione> riley_dt: ^^
16:28 < riley_dt> ok for checkpoint bug
16:28 -!- lon [n=lhh@nat/redhat/x-12b78cf08e49fa80] has joined #linux-cluster
16:28 < riley_dt> working on making cman ignore messages from old ring
16:28 < riley_dt> going to see if that results in faster reproduction of the issue without the revolver tool failing
16:29 < riley_dt> my hope is that will fix the revolver hangup problem when no segfault happens 
16:29 < kanderso> riley_dt: you think there is an issue with rebuilding the ring after node failure 
16:29 < riley_dt> and then the tracing used in the segfaulting process can determine for me where the problem is
16:29 < kanderso> ahh
16:29 < riley_dt> the issue with cman is that it doesn't ignore the old ring messages
16:29 < riley_dt> so here is scenario:
16:29 < riley_dt> node 1, 2 start
16:30 < riley_dt> they do some synchronization and send their quorum messages
16:30 < riley_dt> then node 3 starts
16:30 < riley_dt> node 1 and 2 never finished their synchronization of the quorum messages
16:30 < riley_dt> and now node3 starts, which forms new ring 1, 2, 3.  these nodes receive old ring information
16:30 < riley_dt> such as qourum votes
16:30 < riley_dt> this confuses cman
16:30 < chrissie> (you say)
16:30 < riley_dt> cman drops qourum then reacquired it
16:30 < riley_dt> yes this is all my guessing right now
16:31 < riley_dt> i dont have evidence this is what the issue is
16:31 < riley_dt> just a short trace
16:31 < riley_dt> i have definately seen cman drop and reaquire qourum in a new configuration
16:31 < kanderso> why are the old messages invalid?
16:31 < chrissie> riley_dt: yes that's quite possible
16:31 < riley_dt> they were sent under a set of assumptions that are no longer true
16:32 < riley_dt> totem has a queue of outgoing messages
16:32 < riley_dt> it is posible a queued message can be queued in an old ring and sent in a new ring
16:32 < riley_dt> i believe that is what happens in this scenario
16:33 < riley_dt> all of the openais servifce engines include a ring id to reject these messages
16:33 < kanderso> hmm - when you form a new ring - do you discard all of the old messages
16:33 < riley_dt> kanderso you can't discard messages per vs it would make things really confusing
16:33 < riley_dt> but some services should manually reject old messages
16:33 < riley_dt> the ckpt service for example will manually reject old ring messages
16:33 < riley_dt> cman should as well but does not
16:34 < kanderso> when would a service not reject old messages
16:34 < riley_dt> cpg doesn't
16:34 < riley_dt> its an application choice that uses totem
16:34 < kanderso> would this be related to our 64 node stability problems as well when 16 nodes try to join?
16:34 < riley_dt> the way most of th eopenais service engines are designed is to resync all state 
16:35 < riley_dt> kanderso not sure on the 16 node issue - haven't seen a trace
16:35 < riley_dt> kanderso from my understanding that is trouble forming a ring not trouble synchronizing
16:36 < kanderso> so, next step - cause cman to reject old messages and see if you can create a stable cluster - do you need hardware?
16:36 < riley_dt> hardware is good
16:36 < riley_dt> so next step is cause cman to reject old messages
16:37 < riley_dt> and see if I can get ckpt to segfault with revolver
16:37 < riley_dt> with the tracing
16:37 < riley_dt> the ckpt tracing will pretty much solve the problem once i can get a trace
16:37 < fabbione> riley_dt: please add it as Action points to the Agenda so everybody knows
16:37 < riley_dt> with my current tracing, openais wont segfault in the ckpt service
16:37 < fabbione> i need to call the 5 minutes mark.. unless you are almost done
16:37 < riley_dt> because revolver fails before that happens
16:38 < riley_dt> revolver failure is undetermined but seems to be related to this cman issue
16:38 < riley_dt> ok im wrapped
16:38 < kanderso> fabbione: that's all I had for now
16:38 < fabbione> kanderso, riley_dt: ok cool
16:39 < fabbione> we are supposed to give voice to the community now. I know a bunch of people are connected but didn't add items on the agenda
16:39 < fabbione> is there anybody that would like to comment/add ideas/etc?
16:39 < markflar> fabbione: just wanted to say thanks for organizing this meeting and also the cluster summit
16:40 < fabbione> markflar: thanks. i hope it is useful for everybody to see what is going on
16:40 < markflar> is the list of attendees finalized?
16:40 < fabbione> markflar: and thanks for waking up at 7am :)
16:40 < markflar> fabbione: np, next time i'll hopefully be less groggy  :)
16:41 < fabbione> markflar: the list that's on the wiki is pretty much final. I am waiting confirmation from jlbec and sunil to be there but i assume they have everything so
rted by now
16:41 < markflar> cool
16:42 < fabbione> ok.. i guess it's about time to wrap up..
16:42 < fabbione> Any Other Business?
16:42 < fabbione> i have only one item at this point.. once corosync will hit fedora, we will bump our soname to 3.0.
16:42 -!- edwardam [n=edwardam@c-98-208-67-213.hsd1.ca.comcast.net] has joined #linux-cluster
16:42 < fabbione> our library API didn't change in a long time
16:43 < fabbione> and we need to finalize it for packagers
16:43 < riley_dt> ya
16:43 < fabbione> just to make clear
16:43 < fabbione> this is only library API
16:43 < fabbione> it doesn't guarantee that the code is stable..
16:43 < fabbione> nor feature complete
16:44 < dct__> what does it mean that packagers can/will do?
16:44 < fabbione> dct__: there are different issues related to the library soname
16:44 < fabbione> depending on the distro:
16:44 < fabbione> packagers need to rebuild packages that depend on top of our library API/ABI
16:45 < fabbione> for example: ricci needs libcman. bumping the soname of libcman, ricci needs to be rebuilt
16:45 < fabbione> (this is general for all distros)
16:45 < fabbione> some distros like debian/ubuntu represents the major soname in the pacakge name
16:45 < fabbione> for example: libcman2 or libdlm2
16:45 < fabbione> those need to become libfoo3
16:46 < fabbione> there is administration time to get those changes in the distrubtion
16:46 < fabbione> this is basically it, but both fedora and ubuntu are about to freeze hard
16:46 < fabbione> debian should release in sept, so they won't get those changes (they are running stable2 atm)
16:47 < fabbione> what happens for the future is that it is easier to update a package that does not include a soname change
16:47 < fabbione> so bug fixes and such, because we have a solid base already
16:47 < riley_dt> when the abi remains the same i dont see the point of changing the so number
16:48 < fabbione> riley_dt: the problem is that so far we didn't really do a good job in tracking ABI changes. you have the same issue (if not worst) with lcrso
16:48 < fabbione> so this is going to sign a "clean" start
16:48 < fabbione> and improve this process from now on
16:49 < fabbione> we always assumed: "oh just rebuild.. it will work" and this is not the kind of method distributions can accept for future updates
16:49 < fabbione> now we are spreading our software a lot more
16:49 < fabbione> more projects depend on us
16:49 < fabbione> we need to be more disciplined about it
16:50 < fabbione> ok
16:50 < fabbione> dct__: does this answer your question?
16:50 < dct__> yep, thanks
16:50 < fabbione> ok perfect
16:50 < fabbione> anything more?
16:51 < kanderso> fabbione: are you going to post the log to cluster-devel mailing list?
16:51 < fabbione> kanderso: i was thinking about adding it to the wiki, but i can do both
16:51 < kanderso> cool
16:51 < fabbione> ok
16:51 < fabbione> perfect
16:51 < fabbione> guys, i'd like to thank everybody for experimenting with me
16:52 < fabbione> i hope it wasn't too boring
16:52 < richterd> fabbione: thank you, i thought this was quite useful.  if there are more i'll be there :)
16:52 < fabbione> and please send me any comments (negative/positive/neutral)
16:52 < fabbione> richterd: you are welcome
16:52 < fabbione> richterd: hopefully there will be more.. just need to find the right frequency
16:53 < fabbione> thanks everybody
16:53  * richterd nods.
16:53 < fabbione> have a nice evening/rest of the day
16:54 -!- richterd is now known as richterd_AFK
17:11 -!- ssato [n=ssato@NE1051lan9.rev.em-net.ne.jp] has quit ["Leaving."]
17:15 -!- StuartMI [n=stuartmi@208.114.97.108] has joined #linux-cluster
17:17 -!- aenaus [n=rgfergfa@91.140.101.252] has joined #linux-cluster
17:19 < lon> fabbione: fwiw, my todo list for rgmanager in master are:
17:19 < lon> (1) Kill vf.
17:19 < lon> (2) Kill non-s-lang-processing
17:19 < lon> and that's about it
17:19 < lon> (apart from bugs)