Hi guys
It's been a long time coming, but I finally published the non-blocking state transfer draft on the wiki: https://community.jboss.org/wiki/Non-blockingStateTransfer Unlike my previous state transfer design document, I think I've fleshed out most of the implications. Still, there are some things I don't have a clear solution for yet. As you would expect it's mostly around merging and delayed state transfer. I'm looking forward to hearing your comments/advice! Cheers Dan PS: Let's discuss this over the mailing list only. _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Hey Dan,
Thanks very much for writing this up! Some questions: 1. What's "steady state" ? Running status after state transfer has happened? 2. In L1, I don't understand what you mean in "we do need to add the old no-longer-owners as requestors for the transferred keys". Do we really need L1OnRehash? What's the use case/motivation for this option? 3. Recovery cache, just wondering, shouldn't we configure such cache with a cache store by default? Would that help? 4. What's the data container snapshot exactly? What kind of impact would it have on memory consumption? One last comment: There's a lot of detail in that wiki which is hard to fully understand. I think it would really help to build some diagrams explaining some of the process to help the community understand better your design, or alternative solutions. WDYT? Cheers, On Mar 8, 2012, at 10:55 AM, Dan Berindei wrote: > Hi guys > > It's been a long time coming, but I finally published the non-blocking > state transfer draft on the wiki: > https://community.jboss.org/wiki/Non-blockingStateTransfer > > Unlike my previous state transfer design document, I think I've > fleshed out most of the implications. Still, there are some things I > don't have a clear solution for yet. As you would expect it's mostly > around merging and delayed state transfer. > > I'm looking forward to hearing your comments/advice! > > Cheers > Dan > > PS: Let's discuss this over the mailing list only. > _______________________________________________ > infinispan-dev mailing list > [hidden email] > https://lists.jboss.org/mailman/listinfo/infinispan-dev -- Galder Zamarreño Sr. Software Engineer Infinispan, JBoss Cache _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Hi Galder First of all, thanks for reading! On Mar 9, 2012 12:54 PM, "Galder Zamarreño" <[hidden email]> wrote: Exactly, it's when there is no state transfer in progress. > 2. In L1, I don't understand what you mean in "we do need to add the old no-longer-owners as requestors for the transferred keys". Do we really need L1OnRehash? What's the use case/motivation for this option? If we have a put on a new owner, it must know to invalidate the key both on the nodes that read it after the CH change and the ones that read before the CH change (from an old owner). It will become critical once we start staggering clustered get commands. The L1OnRehash requirement is a bit different, because the old owner wasn't on any requestor list before the ownership change. But the solution is the same. > 3. Recovery cache, just wondering, shouldn't we configure such cache with a cache store by default? Would that help? I think a cache store would slow things too much, but it would solve this. > 4. What's the data container snapshot exactly? What kind of impact would it have on memory consumption? It's not really a snapshot, I just called it a snapshot to convey the idea that the (Bounded)ConcurrentHashMap doesn't pick up all the changes made after the start of the iteration. > One last comment: There's a lot of detail in that wiki which is hard to fully understand. I think it would really help to build some diagrams explaining some of the process to help the community understand better your design, or alternative solutions. WDYT? Fair point. I focused more on keeping track of required code changes to make sure I don't miss anything, but that ended up being even more dense so I removed it from the wiki. Any particular diagram requests, to help prioritize things? Cheers > Cheers, _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Dan Berindei
Wow !
Does this need to be so complex ? I've spent a hour trying to understand it, and am still overwhelmed... :-) My understanding (based on my changed in 4.2) is that state transfer moves/deletes keys based on the diff between 2 subsequent views: - Each node checks all of the affected keys - If a key should be stored in additional nodes, the key is pushed there - If a key shouldn't be stored locally anymore, it is removed IMO, there's no need to handle a merge differently from a regular view, and we might end up with inconsistent state, but that's unavoidable until we have eventual consistency. Fine... Also, why do we need to transfer ownership information ? Can't ownership be calculated purely on local information ? I'm afraid that the complexity will increase the state space (hard to test all possible state transitions), lead to unnecessary messages being sent and most importantly, might lead to blocks. The section on locking outright scares me :-) Perhaps reducing the level of details here - as Galder suggested - might help to understand the basic design. Sorry for being a bit negative, but I think state transfer is one of the most critical and important pieces of code in DIST mode, and this needs to work against large (say a couple of hundreds) clusters and nodes joining, leaving or crashing all the times... I'm going to re-read the design again, maybe what I said above is just BS ... :-) On 3/8/12 11:55 AM, Dan Berindei wrote: > Hi guys > > It's been a long time coming, but I finally published the non-blocking > state transfer draft on the wiki: > https://community.jboss.org/wiki/Non-blockingStateTransfer > > Unlike my previous state transfer design document, I think I've > fleshed out most of the implications. Still, there are some things I > don't have a clear solution for yet. As you would expect it's mostly > around merging and delayed state transfer. > > I'm looking forward to hearing your comments/advice! > > Cheers > Dan > > PS: Let's discuss this over the mailing list only. > Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
I agree with Bela: this looks scary, can't imagine how tricky it would
be to implement it correctly. Could you split the problem? Also, what happened with the super-simple ideas I'd shared in London? Is is the same? I'm assuming it's very different.. I might have overseen some important aspect but I'd like to know why that approach was discarded. Was it looking too simple ? :P I'm sorry I'm skimming through it, will have more time next week but as Galder said as well I'll need to draw this to understand it better. Some first-impact notes: # ownership information Deciding when to start a state transfer, is a different problem. move it to another page or drop it? Deciding when to have a node join - same as above? # Cache entries ## "We need a snapshot of the iterator" < Can we avoid it? We just start refusing to serve write commands by checking any incoming command. It's an additional interceptor which checks the incoming command is "appropriate" to be handled by us or needs to return some kind of rejection code. ## Need for tombstones < I think we can avoid that actually. We'll need it for MVCC replace/delete operations, but for state transfer it's not needed if we decide that a Write operation has to send the value to the new owners only and send an "authoritative invalidation" to all previous owners. # Lock Information This is trivial if you stop thinking as them as being special. A lock is a marker, and a marker is a value stored in the grid. These values are transferred as any other value, with one single differentiator: since there always is only one, they are manipulated via CAS operations and are guaranteed to be consistent without the need of being locked when changed. #L1 let's keep it simple initially and just flush them out as decided. ## the cleanup you mention: is that not a current bug, orthogonal to this design page? (trying to identify more thing to move out) #Handling merges Could we simplify this by saying that the views are not actually a linked list but a tree? In this document we're not attempting to solve consistent merging of split brain, right? So we only need to know how to move the state to the rightful new owner. For conflicts, let's assume there is an "ConflictResolver object" which we'll describe/implement somewhere else. #State transfer disabled We should think about the cases in which this option makes sense to be enabled. In those cases, would people still be interested in L1 consistency and transactions? If not, this is not a problem to solve. after getting to the end, it's not a bad document at all but I still think it looks too scary :D Cheers, Sanne On 9 March 2012 14:19, Bela Ban <[hidden email]> wrote: > Wow ! > > Does this need to be so complex ? I've spent a hour trying to understand > it, and am still overwhelmed... :-) > > My understanding (based on my changed in 4.2) is that state transfer > moves/deletes keys based on the diff between 2 subsequent views: > - Each node checks all of the affected keys > - If a key should be stored in additional nodes, the key is pushed there > - If a key shouldn't be stored locally anymore, it is removed > > IMO, there's no need to handle a merge differently from a regular view, > and we might end up with inconsistent state, but that's unavoidable > until we have eventual consistency. Fine... > > Also, why do we need to transfer ownership information ? Can't ownership > be calculated purely on local information ? > > I'm afraid that the complexity will increase the state space (hard to > test all possible state transitions), lead to unnecessary messages being > sent and most importantly, might lead to blocks. > > The section on locking outright scares me :-) Perhaps reducing the level > of details here - as Galder suggested - might help to understand the > basic design. > > Sorry for being a bit negative, but I think state transfer is one of the > most critical and important pieces of code in DIST mode, and this needs > to work against large (say a couple of hundreds) clusters and nodes > joining, leaving or crashing all the times... > > I'm going to re-read the design again, maybe what I said above is just > BS ... :-) > > > On 3/8/12 11:55 AM, Dan Berindei wrote: >> Hi guys >> >> It's been a long time coming, but I finally published the non-blocking >> state transfer draft on the wiki: >> https://community.jboss.org/wiki/Non-blockingStateTransfer >> >> Unlike my previous state transfer design document, I think I've >> fleshed out most of the implications. Still, there are some things I >> don't have a clear solution for yet. As you would expect it's mostly >> around merging and delayed state transfer. >> >> I'm looking forward to hearing your comments/advice! >> >> Cheers >> Dan >> >> PS: Let's discuss this over the mailing list only. >> > -- > Bela Ban, JGroups lead (http://www.jgroups.org) > _______________________________________________ > infinispan-dev mailing list > [hidden email] > https://lists.jboss.org/mailman/listinfo/infinispan-dev infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Bela Ban
On Mar 9, 2012 4:19 PM, "Bela Ban" <[hidden email]> wrote: Sorry about that Bela, it is quite complex indeed.
> My understanding (based on my changed in 4.2) is that state transfer That's fine if we block all writes during state transfer, but once we start allowing writes during state transfer we need to log all changes and send them to the new owners at the end (the approach in 4.2 without your changes) or redirect all commands to the new owners. In addition to that, we have to either block all commands on the new owners until they receive the entire state or to forward get commands to the old owners as well. The two options apply for lock commands as well.
> IMO, there's no need to handle a merge differently from a regular view, I'm not trying to make merges more complicated on purpose :) I think we need to try our best to prevent data loss, even if we there is a chance of inconsistency. We still see clusters in the test suite form via merges from time to time, so we can't just say after a merge all bets are off. The problem is that I chose to forward get commands to the old owners AND to remove the cache view rollback (which was blocking in our Lisbon design). This means that we must keep a chain of cache views for which we haven't finished state transfer, and with merges that chain turns into a tree + it has to be broadcasted by the coordinator to all the nodes.
> Also, why do we need to transfer ownership information ? Can't ownership The current ownership information can be calculated based solely on the members list. But the ownership in the previous cache view(s) can not be computed by joiners based only on their information, so it has to be broadcasted by the coordinator.
I agree the increased complexity is a concern, but I'm not willing to give up on non-blocking state transfer just yet... One particularly nasty problem with the existing, blocking, state transfer is that before iterating the data container we need to wait for all the pending commands to finish. So if we have high contention and a 60 seconds lock acquisition timeout, state transfer is almost guaranteed to take > 60 seconds.
> The section on locking outright scares me :-) Perhaps reducing the level I got burned pretty hard with my asymmetric clusters design, because the implementation turned out a lot more complex than the design, so I tried to investigate all the interactions between the different choices we're making this time.
> Sorry for being a bit negative, but I think state transfer is one of the I'd argue that the blocking state transfer we have doesn't satisfy this requirement...
> I'm going to re-read the design again, maybe what I said above is just Please do re-read it, I'll try to simplify it a bit by Monday based on your feedback.
> _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Sanne Grinovero-3
On Fri, Mar 9, 2012 at 6:03 PM, Sanne Grinovero <[hidden email]> wrote:
> I agree with Bela: this looks scary, can't imagine how tricky it would > be to implement it correctly. Could you split the problem? > I thought I just did that with the different sections :) All the state transfer components are bound together by the approach we choose for updating the CH, so and I think most of the complications stem from the way we have to maintain ownership information - not just for the current cache view, but for the previous cache views as well. > Also, what happened with the super-simple ideas I'd shared in London? > Is is the same? I'm assuming it's very different.. > I might have overseen some important aspect but I'd like to know why > that approach was discarded. Was it looking too simple ? :P > I thought it was the same thing, but in London we only had a very high-level discussion. We didn't discuss how a joiner gets the chain of CHs so that it knows which are the old owners, and we only briefly touched on how to discard old cache views. Please let me know where you think I've diverged from our discussions :) > I'm sorry I'm skimming through it, will have more time next week but > as Galder said as well I'll need to draw this to understand it better. > > Some first-impact notes: > # ownership information > Deciding when to start a state transfer, is a different problem. > move it to another page or drop it? > Deciding when to have a node join - same as above? > We do need to decide whether we want to allow CH updates without transferring the actual data or not - it has an impact on how big the "transaction log" on the new owners can get. We also need to decide whether we allow state transfer to finish successfully even though one node has already left the cache view. I've chosen to "interrupt" the state transfer on a leave in order to avoid weird situations where DM.getLocation(key) returns an empty set, but that complicates things a bit so we need to make it explicit here. E.g. after an error we can't just retry the same cache view, we have to account for leavers as well, so we can't say that a merge can only be the last view in the chain). > # Cache entries > ## "We need a snapshot of the iterator" < Can we avoid it? We just > start refusing to serve write commands by checking any incoming > command. It's an additional interceptor which checks the incoming > command is "appropriate" to be handled by us or needs to return some > kind of rejection code. We have to iterate over the data container in order to push data, and that iterator behaves as a snapshot. So there's nothing to avoid here. I did mention that we start refusing write commands - the iteration of the data container is the reason why we need the cache view id check. > ## Need for tombstones < I think we can avoid that actually. We'll > need it for MVCC replace/delete operations, but for state transfer > it's not needed if we decide that a Write operation has to send the > value to the new owners only and send an "authoritative invalidation" > to all previous owners. > I'm not sure how this "authoritative invalidation" would work... any change to the data container may or may be reflected in the iteration done by the state transfer task, and actually by the time we send the invalidation command the entry could in fact be in a JGroups message on its way to the new owner. > # Lock Information > This is trivial if you stop thinking as them as being special. A lock > is a marker, and a marker is a value stored in the grid. These values > are transferred as any other value, with one single differentiator: > since there always is only one, they are manipulated via CAS > operations and are guaranteed to be consistent without the need of > being locked when changed. > Actually no, we only keep the lock on the primary owner so we can't rely on the lock information being there in case of a leave. So we need to transfer transaction information instead - and the transaction information only gives a "possibly locked" status for a key. Even if we do treat the transaction information as the normal state, it doesn't mean that we don't have to handle it - in fact I wrote this idea and marked it [LATER] because I thought it would be too complicated to duplicate the data infrastructure for transaction information. > #L1 > let's keep it simple initially and just flush them out as decided. Agree, it's simpler to flush everything out - but that doesn't mean we don't have to change anything, we still need to add the old owners as requestors if L1OnRehash is enabled. > ## the cleanup you mention: is that not a current bug, orthogonal to > this design page? (trying to identify more thing to move out) > I could move out the fact that it's a current bug, but it's still something we need to do during state transfer and we have to decide what locks/latches we need to hold while doing it in order to ensure consistency. > #Handling merges > Could we simplify this by saying that the views are not actually a > linked list but a tree? As I wrote the document I wasn't quite sure that the tree approach would work, but I'm more and more convinced that it would be fine. I agree that it sounds a little more complicated because I wrote it at first considering a list of cache views and I only started considering the generic tree approach when I wrote the merge section. But I couldn't change it as I wasn't 100% sold on the tree approach (another idea I had was to ignore any view change after a merge, so the cache view tree could only have more than 1 branch at the root level), so I try to leave the option open hoping for better suggestions from you guys. > In this document we're not attempting to solve consistent merging of > split brain, right? So we only need to know how to move the state to > the rightful new owner. For conflicts, let's assume there is an > "ConflictResolver object" which we'll describe/implement somewhere > else. > I don't want to solve consistent merging here, but I do want to make sure that it is possible. For instance if after a failed merge the information about the old partitions is lost, no "ConflictResolver object" could make that state consistent and state will be just lost. The long discussion about which view is newer is there because without merges we want to ensure consistency and apply received entries in their logical order (which happens the ascending order of their cache view id) - therefore we need to buffer data received for "intermediate" nodes in the cache view chain. Merges complicate this because the cache view tree has multiple leaves and we have no way of ordering them (unlike JGroups we identify a cache view only by its id, so there's no way compare cache views and decide one is a descendent of another - hence the need for the intermediate flag. > #State transfer disabled > We should think about the cases in which this option makes sense to be > enabled. In those cases, would people still be interested in L1 > consistency and transactions? If not, this is not a problem to solve. > The main use case I have in mind is when the user doesn't care about missing data (we're not the authoritative source) but he does care about staleness. Most users I talked with (not many, I grant you that) are willing to accept stale data but on very short timescales. L1, with its 10 minutes default lifespan, doesn't qualify - and making the L1 lifespan very short will make it useless. > after getting to the end, it's not a bad document at all but I still > think it looks too scary :D > Agree with both ;-) Cheers Dan > Cheers, > Sanne > > On 9 March 2012 14:19, Bela Ban <[hidden email]> wrote: >> Wow ! >> >> Does this need to be so complex ? I've spent a hour trying to understand >> it, and am still overwhelmed... :-) >> >> My understanding (based on my changed in 4.2) is that state transfer >> moves/deletes keys based on the diff between 2 subsequent views: >> - Each node checks all of the affected keys >> - If a key should be stored in additional nodes, the key is pushed there >> - If a key shouldn't be stored locally anymore, it is removed >> >> IMO, there's no need to handle a merge differently from a regular view, >> and we might end up with inconsistent state, but that's unavoidable >> until we have eventual consistency. Fine... >> >> Also, why do we need to transfer ownership information ? Can't ownership >> be calculated purely on local information ? >> >> I'm afraid that the complexity will increase the state space (hard to >> test all possible state transitions), lead to unnecessary messages being >> sent and most importantly, might lead to blocks. >> >> The section on locking outright scares me :-) Perhaps reducing the level >> of details here - as Galder suggested - might help to understand the >> basic design. >> >> Sorry for being a bit negative, but I think state transfer is one of the >> most critical and important pieces of code in DIST mode, and this needs >> to work against large (say a couple of hundreds) clusters and nodes >> joining, leaving or crashing all the times... >> >> I'm going to re-read the design again, maybe what I said above is just >> BS ... :-) >> >> >> On 3/8/12 11:55 AM, Dan Berindei wrote: >>> Hi guys >>> >>> It's been a long time coming, but I finally published the non-blocking >>> state transfer draft on the wiki: >>> https://community.jboss.org/wiki/Non-blockingStateTransfer >>> >>> Unlike my previous state transfer design document, I think I've >>> fleshed out most of the implications. Still, there are some things I >>> don't have a clear solution for yet. As you would expect it's mostly >>> around merging and delayed state transfer. >>> >>> I'm looking forward to hearing your comments/advice! >>> >>> Cheers >>> Dan >>> >>> PS: Let's discuss this over the mailing list only. >>> >> -- >> Bela Ban, JGroups lead (http://www.jgroups.org) >> _______________________________________________ >> infinispan-dev mailing list >> [hidden email] >> https://lists.jboss.org/mailman/listinfo/infinispan-dev > _______________________________________________ > infinispan-dev mailing list > [hidden email] > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Dan Berindei
Sorry for the delay, been busy ...
Comments inline. On 3/9/12 7:20 PM, Dan Berindei wrote: > On Mar 9, 2012 4:19 PM, "Bela Ban"<[hidden email]> wrote: > >> My understanding (based on my changed in 4.2) is that state transfer >> moves/deletes keys based on the diff between 2 subsequent views: >> - Each node checks all of the affected keys >> - If a key should be stored in additional nodes, the key is pushed there >> - If a key shouldn't be stored locally anymore, it is removed >> > > > That's fine if we block all writes during state transfer, but once we start > allowing writes during state transfer we need to log all changes and send > them to the new owners at the end (the approach in 4.2 without your > changes) or redirect all commands to the new owners. OK > In addition to that, we have to either block all commands on the new owners > until they receive the entire state or to forward get commands to the old > owners as well. The two options apply for lock commands as well. Why do we have to lock at all if we use queueing of requests during a state transfer ? > I'm not trying to make merges more complicated on purpose :) I didn't imply that; but I thought the London design was pretty simple, and I'm trying to figure out why we have such a big (maybe perceived) diff between London and what's in the document. >> Also, why do we need to transfer ownership information ? Can't ownership >> be calculated purely on local information ? >> > > > The current ownership information can be calculated based solely on the > members list. But the ownership in the previous cache view(s) can not be > computed by joiners based only on their information, so it has to be > broadcasted by the coordinator. OK > One particularly nasty problem with the existing, blocking, state transfer > is that before iterating the data container we need to wait for all the > pending commands to finish. Can't we queue the state transfer requests *and* the regular requests ? When ST is done, we apply the ST requests first, then the queued regular requests. -- Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
On Thu, Mar 15, 2012 at 10:09 AM, Bela Ban <[hidden email]> wrote:
> Sorry for the delay, been busy ... > No worries Bela, I'm late with my rewrite of the NBST document as well... > Comments inline. > > On 3/9/12 7:20 PM, Dan Berindei wrote: >> On Mar 9, 2012 4:19 PM, "Bela Ban"<[hidden email]> wrote: > >> >>> My understanding (based on my changed in 4.2) is that state transfer >>> moves/deletes keys based on the diff between 2 subsequent views: >>> - Each node checks all of the affected keys >>> - If a key should be stored in additional nodes, the key is pushed there >>> - If a key shouldn't be stored locally anymore, it is removed >>> >> >> >> That's fine if we block all writes during state transfer, but once we start >> allowing writes during state transfer we need to log all changes and send >> them to the new owners at the end (the approach in 4.2 without your >> changes) or redirect all commands to the new owners. > > > OK > > >> In addition to that, we have to either block all commands on the new owners >> until they receive the entire state or to forward get commands to the old >> owners as well. The two options apply for lock commands as well. > > > Why do we have to lock at all if we use queueing of requests during a > state transfer ? > I'm not sure what you mean by queuing of requests... The current approach: * If the cache is sync, the old owners will throw a StateTransferInProgressException when a state transfer is running and the originator will then wait for state transfer to end before sending the command again to the new owners. I'm * If the cache is async, we block write commands on the target, and unblock them if after the state transfer finished. I don't think we do anything to make sure all the new owners get all the commands that were targeted to the old owners, so we can lose consistency between owners. The new design will always reject commands with the old cache view id, on any owner. If the cache view id is ok but we don't have the new CH or the lock information yet, the command will block on the target. * If the cache is sync, we can use the same approach as before and tell the originator to retry the command on the new owners. * If the cache is async, instead of doing nothing like before I was thinking that the target itself could forward the commands (especially write commands) to the new owners. We didn't discuss this in London, I added it afterwards as I realized that without forwarding we'll be losing data. I didn't realize at the time, however, that the current design has the same problem. I think I see a problem with my request forwarding plan: the requests will have a different source, so JGroups will not ensure an ordering between them and the originator's subsequent requests. This means we'll have inconsistencies anyway, so perhaps it would be better if we stuck to the current design's limitations and remove the requirement for old targets to forward commands to new ones. > >> I'm not trying to make merges more complicated on purpose :) > > > I didn't imply that; but I thought the London design was pretty simple, > and I'm trying to figure out why we have such a big (maybe perceived) > diff between London and what's in the document. > True, there are a many complications that were not apparent during our London discussion. > >>> Also, why do we need to transfer ownership information ? Can't ownership >>> be calculated purely on local information ? >>> >> >> >> The current ownership information can be calculated based solely on the >> members list. But the ownership in the previous cache view(s) can not be >> computed by joiners based only on their information, so it has to be >> broadcasted by the coordinator. > > > OK > > > >> One particularly nasty problem with the existing, blocking, state transfer >> is that before iterating the data container we need to wait for all the >> pending commands to finish. > > > Can't we queue the state transfer requests *and* the regular requests ? > When ST is done, we apply the ST requests first, then the queued regular > requests. > That was basically what we did in the blocking design: the ST commands could execute during ST, but regular commands would block until the end of the ST. With async caches, that meant we would use JGroups' 1 queue per sender (so not a global queue, but close). The problem was not with the regular commands that arrived after the start of the ST, but with the commands that had already started executing when ST started. This is the classic example: 1. A prepare command for Tx1 locks k1 on node A 2. A prepare command for Tx2 tries to acquire lock k1 on node A 3. State transfer starts up and blocks all write commands 4. The Tx1 commit command, which will unlock k1, arrives but can't run until state transfer has ended 5. The Tx2 prepare command times out on the lock acquisition after 10 seconds (by default) 6. State transfer can can now proceed and push or receive data. 7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however. The solution I had in mind for the old design was to add some kind of deadlock detection to the LockManager and throw a StateTransferInProgress when a deadlock with the state transfer is detected. With the new design I thought it would be simpler to not acquire a big lock for the entire duration of the write command that would prevent state transfer. Instead I would acquire different locks for much shorter amounts of time, and at the beginning of each lock acquisition we would just check that the command's view id is still the correct one. I guess I may have been wrong on the "simpler" assertion :) _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Hi Dan,
This statement makes me nervous. Would we still have inconsistencies in a transactional context as well? "We didn't discuss this in London, I added it afterwards as I realized that without forwarding we'll be losing data. I didn't realize at the time, however, that the current design has the same problem. I think I see a problem with my request forwarding plan: the requests will have a different source, so JGroups will not ensure an ordering between them and the originator's subsequent requests. This means we'll have inconsistencies anyway, so perhaps it would be better if we stuck to the current design's limitations and remove the requirement for old targets to forward commands to new ones." Regards, Erik _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Sorry for making you nervous Erik, but if I remember correctly you're
using synchronous caches so these paragraphs should not apply to your use case. In asynchronous caches there is always the possibility of inconsistency, because different owners can receive the transactions in a different order (and the transactions are always 1PC). So this is really nothing new, I just thought I could close one window of inconsistency during view changes and I realized that it's not that easy. In sync caches with async commit, the commit RPC synchronous, but on a background thread, so we are able to retry the command on the new owners. There is one scenario open to inconsistencies for sync caches: when the originator dies before retrying the command on the new nodes. Enabling recovery will fix this, but I think forwarding from the old owners could be a more lightweight solution (at the cost of extra complexity, of course). Cheers Dan On Thu, Mar 15, 2012 at 3:16 PM, Erik Salter <[hidden email]> wrote: > Hi Dan, > > This statement makes me nervous. Would we still have inconsistencies in a > transactional context as well? > > "We didn't discuss this in London, I added it afterwards as I realized > that without forwarding we'll be losing data. I didn't realize at the > time, however, that the current design has the same problem. > > I think I see a problem with my request forwarding plan: the requests > will have a different source, so JGroups will not ensure an ordering > between them and the originator's subsequent requests. This means > we'll have inconsistencies anyway, so perhaps it would be better if we > stuck to the current design's limitations and remove the requirement > for old targets to forward commands to new ones." > > Regards, > > Erik > > > _______________________________________________ > infinispan-dev mailing list > [hidden email] > https://lists.jboss.org/mailman/listinfo/infinispan-dev _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Dan Berindei
On 3/15/12 11:29 AM, Dan Berindei wrote: > That was basically what we did in the blocking design: the ST commands > could execute during ST, but regular commands would block until the > end of the ST. With async caches, that meant we would use JGroups' 1 > queue per sender (so not a global queue, but close). > > The problem was not with the regular commands that arrived after the > start of the ST, but with the commands that had already started > executing when ST started. This is the classic example: > 1. A prepare command for Tx1 locks k1 on node A > 2. A prepare command for Tx2 tries to acquire lock k1 on node A > 3. State transfer starts up and blocks all write commands > 4. The Tx1 commit command, which will unlock k1, arrives but can't run > until state transfer has ended > 5. The Tx2 prepare command times out on the lock acquisition after 10 > seconds (by default) > 6. State transfer can can now proceed and push or receive data. > 7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however. > > The solution I had in mind for the old design was to add some kind of > deadlock detection to the LockManager and throw a > StateTransferInProgress when a deadlock with the state transfer is > detected. OK. I don't like the old design, as ST has to wait until all pending TXs (those with locks held) have to commit before we can make progress. If the lock acquition timeout is high, we'll have to wait for a long time. > With the new design I thought it would be simpler to not acquire a big > lock for the entire duration of the write command that would prevent > state transfer. Instead I would acquire different locks for much > shorter amounts of time, and at the beginning of each lock acquisition > we would just check that the command's view id is still the correct > one. OK. Perhaps an overview of the new design in the document is warranted. There's a section on transfer of CacheEntries and one on locks, but I didn't see a combined discussion. Perhaps an example like the one above would be good ? I now realize how much simpler the use of total order is here: since all updates in a cluster happen in total order, we don't need to acquire locks in 1 phase and release them in another phase. ST is then just another update, inserted at a certain place in the stream of updates. I assume the Cloud-TM guys don't do state transfer in their prototype, or do they ? Pedro ? If not, then there needs to be an implementation of ST for TO. Cheers, -- Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Dan Berindei
Hi Dan, Ownership Information: - I remember the discussion with Sanne about an algorithm that wouldn't require view prepare/rollback, but I think it would be very interring to see it described in detail somewhere as all the points you raised in the document are very closely tied to that - "However, it's not that clear what to do when a node leaves while the cache is in "steady state"" --> if (numOwners-1) nodes crash a state transfer is (most likely) wanted in order not to loose consistency in the eventuality another node goes down. By default numOwners is 2, so in the first release, can't we just assume that for every leave we'd want immediately issue a state transfer? I think this would cover most of our user's needs and simplify the problem considerably. Recovery - I agree with your point re:recovery. This shouldn't be considered hight prio in the first release. The recovery info is kept in an independent cache, which allows a lot of flexibility: e.g. can point to a shared cache store so that recovery caches on other nodes can read that info when needed. Locking/sync.. "The state transfer task will acquire the DCL in exclusive mode after updating the cache view id and release it after obtaining the data container snapshot. This will ensure the state transfer task can't proceed on the old owner until all write commands that knew about the old CH have finished executing." <-- that means that incoming writes would block for the duration of iterating the DataContainer? That shouldn't be too bad, unless a cache store is present.. "CacheViewInterceptor [..] also needs to block in the interval between a node noticing a leave and actually receiving the new cache view from the coordinator" <-- why can't the local cache star computing the new CH independently and not wait for the coordinator..? Cheers, Mircea On 8 Mar 2012, at 10:55, Dan Berindei wrote:
_______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Dan Berindei
_______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Bela Ban
Hi guys
I've updated the document at https://community.jboss.org/wiki/Non-blockingStateTransfer This time I've added an overview and changed "locking requirements" to "command execution during state transfer". I also assumed that we are going to use the tree of cache view. More comments inline. On Fri, Mar 16, 2012 at 9:16 AM, Bela Ban <[hidden email]> wrote: > > > On 3/15/12 11:29 AM, Dan Berindei wrote: > >> That was basically what we did in the blocking design: the ST commands >> could execute during ST, but regular commands would block until the >> end of the ST. With async caches, that meant we would use JGroups' 1 >> queue per sender (so not a global queue, but close). >> >> The problem was not with the regular commands that arrived after the >> start of the ST, but with the commands that had already started >> executing when ST started. This is the classic example: >> 1. A prepare command for Tx1 locks k1 on node A >> 2. A prepare command for Tx2 tries to acquire lock k1 on node A >> 3. State transfer starts up and blocks all write commands >> 4. The Tx1 commit command, which will unlock k1, arrives but can't run >> until state transfer has ended >> 5. The Tx2 prepare command times out on the lock acquisition after 10 >> seconds (by default) >> 6. State transfer can can now proceed and push or receive data. >> 7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however. >> >> The solution I had in mind for the old design was to add some kind of >> deadlock detection to the LockManager and throw a >> StateTransferInProgress when a deadlock with the state transfer is >> detected. > > > OK. I don't like the old design, as ST has to wait until all pending TXs > (those with locks held) have to commit before we can make progress. If > the lock acquition timeout is high, we'll have to wait for a long time. > > >> With the new design I thought it would be simpler to not acquire a big >> lock for the entire duration of the write command that would prevent >> state transfer. Instead I would acquire different locks for much >> shorter amounts of time, and at the beginning of each lock acquisition >> we would just check that the command's view id is still the correct >> one. > > > OK. Perhaps an overview of the new design in the document is warranted. > There's a section on transfer of CacheEntries and one on locks, but I > didn't see a combined discussion. Perhaps an example like the one above > would be good ? > I hope I've improved on this in the new version. > I now realize how much simpler the use of total order is here: since all > updates in a cluster happen in total order, we don't need to acquire > locks in 1 phase and release them in another phase. ST is then just > another update, inserted at a certain place in the stream of updates. > Unfortunately I think that would mean a blocking state transfer, because all the other updates would have to wait on state to be transferred. > I assume the Cloud-TM guys don't do state transfer in their prototype, > or do they ? Pedro ? If not, then there needs to be an implementation of > ST for TO. > I'm curious about this as well. Cheers Dan _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Thanks Dan,
here are some comments / questions: "2. Each cache member receives the ownership information from the coordinator and starts rejecting all commands started in the old cache view" How do you know a command was started in the old cache view; does this mean you're shipping a cache view ID with every request ? "2.1. Commands with the new cache view id will be blocked until we have installed the new CH and we have received lock information from the previous owners" Doesn't this make this design *blocking* again ? Or do you queue requests with the new view-ID, return immediately and apply them when the new view-id is installed ? If the latter is the case, what do you return ? An OK (how do you know the request will apply OK) ? "A merge can be coalesced with multiple state transfers, one running in each partition. So in the general case a coalesced state transfer contain a tree of cache view changes." Hmm, this can make a state transfer message quite large. Are we trimming the modification list ? E.g if we have 10 PUTs on K, 1 removal of K, and another 4 PUTs, do we just send the *last* PUT, or do we send a modification list of 15 ? "Get commands can write to the data container as well when L1 is enabled, so we can't block just write commands." Another downside of the L1 cache being part of the regular cache. IMO it would be much better to separate the 2, as I wrote in previous emails yesterday. "The new owners will start receiving commands before they have received all the state * In order to handle those commands, the new owners will have to get the values from the old owners * We know that the owners in the later views have a newer version of the entry (if they have it at all). So we need to go back on the cache views tree and ask all the nodes on one level at the same time - if we don't get any certain anwer we go to the next level and repeat." How does a member C know that it *will* receive any state at all ? E.g. if we had key K on A and B, and now B crashed, then A would push a copy of K to C. So when C receives a request R from another member E, but hasn't received a state transfer from A yet, how does it know whether to apply or queue R ? Does C wait until it get an END-OF-ST message from the coordinator ? (skipping the rest due to exhaustion by complexity :-)) Over the last couple of days, I've exchanged a couple of emails with the Cloud-TM guys, and I'm more and more convinced that their total order solution is the simpler approach to (1) transactional updates and (2) state transfer. They don't have a solution for (2) yet, but I believe this can be done as another totally ordered transaction, applied in the correct location within the update stream. Or, we could possibly use a flush: as we don't need to wait for pending TXs to complete and release their locks, this should be quite fast. So my 5 cents: #1 We should focus on the total order approach, and get rid of the 2PC and locking business for transactional updates #2 Really focus on the eventual consistency approach Thoughts ? -- Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Mircea Markus
On Tue, Mar 20, 2012 at 2:51 PM, Mircea Markus <[hidden email]> wrote:
> Hi Dan, > > Some notes: > > Ownership Information: > - I remember the discussion with Sanne about an algorithm that wouldn't > require view prepare/rollback, but I think it would be very interring to see > it described in detail somewhere as all the points you raised in the > document are very closely tied to that Yes, actually I'm trying to describe precisely that. Of course, there are some complications... > - "However, it's not that clear what to do when a node leaves while the > cache is in "steady state"" --> if (numOwners-1) nodes crash a state > transfer is (most likely) wanted in order not to loose consistency in the > eventuality another node goes down. By default numOwners is 2, so in the > first release, can't we just assume that for every leave we'd want > immediately issue a state transfer? I think this would cover most of our > user's needs and simplify the problem considerably. > Not necessarily: if the user has a TopologyAwareConsistentHash and has split his cluster in two "sites" or "racks", he can bring down an entire site/rack before rebalancing, without the risk of losing any data. > Recovery > - I agree with your point re:recovery. This shouldn't be considered hight > prio in the first release. The recovery info is kept in an independent > cache, which allows a lot of flexibility: e.g. can point to a shared cache > store so that recovery caches on other nodes can read that info when needed. > I don't agree about the shared cache store, because then every cache transaction would have to write to that store. I think that would be cost-prohibitive. > > Locking/sync.. > "The state transfer task will acquire the DCL in exclusive mode after > updating the cache view id and release it after obtaining the data container > snapshot. This will ensure the state transfer task can't proceed on the old > owner until all write commands that knew about the old CH have finished > executing." <-- that means that incoming writes would block for the duration > of iterating the DataContainer? That shouldn't be too bad, unless a cache > store is present.. > Ok, first of all, in the new draft merged the data container lock and the cache view lock, I don't think making them separate actually helped with anything but it made things harder to explain. The idea was that incoming writes would only block while state transfer gets the iterator to the data container, which is very short. But the state transfer will also block on all the writes to the data container - to make sure that we won't miss any writes. The algorithm is like this: Write command === 1. acquire read lock 2. check cache view id 3. write entry to data container 4. release read lock State Transfer === 1. acquire write lock 2. update ch 3. update cache view id 4. release read lock > "CacheViewInterceptor [..] also needs to block in the interval between a > node noticing a leave and actually receiving the new cache view from the > coordinator" <-- why can't the local cache star computing the new CH > independently and not wait for the coordinator..? > Because the local cache doesn't listen for join and leave requests, only the coordinator does, and the cache view is computed based on the requests that the coordinator has seen (and not on the JGroups cluster joins and leaves). Cheers Dan _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
In reply to this post by Bela Ban
On Thu, Mar 22, 2012 at 11:26 AM, Bela Ban <[hidden email]> wrote:
> Thanks Dan, > > here are some comments / questions: > > "2. Each cache member receives the ownership information from the > coordinator and starts rejecting all commands started in the old cache view" > > How do you know a command was started in the old cache view; does this > mean you're shipping a cache view ID with every request ? > Yes, the plan is to ship the cache view id with every command - I mentioned in the "Command Execution During State Transfer" section that we set the view id in the command, I should make it clear that we also ship it to remote nodes. > > > "2.1. Commands with the new cache view id will be blocked until we have > installed the new CH and we have received lock information from the > previous owners" > > Doesn't this make this design *blocking* again ? Or do you queue > requests with the new view-ID, return immediately and apply them when > the new view-id is installed ? If the latter is the case, what do you > return ? An OK (how do you know the request will apply OK) ? > You're right, it will be blocking. However, the in-flight transaction data should be much smaller than the entire state, so I'm hoping that this blocking phase won't be as painful as it is now. Furthermore, commands waiting for a lock for for a sync RPC won't block state transfer anymore, so deadlocks between writes and state transfer (that now end with the write command timing out) should be impossible. I am not planning for any queuing at this point - we could only queue if we had the same order between the writes on all the nodes, otherwise we would get inconsistencies. For async commands we will have an implicit queue in the JGroups thread pool, but we won't have anything for sync commands. > > > "A merge can be coalesced with multiple state transfers, one running in > each partition. So in the general case a coalesced state transfer > contain a tree of cache view changes." > > Hmm, this can make a state transfer message quite large. Are we trimming > the modification list ? E.g if we have 10 PUTs on K, 1 removal of K, and > another 4 PUTs, do we just send the *last* PUT, or do we send a > modification list of 15 ? > Yep, my idea was to keep only a set of keys that have been modified for each cache view id. The second put to the same key wouldn't modify the set. When we transfer the entries to the new owners, we iterate the data container and only send each key once. Before writing the key to this set (and the entry to the data container), the primary owner will synchronously invalidate the key not only on the L1 requestors, but on all the owners in the previous cache views (that are no longer owners in the latest view). This will ensure that the new owner will only receive one copy of each key and won't have to buffer and sort the state received from all the previous owners. > > > "Get commands can write to the data container as well when L1 is > enabled, so we can't block just write commands." > > Another downside of the L1 cache being part of the regular cache. IMO it > would be much better to separate the 2, as I wrote in previous emails > yesterday. > I haven't read those yet, but I'm not sure moving the L1 cache to another container would eliminate the problem completely. It's probably not clear from the document, but if we empty the L1 on cache view changes then we have to ensure that we don't write to L1 anything from an old owner. Otherwise we're left with a value in L1 that none of the new owners knows about and won't invalidate on a put to that key. > > "The new owners will start receiving commands before they have received > all the state > * In order to handle those commands, the new owners will have to > get the values from the old owners > * We know that the owners in the later views have a newer version > of the entry (if they have it at all). So we need to go back on the > cache views tree and ask all the nodes on one level at the same time - > if we don't get any certain anwer we go to the next level and repeat." > > > How does a member C know that it *will* receive any state at all ? E.g. > if we had key K on A and B, and now B crashed, then A would push a copy > of K to C. > > So when C receives a request R from another member E, but hasn't > received a state transfer from A yet, how does it know whether to apply > or queue R ? Does C wait until it get an END-OF-ST message from the > coordinator ? > Yep, each node will signal to the coordinator that it finished pushing data and when the coordinator gets the confirmation from all the nodes it will broadcast the end-of-ST message to everyone. > > (skipping the rest due to exhaustion by complexity :-)) > > > Over the last couple of days, I've exchanged a couple of emails with the > Cloud-TM guys, and I'm more and more convinced that their total order > solution is the simpler approach to (1) transactional updates and (2) > state transfer. They don't have a solution for (2) yet, but I believe > this can be done as another totally ordered transaction, applied in the > correct location within the update stream. Or, we could possibly use a > flush: as we don't need to wait for pending TXs to complete and release > their locks, this should be quite fast. > Sounds intriguing, but I'm not sure how making the entire state transfer a single transaction would allow us to handle transactions while that state is being transferred. Blocking state transfer works with 2PC already. Well, at least as long as we don't have any merges... > > So my 5 cents: > > #1 We should focus on the total order approach, and get rid of the 2PC > and locking business for transactional updates > The biggest problem I remember total order having is TM transactions that have other participants (as opposed to cache-only transactions). I haven't followed the TO discussion on the mailing list very closely, does that work now? Regarding state transfer in particular, remember, non-blocking state transfer with 2PC sounded very easy as well, before we got into the details. The proof-of-concept you had running on the 4.2 branch was much simpler than what we have now, and even what we have now doesn't cover everything. > #2 Really focus on the eventual consistency approach > Again, it's very tempting, but I fear much of what makes eventual consistency tempting is that we don't know enough of its implementation details yet. Manik has been investigating eventual consistency for a while, I wonder what he thinks... Cheers Dan _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Thanks for the feedback and clarifications Dan !
Comments inline... >> Over the last couple of days, I've exchanged a couple of emails with the >> Cloud-TM guys, and I'm more and more convinced that their total order >> solution is the simpler approach to (1) transactional updates and (2) >> state transfer. They don't have a solution for (2) yet, but I believe >> this can be done as another totally ordered transaction, applied in the >> correct location within the update stream. Or, we could possibly use a >> flush: as we don't need to wait for pending TXs to complete and release >> their locks, this should be quite fast. >> > > Sounds intriguing, but I'm not sure how making the entire state > transfer a single transaction would allow us to handle transactions > while that state is being transferred. We're discussing 2 scenarios, one being the use of flush (but short-lived as we don't have any locks in play) and the other being Pedro's proposal of a start-state TO-cast (shipped with a keys set), followed by a state transfer. Transactions with any keys in the key set are queued and applied in order after the state transfer is done (signalled with abother stop-state TO-cast). If you have anything to add, let's discuss it on the other thread. >> So my 5 cents: >> >> #1 We should focus on the total order approach, and get rid of the 2PC >> and locking business for transactional updates >> > > The biggest problem I remember total order having is TM transactions > that have other participants (as opposed to cache-only transactions). > I haven't followed the TO discussion on the mailing list very closely, > does that work now? No, I don't think that's addressed by TOM, good point in favor of having 2 (or more) approaches to partial replication and state transfer ! I think a lot of systems would still benefit from TOM though as there are no external participants, for instance session replication: it uses an Infinispan batch to ship session updates to the session owners, and there are no other participants (as a matter of fact, this is not even a JTA transaction anyway). > Regarding state transfer in particular, remember, non-blocking state > transfer with 2PC sounded very easy as well, before we got into the > details. True :-) >> #2 Really focus on the eventual consistency approach >> > > Again, it's very tempting, but I fear much of what makes eventual > consistency tempting is that we don't know enough of its > implementation details yet. True again... > Manik has been investigating eventual consistency for a while, I > wonder what he thinks... Yep - Manik, do you have any status update on eventual consistency ? -- Bela Ban, JGroups lead (http://www.jgroups.org) _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
On 3/23/12 10:45 AM, Bela Ban wrote:
>> The biggest problem I remember total order having is TM transactions >> that have other participants (as opposed to cache-only transactions). >> I haven't followed the TO discussion on the mailing list very closely, >> does that work now? > > No, I don't think that's addressed by TOM, good point in favor of having > 2 (or more) approaches to partial replication and state transfer ! > > Actually, we already have code for dealing with scenarios in which ISPN is involved in a distributed transaction with other participants.( Pedro can point it out, it should be already on github.) In this case, the solution implies necessarily the usage of 2PC, but we can disseminate the prepare messages within ISPN using TOM. Pro: - deadlock freedom at the ISPN level, which can contribute to make of ISPN a well-behaved (i.e. responsive) participant in a distributed transaction even in high contention scenarios. Con: - in this case we cannot determine right away the outcome of a transaction once it is TOM-delivered, as we need to take into account also the votes of external participants. Hence the need for an extra phase. _______________________________________________ infinispan-dev mailing list [hidden email] https://lists.jboss.org/mailman/listinfo/infinispan-dev |
Free forum by Nabble | Edit this page |