Quantcast

[infinispan-dev] Non-blocking state transfer (ISPN-1424)

classic Classic list List threaded Threaded
21 messages Options
12
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

[infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
Hi guys

It's been a long time coming, but I finally published the non-blocking
state transfer draft on the wiki:
https://community.jboss.org/wiki/Non-blockingStateTransfer

Unlike my previous state transfer design document, I think I've
fleshed out most of the implications. Still, there are some things I
don't have a clear solution for yet. As you would expect it's mostly
around merging and delayed state transfer.

I'm looking forward to hearing your comments/advice!

Cheers
Dan

PS: Let's discuss this over the mailing list only.
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Galder Zamarreño
Hey Dan,

Thanks very much for writing this up! Some questions:

1. What's "steady state" ? Running status after state transfer has happened?

2. In L1, I don't understand what you mean in "we do need to add the old no-longer-owners as requestors for the transferred keys". Do we really need L1OnRehash? What's the use case/motivation for this option?

3. Recovery cache, just wondering, shouldn't we configure such cache with a cache store by default? Would that help?

4. What's the data container snapshot exactly? What kind of impact would it have on memory consumption?

One last comment: There's a lot of detail in that wiki which is hard to fully understand. I think it would really help to build some diagrams explaining some of the process to help the community understand better your design, or alternative solutions. WDYT?

Cheers,

On Mar 8, 2012, at 10:55 AM, Dan Berindei wrote:

> Hi guys
>
> It's been a long time coming, but I finally published the non-blocking
> state transfer draft on the wiki:
> https://community.jboss.org/wiki/Non-blockingStateTransfer
>
> Unlike my previous state transfer design document, I think I've
> fleshed out most of the implications. Still, there are some things I
> don't have a clear solution for yet. As you would expect it's mostly
> around merging and delayed state transfer.
>
> I'm looking forward to hearing your comments/advice!
>
> Cheers
> Dan
>
> PS: Let's discuss this over the mailing list only.
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei

Hi Galder

First of all, thanks for reading!

On Mar 9, 2012 12:54 PM, "Galder Zamarreño" <[hidden email]> wrote:
>
> Hey Dan,
>
> Thanks very much for writing this up! Some questions:
>
> 1. What's "steady state" ? Running status after state transfer has happened?
>

Exactly, it's when there is no state transfer in progress.

> 2. In L1, I don't understand what you mean in "we do need to add the old no-longer-owners as requestors for the transferred keys". Do we really need L1OnRehash? What's the use case/motivation for this option?
>

If we have a put on a new owner, it must know to invalidate the key both on the nodes that read it after the CH change and the ones that read before the CH change (from an old owner). It will become critical once we start staggering clustered get commands.

The L1OnRehash requirement is a bit different, because the old owner wasn't on any requestor list before the ownership change. But the solution is the same.

> 3. Recovery cache, just wondering, shouldn't we configure such cache with a cache store by default? Would that help?
>

I think a cache store would slow things too much, but it would solve this.

> 4. What's the data container snapshot exactly? What kind of impact would it have on memory consumption?
>

It's not really a snapshot, I just called it a snapshot to convey the idea that the (Bounded)ConcurrentHashMap doesn't pick up all the changes made after the start of the iteration.

> One last comment: There's a lot of detail in that wiki which is hard to fully understand. I think it would really help to build some diagrams explaining some of the process to help the community understand better your design, or alternative solutions. WDYT?
>

Fair point. I focused more on keeping track of required code changes to make sure I don't miss anything, but that ended up being even more dense so I removed it from the wiki.

Any particular diagram requests, to help prioritize things?

Cheers
Dan

> Cheers,
>
> On Mar 8, 2012, at 10:55 AM, Dan Berindei wrote:
>
> > Hi guys
> >
> > It's been a long time coming, but I finally published the non-blocking
> > state transfer draft on the wiki:
> > https://community.jboss.org/wiki/Non-blockingStateTransfer
> >
> > Unlike my previous state transfer design document, I think I've
> > fleshed out most of the implications. Still, there are some things I
> > don't have a clear solution for yet. As you would expect it's mostly
> > around merging and delayed state transfer.
> >
> > I'm looking forward to hearing your comments/advice!
> >
> > Cheers
> > Dan
> >
> > PS: Let's discuss this over the mailing list only.
> > _______________________________________________
> > infinispan-dev mailing list
> > [hidden email]
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Bela Ban
In reply to this post by Dan Berindei
Wow !

Does this need to be so complex ? I've spent a hour trying to understand
it, and am still overwhelmed... :-)

My understanding (based on my changed in 4.2) is that state transfer
moves/deletes keys based on the diff between 2 subsequent views:
- Each node checks all of the affected keys
- If a key should be stored in additional nodes, the key is pushed there
- If a key shouldn't be stored locally anymore, it is removed

IMO, there's no need to handle a merge differently from a regular view,
and we might end up with inconsistent state, but that's unavoidable
until we have eventual consistency. Fine...

Also, why do we need to transfer ownership information ? Can't ownership
be calculated purely on local information ?

I'm afraid that the complexity will increase the state space (hard to
test all possible state transitions), lead to unnecessary messages being
sent and most importantly, might lead to blocks.

The section on locking outright scares me :-) Perhaps reducing the level
of details here - as Galder suggested - might help to understand the
basic design.

Sorry for being a bit negative, but I think state transfer is one of the
most critical and important pieces of code in DIST mode, and this needs
to work against large (say a couple of hundreds) clusters and nodes
joining, leaving or crashing all the times...

I'm going to re-read the design again, maybe what I said above is just
BS ... :-)


On 3/8/12 11:55 AM, Dan Berindei wrote:

> Hi guys
>
> It's been a long time coming, but I finally published the non-blocking
> state transfer draft on the wiki:
> https://community.jboss.org/wiki/Non-blockingStateTransfer
>
> Unlike my previous state transfer design document, I think I've
> fleshed out most of the implications. Still, there are some things I
> don't have a clear solution for yet. As you would expect it's mostly
> around merging and delayed state transfer.
>
> I'm looking forward to hearing your comments/advice!
>
> Cheers
> Dan
>
> PS: Let's discuss this over the mailing list only.
>
--
Bela Ban, JGroups lead (http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Sanne Grinovero-3
I agree with Bela: this looks scary, can't imagine how tricky it would
be to implement it correctly. Could you split the problem?

Also, what happened with the super-simple ideas I'd shared in London?
Is is the same? I'm assuming it's very different..
I might have overseen some important aspect but I'd like to know why
that approach was discarded. Was it looking too simple ? :P

I'm sorry I'm skimming through it, will have more time next week but
as Galder said as well I'll need to draw this to understand it better.

Some first-impact notes:
# ownership information
  Deciding when to start a state transfer, is a different problem.
move it to another page or drop it?
  Deciding when to have a node join - same as above?

# Cache entries
##  "We need a snapshot of the iterator" < Can we avoid it? We just
start refusing to serve write commands by checking any incoming
command. It's an additional interceptor which checks the incoming
command is "appropriate" to be handled by us or needs to return some
kind of rejection code.
## Need for tombstones < I think we can avoid that actually. We'll
need it for MVCC replace/delete operations, but for state transfer
it's not needed if we decide that a Write operation has to send the
value to the new owners only and send an "authoritative invalidation"
to all previous owners.

# Lock Information
This is trivial if you stop thinking as them as being special. A lock
is a marker, and a marker is a value stored in the grid. These values
are transferred as any other value, with one single differentiator:
since there always is only one, they are manipulated via CAS
operations and are guaranteed to be consistent without the need of
being locked when changed.

#L1
let's keep it simple initially and just flush them out as decided.
## the cleanup you mention: is that not a current bug, orthogonal to
this design page? (trying to identify more thing to move out)

#Handling merges
Could we simplify this by saying that the views are not actually a
linked list but a tree?
In this document we're not attempting to solve consistent merging of
split brain, right? So we only need to know how to move the state to
the rightful new owner. For conflicts, let's assume there is an
"ConflictResolver object" which we'll describe/implement somewhere
else.

#State transfer disabled
We should think about the cases in which this option makes sense to be
enabled. In those cases, would people still be interested in L1
consistency and transactions? If not, this is not a problem to solve.

after getting to the end, it's not a bad document at all but I still
think it looks too scary :D

Cheers,
Sanne

On 9 March 2012 14:19, Bela Ban <[hidden email]> wrote:

> Wow !
>
> Does this need to be so complex ? I've spent a hour trying to understand
> it, and am still overwhelmed... :-)
>
> My understanding (based on my changed in 4.2) is that state transfer
> moves/deletes keys based on the diff between 2 subsequent views:
> - Each node checks all of the affected keys
> - If a key should be stored in additional nodes, the key is pushed there
> - If a key shouldn't be stored locally anymore, it is removed
>
> IMO, there's no need to handle a merge differently from a regular view,
> and we might end up with inconsistent state, but that's unavoidable
> until we have eventual consistency. Fine...
>
> Also, why do we need to transfer ownership information ? Can't ownership
> be calculated purely on local information ?
>
> I'm afraid that the complexity will increase the state space (hard to
> test all possible state transitions), lead to unnecessary messages being
> sent and most importantly, might lead to blocks.
>
> The section on locking outright scares me :-) Perhaps reducing the level
> of details here - as Galder suggested - might help to understand the
> basic design.
>
> Sorry for being a bit negative, but I think state transfer is one of the
> most critical and important pieces of code in DIST mode, and this needs
> to work against large (say a couple of hundreds) clusters and nodes
> joining, leaving or crashing all the times...
>
> I'm going to re-read the design again, maybe what I said above is just
> BS ... :-)
>
>
> On 3/8/12 11:55 AM, Dan Berindei wrote:
>> Hi guys
>>
>> It's been a long time coming, but I finally published the non-blocking
>> state transfer draft on the wiki:
>> https://community.jboss.org/wiki/Non-blockingStateTransfer
>>
>> Unlike my previous state transfer design document, I think I've
>> fleshed out most of the implications. Still, there are some things I
>> don't have a clear solution for yet. As you would expect it's mostly
>> around merging and delayed state transfer.
>>
>> I'm looking forward to hearing your comments/advice!
>>
>> Cheers
>> Dan
>>
>> PS: Let's discuss this over the mailing list only.
>>
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
In reply to this post by Bela Ban

On Mar 9, 2012 4:19 PM, "Bela Ban" <[hidden email]> wrote:
>
> Wow !
>
> Does this need to be so complex ? I've spent a hour trying to understand
> it, and am still overwhelmed... :-)
>


Sorry about that Bela, it is quite complex indeed.


> My understanding (based on my changed in 4.2) is that state transfer
> moves/deletes keys based on the diff between 2 subsequent views:
> - Each node checks all of the affected keys
> - If a key should be stored in additional nodes, the key is pushed there
> - If a key shouldn't be stored locally anymore, it is removed
>


That's fine if we block all writes during state transfer, but once we start allowing writes during state transfer we need to log all changes and send them to the new owners at the end (the approach in 4.2 without your changes) or redirect all commands to the new owners.

In addition to that, we have to either block all commands on the new owners until they receive the entire state or to forward get commands to the old owners as well. The two options apply for lock commands as well.


> IMO, there's no need to handle a merge differently from a regular view,
> and we might end up with inconsistent state, but that's unavoidable
> until we have eventual consistency. Fine...
>


I'm not trying to make merges more complicated on purpose :)

I think we need to try our best to prevent data loss, even if we there is a chance of inconsistency. We still see clusters in the test suite form via merges from time to time, so we can't just say after a merge all bets are off.

The problem is that I chose to forward get commands to the old owners AND to remove the cache view rollback (which was blocking in our Lisbon design). This means that we must keep a chain of cache views for which we haven't finished state transfer, and with merges that chain turns into a tree + it has to be broadcasted by the coordinator to all the nodes.


> Also, why do we need to transfer ownership information ? Can't ownership
> be calculated purely on local information ?
>


The current ownership information can be calculated based solely on the members list. But the ownership in the previous cache view(s) can not be computed by joiners based only on their information, so it has to be broadcasted by the coordinator.


> I'm afraid that the complexity will increase the state space (hard to
> test all possible state transitions), lead to unnecessary messages being
> sent and most importantly, might lead to blocks.
>


I agree the increased complexity is a concern, but I'm not willing to give up on non-blocking state transfer just yet...

One particularly nasty problem with the existing, blocking, state transfer is that before iterating the data container we need to wait for all the pending commands to finish. So if we have high contention and a 60 seconds lock acquisition timeout, state transfer is almost guaranteed to take > 60 seconds.


> The section on locking outright scares me :-) Perhaps reducing the level
> of details here - as Galder suggested - might help to understand the
> basic design.
>


I got burned pretty hard with my asymmetric clusters design, because the implementation turned out a lot more complex than the design, so I tried to investigate all the interactions between the different choices we're making this time.


> Sorry for being a bit negative, but I think state transfer is one of the
> most critical and important pieces of code in DIST mode, and this needs
> to work against large (say a couple of hundreds) clusters and nodes
> joining, leaving or crashing all the times...
>


I'd argue that the blocking state transfer we have doesn't satisfy this requirement...


> I'm going to re-read the design again, maybe what I said above is just
> BS ... :-)
>


Please do re-read it, I'll try to simplify it a bit by Monday based on your feedback.


>
> On 3/8/12 11:55 AM, Dan Berindei wrote:
> > Hi guys
> >
> > It's been a long time coming, but I finally published the non-blocking
> > state transfer draft on the wiki:
> > https://community.jboss.org/wiki/Non-blockingStateTransfer
> >
> > Unlike my previous state transfer design document, I think I've
> > fleshed out most of the implications. Still, there are some things I
> > don't have a clear solution for yet. As you would expect it's mostly
> > around merging and delayed state transfer.
> >
> > I'm looking forward to hearing your comments/advice!
> >
> > Cheers
> > Dan
> >
> > PS: Let's discuss this over the mailing list only.
> >
> --
> Bela Ban, JGroups lead (http://www.jgroups.org)
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
In reply to this post by Sanne Grinovero-3
On Fri, Mar 9, 2012 at 6:03 PM, Sanne Grinovero <[hidden email]> wrote:
> I agree with Bela: this looks scary, can't imagine how tricky it would
> be to implement it correctly. Could you split the problem?
>

I thought I just did that with the different sections :)

All the state transfer components are bound together by the approach
we choose for updating the CH, so and I think most of the
complications stem from the way we have to maintain ownership
information - not just for the current cache view, but for the
previous cache views as well.


> Also, what happened with the super-simple ideas I'd shared in London?
> Is is the same? I'm assuming it's very different..
> I might have overseen some important aspect but I'd like to know why
> that approach was discarded. Was it looking too simple ? :P
>

I thought it was the same thing, but in London we only had a very
high-level discussion. We didn't discuss how a joiner gets the chain
of CHs so that it knows which are the old owners, and we only briefly
touched on how to discard old cache views. Please let me know where
you think I've diverged from our discussions :)


> I'm sorry I'm skimming through it, will have more time next week but
> as Galder said as well I'll need to draw this to understand it better.
>
> Some first-impact notes:
> # ownership information
>  Deciding when to start a state transfer, is a different problem.
> move it to another page or drop it?
>  Deciding when to have a node join - same as above?
>

We do need to decide whether we want to allow CH updates without
transferring the actual data or not - it has an impact on how big the
"transaction log" on the new owners can get.

We also need to decide whether we allow state transfer to finish
successfully even though one node has already left the cache view.
I've chosen to "interrupt" the state transfer on a leave in order to
avoid weird situations where DM.getLocation(key) returns an empty set,
but that complicates things a bit so we need to make it explicit here.
E.g. after an error we can't just retry the same cache view, we have
to account for leavers as well, so we can't say that a merge can only
be the last view in the chain).


> # Cache entries
> ##  "We need a snapshot of the iterator" < Can we avoid it? We just
> start refusing to serve write commands by checking any incoming
> command. It's an additional interceptor which checks the incoming
> command is "appropriate" to be handled by us or needs to return some
> kind of rejection code.

We have to iterate over the data container in order to push data, and
that iterator behaves as a snapshot. So there's nothing to avoid here.
I did mention that we start refusing write commands - the iteration of
the data container is the reason why we need the cache view id check.


> ## Need for tombstones < I think we can avoid that actually. We'll
> need it for MVCC replace/delete operations, but for state transfer
> it's not needed if we decide that a Write operation has to send the
> value to the new owners only and send an "authoritative invalidation"
> to all previous owners.
>

I'm not sure how this "authoritative invalidation" would work... any
change to the data container may or may be reflected in the iteration
done by the state transfer task, and actually by the time we send the
invalidation command the entry could in fact be in a JGroups message
on its way to the new owner.


> # Lock Information
> This is trivial if you stop thinking as them as being special. A lock
> is a marker, and a marker is a value stored in the grid. These values
> are transferred as any other value, with one single differentiator:
> since there always is only one, they are manipulated via CAS
> operations and are guaranteed to be consistent without the need of
> being locked when changed.
>

Actually no, we only keep the lock on the primary owner so we can't
rely on the lock information being there in case of a leave. So we
need to transfer transaction information instead - and the transaction
information only gives a "possibly locked" status for a key.

Even if we do treat the transaction information as the normal state,
it doesn't mean that we don't have to handle it - in fact I wrote this
idea and marked it [LATER] because I thought it would be too
complicated to duplicate the data infrastructure for transaction
information.


> #L1
> let's keep it simple initially and just flush them out as decided.

Agree, it's simpler to flush everything out - but that doesn't mean we
don't have to change anything, we still need to add the old owners as
requestors if L1OnRehash is enabled.


> ## the cleanup you mention: is that not a current bug, orthogonal to
> this design page? (trying to identify more thing to move out)
>

I could move out the fact that it's a current bug, but it's still
something we need to do during state transfer and we have to decide
what locks/latches  we need to hold while doing it in order to ensure
consistency.


> #Handling merges
> Could we simplify this by saying that the views are not actually a
> linked list but a tree?

As I wrote the document I wasn't quite sure that the tree approach
would work, but I'm more and more convinced that it would be fine.
I agree that it sounds a little more complicated because I wrote it at
first considering a list of cache views and I only started considering
the generic tree approach when I wrote the merge section. But I
couldn't change it as I wasn't 100% sold on the tree approach (another
idea I had was to ignore any view change after a merge, so the cache
view tree could only have more than 1 branch at the root level), so I
try to leave the option open hoping for better suggestions from you
guys.


> In this document we're not attempting to solve consistent merging of
> split brain, right? So we only need to know how to move the state to
> the rightful new owner. For conflicts, let's assume there is an
> "ConflictResolver object" which we'll describe/implement somewhere
> else.
>

I don't want to solve consistent merging here, but I do want to make
sure that it is possible. For instance if after a failed merge the
information about the old partitions is lost, no "ConflictResolver
object" could make that state consistent and state will be just lost.

The long discussion about which view is newer is there because without
merges we want to ensure consistency and apply received entries in
their logical order (which happens the ascending order of their cache
view id) - therefore we need to buffer data received for
"intermediate" nodes in the cache view chain. Merges complicate this
because the cache view tree has multiple leaves and we have no way of
ordering them (unlike JGroups we identify a cache view only by its id,
so there's no way compare cache views and decide one is a descendent
of another - hence the need for the intermediate flag.


> #State transfer disabled
> We should think about the cases in which this option makes sense to be
> enabled. In those cases, would people still be interested in L1
> consistency and transactions? If not, this is not a problem to solve.
>

The main use case I have in mind is when the user doesn't care about
missing data (we're not the authoritative source) but he does care
about staleness. Most users I talked with (not many, I grant you that)
are willing to accept stale data but on very short timescales. L1,
with its 10 minutes default lifespan, doesn't qualify - and making the
L1 lifespan very short will make it useless.


> after getting to the end, it's not a bad document at all but I still
> think it looks too scary :D
>

Agree with both ;-)

Cheers
Dan


> Cheers,
> Sanne
>
> On 9 March 2012 14:19, Bela Ban <[hidden email]> wrote:
>> Wow !
>>
>> Does this need to be so complex ? I've spent a hour trying to understand
>> it, and am still overwhelmed... :-)
>>
>> My understanding (based on my changed in 4.2) is that state transfer
>> moves/deletes keys based on the diff between 2 subsequent views:
>> - Each node checks all of the affected keys
>> - If a key should be stored in additional nodes, the key is pushed there
>> - If a key shouldn't be stored locally anymore, it is removed
>>
>> IMO, there's no need to handle a merge differently from a regular view,
>> and we might end up with inconsistent state, but that's unavoidable
>> until we have eventual consistency. Fine...
>>
>> Also, why do we need to transfer ownership information ? Can't ownership
>> be calculated purely on local information ?
>>
>> I'm afraid that the complexity will increase the state space (hard to
>> test all possible state transitions), lead to unnecessary messages being
>> sent and most importantly, might lead to blocks.
>>
>> The section on locking outright scares me :-) Perhaps reducing the level
>> of details here - as Galder suggested - might help to understand the
>> basic design.
>>
>> Sorry for being a bit negative, but I think state transfer is one of the
>> most critical and important pieces of code in DIST mode, and this needs
>> to work against large (say a couple of hundreds) clusters and nodes
>> joining, leaving or crashing all the times...
>>
>> I'm going to re-read the design again, maybe what I said above is just
>> BS ... :-)
>>
>>
>> On 3/8/12 11:55 AM, Dan Berindei wrote:
>>> Hi guys
>>>
>>> It's been a long time coming, but I finally published the non-blocking
>>> state transfer draft on the wiki:
>>> https://community.jboss.org/wiki/Non-blockingStateTransfer
>>>
>>> Unlike my previous state transfer design document, I think I've
>>> fleshed out most of the implications. Still, there are some things I
>>> don't have a clear solution for yet. As you would expect it's mostly
>>> around merging and delayed state transfer.
>>>
>>> I'm looking forward to hearing your comments/advice!
>>>
>>> Cheers
>>> Dan
>>>
>>> PS: Let's discuss this over the mailing list only.
>>>
>> --
>> Bela Ban, JGroups lead (http://www.jgroups.org)
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Bela Ban
In reply to this post by Dan Berindei
Sorry for the delay, been busy ...

Comments inline.

On 3/9/12 7:20 PM, Dan Berindei wrote:
> On Mar 9, 2012 4:19 PM, "Bela Ban"<[hidden email]>  wrote:

>
>> My understanding (based on my changed in 4.2) is that state transfer
>> moves/deletes keys based on the diff between 2 subsequent views:
>> - Each node checks all of the affected keys
>> - If a key should be stored in additional nodes, the key is pushed there
>> - If a key shouldn't be stored locally anymore, it is removed
>>
>
>
> That's fine if we block all writes during state transfer, but once we start
> allowing writes during state transfer we need to log all changes and send
> them to the new owners at the end (the approach in 4.2 without your
> changes) or redirect all commands to the new owners.


OK


> In addition to that, we have to either block all commands on the new owners
> until they receive the entire state or to forward get commands to the old
> owners as well. The two options apply for lock commands as well.


Why do we have to lock at all if we use queueing of requests during a
state transfer ?


> I'm not trying to make merges more complicated on purpose :)


I didn't imply that; but I thought the London design was pretty simple,
and I'm trying to figure out why we have such a big (maybe perceived)
diff between London and what's in the document.


>> Also, why do we need to transfer ownership information ? Can't ownership
>> be calculated purely on local information ?
>>
>
>
> The current ownership information can be calculated based solely on the
> members list. But the ownership in the previous cache view(s) can not be
> computed by joiners based only on their information, so it has to be
> broadcasted by the coordinator.


OK



> One particularly nasty problem with the existing, blocking, state transfer
> is that before iterating the data container we need to wait for all the
> pending commands to finish.


Can't we queue the state transfer requests *and* the regular requests ?
When ST is done, we apply the ST requests first, then the queued regular
requests.



--
Bela Ban, JGroups lead (http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
On Thu, Mar 15, 2012 at 10:09 AM, Bela Ban <[hidden email]> wrote:
> Sorry for the delay, been busy ...
>

No worries Bela, I'm late with my rewrite of the NBST document as well...

> Comments inline.
>
> On 3/9/12 7:20 PM, Dan Berindei wrote:
>> On Mar 9, 2012 4:19 PM, "Bela Ban"<[hidden email]>  wrote:
>
>>
>>> My understanding (based on my changed in 4.2) is that state transfer
>>> moves/deletes keys based on the diff between 2 subsequent views:
>>> - Each node checks all of the affected keys
>>> - If a key should be stored in additional nodes, the key is pushed there
>>> - If a key shouldn't be stored locally anymore, it is removed
>>>
>>
>>
>> That's fine if we block all writes during state transfer, but once we start
>> allowing writes during state transfer we need to log all changes and send
>> them to the new owners at the end (the approach in 4.2 without your
>> changes) or redirect all commands to the new owners.
>
>
> OK
>
>
>> In addition to that, we have to either block all commands on the new owners
>> until they receive the entire state or to forward get commands to the old
>> owners as well. The two options apply for lock commands as well.
>
>
> Why do we have to lock at all if we use queueing of requests during a
> state transfer ?
>

I'm not sure what you mean by queuing of requests...

The current approach:
* If the cache is sync, the old owners will throw a
StateTransferInProgressException when a state transfer is running and
the originator will then wait for state transfer to end before sending
the command again to the new owners. I'm
* If the cache is async, we block write commands on the target, and
unblock them if after the state transfer finished. I don't think we do
anything to make sure all the new owners get all the commands that
were targeted to the old owners, so we can lose consistency between
owners.

The new design will always reject commands with the old cache view id,
on any owner. If the cache view id is ok but we don't have the new CH
or the lock information yet, the command will block on the target.
* If the cache is sync, we can use the same approach as before and
tell the originator to retry the command on the new owners.
* If the cache is async, instead of doing nothing like before I was
thinking that the target itself could forward the commands (especially
write commands) to the new owners.

We didn't discuss this in London, I added it afterwards as I realized
that without forwarding we'll be losing data. I didn't realize at the
time, however, that the current design has the same problem.

I think I see a problem with my request forwarding plan: the requests
will have a different source, so JGroups will not ensure an ordering
between them and the originator's subsequent requests. This means
we'll have inconsistencies anyway, so perhaps it would be better if we
stuck to the current design's limitations and remove the requirement
for old targets to forward commands to new ones.

>
>> I'm not trying to make merges more complicated on purpose :)
>
>
> I didn't imply that; but I thought the London design was pretty simple,
> and I'm trying to figure out why we have such a big (maybe perceived)
> diff between London and what's in the document.
>

True, there are a many complications that were not apparent during our
London discussion.

>
>>> Also, why do we need to transfer ownership information ? Can't ownership
>>> be calculated purely on local information ?
>>>
>>
>>
>> The current ownership information can be calculated based solely on the
>> members list. But the ownership in the previous cache view(s) can not be
>> computed by joiners based only on their information, so it has to be
>> broadcasted by the coordinator.
>
>
> OK
>
>
>
>> One particularly nasty problem with the existing, blocking, state transfer
>> is that before iterating the data container we need to wait for all the
>> pending commands to finish.
>
>
> Can't we queue the state transfer requests *and* the regular requests ?
> When ST is done, we apply the ST requests first, then the queued regular
> requests.
>

That was basically what we did in the blocking design: the ST commands
could execute during ST, but regular commands would block until the
end of the ST. With async caches, that meant we would use JGroups' 1
queue per sender (so not a global queue, but close).

The problem was not with the regular commands that arrived after the
start of the ST, but with the commands that had already started
executing when ST started. This is the classic example:
1. A prepare command for Tx1 locks k1 on node A
2. A prepare command for Tx2 tries to acquire lock k1 on node A
3. State transfer starts up and blocks all write commands
4. The Tx1 commit command, which will unlock k1, arrives but can't run
until state transfer has ended
5. The Tx2 prepare command times out on the lock acquisition after 10
seconds (by default)
6. State transfer can can now proceed and push or receive data.
7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however.

The solution I had in mind for the old design was to add some kind of
deadlock detection to the LockManager and throw a
StateTransferInProgress when a deadlock with the state transfer is
detected.

With the new design I thought it would be simpler to not acquire a big
lock for the entire duration of the write command that would prevent
state transfer. Instead I would acquire different locks for much
shorter amounts of time, and at the beginning of each lock acquisition
we would just check that the command's view id is still the correct
one.

I guess I may have been wrong on the "simpler" assertion :)

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Erik Salter-2
Hi Dan,

This statement makes me nervous.  Would we still have inconsistencies in a
transactional context as well?

"We didn't discuss this in London, I added it afterwards as I realized
that without forwarding we'll be losing data. I didn't realize at the
time, however, that the current design has the same problem.

I think I see a problem with my request forwarding plan: the requests
will have a different source, so JGroups will not ensure an ordering
between them and the originator's subsequent requests. This means
we'll have inconsistencies anyway, so perhaps it would be better if we
stuck to the current design's limitations and remove the requirement
for old targets to forward commands to new ones."

Regards,

Erik


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
Sorry for making you nervous Erik, but if I remember correctly you're
using synchronous caches so these paragraphs should not apply to your
use case.

In asynchronous caches there is always the possibility of
inconsistency, because different owners can receive the transactions
in a different order (and the transactions are always 1PC). So this is
really nothing new, I just thought I could close one window of
inconsistency during view changes and I realized that it's not that
easy.

In sync caches with async commit, the commit RPC synchronous, but on a
background thread, so we are able to retry the command on the new
owners.

There is one scenario open to inconsistencies for sync caches: when
the originator dies before retrying the command on the new nodes.
Enabling recovery will fix this, but I think forwarding from the old
owners could be a more lightweight solution (at the cost of extra
complexity, of course).

Cheers
Dan


On Thu, Mar 15, 2012 at 3:16 PM, Erik Salter <[hidden email]> wrote:

> Hi Dan,
>
> This statement makes me nervous.  Would we still have inconsistencies in a
> transactional context as well?
>
> "We didn't discuss this in London, I added it afterwards as I realized
> that without forwarding we'll be losing data. I didn't realize at the
> time, however, that the current design has the same problem.
>
> I think I see a problem with my request forwarding plan: the requests
> will have a different source, so JGroups will not ensure an ordering
> between them and the originator's subsequent requests. This means
> we'll have inconsistencies anyway, so perhaps it would be better if we
> stuck to the current design's limitations and remove the requirement
> for old targets to forward commands to new ones."
>
> Regards,
>
> Erik
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Bela Ban
In reply to this post by Dan Berindei


On 3/15/12 11:29 AM, Dan Berindei wrote:

> That was basically what we did in the blocking design: the ST commands
> could execute during ST, but regular commands would block until the
> end of the ST. With async caches, that meant we would use JGroups' 1
> queue per sender (so not a global queue, but close).
>
> The problem was not with the regular commands that arrived after the
> start of the ST, but with the commands that had already started
> executing when ST started. This is the classic example:
> 1. A prepare command for Tx1 locks k1 on node A
> 2. A prepare command for Tx2 tries to acquire lock k1 on node A
> 3. State transfer starts up and blocks all write commands
> 4. The Tx1 commit command, which will unlock k1, arrives but can't run
> until state transfer has ended
> 5. The Tx2 prepare command times out on the lock acquisition after 10
> seconds (by default)
> 6. State transfer can can now proceed and push or receive data.
> 7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however.
>
> The solution I had in mind for the old design was to add some kind of
> deadlock detection to the LockManager and throw a
> StateTransferInProgress when a deadlock with the state transfer is
> detected.


OK. I don't like the old design, as ST has to wait until all pending TXs
(those with locks held) have to commit before we can make progress. If
the lock acquition timeout is high, we'll have to wait for a long time.


> With the new design I thought it would be simpler to not acquire a big
> lock for the entire duration of the write command that would prevent
> state transfer. Instead I would acquire different locks for much
> shorter amounts of time, and at the beginning of each lock acquisition
> we would just check that the command's view id is still the correct
> one.


OK. Perhaps an overview of the new design in the document is warranted.
There's a section on transfer of CacheEntries and one on locks, but I
didn't see a combined discussion. Perhaps an example like the one above
would be good ?

I now realize how much simpler the use of total order is here: since all
updates in a cluster happen in total order, we don't need to acquire
locks in 1 phase and release them in another phase. ST is then just
another update, inserted at a certain place in the stream of updates.

I assume the Cloud-TM guys don't do state transfer in their prototype,
or do they ? Pedro ? If not, then there needs to be an implementation of
ST for TO.

Cheers,

--
Bela Ban, JGroups lead (http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Mircea Markus
In reply to this post by Dan Berindei
Hi Dan,

Some notes:

Ownership Information:
- I remember the discussion with Sanne about an algorithm that wouldn't require view prepare/rollback, but I think it would be very interring to see it described in detail somewhere as all the points you raised in the document are very closely tied to that
- "However, it's not that clear what to do when a node leaves while the cache is in "steady state"" --> if (numOwners-1) nodes crash a state transfer is (most likely) wanted in order not to loose consistency in the eventuality another node goes down. By default numOwners is 2, so in the first release, can't we just assume that for every leave we'd want immediately issue a state transfer? I think this would cover most of our user's needs and simplify the problem considerably.

Recovery
- I agree with your point re:recovery. This shouldn't be considered hight prio in the first release. The recovery info is kept in an independent cache, which allows a lot of flexibility: e.g. can point to a shared cache store so that recovery caches on other nodes can read that info when needed.


Locking/sync..
"The state transfer task will acquire the DCL in exclusive mode after updating the cache view id and release it after obtaining the data container snapshot. This will ensure the state transfer task can't proceed on the old owner until all write commands that knew about the old CH have finished executing." <-- that means that incoming writes would block for the duration of iterating the DataContainer? That shouldn't be too bad, unless a cache store is present..

"CacheViewInterceptor [..] also needs to block in the interval between a node noticing a leave and actually receiving the new cache view from the coordinator" <-- why can't the local cache star computing the new CH independently and not wait for the coordinator..?

Cheers,
Mircea

On 8 Mar 2012, at 10:55, Dan Berindei wrote:

Hi guys

It's been a long time coming, but I finally published the non-blocking
state transfer draft on the wiki:
https://community.jboss.org/wiki/Non-blockingStateTransfer

Unlike my previous state transfer design document, I think I've
fleshed out most of the implications. Still, there are some things I
don't have a clear solution for yet. As you would expect it's mostly
around merging and delayed state transfer.

I'm looking forward to hearing your comments/advice!

Cheers
Dan

PS: Let's discuss this over the mailing list only.
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Mircea Markus
In reply to this post by Dan Berindei

> 3. Recovery cache, just wondering, shouldn't we configure such cache with a cache store by default? Would that help?
>

I think a cache store would slow things too much, but it would solve this.

A cache store on the recovery cache should't actually slow down things at all on the critical path. Recovery info (a fancy name for PrepareCommand) is moved in the recovery cache only when the tx originator is crashed. And this happens on an async thread (view change). Then this cache is not queried from the critical path, but only by async recovery requests.   
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
In reply to this post by Bela Ban
Hi guys

I've updated the document at
https://community.jboss.org/wiki/Non-blockingStateTransfer
This time I've added an overview and changed "locking requirements" to
"command execution during state transfer".
I also assumed that we are going to use the tree of cache view.

More comments inline.

On Fri, Mar 16, 2012 at 9:16 AM, Bela Ban <[hidden email]> wrote:

>
>
> On 3/15/12 11:29 AM, Dan Berindei wrote:
>
>> That was basically what we did in the blocking design: the ST commands
>> could execute during ST, but regular commands would block until the
>> end of the ST. With async caches, that meant we would use JGroups' 1
>> queue per sender (so not a global queue, but close).
>>
>> The problem was not with the regular commands that arrived after the
>> start of the ST, but with the commands that had already started
>> executing when ST started. This is the classic example:
>> 1. A prepare command for Tx1 locks k1 on node A
>> 2. A prepare command for Tx2 tries to acquire lock k1 on node A
>> 3. State transfer starts up and blocks all write commands
>> 4. The Tx1 commit command, which will unlock k1, arrives but can't run
>> until state transfer has ended
>> 5. The Tx2 prepare command times out on the lock acquisition after 10
>> seconds (by default)
>> 6. State transfer can can now proceed and push or receive data.
>> 7. The Tx1 commit can now run and unlock k1. It's too late for Tx2, however.
>>
>> The solution I had in mind for the old design was to add some kind of
>> deadlock detection to the LockManager and throw a
>> StateTransferInProgress when a deadlock with the state transfer is
>> detected.
>
>
> OK. I don't like the old design, as ST has to wait until all pending TXs
> (those with locks held) have to commit before we can make progress. If
> the lock acquition timeout is high, we'll have to wait for a long time.
>
>
>> With the new design I thought it would be simpler to not acquire a big
>> lock for the entire duration of the write command that would prevent
>> state transfer. Instead I would acquire different locks for much
>> shorter amounts of time, and at the beginning of each lock acquisition
>> we would just check that the command's view id is still the correct
>> one.
>
>
> OK. Perhaps an overview of the new design in the document is warranted.
> There's a section on transfer of CacheEntries and one on locks, but I
> didn't see a combined discussion. Perhaps an example like the one above
> would be good ?
>

I hope I've improved on this in the new version.

> I now realize how much simpler the use of total order is here: since all
> updates in a cluster happen in total order, we don't need to acquire
> locks in 1 phase and release them in another phase. ST is then just
> another update, inserted at a certain place in the stream of updates.
>

Unfortunately I think that would mean a blocking state transfer,
because all the other updates would have to wait on state to be
transferred.

> I assume the Cloud-TM guys don't do state transfer in their prototype,
> or do they ? Pedro ? If not, then there needs to be an implementation of
> ST for TO.
>

I'm curious about this as well.

Cheers
Dan
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Bela Ban
Thanks Dan,

here are some comments / questions:

"2. Each cache member receives the ownership information from the
coordinator and starts rejecting all commands started in the old cache view"

How do you know a command was started in the old cache view;  does this
mean you're shipping a cache view ID with every request ?



"2.1. Commands with the new cache view id will be blocked until we have
installed the new CH and we have received lock information from the
previous owners"

Doesn't this make this design *blocking* again ? Or do you queue
requests with the new view-ID, return immediately and apply them when
the new view-id is installed ? If the latter is the case, what do you
return ? An OK (how do you know the request will apply OK) ?



"A merge can be coalesced with multiple state transfers, one running in
each partition. So in the general case a coalesced state transfer
contain a tree of cache view changes."

Hmm, this can make a state transfer message quite large. Are we trimming
the modification list ? E.g if we have 10 PUTs on K, 1 removal of K, and
another 4 PUTs, do we just send the *last* PUT, or do we send a
modification list of 15 ?



"Get commands can write to the data container as well when L1 is
enabled, so we can't block just write commands."

Another downside of the L1 cache being part of the regular cache. IMO it
would be much better to separate the 2, as I wrote in previous emails
yesterday.


"The new owners will start receiving commands before they have received
all the state
     * In order to handle those commands, the new owners will have to
get the values from the old owners
     * We know that the owners in the later views have a newer version
of the entry (if they have it at all). So we need to go back on the
cache views tree and ask all the nodes on one level at the same time -
if we don't get any certain anwer we go to the next level and repeat."


How does a member C know that it *will* receive any state at all ? E.g.
if we had key K on A and B, and now B crashed, then A would push a copy
of K to C.

So when C receives a request R from another member E, but hasn't
received a state transfer from A yet, how does it know whether to apply
or queue R ? Does C wait until it get an END-OF-ST message from the
coordinator ?


(skipping the rest due to exhaustion by complexity :-))


Over the last couple of days, I've exchanged a couple of emails with the
Cloud-TM guys, and I'm more and more convinced that their total order
solution is the simpler approach to (1) transactional updates and (2)
state transfer. They don't have a solution for (2) yet, but I believe
this can be done as another totally ordered transaction, applied in the
correct location within the update stream. Or, we could possibly use a
flush: as we don't need to wait for pending TXs to complete and release
their locks, this should be quite fast.


So my 5 cents:

#1 We should focus on the total order approach, and get rid of the 2PC
and locking business for transactional updates

#2 Really focus on the eventual consistency approach


Thoughts ?

--
Bela Ban, JGroups lead (http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
In reply to this post by Mircea Markus
On Tue, Mar 20, 2012 at 2:51 PM, Mircea Markus <[hidden email]> wrote:
> Hi Dan,
>
> Some notes:
>
> Ownership Information:
> - I remember the discussion with Sanne about an algorithm that wouldn't
> require view prepare/rollback, but I think it would be very interring to see
> it described in detail somewhere as all the points you raised in the
> document are very closely tied to that

Yes, actually I'm trying to describe precisely that. Of course, there
are some complications...

> - "However, it's not that clear what to do when a node leaves while the
> cache is in "steady state"" --> if (numOwners-1) nodes crash a state
> transfer is (most likely) wanted in order not to loose consistency in the
> eventuality another node goes down. By default numOwners is 2, so in the
> first release, can't we just assume that for every leave we'd want
> immediately issue a state transfer? I think this would cover most of our
> user's needs and simplify the problem considerably.
>

Not necessarily: if the user has a TopologyAwareConsistentHash and has
split his cluster in two "sites" or "racks", he can bring down an
entire site/rack before rebalancing, without the risk of losing any
data.

> Recovery
> - I agree with your point re:recovery. This shouldn't be considered hight
> prio in the first release. The recovery info is kept in an independent
> cache, which allows a lot of flexibility: e.g. can point to a shared cache
> store so that recovery caches on other nodes can read that info when needed.
>

I don't agree about the shared cache store, because then every cache
transaction would have to write to that store. I think that would be
cost-prohibitive.

>
> Locking/sync..
> "The state transfer task will acquire the DCL in exclusive mode after
> updating the cache view id and release it after obtaining the data container
> snapshot. This will ensure the state transfer task can't proceed on the old
> owner until all write commands that knew about the old CH have finished
> executing." <-- that means that incoming writes would block for the duration
> of iterating the DataContainer? That shouldn't be too bad, unless a cache
> store is present..
>

Ok, first of all, in the new draft merged the data container lock and
the cache view lock, I don't think making them separate actually
helped with anything but it made things harder to explain.

The idea was that incoming writes would only block while state
transfer gets the iterator to the data container, which is very short.
But the state transfer will also block on all the writes to the data
container - to make sure that we won't miss any writes.

The algorithm is like this:

Write command
===
1. acquire read lock
2. check cache view id
3. write entry to data container
4. release read lock

State Transfer
===
1. acquire write lock
2. update ch
3. update cache view id
4. release read lock

> "CacheViewInterceptor [..] also needs to block in the interval between a
> node noticing a leave and actually receiving the new cache view from the
> coordinator" <-- why can't the local cache star computing the new CH
> independently and not wait for the coordinator..?
>

Because the local cache doesn't listen for join and leave requests,
only the coordinator does, and the cache view is computed based on the
requests that the coordinator has seen (and not on the JGroups cluster
joins and leaves).

Cheers
Dan

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Dan Berindei
In reply to this post by Bela Ban
On Thu, Mar 22, 2012 at 11:26 AM, Bela Ban <[hidden email]> wrote:

> Thanks Dan,
>
> here are some comments / questions:
>
> "2. Each cache member receives the ownership information from the
> coordinator and starts rejecting all commands started in the old cache view"
>
> How do you know a command was started in the old cache view;  does this
> mean you're shipping a cache view ID with every request ?
>

Yes, the plan is to ship the cache view id with every command - I
mentioned in the "Command Execution During State Transfer" section
that we set the view id in the command, I should make it clear that we
also ship it to remote nodes.

>
>
> "2.1. Commands with the new cache view id will be blocked until we have
> installed the new CH and we have received lock information from the
> previous owners"
>
> Doesn't this make this design *blocking* again ? Or do you queue
> requests with the new view-ID, return immediately and apply them when
> the new view-id is installed ? If the latter is the case, what do you
> return ? An OK (how do you know the request will apply OK) ?
>

You're right, it will be blocking. However, the in-flight transaction
data should be much smaller than the entire state, so I'm hoping that
this blocking phase won't be as painful as it is now.
Furthermore, commands waiting for a lock for for a sync RPC won't
block state transfer anymore, so deadlocks between writes and state
transfer (that now end with the write command timing out) should be
impossible.

I am not planning for any queuing at this point - we could only queue
if we had the same order between the writes on all the nodes,
otherwise we would get inconsistencies.
For async commands we will have an implicit queue in the JGroups
thread pool, but we won't have anything for sync commands.

>
>
> "A merge can be coalesced with multiple state transfers, one running in
> each partition. So in the general case a coalesced state transfer
> contain a tree of cache view changes."
>
> Hmm, this can make a state transfer message quite large. Are we trimming
> the modification list ? E.g if we have 10 PUTs on K, 1 removal of K, and
> another 4 PUTs, do we just send the *last* PUT, or do we send a
> modification list of 15 ?
>

Yep, my idea was to keep only a set of keys that have been modified
for each cache view id. The second put to the same key wouldn't modify
the set.
When we transfer the entries to the new owners, we iterate the data
container and only send each key once.

Before writing the key to this set (and the entry to the data
container), the primary owner will synchronously invalidate the key
not only on the L1 requestors, but on all the owners in the previous
cache views (that are no longer owners in the latest view). This will
ensure that the new owner will only receive one copy of each key and
won't have to buffer and sort the state received from all the previous
owners.

>
>
> "Get commands can write to the data container as well when L1 is
> enabled, so we can't block just write commands."
>
> Another downside of the L1 cache being part of the regular cache. IMO it
> would be much better to separate the 2, as I wrote in previous emails
> yesterday.
>

I haven't read those yet, but I'm not sure moving the L1 cache to
another container would eliminate the problem completely.
It's probably not clear from the document, but if we empty the L1 on
cache view changes then we have to ensure that we don't write to L1
anything from an old owner. Otherwise we're left with a value in L1
that none of the new owners knows about and won't invalidate on a put
to that key.

>
> "The new owners will start receiving commands before they have received
> all the state
>     * In order to handle those commands, the new owners will have to
> get the values from the old owners
>     * We know that the owners in the later views have a newer version
> of the entry (if they have it at all). So we need to go back on the
> cache views tree and ask all the nodes on one level at the same time -
> if we don't get any certain anwer we go to the next level and repeat."
>
>
> How does a member C know that it *will* receive any state at all ? E.g.
> if we had key K on A and B, and now B crashed, then A would push a copy
> of K to C.
>
> So when C receives a request R from another member E, but hasn't
> received a state transfer from A yet, how does it know whether to apply
> or queue R ? Does C wait until it get an END-OF-ST message from the
> coordinator ?
>

Yep, each node will signal to the coordinator that it finished pushing
data and when the coordinator gets the confirmation from all the nodes
it will broadcast the end-of-ST message to everyone.

>
> (skipping the rest due to exhaustion by complexity :-))
>
>
> Over the last couple of days, I've exchanged a couple of emails with the
> Cloud-TM guys, and I'm more and more convinced that their total order
> solution is the simpler approach to (1) transactional updates and (2)
> state transfer. They don't have a solution for (2) yet, but I believe
> this can be done as another totally ordered transaction, applied in the
> correct location within the update stream. Or, we could possibly use a
> flush: as we don't need to wait for pending TXs to complete and release
> their locks, this should be quite fast.
>

Sounds intriguing, but I'm not sure how making the entire state
transfer a single transaction would allow us to handle transactions
while that state is being transferred.
Blocking state transfer works with 2PC already. Well, at least as long
as we don't have any merges...

>
> So my 5 cents:
>
> #1 We should focus on the total order approach, and get rid of the 2PC
> and locking business for transactional updates
>

The biggest problem I remember total order having is TM transactions
that have other participants (as opposed to cache-only transactions).
I haven't followed the TO discussion on the mailing list very closely,
does that work now?

Regarding state transfer in particular, remember, non-blocking state
transfer with 2PC sounded very easy as well, before we got into the
details.
The proof-of-concept you had running on the 4.2 branch was much
simpler than what we have now, and even what we have now doesn't cover
everything.

> #2 Really focus on the eventual consistency approach
>

Again, it's very tempting, but I fear much of what makes eventual
consistency tempting is that we don't know enough of its
implementation details yet.
Manik has been investigating eventual consistency for a while, I
wonder what he thinks...


Cheers
Dan

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Bela Ban
Thanks for the feedback and clarifications Dan !

Comments inline...



>> Over the last couple of days, I've exchanged a couple of emails with the
>> Cloud-TM guys, and I'm more and more convinced that their total order
>> solution is the simpler approach to (1) transactional updates and (2)
>> state transfer. They don't have a solution for (2) yet, but I believe
>> this can be done as another totally ordered transaction, applied in the
>> correct location within the update stream. Or, we could possibly use a
>> flush: as we don't need to wait for pending TXs to complete and release
>> their locks, this should be quite fast.
>>
>
> Sounds intriguing, but I'm not sure how making the entire state
> transfer a single transaction would allow us to handle transactions
> while that state is being transferred.


We're discussing 2 scenarios, one being the use of flush (but
short-lived as we don't have any locks in play) and the other being
Pedro's proposal of a start-state TO-cast (shipped with a keys set),
followed by a state transfer. Transactions with any keys in the key set
are queued and applied in order after the state transfer is done
(signalled with abother stop-state TO-cast).

If you have anything to add, let's discuss it on the other thread.



>> So my 5 cents:
>>
>> #1 We should focus on the total order approach, and get rid of the 2PC
>> and locking business for transactional updates
>>
>
> The biggest problem I remember total order having is TM transactions
> that have other participants (as opposed to cache-only transactions).
> I haven't followed the TO discussion on the mailing list very closely,
> does that work now?


No, I don't think that's addressed by TOM, good point in favor of having
2 (or more) approaches to partial replication and state transfer !

I think a lot of systems would still benefit from TOM though as there
are no external participants, for instance session replication: it uses
an Infinispan batch to ship session updates to the session owners, and
there are no other participants (as a matter of fact, this is not even a
JTA transaction anyway).


> Regarding state transfer in particular, remember, non-blocking state
> transfer with 2PC sounded very easy as well, before we got into the
> details.

True :-)



>> #2 Really focus on the eventual consistency approach
>>
>
> Again, it's very tempting, but I fear much of what makes eventual
> consistency tempting is that we don't know enough of its
> implementation details yet.


True again...


> Manik has been investigating eventual consistency for a while, I
> wonder what he thinks...


Yep - Manik, do you have any status update on eventual consistency ?


--
Bela Ban, JGroups lead (http://www.jgroups.org)
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Non-blocking state transfer (ISPN-1424)

Paolo Romano
On 3/23/12 10:45 AM, Bela Ban wrote:
>> The biggest problem I remember total order having is TM transactions
>> that have other participants (as opposed to cache-only transactions).
>> I haven't followed the TO discussion on the mailing list very closely,
>> does that work now?
>
> No, I don't think that's addressed by TOM, good point in favor of having
> 2 (or more) approaches to partial replication and state transfer !
>
>
Actually, we already have code for dealing with scenarios in which ISPN
is involved in a distributed transaction with other participants.( Pedro
can point it out, it should be already on github.)

In this case, the solution implies necessarily the usage of 2PC, but we
can disseminate the prepare messages within ISPN using TOM.

Pro:
- deadlock freedom at the ISPN level, which can contribute to make of
ISPN a well-behaved (i.e. responsive) participant in a distributed
transaction even in high contention scenarios.

Con:
- in this case we cannot determine right away the outcome of a
transaction once it is TOM-delivered, as we need to take into account
also the votes of external participants. Hence the need for an extra phase.

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
12
Loading...