[infinispan-dev] transactions :: incremental locking

classic Classic list List threaded Threaded
14 messages Options
Reply | Threaded
Open this post in threaded view
|

[infinispan-dev] transactions :: incremental locking

Mircea Markus
Hi,

This is the document describing the incremental optimistic locking Dan and myself discussed last week: http://community.jboss.org/wiki/IncrementalOptimisticLocking
Unless I'm missing something, this together with lock reordering[1] cover 100% of the possible deadlock situations in the optimistic locking scheme[2] - which is pretty awesome!

Cheers,
Mircea   



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Manik Surtani
Good stuff, shame about the RPC count .  ;)

On 5 Jul 2011, at 14:25, Mircea Markus wrote:

Hi,

This is the document describing the incremental optimistic locking Dan and myself discussed last week: http://community.jboss.org/wiki/IncrementalOptimisticLocking
Unless I'm missing something, this together with lock reordering[1] cover 100% of the possible deadlock situations in the optimistic locking scheme[2] - which is pretty awesome!

Cheers,
Mircea   


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Mircea Markus

On 5 Jul 2011, at 15:39, Manik Surtani wrote:

> Good stuff, shame about the RPC count .  ;)
yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Manik Surtani

On 5 Jul 2011, at 15:47, Mircea Markus wrote:

>
> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>
>> Good stuff, shame about the RPC count .  ;)
> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is

Yup.

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Paolo Romano
In reply to this post by Mircea Markus
On 7/5/11 3:47 PM, Mircea Markus wrote:
> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>
>> Good stuff, shame about the RPC count .  ;)
> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is
>
I agree with Manik's observation.

When benchmarking it, it would be interesting to compare it also with
the total order multicast approach that Pedro has been developing.

When do you plan to have these locking optimizations available?
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Mircea Markus

On 6 Jul 2011, at 09:00, Paolo Romano wrote:

> On 7/5/11 3:47 PM, Mircea Markus wrote:
>> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>>
>>> Good stuff, shame about the RPC count .  ;)
>> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is
>>
> I agree with Manik's observation.
>
> When benchmarking it, it would be interesting to compare it also with
> the total order multicast approach that Pedro has been developing.
+1
>
> When do you plan to have these locking optimizations available?
We hope to have it in for 5.1, but it has a lower priority compared with the other lock related changes.

> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Sanne Grinovero-3
Very nice design, congratulations.
Second picture is missing a final "4" step returning the ACK to N1 right?

Abot the third proposal I was thinking about a possible improvement,
not sure if it works:
consider the list defining the keys being locked, already ordered. As
far as I understood about the ordering on the hashweel, you can now
split this list to have the groups which need to be sent to each node:
k1, k2, k3, k4, k5 -> N2{k1, k2} N3{k3, k4} N4{k5}
i.e. the sequence does not need to be reordered and we can make use of that.

Now instead of sending the full list in multicast to each involved
node, we send the list only to the first node, and wait for an ACK
from the last node involved in the sequence.
This does not reduce the number of async rpcs being executed, but
sends less packets on the wire, and has the advantage of never needing
to go back to recover from a failed node or recover from a potential
deadlock, as all keys are always taken in proper order, not
"optimistically".

[reusing the same picture of "Hybrid incremental approach"]
assuming this list of keys, owned by these nodes: k1, k2, k3, k4, k5
-> N2{k1, k2} N3{k3, k4} N4{k5}
Now if N2 receives the list, it acquires k1, k2, in order, and sends
the full list over to N3 with a message like

N2{k1 *taken, k2 *taken} N3{k3, k4} N4{k5}

and it will also wait for an async ACK. If it receives an ACK, N2's
work is done.
If it doesn't receive one in a reasonable time, it will assume N3 is
dead: now the list of nodes changes, but the order in which these
elements need to be acquired did not. In fact, some of these might be
moved to N2 himself, some to N4 since the rehashing will move them to
neighbours: N3 owned keys can be taken right away (again not violating
the order), the rest of the list is again forwarded to the next alive
node.
The last node will return an ACK to the originating node to inform
that the prepare phase was successful, and when he did it will send an
ACK back to it's previous node to free it's resources, this one will
propagate back the ACK, etc.
No deadlocks, no timeouts, no undo operations. what am I missing?

Cheers,
Sanne

2011/7/6 Mircea Markus <[hidden email]>:

>
> On 6 Jul 2011, at 09:00, Paolo Romano wrote:
>
>> On 7/5/11 3:47 PM, Mircea Markus wrote:
>>> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>>>
>>>> Good stuff, shame about the RPC count .  ;)
>>> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is
>>>
>> I agree with Manik's observation.
>>
>> When benchmarking it, it would be interesting to compare it also with
>> the total order multicast approach that Pedro has been developing.
> +1
>>
>> When do you plan to have these locking optimizations available?
> We hope to have it in for 5.1, but it has a lower priority compared with the other lock related changes.
>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Galder Zamarreno
In reply to this post by Mircea Markus
Mircea, thx for writing this up. Some comments:

It's not clear to me, at least given the examples in the wiki, how problem of Tx1 and Tx2 with a and b explained in the introduction is fixed by the solution.

The problem exposed:
"- with some right timing, during prepare time it is possible for these two transactions to deadlock:
        • Tx1 lock acquired on "a" @ N3
        • Tx2 lock acquired on "b" @ N4
        • Tx1 cannot progress acquiring  lock on "b" @ N4, that lock is acquired by Tx2
        • Tx2 cannot acquire lock on "a" @ N3 as that lock is held by Tx1
        • Tx1 and Tx2 are waiting for each other => deadlock"

The 'alleged'' solution?

"In the example above, considering the view = {N1, N2, N3, N4} and incremental lock acquisition:
        • Tx1 acquires lock on a@ N3
        • Tx2 tries to acquire lock on a@ N3. It cannot acquire and waits for Tx1 to release it
        • Tx1 acquires locks on b@ N4, completes and releases locks
        • Tx2 acquires lock on a@ N3 and completes as well"

Is the latter block supposed to show how the example in the introduction is fixed? If it is, it's not really doing it cos it's not explaining what happens to Tx2 that is trying to acquire locks on 'b'...

The diagrams and the associated explanations are confusing too. Example:

"N1 first locks a@N3 (1)" -> in the diagram, (1) represents a call from N1 to N2

"b@N4 (RPC 2)" -> That corresponds to to 5/6 arrows?

On Jul 5, 2011, at 3:25 PM, Mircea Markus wrote:

> Hi,
>
> This is the document describing the incremental optimistic locking Dan and myself discussed last week: http://community.jboss.org/wiki/IncrementalOptimisticLocking
> Unless I'm missing something, this together with lock reordering[1] cover 100% of the possible deadlock situations in the optimistic locking scheme[2] - which is pretty awesome!
>
> Cheers,
> Mircea  
>
> [1] http://community.jboss.org/wiki/LockReorderingForAvoidingDeadlocks
> [2] http://community.jboss.org/wiki/OptimisticLockingInInfinispan
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Mircea Markus
In reply to this post by Sanne Grinovero-3

On 6 Jul 2011, at 16:41, Sanne Grinovero wrote:

> Very nice design, congratulations.
> Second picture is missing a final "4" step returning the ACK to N1 right?
yes, this happens when 1a, 1b and 1c returned to N1.

>
> Abot the third proposal I was thinking about a possible improvement,
> not sure if it works:
> consider the list defining the keys being locked, already ordered. As
> far as I understood about the ordering on the hashweel, you can now
> split this list to have the groups which need to be sent to each node:
> k1, k2, k3, k4, k5 -> N2{k1, k2} N3{k3, k4} N4{k5}
> i.e. the sequence does not need to be reordered and we can make use of that.
>
> Now instead of sending the full list in multicast to each involved
> node, we send the list only to the first node, and wait for an ACK
> from the last node involved in the sequence.
> This does not reduce the number of async rpcs being executed, but
> sends less packets on the wire, and has the advantage of never needing
> to go back to recover from a failed node or recover from a potential
> deadlock, as all keys are always taken in proper order, not
> "optimistically".
Your suggestion is a variation of 2nd approach("Async incremental locking"). If we approach it this way, the prepare message, which is potentially big, is sent in sequence from one node to the other.
With current approach the prepare is send in parallel, and the lock request sent from N2->N3->N4 has a smaller payload.

>
> [reusing the same picture of "Hybrid incremental approach"]
> assuming this list of keys, owned by these nodes: k1, k2, k3, k4, k5
> -> N2{k1, k2} N3{k3, k4} N4{k5}
> Now if N2 receives the list, it acquires k1, k2, in order, and sends
> the full list over to N3 with a message like
>
> N2{k1 *taken, k2 *taken} N3{k3, k4} N4{k5}
>
> and it will also wait for an async ACK. If it receives an ACK, N2's
> work is done.
> If it doesn't receive one in a reasonable time, it will assume
This failure detection functionality is in jgroups, I wouldn't want to have it duplicated here as well. But indeed N2 can be notified by jgroups about this.

> N3 is
> dead: now the list of nodes changes, but the order in which these
> elements need to be acquired did not. In fact, some of these might be
> moved to N2 himself, some to N4 since the rehashing will move them to
> neighbours: N3 owned keys can be taken right away (again not violating
> the order), the rest of the list is again forwarded to the next alive
> node.
> The last node will return an ACK to the originating node to inform
> that the prepare phase was successful, and when he did it will send an
> ACK back to it's previous node to free it's resources, this one will
> propagate back the ACK, etc.
> No deadlocks, no timeouts, no undo operations. what am I missing?

>
> Cheers,
> Sanne
>
> 2011/7/6 Mircea Markus <[hidden email]>:
>>
>> On 6 Jul 2011, at 09:00, Paolo Romano wrote:
>>
>>> On 7/5/11 3:47 PM, Mircea Markus wrote:
>>>> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>>>>
>>>>> Good stuff, shame about the RPC count .  ;)
>>>> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is
>>>>
>>> I agree with Manik's observation.
>>>
>>> When benchmarking it, it would be interesting to compare it also with
>>> the total order multicast approach that Pedro has been developing.
>> +1
>>>
>>> When do you plan to have these locking optimizations available?
>> We hope to have it in for 5.1, but it has a lower priority compared with the other lock related changes.
>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Mircea Markus
In reply to this post by Galder Zamarreno

On 8 Jul 2011, at 10:34, Galder Zamarreño wrote:

> Mircea, thx for writing this up. Some comments:
>
> It's not clear to me, at least given the examples in the wiki, how problem of Tx1 and Tx2 with a and b explained in the introduction is fixed by the solution.
>
> The problem exposed:
> "- with some right timing, during prepare time it is possible for these two transactions to deadlock:
> • Tx1 lock acquired on "a" @ N3
> • Tx2 lock acquired on "b" @ N4
> • Tx1 cannot progress acquiring  lock on "b" @ N4, that lock is acquired by Tx2
> • Tx2 cannot acquire lock on "a" @ N3 as that lock is held by Tx1
> • Tx1 and Tx2 are waiting for each other => deadlock"
>
> The 'alleged'' solution?
>
> "In the example above, considering the view = {N1, N2, N3, N4} and incremental lock acquisition:
> • Tx1 acquires lock on a@ N3
> • Tx2 tries to acquire lock on a@ N3. It cannot acquire and waits for Tx1 to release it
> • Tx1 acquires locks on b@ N4, completes and releases locks
> • Tx2 acquires lock on a@ N3 and completes as well"
>
> Is the latter block supposed to show how the example in the introduction is fixed? If it is, it's not really doing it cos it's not explaining what happens to Tx2 that is trying to acquire locks on 'b'...
Tx2 acquires lock on b @ N4 after locking a@N3, I've updated the document.
>
> The diagrams and the associated explanations are confusing too. Example:
>
> "N1 first locks a@N3 (1)" -> in the diagram, (1) represents a call from N1 to N2
>
> "b@N4 (RPC 2)" -> That corresponds to to 5/6 arrows?
I've also updated this, thanks!

>
> On Jul 5, 2011, at 3:25 PM, Mircea Markus wrote:
>
>> Hi,
>>
>> This is the document describing the incremental optimistic locking Dan and myself discussed last week: http://community.jboss.org/wiki/IncrementalOptimisticLocking
>> Unless I'm missing something, this together with lock reordering[1] cover 100% of the possible deadlock situations in the optimistic locking scheme[2] - which is pretty awesome!
>>
>> Cheers,
>> Mircea  
>>
>> [1] http://community.jboss.org/wiki/LockReorderingForAvoidingDeadlocks
>> [2] http://community.jboss.org/wiki/OptimisticLockingInInfinispan
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Sanne Grinovero-3
In reply to this post by Mircea Markus
2011/7/11 Mircea Markus <[hidden email]>:

>
> On 6 Jul 2011, at 16:41, Sanne Grinovero wrote:
>
>> Very nice design, congratulations.
>> Second picture is missing a final "4" step returning the ACK to N1 right?
> yes, this happens when 1a, 1b and 1c returned to N1.
>>
>> Abot the third proposal I was thinking about a possible improvement,
>> not sure if it works:
>> consider the list defining the keys being locked, already ordered. As
>> far as I understood about the ordering on the hashweel, you can now
>> split this list to have the groups which need to be sent to each node:
>> k1, k2, k3, k4, k5 -> N2{k1, k2} N3{k3, k4} N4{k5}
>> i.e. the sequence does not need to be reordered and we can make use of that.
>>
>> Now instead of sending the full list in multicast to each involved
>> node, we send the list only to the first node, and wait for an ACK
>> from the last node involved in the sequence.
>> This does not reduce the number of async rpcs being executed, but
>> sends less packets on the wire, and has the advantage of never needing
>> to go back to recover from a failed node or recover from a potential
>> deadlock, as all keys are always taken in proper order, not
>> "optimistically".
> Your suggestion is a variation of 2nd approach("Async incremental locking"). If we approach it this way, the prepare message, which is potentially big, is sent in sequence from one node to the other.
> With current approach the prepare is send in parallel, and the lock request sent from N2->N3->N4 has a smaller payload.

Yes it's a variation, and I wouldn't have thought of it without
reading your nice proposal. Still I'm not understanding why the
original suggestion is doing anything in parallel, since you suggest
the payload must be kept low, you're sending multiple different keys
over the wire from N1, so you can't use multicast and assuming you
have a single wire these are sent in sequence too. About total number
of parallel communications, the original approach has the same number
of serialized "node hops" for the best scenario, but worse in case
something needs to be rolled back, imho it gets complicated if you
have to rollback a longer path when more nodes are involved!
so my intention was to a) make it simpler b) while still optimising
for the good case, having less chattiness in the worst case.

I agree my last proposal sends keys over to nodes which are not
interesting to them, still since it's not about the values I wouldn't
expect this to be a big change in the packet size, and am wondering if
this "token ring" approach would not be simpler: there are less
packets going around at the same time, order is guaranteed, so I
wouldn't be sure of this being less efficient on the network without
some real world test.

Cheers,
Sanne

>>
>> [reusing the same picture of "Hybrid incremental approach"]
>> assuming this list of keys, owned by these nodes: k1, k2, k3, k4, k5
>> -> N2{k1, k2} N3{k3, k4} N4{k5}
>> Now if N2 receives the list, it acquires k1, k2, in order, and sends
>> the full list over to N3 with a message like
>>
>> N2{k1 *taken, k2 *taken} N3{k3, k4} N4{k5}
>>
>> and it will also wait for an async ACK. If it receives an ACK, N2's
>> work is done.
>> If it doesn't receive one in a reasonable time, it will assume
> This failure detection functionality is in jgroups, I wouldn't want to have it duplicated here as well. But indeed N2 can be notified by jgroups about this.
>> N3 is
>> dead: now the list of nodes changes, but the order in which these
>> elements need to be acquired did not. In fact, some of these might be
>> moved to N2 himself, some to N4 since the rehashing will move them to
>> neighbours: N3 owned keys can be taken right away (again not violating
>> the order), the rest of the list is again forwarded to the next alive
>> node.
>> The last node will return an ACK to the originating node to inform
>> that the prepare phase was successful, and when he did it will send an
>> ACK back to it's previous node to free it's resources, this one will
>> propagate back the ACK, etc.
>> No deadlocks, no timeouts, no undo operations. what am I missing?
>
>>
>> Cheers,
>> Sanne
>>
>> 2011/7/6 Mircea Markus <[hidden email]>:
>>>
>>> On 6 Jul 2011, at 09:00, Paolo Romano wrote:
>>>
>>>> On 7/5/11 3:47 PM, Mircea Markus wrote:
>>>>> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>>>>>
>>>>>> Good stuff, shame about the RPC count .  ;)
>>>>> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is
>>>>>
>>>> I agree with Manik's observation.
>>>>
>>>> When benchmarking it, it would be interesting to compare it also with
>>>> the total order multicast approach that Pedro has been developing.
>>> +1
>>>>
>>>> When do you plan to have these locking optimizations available?
>>> We hope to have it in for 5.1, but it has a lower priority compared with the other lock related changes.
>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Mircea Markus

On 11 Jul 2011, at 12:27, Sanne Grinovero wrote:

> 2011/7/11 Mircea Markus <[hidden email]>:
>>
>> On 6 Jul 2011, at 16:41, Sanne Grinovero wrote:
>>
>>> Very nice design, congratulations.
>>> Second picture is missing a final "4" step returning the ACK to N1 right?
>> yes, this happens when 1a, 1b and 1c returned to N1.
>>>
>>> Abot the third proposal I was thinking about a possible improvement,
>>> not sure if it works:
>>> consider the list defining the keys being locked, already ordered. As
>>> far as I understood about the ordering on the hashweel, you can now
>>> split this list to have the groups which need to be sent to each node:
>>> k1, k2, k3, k4, k5 -> N2{k1, k2} N3{k3, k4} N4{k5}
>>> i.e. the sequence does not need to be reordered and we can make use of that.
>>>
>>> Now instead of sending the full list in multicast to each involved
>>> node, we send the list only to the first node, and wait for an ACK
>>> from the last node involved in the sequence.
>>> This does not reduce the number of async rpcs being executed, but
>>> sends less packets on the wire, and has the advantage of never needing
>>> to go back to recover from a failed node or recover from a potential
>>> deadlock, as all keys are always taken in proper order, not
>>> "optimistically".
>> Your suggestion is a variation of 2nd approach("Async incremental locking"). If we approach it this way, the prepare message, which is potentially big, is sent in sequence from one node to the other.
>> With current approach the prepare is send in parallel, and the lock request sent from N2->N3->N4 has a smaller payload.
>
> Yes it's a variation, and I wouldn't have thought of it without
> reading your nice proposal. Still I'm not understanding why the
> original suggestion is doing anything in parallel, since you suggest
> the payload must be kept low, you're sending multiple different keys
> over the wire from N1, so you can't use multicast and assuming you
> have a single wire these are sent in sequence too.
ATM the prepare is sent as a multicast, each node receiving all the key-set touched by the transaction.
> About total number
> of parallel communications, the original approach has the same number
> of serialized "node hops" for the best scenario,
I'm a bit lost on what is the "original approach" :-) Is the "Async incremental locking" or the "hybrid one"?
For the "hybrid", the number of serialized "node hops" in the best scenario is 1.
> but worse in case
> something needs to be rolled back, imho it gets complicated if you
> have to rollback a longer path when more nodes are involved!
yes, but the rollback should be seen as an exception in this situation, and the optimization is for the most common scenario which is lock acquisition succeeds.
> so my intention was to a) make it simpler b) while still optimising
> for the good case, having less chattiness in the worst case.
The good case, at least compared with the hybrid approach, is less optimal: hybrid has serial 1 hop, this one has numNodesInvolvedInTx hops.

>
> I agree my last proposal sends keys over to nodes which are not
> interesting to them, still since it's not about the values I wouldn't
> expect this to be a big change in the packet size, and am wondering if
> this "token ring" approach would not be simpler: there are less
> packets going around at the same time, order is guaranteed, so I
> wouldn't be sure of this being less efficient on the network without
> some real world test.
>
> Cheers,
> Sanne
>
>>>
>>> [reusing the same picture of "Hybrid incremental approach"]
>>> assuming this list of keys, owned by these nodes: k1, k2, k3, k4, k5
>>> -> N2{k1, k2} N3{k3, k4} N4{k5}
>>> Now if N2 receives the list, it acquires k1, k2, in order, and sends
>>> the full list over to N3 with a message like
>>>
>>> N2{k1 *taken, k2 *taken} N3{k3, k4} N4{k5}
>>>
>>> and it will also wait for an async ACK. If it receives an ACK, N2's
>>> work is done.
>>> If it doesn't receive one in a reasonable time, it will assume
>> This failure detection functionality is in jgroups, I wouldn't want to have it duplicated here as well. But indeed N2 can be notified by jgroups about this.
>>> N3 is
>>> dead: now the list of nodes changes, but the order in which these
>>> elements need to be acquired did not. In fact, some of these might be
>>> moved to N2 himself, some to N4 since the rehashing will move them to
>>> neighbours: N3 owned keys can be taken right away (again not violating
>>> the order), the rest of the list is again forwarded to the next alive
>>> node.
>>> The last node will return an ACK to the originating node to inform
>>> that the prepare phase was successful, and when he did it will send an
>>> ACK back to it's previous node to free it's resources, this one will
>>> propagate back the ACK, etc.
>>> No deadlocks, no timeouts, no undo operations. what am I missing?
>>
>>>
>>> Cheers,
>>> Sanne
>>>
>>> 2011/7/6 Mircea Markus <[hidden email]>:
>>>>
>>>> On 6 Jul 2011, at 09:00, Paolo Romano wrote:
>>>>
>>>>> On 7/5/11 3:47 PM, Mircea Markus wrote:
>>>>>> On 5 Jul 2011, at 15:39, Manik Surtani wrote:
>>>>>>
>>>>>>> Good stuff, shame about the RPC count .  ;)
>>>>>> yeah. Still very valid when there are deadlocks, guess figures will tell us more precisely what the gain is
>>>>>>
>>>>> I agree with Manik's observation.
>>>>>
>>>>> When benchmarking it, it would be interesting to compare it also with
>>>>> the total order multicast approach that Pedro has been developing.
>>>> +1
>>>>>
>>>>> When do you plan to have these locking optimizations available?
>>>> We hope to have it in for 5.1, but it has a lower priority compared with the other lock related changes.
>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Sanne Grinovero-3
This is getting a bit confusing; sorry, I've tried to be brief but it
seems I didn't improve the message.
I'll add some more notes below as they belong to the context, but
let's talk about this on IRC.

2011/7/11 Mircea Markus <[hidden email]>:

> On 11 Jul 2011, at 12:27, Sanne Grinovero wrote:
>> 2011/7/11 Mircea Markus <[hidden email]>:
>>> On 6 Jul 2011, at 16:41, Sanne Grinovero wrote:
>>>
>>>> Very nice design, congratulations.
>>>> Second picture is missing a final "4" step returning the ACK to N1 right?
>>> yes, this happens when 1a, 1b and 1c returned to N1.
>>>>
>>>> Abot the third proposal I was thinking about a possible improvement,
>>>> not sure if it works:
>>>> consider the list defining the keys being locked, already ordered. As
>>>> far as I understood about the ordering on the hashweel, you can now
>>>> split this list to have the groups which need to be sent to each node:
>>>> k1, k2, k3, k4, k5 -> N2{k1, k2} N3{k3, k4} N4{k5}
>>>> i.e. the sequence does not need to be reordered and we can make use of that.
>>>>
>>>> Now instead of sending the full list in multicast to each involved
>>>> node, we send the list only to the first node, and wait for an ACK
>>>> from the last node involved in the sequence.
>>>> This does not reduce the number of async rpcs being executed, but
>>>> sends less packets on the wire, and has the advantage of never needing
>>>> to go back to recover from a failed node or recover from a potential
>>>> deadlock, as all keys are always taken in proper order, not
>>>> "optimistically".
>>> Your suggestion is a variation of 2nd approach("Async incremental locking"). If we approach it this way, the prepare message, which is potentially big, is sent in sequence from one node to the other.
>>> With current approach the prepare is send in parallel, and the lock request sent from N2->N3->N4 has a smaller payload.
>>
>> Yes it's a variation, and I wouldn't have thought of it without
>> reading your nice proposal. Still I'm not understanding why the
>> original suggestion is doing anything in parallel, since you suggest
>> the payload must be kept low, you're sending multiple different keys
>> over the wire from N1, so you can't use multicast and assuming you
>> have a single wire these are sent in sequence too.
> ATM the prepare is sent as a multicast, each node receiving all the key-set touched by the transaction.
>> About total number
>> of parallel communications, the original approach has the same number
>> of serialized "node hops" for the best scenario,

> I'm a bit lost on what is the "original approach" :-) Is the "Async incremental locking" or the "hybrid one"?
> For the "hybrid", the number of serialized "node hops" in the best scenario is 1.
I'm sorry, I'm referring to "Hybrid incremental approach".
How can you have a value of 1 if the nodes have to confirm in
sequence? As the picture shows, there are messages labeled in sequence
1,2,3,4, and it should need a final ACK (5) too.
The small change I'm suggesting doesn't improve this that much, it's
only avoiding need for rolling back lock acquisitions: since you need
1-5 messages in series, I'm suggesting we take advantage of that,
admittely using a larger payload.
Thinking about it now, it's also making sure that locks are kept for a
smaller time: if all locks are acquired during the initial multicast
from message in phase 1, while with the additional proposal I've sent
some locks are acquired a bit later in terms of network messages; no
doubt the change is small, but this might make it possible from some
transactions to make progress quicker.

>> but worse in case
>> something needs to be rolled back, imho it gets complicated if you
>> have to rollback a longer path when more nodes are involved!
> yes, but the rollback should be seen as an exception in this situation, and the optimization is for the most common scenario which is lock acquisition succeeds.
>> so my intention was to a) make it simpler b) while still optimising
>> for the good case, having less chattiness in the worst case.
> The good case, at least compared with the hybrid approach, is less optimal: hybrid has serial 1 hop, this one has numNodesInvolvedInTx hops.

I'm failing to understand how you can have a single 1 hop in series,
you must ack the lock acquisition in order to achieve your goal of
deadlock prevention; there must be a misunderstanding, as I'm
experiencing the opposite interpretation; actually I think your
explanation and solution on the wiki was brilliant but we're getting
entangled on the mailing list.
Did you try expanding your schema on more nodes, what should happen if
the first node fails in a series of a dozen? It doesn't scale with the
number of involved servers, and you're keeping some locks for the time
of multiple network hops, while these locks should actually be free
and are blocking other transactions. You might well end up in a
spinning situation as the delays are in terms of multiple network
hops: you blocked another TX which needs to restart from scratch in
the locking fight, and abort yourself to retry as well: that's a lot
of wasted locking, which seem unnecessary as we can acquire them in
order together with the ACK messages.

I really think the "token ring" approach is an improvement on "Hybrid
incremental" because you can easily proof that there's no way to get
less hops considering you sequential+ACK requirement, and since we
have this nice order of the keys, it's pretty nice that we can take
advantage of that by sending the token around in the correct sequence.

Cheers,
Sanne
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] transactions :: incremental locking

Mircea Markus

On 11 Jul 2011, at 14:03, Sanne Grinovero wrote:

> This is getting a bit confusing; sorry, I've tried to be brief but it
> seems I didn't improve the message.
> I'll add some more notes below as they belong to the context, but
> let's talk about this on IRC.
better.

>
> 2011/7/11 Mircea Markus <[hidden email]>:
>> On 11 Jul 2011, at 12:27, Sanne Grinovero wrote:
>>> 2011/7/11 Mircea Markus <[hidden email]>:
>>>> On 6 Jul 2011, at 16:41, Sanne Grinovero wrote:
>>>>
>>>>> Very nice design, congratulations.
>>>>> Second picture is missing a final "4" step returning the ACK to N1 right?
>>>> yes, this happens when 1a, 1b and 1c returned to N1.
>>>>>
>>>>> Abot the third proposal I was thinking about a possible improvement,
>>>>> not sure if it works:
>>>>> consider the list defining the keys being locked, already ordered. As
>>>>> far as I understood about the ordering on the hashweel, you can now
>>>>> split this list to have the groups which need to be sent to each node:
>>>>> k1, k2, k3, k4, k5 -> N2{k1, k2} N3{k3, k4} N4{k5}
>>>>> i.e. the sequence does not need to be reordered and we can make use of that.
>>>>>
>>>>> Now instead of sending the full list in multicast to each involved
>>>>> node, we send the list only to the first node, and wait for an ACK
>>>>> from the last node involved in the sequence.
>>>>> This does not reduce the number of async rpcs being executed, but
>>>>> sends less packets on the wire, and has the advantage of never needing
>>>>> to go back to recover from a failed node or recover from a potential
>>>>> deadlock, as all keys are always taken in proper order, not
>>>>> "optimistically".
>>>> Your suggestion is a variation of 2nd approach("Async incremental locking"). If we approach it this way, the prepare message, which is potentially big, is sent in sequence from one node to the other.
>>>> With current approach the prepare is send in parallel, and the lock request sent from N2->N3->N4 has a smaller payload.
>>>
>>> Yes it's a variation, and I wouldn't have thought of it without
>>> reading your nice proposal. Still I'm not understanding why the
>>> original suggestion is doing anything in parallel, since you suggest
>>> the payload must be kept low, you're sending multiple different keys
>>> over the wire from N1, so you can't use multicast and assuming you
>>> have a single wire these are sent in sequence too.
>> ATM the prepare is sent as a multicast, each node receiving all the key-set touched by the transaction.
>>> About total number
>>> of parallel communications, the original approach has the same number
>>> of serialized "node hops" for the best scenario,
>
>> I'm a bit lost on what is the "original approach" :-) Is the "Async incremental locking" or the "hybrid one"?
>> For the "hybrid", the number of serialized "node hops" in the best scenario is 1.
> I'm sorry, I'm referring to "Hybrid incremental approach".
> How can you have a value of 1 if the nodes have to confirm in
> sequence? As the picture shows, there are messages labeled in sequence
> 1,2,3,4, and it should need a final ACK (5) too.
> The small change I'm suggesting doesn't improve this that much, it's
> only avoiding need for rolling back lock acquisitions: since you need
> 1-5 messages in series, I'm suggesting we take advantage of that,
> admittely using a larger payload.
> Thinking about it now, it's also making sure that locks are kept for a
> smaller time: if all locks are acquired during the initial multicast
> from message in phase 1, while with the additional proposal I've sent
> some locks are acquired a bit later in terms of network messages; no
> doubt the change is small, but this might make it possible from some
> transactions to make progress quicker.
>
>>> but worse in case
>>> something needs to be rolled back, imho it gets complicated if you
>>> have to rollback a longer path when more nodes are involved!
>> yes, but the rollback should be seen as an exception in this situation, and the optimization is for the most common scenario which is lock acquisition succeeds.
>>> so my intention was to a) make it simpler b) while still optimising
>>> for the good case, having less chattiness in the worst case.
>> The good case, at least compared with the hybrid approach, is less optimal: hybrid has serial 1 hop, this one has numNodesInvolvedInTx hops.
>
> I'm failing to understand how you can have a single 1 hop in series,
> you must ack the lock acquisition in order to achieve your goal of
> deadlock prevention;
yes, it's a sync RPC, i.e. 2 network hops.
> there must be a misunderstanding, as I'm
> experiencing the opposite interpretation; actually I think your
> explanation and solution on the wiki was brilliant but we're getting
> entangled on the mailing list.
> Did you try expanding your schema on more nodes, what should happen if
> the first node fails in a series of a dozen?
then it's not the "good" case, and the hybrid approach should be used when there is a small potential for deadlocks. If there is a high chance for deadlocks, then the async incremental locking should be used.
> It doesn't scale with the
> number of involved servers, and you're keeping some locks for the time
> of multiple network hops, while these locks should actually be free
> and are blocking other transactions.
Agreed. Bug again, the hybrid approach sees the deadlock as an exception rather than a normal use case.
> You might well end up in a
> spinning situation as the delays are in terms of multiple network
> hops: you blocked another TX which needs to restart from scratch in
> the locking fight, and abort yourself to retry as well: that's a lot
> of wasted locking, which seem unnecessary as we can acquire them in
> order together with the ACK messages.
yes, and that's the way the "async incremental locking" works. Async incremental locking is more of a pessimistic approach, as it expects a deadlock and it is optimized  for that. OTOH the hybrid approach hopes there won't be any deadlocks...
>
> I really think the "token ring" approach is an improvement on "Hybrid
> incremental" because you can easily proof that there's no way to get
> less hops considering you sequential+ACK requirement,
I don't think they are comparable that way, as they were meant to serve different use cases.
The token ring approach you suggest is a variation of the "async incremental locking", and we also discussed that one. We preferred the current "async incremental locking" approach though, as it sends the prepare command in parallel only once, and not in sequence.
I do think we're talking about very similar things, and as you suggested, let's catch this up on the forums.

> and since we
> have this nice order of the keys, it's pretty nice that we can take
> advantage of that by sending the token around in the correct sequence.
>
> Cheers,
> Sanne
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev