[infinispan-dev] Primary-Backup replication scheme in Infinispan

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[infinispan-dev] Primary-Backup replication scheme in Infinispan

Sebastiano Peluso
Hi all,

first let us introduce ourselves given that it's the first time we write
in this mailing list.

Our names are Sebastiano Peluso and Diego Didona, and we are working at
INESC-ID Lisbon in the context of the Cloud-TM project. Our work is
framed in the context of self-adaptive replication mechanisms (please
refer to previous message by Paolo on this mailing list and to the
www.cloudtm.eu website for additional details on this project).

Up to date we have been working on developing a relatively simple
primary-backup (PB) replication mechanism, which we integrated within
Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the
primary) is allowed to process update transactions, whereas the
remaining nodes only process read-only transactions. This allows coping
very efficiently with write transactions, as the primary does not have
to incur in the cost of remote coordination schemes nor in distributed
deadlocks, which can hamper performance at high contention with two
phase commit schemes. With PB, in fact, the primary can serialize
transactions locally, and simply propagate the updates to the backups
for fault-tolerance as well as to allow them to process read-only
transactions on fresh data of course. On the other hand, the primary is
clearly prone to become the bottleneck especially in large clusters and
write intensive workloads.

Thus, this scheme does not really represent a replacement for the
default 2PC protocol, but rather an alternative approach that results
particularly attractive (as we will illustrate in the following with
some radargun-based performance results) in small scale clusters, or, in
"elastic" cloud scenarios, in periods where the workload is either read
dominated or not very intense. Being this scheme particularly efficient
in these scenarios, in fact, its adoption would allow to minimize the
number of resources acquired from the cloud provider in these periods,
with direct benefits in terms of cost reductions. In Cloud-TM, in fact,
we aim at designing autonomic mechanisms that would dynamically switch
among multiple replication mechanisms depending on the current workload
characteristics.

Before discussing the results of our preliminary benchmarking study, we
would like to briefly overview how we integrated this replication
mechanism within Infinispan. Any comment/feedback is clearly highly
appreciated. First of all we have defined a new command, namely
PassiveReplicationCommand, that is a subclass of PrepareCommand. We had
to define a new command because we had to design customized "visiting"
methods for the interceptors. Note that our protocol only affects the
commit phase of a transaction, specifically, during the prepare phase,
in the prepare method of the TransactionXaAdapter class, if the Primary
Backup mode is enabled, then a PassiveReplicationCommand is built by the
CommandsFactory and it is passed to the invoke method of the invoker.
The PassiveReplicationCommand is then visited by the all the
interceptors in the chain, by means of the
visitPassiveReplicationCommand methods. We describe more in detail the
operations performed by the non-trivial interceptors:

-TxInterceptor: like in the 2PC protocol, if the context is not
originated locally, then for each modification stored in the
PassiveReplicationCommand the chain of interceptors is invoked.
-LockingInterceptor: first the next interceptor is called, then the
cleanupLocks is performed with the second parameter set to true (commit
the operations). This operation is always safe: on the primary it is
called only after that the acks from all the slaves are received  (see
the ReplicationInterceptor below); on the slave there are no concurrent
conflicting writes since these are already locally serialized by the
locking scheme performed at the primary.
-ReplicationInterceptor: first it invokes the next iterceptor; then if
the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method
rpcManager.broadcastRpcCommand(command, true, false) is performed, that
replicates the modifications in a synchronous mode (waiting for an
explicit ack from all the backups).

As for the commit phase:
-Given that in the Primary Backup the prepare phase works as a commit
too, the commit method on a TransactionXaAdapter object in this case
simply returns.

On the resulting extended Infinispan version a subset of unit/functional
tests were executed and successfully passed:
- commands.CommandIdUniquenessTest
- replication.SyncReplicatedAPITest
- replication.SyncReplImplicitLockingTest
- replication.SyncReplLockingTest
- replication.SyncReplTest
- replication.SyncCacheListenerTest
- replication.ReplicationExceptionTest


We have tested this solution using a customized version of Radargun. Our
customizations were first of all aimed at having each thread accessing
data within transactions, instead of executing single put/get
operations. In addition, now every Stresser thread accesses with uniform
probability all of the keys stored by Infinispan, thus generating
conflicts with a probability proportional to the number of concurrently
active threads and inversely proportional to the total number of keys
maintained.

As already hinted, our results highlight that, depending on the current
workload/number of nodes in the system, it is possible to identify
scenarios where the PB scheme significantly outperforms the current 2PC
scheme, and vice versa. Our experiments were performed in a cluster of
homogeneous 8-core (Xeon@2.16GhZ) nodes interconnected via a Gigabit
Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The
results were obtained by running for 30 seconds 8 parallel Stresser
threads per nodes, and letting the number of node vary from 2 to 10. In
2PC, each thread executes transactions which consist of 10 (get/put)
operations, with a 10% of probability of generating a put operation.
With PB, the same kind of transactions are executed on the primary, but
the backups execute read-only transactions composed of 10 get
operations. This allows to compare the maximum throughput of update
transactions provided by the two compared schemes, without excessively
favoring PB by keeping the backups totally idle.

The total write transactions' throughput exhibited by the cluster (i.e.
not the throughput per node) is shown in the attached plots, relevant to
caches containing 1000, 10000 and 100000 keys. As already discussed, the
lower the number of keys, the higher the chance of contention and the
probability of aborts due to conflicts. In particular with the 2PC
scheme the number of failed transactions steadily increases at high
contention up to 6% (to the best of our understanding in particular due
to distributed deadlocks). With PB, instead the number of failed txs due
to contention is always 0.

Note that, currently, we are assuming that backup nodes do not generate
update transactions. In practice this corresponds to assuming the
presence of some load balancing scheme which directs (all and only)
update transactions to the primary node, and read transactions to the
backup. In the negative case (a put operation is generated on a backup
node), we simply throw a PassiveReplicationException at the
CacheDelegate level. This is probably suboptimal/undesirable in real
settings, as update transactions may be transparently rerouted (at least
in principle!) to the primary node in a RPC-style. Any suggestion on how
to implement such a redirection facility in a transparent/non-intrusive
manner would be highly appreciated of course! ;-)

To conclude, we are currently working on a statistical model that is
able to predict the best suited replication scheme given the current
workload/number of machines, as well as on a mechanism to dynamically
switch from one replication scheme to the other.


We'll keep you posted on our progresses!

Regards,
    Sebastiano Peluso, Diego Didona

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

PCvsPR_WRITE_TX_1000keys.png (41K) Download Attachment
PCvsPR_WRITE_TX_10000keys.png (47K) Download Attachment
PCvsPR_WRITE_TX_100000keys.png (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Primary-Backup replication scheme in Infinispan

Mircea Markus
Hi,

On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:

> Hi all,
>
> first let us introduce ourselves given that it's the first time we write in this mailing list.
>
> Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive replication mechanisms (please refer to previous message by Paolo on this mailing list and to the www.cloudtm.eu website for additional details on this project).
>
> Up to date we have been working on developing a relatively simple primary-backup (PB) replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the primary) is allowed to process update transactions, whereas the remaining nodes only process read-only transactions. This allows coping very efficiently with write transactions, as the primary does not have to incur in the cost of remote coordination schemes nor in distributed deadlocks, which can hamper performance at high contention with two phase commit schemes. With PB, in fact, the primary can serialize transactions locally, and simply propagate the updates to the backups for fault-tolerance as well as to allow them to process read-only transactions on fresh data of course.
Is this propagation taking place in transaction's scope? Or after the transaction is completed locally? If later, you do propagate changes to other nodes async (this would leave place to consistency issues)?
> On the other hand, the primary is clearly prone to become the bottleneck especially in large clusters and write intensive workloads.
That would also be the case with "classic" 2pc as it is currently implemented.
>
> Thus, this scheme does not really represent a replacement for the default 2PC protocol, but rather an alternative approach that results particularly attractive (as we will illustrate in the following with some radargun-based performance results) in small scale clusters, or, in "elastic" cloud scenarios, in periods where the workload is either read dominated or not very intense. Being this scheme particularly efficient in these scenarios, in fact, its adoption would allow to minimize the number of resources acquired from the cloud provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM, in fact, we aim at designing autonomic mechanisms that would dynamically switch among multiple replication mechanisms depending on the current workload characteristics.
>
> Before discussing the results of our preliminary benchmarking study, we would like to briefly overview how we integrated this replication mechanism within Infinispan. Any comment/feedback is clearly highly appreciated. First of all we have defined a new command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to define a new command because we had to design customized "visiting" methods for the interceptors. Note that our protocol only affects the commit phase of a transaction, specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by the CommandsFactory and it is passed to the invoke method of the invoker. The PassiveReplicationCommand is then visited by the all the interceptors in the chain, by means of the visitPassiveReplicationCommand methods. We describe more in detail the operations performed by the non-trivial int!
 erceptors:

>
> -TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then for each modification stored in the PassiveReplicationCommand the chain of interceptors is invoked.
> -LockingInterceptor: first the next interceptor is called, then the cleanupLocks is performed with the second parameter set to true (commit the operations). This operation is always safe: on the primary it is called only after that the acks from all the slaves are received  (see the ReplicationInterceptor below); on the slave there are no concurrent conflicting writes since these are already locally serialized by the locking scheme performed at the primary.
> -ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the modifications in a synchronous mode (waiting for an explicit ack from all the backups).
>
> As for the commit phase:
> -Given that in the Primary Backup the prepare phase works as a commit too, the commit method on a TransactionXaAdapter object in this case simply returns.
>
> On the resulting extended Infinispan version a subset of unit/functional tests were executed and successfully passed:
> - commands.CommandIdUniquenessTest
> - replication.SyncReplicatedAPITest
> - replication.SyncReplImplicitLockingTest
> - replication.SyncReplLockingTest
> - replication.SyncReplTest
> - replication.SyncCacheListenerTest
> - replication.ReplicationExceptionTest
>
>
> We have tested this solution using a customized version of Radargun. Our customizations were first of all aimed at having each thread accessing data within transactions, instead of executing single put/get operations. In addition, now every Stresser thread accesses with uniform probability all of the keys stored by Infinispan, thus generating conflicts with a probability proportional to the number of concurrently active threads and inversely proportional to the total number of keys maintained.
This looks good to me. I would expect all the tests suite to pass. Is this code accessible somewhere(both ISPN and Radargun code)? I'd love to give it a look.  
>
> As already hinted, our results highlight that, depending on the current workload/number of nodes in the system, it is possible to identify scenarios where the PB scheme significantly outperforms the current 2PC scheme, and vice versa. Our experiments were performed in a cluster of homogeneous 8-core (Xeon@2.16GhZ) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8 parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In 2PC, each thread executes transactions which consist of 10 (get/put) operations, with a 10% of probability of generating a put operation. With PB, the same kind of transactions are executed on the primary, but the backups execute read-only transactions composed of 10 get operations. This allows to compare the maximum throughput of update transactions provided by the two compared schemes, without excessively favoring PB !
 by keeping the backups totally idle.
>
> The total write transactions' throughput exhibited by the cluster (i.e. not the throughput per node) is shown in the attached plots, relevant to caches containing 1000, 10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the chance of contention and the probability of aborts due to conflicts. In particular with the 2PC scheme the number of failed transactions steadily increases at high contention up to 6% (to the best of our understanding in particular due to distributed deadlocks). With PB, instead the number of failed txs due to contention is always 0.
>
> Note that, currently, we are assuming that backup nodes do not generate update transactions. In practice this corresponds to assuming the presence of some load balancing scheme which directs (all and only) update transactions to the primary node, and read transactions to the backup. In the negative case (a put operation is generated on a backup node), we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably suboptimal/undesirable in real settings, as update transactions may be transparently rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on how to implement such a redirection facility in a transparent/non-intrusive manner would be highly appreciated of course! ;-)
>
> To conclude, we are currently working on a statistical model that is able to predict the best suited replication scheme given the current workload/number of machines, as well as on a mechanism to dynamically switch from one replication scheme to the other.
>
>
> We'll keep you posted on our progresses!

Thank you!
>
> Regards,
>   Sebastiano Peluso, Diego Didona
> <PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>_______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] Primary-Backup replication scheme in Infinispan

Mark Little
In reply to this post by Sebastiano Peluso
Hi

On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:

Hi all,

first let us introduce ourselves given that it's the first time we write in this mailing list.

Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive replication mechanisms (please refer to previous message by Paolo on this mailing list and to the www.cloudtm.eu website for additional details on this project).

Up to date we have been working on developing a relatively simple primary-backup (PB) replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the primary) is allowed to process update transactions, whereas the remaining nodes only process read-only transactions. This allows coping very efficiently with write transactions, as the primary does not have to incur in the cost of remote coordination schemes nor in distributed deadlocks, which can hamper performance at high contention with two phase commit schemes. With PB, in fact, the primary can serialize transactions locally, and simply propagate the updates to the backups for fault-tolerance as well as to allow them to process read-only transactions on fresh data of course. On the other hand, the primary is clearly prone to become the bottleneck especially in large clusters and write intensive workloads.

So when you say "serialize transactions locally" does this mean you don't do distributed locking? Or is "locally" not referring simply to the primary?


Thus, this scheme does not really represent a replacement for the default 2PC protocol, but rather an alternative approach that results particularly attractive (as we will illustrate in the following with some radargun-based performance results) in small scale clusters, or, in "elastic" cloud scenarios, in periods where the workload is either read dominated or not very intense.

I'm not sure what you mean by a "replacement" for 2PC? 2PC is for atomicity (consensus). It's the A in ACID. It's not the C, I or D.

Being this scheme particularly efficient in these scenarios, in fact, its adoption would allow to minimize the number of resources acquired from the cloud provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM, in fact, we aim at designing autonomic mechanisms that would dynamically switch among multiple replication mechanisms depending on the current workload characteristics.

Replication and transactions aren't mutually inconsistent things. They can be merged quite well :-)


The reason I mention these is because I don't quite understand what your goal is/was with regards to transactions and replication.


Before discussing the results of our preliminary benchmarking study, we would like to briefly overview how we integrated this replication mechanism within Infinispan. Any comment/feedback is clearly highly appreciated. First of all we have defined a new command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to define a new command because we had to design customized "visiting" methods for the interceptors. Note that our protocol only affects the commit phase of a transaction, specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by the CommandsFactory and it is passed to the invoke method of the invoker. The PassiveReplicationCommand is then visited by the all the interceptors in the chain, by means of the visitPassiveReplicationCommand methods. We describe more in detail the operations performed by the non-trivial interceptors:

-TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then for each modification stored in the PassiveReplicationCommand the chain of interceptors is invoked.
-LockingInterceptor: first the next interceptor is called, then the cleanupLocks is performed with the second parameter set to true (commit the operations). This operation is always safe: on the primary it is called only after that the acks from all the slaves are received  (see the ReplicationInterceptor below); on the slave there are no concurrent conflicting writes since these are already locally serialized by the locking scheme performed at the primary.
-ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the modifications in a synchronous mode (waiting for an explicit ack from all the backups).

As for the commit phase:
-Given that in the Primary Backup the prepare phase works as a commit too, the commit method on a TransactionXaAdapter object in this case simply returns.

On the resulting extended Infinispan version a subset of unit/functional tests were executed and successfully passed:
- commands.CommandIdUniquenessTest
- replication.SyncReplicatedAPITest
- replication.SyncReplImplicitLockingTest
- replication.SyncReplLockingTest
- replication.SyncReplTest
- replication.SyncCacheListenerTest
- replication.ReplicationExceptionTest


We have tested this solution using a customized version of Radargun. Our customizations were first of all aimed at having each thread accessing data within transactions, instead of executing single put/get operations. In addition, now every Stresser thread accesses with uniform probability all of the keys stored by Infinispan, thus generating conflicts with a probability proportional to the number of concurrently active threads and inversely proportional to the total number of keys maintained.

As already hinted, our results highlight that, depending on the current workload/number of nodes in the system, it is possible to identify scenarios where the PB scheme significantly outperforms the current 2PC scheme, and vice versa.

I'm obviously missing something here, because it still seems like you are attempting to compare incomparable things. It's not like comparing apples and oranges (at least they are both fruit), but more like apples and broccoli ;-)

Our experiments were performed in a cluster of homogeneous 8-core (Xeon@2.16GhZ) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8 parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In 2PC, each thread executes transactions which consist of 10 (get/put) operations, with a 10% of probability of generating a put operation. With PB, the same kind of transactions are executed on the primary, but the backups execute read-only transactions composed of 10 get operations. This allows to compare the maximum throughput of update transactions provided by the two compared schemes, without excessively favoring PB by keeping the backups totally idle.

The total write transactions' throughput exhibited by the cluster (i.e. not the throughput per node) is shown in the attached plots, relevant to caches containing 1000, 10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the chance of contention and the probability of aborts due to conflicts. In particular with the 2PC scheme the number of failed transactions steadily increases at high contention up to 6% (to the best of our understanding in particular due to distributed deadlocks). With PB, instead the number of failed txs due to contention is always 0.

Maybe you don't mean 2PC, but pessimistic locking? I'm still not too sure. But I am pretty sure it's not 2PC you mean.


Note that, currently, we are assuming that backup nodes do not generate update transactions. In practice this corresponds to assuming the presence of some load balancing scheme which directs (all and only) update transactions to the primary node, and read transactions to the backup.

Can I say it again? Serialisability?

In the negative case (a put operation is generated on a backup node), we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably suboptimal/undesirable in real settings, as update transactions may be transparently rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on how to implement such a redirection facility in a transparent/non-intrusive manner would be highly appreciated of course! ;-)

To conclude, we are currently working on a statistical model that is able to predict the best suited replication scheme given the current workload/number of machines, as well as on a mechanism to dynamically switch from one replication scheme to the other.


We'll keep you posted on our progresses!

Regards,
  Sebastiano Peluso, Diego Didona
<PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb_______________________________________________
Cloudtm-discussion mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion

---
Mark Little

JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).





_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Paolo Romano
Hi Mark,

let me try to clarify the scope and goals of the work Sebastiano and Diego are doing, as I think there may have been some misunderstanding here.

First an important premise, which I think was not clear in their previous message. We are here considering the *full replication mode* of Infinispan, in which every node maintains all copies of the same data items. This implies that there is no need to fetch data from remote nodes during transaction execution. Also, Infinispan was configured NOT TO use eager locking. In other words, during transaction's execution Infinispan acquires locks only locally.

What it is being compared is a single-master replication protocol (the classic primary-backup scheme) vs the 2PC based multi-master replication protocol (which is the one currently implemented by Infinispan).

The two protocols behave identically during transactions' execution. What differs between the two protocols is the handling of the commit phase.

In the single master mode, as no remote conflict can occur and local conflicts are managed by local concurrency control, data updates can be simply pushed to the backups. More precisely (this should answer also a question of Mircea), the primary DURING the commit phase, in a synchronous fashion, propagates the updates to the backups, waits for their ack, and only then returns from the commit request to the application.

In the multi-master mode, 2PC is used to materialize conflicts among transactions executed at different nodes, and to ensure agreement on the transaction's outcome. Thus 2PC serves  to enforce both atomicity and isolation.

The plots attached in the mail from Diego and Sebastiano are comparing the performance of two replication strategies that provide the same degree of consistency/failure resiliency...at least assuming perfect failure detection, I'd prefer to avoid delving into a (actually interesting) discussion of non-blocking guarantees of the two protocols in partially or eventually synchronous environments in this mailing list. Thus I disagree when you say that they are not comparable.

IMO, the only arguable unfairness in the comparison is that the multi-master scheme allows processing update transactions at every node, whereas the single-master scheme should  rely either on 1) a load balancing scheme to ensure redirecting all the update transactions to the master, or 2) rely on a redirection scheme to redirect towards the master any update transactions "originated" by the backups. The testbed considered in the current evaluation is "emulating" option  1, which does favour the PB scheme. On the other hand, we are currently studying a non-intrusive way to implement a transparent redirection scheme (option 2) in Infinispan, so suggestions are very welcome! ;-)

I hope this post clarified your perplexities!! :-)

Cheers,

    Paolo



On 2/17/11 3:53 PM, Mark Little wrote:
Hi

On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:

Hi all,

first let us introduce ourselves given that it's the first time we write in this mailing list.

Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive replication mechanisms (please refer to previous message by Paolo on this mailing list and to the www.cloudtm.eu website for additional details on this project).

Up to date we have been working on developing a relatively simple primary-backup (PB) replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the primary) is allowed to process update transactions, whereas the remaining nodes only process read-only transactions. This allows coping very efficiently with write transactions, as the primary does not have to incur in the cost of remote coordination schemes nor in distributed deadlocks, which can hamper performance at high contention with two phase commit schemes. With PB, in fact, the primary can serialize transactions locally, and simply propagate the updates to the backups for fault-tolerance as well as to allow them to process read-only transactions on fresh data of course. On the other hand, the primary is clearly prone to become the bottleneck especially in large clusters and write intensive workloads.

So when you say "serialize transactions locally" does this mean you don't do distributed locking? Or is "locally" not referring simply to the primary?


Thus, this scheme does not really represent a replacement for the default 2PC protocol, but rather an alternative approach that results particularly attractive (as we will illustrate in the following with some radargun-based performance results) in small scale clusters, or, in "elastic" cloud scenarios, in periods where the workload is either read dominated or not very intense.

I'm not sure what you mean by a "replacement" for 2PC? 2PC is for atomicity (consensus). It's the A in ACID. It's not the C, I or D.

Being this scheme particularly efficient in these scenarios, in fact, its adoption would allow to minimize the number of resources acquired from the cloud provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM, in fact, we aim at designing autonomic mechanisms that would dynamically switch among multiple replication mechanisms depending on the current workload characteristics.

Replication and transactions aren't mutually inconsistent things. They can be merged quite well :-)


The reason I mention these is because I don't quite understand what your goal is/was with regards to transactions and replication.


Before discussing the results of our preliminary benchmarking study, we would like to briefly overview how we integrated this replication mechanism within Infinispan. Any comment/feedback is clearly highly appreciated. First of all we have defined a new command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to define a new command because we had to design customized "visiting" methods for the interceptors. Note that our protocol only affects the commit phase of a transaction, specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by the CommandsFactory and it is passed to the invoke method of the invoker. The PassiveReplicationCommand is then visited by the all the interceptors in the chain, by means of the visitPassiveReplicationCommand methods. We describe more in detail the operations performed by the non-trivial interceptors:

-TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then for each modification stored in the PassiveReplicationCommand the chain of interceptors is invoked.
-LockingInterceptor: first the next interceptor is called, then the cleanupLocks is performed with the second parameter set to true (commit the operations). This operation is always safe: on the primary it is called only after that the acks from all the slaves are received  (see the ReplicationInterceptor below); on the slave there are no concurrent conflicting writes since these are already locally serialized by the locking scheme performed at the primary.
-ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the modifications in a synchronous mode (waiting for an explicit ack from all the backups).

As for the commit phase:
-Given that in the Primary Backup the prepare phase works as a commit too, the commit method on a TransactionXaAdapter object in this case simply returns.

On the resulting extended Infinispan version a subset of unit/functional tests were executed and successfully passed:
- commands.CommandIdUniquenessTest
- replication.SyncReplicatedAPITest
- replication.SyncReplImplicitLockingTest
- replication.SyncReplLockingTest
- replication.SyncReplTest
- replication.SyncCacheListenerTest
- replication.ReplicationExceptionTest


We have tested this solution using a customized version of Radargun. Our customizations were first of all aimed at having each thread accessing data within transactions, instead of executing single put/get operations. In addition, now every Stresser thread accesses with uniform probability all of the keys stored by Infinispan, thus generating conflicts with a probability proportional to the number of concurrently active threads and inversely proportional to the total number of keys maintained.

As already hinted, our results highlight that, depending on the current workload/number of nodes in the system, it is possible to identify scenarios where the PB scheme significantly outperforms the current 2PC scheme, and vice versa.

I'm obviously missing something here, because it still seems like you are attempting to compare incomparable things. It's not like comparing apples and oranges (at least they are both fruit), but more like apples and broccoli ;-)

Our experiments were performed in a cluster of homogeneous 8-core ([hidden email]) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8 parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In 2PC, each thread executes transactions which consist of 10 (get/put) operations, with a 10% of probability of generating a put operation. With PB, the same kind of transactions are executed on the primary, but the backups execute read-only transactions composed of 10 get operations. This allows to compare the maximum throughput of update transactions provided by the two compared schemes, without excessively favoring PB by keeping the backups totally idle.

The total write transactions' throughput exhibited by the cluster (i.e. not the throughput per node) is shown in the attached plots, relevant to caches containing 1000, 10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the chance of contention and the probability of aborts due to conflicts. In particular with the 2PC scheme the number of failed transactions steadily increases at high contention up to 6% (to the best of our understanding in particular due to distributed deadlocks). With PB, instead the number of failed txs due to contention is always 0.

Maybe you don't mean 2PC, but pessimistic locking? I'm still not too sure. But I am pretty sure it's not 2PC you mean.


Note that, currently, we are assuming that backup nodes do not generate update transactions. In practice this corresponds to assuming the presence of some load balancing scheme which directs (all and only) update transactions to the primary node, and read transactions to the backup.

Can I say it again? Serialisability?

In the negative case (a put operation is generated on a backup node), we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably suboptimal/undesirable in real settings, as update transactions may be transparently rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on how to implement such a redirection facility in a transparent/non-intrusive manner would be highly appreciated of course! ;-)

To conclude, we are currently working on a statistical model that is able to predict the best suited replication scheme given the current workload/number of machines, as well as on a mechanism to dynamically switch from one replication scheme to the other.


We'll keep you posted on our progresses!

Regards,
  Sebastiano Peluso, Diego Didona
<PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb_______________________________________________
Cloudtm-discussion mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion

---
Mark Little

JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).




------------------------------------------------------------------------------ The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________ Cloudtm-discussion mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] Re: Primary-Backup replication scheme in Infinispan

Paolo Romano
On 2/17/11 9:02 PM, Mark Little wrote:
Hi Paolo,

Sent from my iPhone

On 17 Feb 2011, at 19:51, Paolo Romano <[hidden email]> wrote:

Hi Mark,

let me try to clarify the scope and goals of the work Sebastiano and Diego are doing, as I think there may have been some misunderstanding here.

First an important premise, which I think was not clear in their previous message. We are here considering the *full replication mode* of Infinispan, in which every node maintains all copies of the same data items. This implies that there is no need to fetch data from remote nodes during transaction execution. Also, Infinispan was configured NOT TO use eager locking. In other words, during transaction's execution Infinispan acquires locks only locally.

Ok that was certainly not clear.


What it is being compared is a single-master replication protocol (the classic primary-backup scheme) vs the 2PC based multi-master replication protocol (which is the one currently implemented by Infinispan).

Since two phase commit is associated a lot with transaction systems ;) and probably more so than replication, I think we probably need to be more careful with the implicit assumptions that are carried with any discussions in future :) It should help avoid confusion.

I see you point. But I guess that the one of main reasons why several details of the Infinispan's 2PC based replication algorithm were omitted in the mail was not make it so long to discourage reading it ;-)



The two protocols behave identically during transactions' execution. What differs between the two protocols is the handling of the commit phase.

In the single master mode, as no remote conflict can occur and local conflicts are managed by local concurrency control, data updates can be simply pushed to the backups. More precisely (this should answer also a question of Mircea), the primary DURING the commit phase, in a synchronous fashion, propagates the updates to the backups, waits for their ack, and only then returns from the commit request to the application.

In the multi-master mode, 2PC is used to materialize conflicts among transactions executed at different nodes, and to ensure agreement on the transaction's outcome. Thus 2PC serves  to enforce both atomicity and isolation.

Not quite :)

2PC is a consensus protocol, pure and simple. I think what you are meaning is that some of the participants do additional work during the two phases, such as happens in a TM where locks may be propagated, promoted or released. But the reason I mention this is because we as an industry (and academia) have confused the world by not being clear on what 2PC brings. You'd be surprised by the number of people who believe 2PC is responsible for all of the ACID semantics. I think we need to be very clear on such distinctions.

... true, it's the lock acquisition taking place during the voting phase of 2PC that ensures isolation, not 2PC per se. I called this approach 2PC based multi-master replication protocol when I Introduced it in my mail... 2PC here was just a shorthand.



The plots attached in the mail from Diego and Sebastiano are comparing the performance of two replication strategies that provide the same degree of consistency/failure resiliency...at least assuming perfect failure detection,

And we both know there is no such thing, unless you want to get into the world of quantum mechanics :-)

....still there are many real systems that are actually designed assuming perfect failure detection... possibly relying on some dedicated hardware mechanism (which encapsulate the perfect failure detection assumption) to enforce accuracy by forcing suspected nodes to shut down falsely suspected nodes, e.g. http://en.wikipedia.org/wiki/STONITH.


I'd prefer to avoid delving into a (actually interesting) discussion of non-blocking guarantees of the two protocols in partially or eventually synchronous environments in this mailing list.

I'd be interested in continuing off list then.

well, trying to make a very long story (probably too) short one may say that:

- in an asynchronous system (even if augmented with eventually perfect failure detection), classic 2PC blocks upon coordinator crashes. This, pragmatically, forces to heuristic decisions that may lead to atomicity violations. Using more expensive commit protocols, such as Paxos Commit (http://research.microsoft.com/apps/pubs/default.aspx?id=64636), one can enforce atomicity also in eventually synchronous systems (tolerating f faults out of 2f+1 replicas).

- With primary backup (PB), in an asynchronous system the issue is the split-brain syndrome (this wiki definition is not the best ever but it's the only I could rapidly find: http://en.wikipedia.org/wiki/Split-brain_(Computing) ). Unsurprisingly, also for PB, it is also possible to design (more expensive) variants working in partially synchronous systems. An example is Vertical Paxos (research.microsoft.com/pubs/80907/podc09v6.pdf). Or assuming virtual synchrony (http://en.wikipedia.org/wiki/Virtual_synchrony),  by propagating the primary's updates via uniform reliable broadcast (also called safe delivery in virtual synchrony's gergon, http://www.cs.huji.ac.il/labs/transis/lab-projects/guide/chap3.html#safe). Or, mapping the solution to classic Paxos, by having the role of the primary coincide with that of the leader in the Multi-paxos protocol (http://en.wikipedia.org/wiki/Paxos_algorithm#Multi-Paxos).

Note that, in an asynchronous system, the PB implementation shown in the plots might violate consistency in case of false failure suspicions of the primary, just like the current Infinispan's 2PC based replication scheme might do in case of false failure suspicions of the coordinator. So, in terms of required synchrony assumptions, the 2 considered protocols are fairly comparable.



Thus I disagree when you say that they are not comparable.

Well hopefully you now see why I still disagree. If the term 2PC is being used here to mean the atomicity/consensus protocol only, then they are not comparable. If it's a shorthand for more than that then we should use a different notation.

Considering that 2PC was a shorthand for "2PC based multi-master replication scheme" ... I hope that we agree now that the performance of the two protocols are comparable ;-)




IMO, the only arguable unfairness in the comparison is that the multi-master scheme allows processing update transactions at every node, whereas the single-master scheme should  rely either on 1) a load balancing scheme to ensure redirecting all the update transactions to the master, or 2) rely on a redirection scheme to redirect towards the master any update transactions "originated" by the backups. The testbed considered in the current evaluation is "emulating" option  1, which does favour the PB scheme. On the other hand, we are currently studying a non-intrusive way to implement a transparent redirection scheme (option 2) in Infinispan, so suggestions are very welcome! ;-)

I hope this post clarified your perplexities!! :-)

Yes and no :)


Cheers,

    Paolo



On 2/17/11 3:53 PM, Mark Little wrote:
Hi

On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:

Hi all,

first let us introduce ourselves given that it's the first time we write in this mailing list.

Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive replication mechanisms (please refer to previous message by Paolo on this mailing list and to the www.cloudtm.eu website for additional details on this project).

Up to date we have been working on developing a relatively simple primary-backup (PB) replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the primary) is allowed to process update transactions, whereas the remaining nodes only process read-only transactions. This allows coping very efficiently with write transactions, as the primary does not have to incur in the cost of remote coordination schemes nor in distributed deadlocks, which can hamper performance at high contention with two phase commit schemes. With PB, in fact, the primary can serialize transactions locally, and simply propagate the updates to the backups for fault-tolerance as well as to allow them to process read-only transactions on fresh data of course. On the other hand, the primary is clearly prone to become the bottleneck especially in large clusters and write intensive workloads.

So when you say "serialize transactions locally" does this mean you don't do distributed locking? Or is "locally" not referring simply to the primary?


Thus, this scheme does not really represent a replacement for the default 2PC protocol, but rather an alternative approach that results particularly attractive (as we will illustrate in the following with some radargun-based performance results) in small scale clusters, or, in "elastic" cloud scenarios, in periods where the workload is either read dominated or not very intense.

I'm not sure what you mean by a "replacement" for 2PC? 2PC is for atomicity (consensus). It's the A in ACID. It's not the C, I or D.

Being this scheme particularly efficient in these scenarios, in fact, its adoption would allow to minimize the number of resources acquired from the cloud provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM, in fact, we aim at designing autonomic mechanisms that would dynamically switch among multiple replication mechanisms depending on the current workload characteristics.

Replication and transactions aren't mutually inconsistent things. They can be merged quite well :-)


The reason I mention these is because I don't quite understand what your goal is/was with regards to transactions and replication.


Before discussing the results of our preliminary benchmarking study, we would like to briefly overview how we integrated this replication mechanism within Infinispan. Any comment/feedback is clearly highly appreciated. First of all we have defined a new command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to define a new command because we had to design customized "visiting" methods for the interceptors. Note that our protocol only affects the commit phase of a transaction, specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by the CommandsFactory and it is passed to the invoke method of the invoker. The PassiveReplicationCommand is then visited by the all the interceptors in the chain, by means of the visitPassiveReplicationCommand methods. We describe more in detail the operations performed by the non-trivial interceptors:

-TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then for each modification stored in the PassiveReplicationCommand the chain of interceptors is invoked.
-LockingInterceptor: first the next interceptor is called, then the cleanupLocks is performed with the second parameter set to true (commit the operations). This operation is always safe: on the primary it is called only after that the acks from all the slaves are received  (see the ReplicationInterceptor below); on the slave there are no concurrent conflicting writes since these are already locally serialized by the locking scheme performed at the primary.
-ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the modifications in a synchronous mode (waiting for an explicit ack from all the backups).

As for the commit phase:
-Given that in the Primary Backup the prepare phase works as a commit too, the commit method on a TransactionXaAdapter object in this case simply returns.

On the resulting extended Infinispan version a subset of unit/functional tests were executed and successfully passed:
- commands.CommandIdUniquenessTest
- replication.SyncReplicatedAPITest
- replication.SyncReplImplicitLockingTest
- replication.SyncReplLockingTest
- replication.SyncReplTest
- replication.SyncCacheListenerTest
- replication.ReplicationExceptionTest


We have tested this solution using a customized version of Radargun. Our customizations were first of all aimed at having each thread accessing data within transactions, instead of executing single put/get operations. In addition, now every Stresser thread accesses with uniform probability all of the keys stored by Infinispan, thus generating conflicts with a probability proportional to the number of concurrently active threads and inversely proportional to the total number of keys maintained.

As already hinted, our results highlight that, depending on the current workload/number of nodes in the system, it is possible to identify scenarios where the PB scheme significantly outperforms the current 2PC scheme, and vice versa.

I'm obviously missing something here, because it still seems like you are attempting to compare incomparable things. It's not like comparing apples and oranges (at least they are both fruit), but more like apples and broccoli ;-)

Our experiments were performed in a cluster of homogeneous 8-core ([hidden email]) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8 parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In 2PC, each thread executes transactions which consist of 10 (get/put) operations, with a 10% of probability of generating a put operation. With PB, the same kind of transactions are executed on the primary, but the backups execute read-only transactions composed of 10 get operations. This allows to compare the maximum throughput of update transactions provided by the two compared schemes, without excessively favoring PB by keeping the backups totally idle.

The total write transactions' throughput exhibited by the cluster (i.e. not the throughput per node) is shown in the attached plots, relevant to caches containing 1000, 10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the chance of contention and the probability of aborts due to conflicts. In particular with the 2PC scheme the number of failed transactions steadily increases at high contention up to 6% (to the best of our understanding in particular due to distributed deadlocks). With PB, instead the number of failed txs due to contention is always 0.

Maybe you don't mean 2PC, but pessimistic locking? I'm still not too sure. But I am pretty sure it's not 2PC you mean.


Note that, currently, we are assuming that backup nodes do not generate update transactions. In practice this corresponds to assuming the presence of some load balancing scheme which directs (all and only) update transactions to the primary node, and read transactions to the backup.

Can I say it again? Serialisability?

In the negative case (a put operation is generated on a backup node), we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably suboptimal/undesirable in real settings, as update transactions may be transparently rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on how to implement such a redirection facility in a transparent/non-intrusive manner would be highly appreciated of course! ;-)

To conclude, we are currently working on a statistical model that is able to predict the best suited replication scheme given the current workload/number of machines, as well as on a mechanism to dynamically switch from one replication scheme to the other.


We'll keep you posted on our progresses!

Regards,
  Sebastiano Peluso, Diego Didona
<PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb_______________________________________________
Cloudtm-discussion mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion

---
Mark Little

JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).




------------------------------------------------------------------------------ The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________ Cloudtm-discussion mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion


------------------------------------------------------------------------------
The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE:
Pinpoint memory and threading errors before they happen.
Find and fix more than 250 security defects in the development cycle.
Locate bottlenecks in serial and parallel code that limit performance.
http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________
Cloudtm-discussion mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion
------------------------------------------------------------------------------ The ultimate all-in-one performance toolkit: Intel(R) Parallel Studio XE: Pinpoint memory and threading errors before they happen. Find and fix more than 250 security defects in the development cycle. Locate bottlenecks in serial and parallel code that limit performance. http://p.sf.net/sfu/intel-dev2devfeb
_______________________________________________ Cloudtm-discussion mailing list [hidden email] https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion


-- 

Paolo Romano, PhD
Senior Researcher
INESC-ID
Rua Alves Redol, 9
1000-059, Lisbon Portugal
Tel. + 351 21 3100300
Fax  + 351 21 3145843
Webpage http://www.gsd.inesc-id.pt/~romanop

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] Primary-Backup replication scheme in Infinispan

Mark Little
Interesting discussion, but you either need to use failure suspectors, which are different to failure detectors, or quantum mechanics. I suppose the only other choice is omniscience, but I don't know how to do that ;-)

Mark.


On 18 Feb 2011, at 00:59, Paolo Romano wrote:

And we both know there is no such thing, unless you want to get into the world of quantum mechanics :-)

....still there are many real systems that are actually designed assuming perfect failure detection... possibly relying on some dedicated hardware mechanism (which encapsulate the perfect failure detection assumption) to enforce accuracy by forcing suspected nodes to shut down falsely suspected nodes, e.g. http://en.wikipedia.org/wiki/STONITH.

---
Mark Little

JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).





_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] Primary-Backup replication scheme in Infinispan

Mark Little
In reply to this post by Paolo Romano
Yeah, I know all this ... recall my PhD was on transactions and replication :-)

You can tolerate partitions if you add weighted voting schemes to the group mechanism so that only a primary in the majority partition can do any work.

Mark.


On 18 Feb 2011, at 00:59, Paolo Romano wrote:

well, trying to make a very long story (probably too) short one may say that:

- in an asynchronous system (even if augmented with eventually perfect failure detection), classic 2PC blocks upon coordinator crashes. This, pragmatically, forces to heuristic decisions that may lead to atomicity violations. Using more expensive commit protocols, such as Paxos Commit (http://research.microsoft.com/apps/pubs/default.aspx?id=64636), one can enforce atomicity also in eventually synchronous systems (tolerating f faults out of 2f+1 replicas). 

- With primary backup (PB), in an asynchronous system the issue is the split-brain syndrome (this wiki definition is not the best ever but it's the only I could rapidly find: http://en.wikipedia.org/wiki/Split-brain_(Computing) ). Unsurprisingly, also for PB, it is also possible to design (more expensive) variants working in partially synchronous systems. An example is Vertical Paxos (research.microsoft.com/pubs/80907/podc09v6.pdf). Or assuming virtual synchrony (http://en.wikipedia.org/wiki/Virtual_synchrony),  by propagating the primary's updates via uniform reliable broadcast (also called safe delivery in virtual synchrony's gergon,http://www.cs.huji.ac.il/labs/transis/lab-projects/guide/chap3.html#safe). Or, mapping the solution to classic Paxos, by having the role of the primary coincide with that of the leader in the Multi-paxos protocol (http://en.wikipedia.org/wiki/Paxos_algorithm#Multi-Paxos). 

Note that, in an asynchronous system, the PB implementation shown in the plots might violate consistency in case of false failure suspicions of the primary, just like the current Infinispan's 2PC based replication scheme might do in case of false failure suspicions of the coordinator. So, in terms of required synchrony assumptions, the 2 considered protocols are fairly comparable.

---
Mark Little

JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).





_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] Primary-Backup replication scheme in Infinispan

Mark Little
In reply to this post by Paolo Romano
+1


On 18 Feb 2011, at 00:59, Paolo Romano wrote:

>> Well hopefully you now see why I still disagree. If the term 2PC is being used here to mean the atomicity/consensus protocol only, then they are not comparable. If it's a shorthand for more than that then we should use a different notation.
>
> Considering that 2PC was a shorthand for "2PC based multi-master replication scheme" ... I hope that we agree now that the performance of the two protocols are comparable ;-)
>

---
Mark Little
[hidden email]

JBoss, by Red Hat
Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom.
Registered in UK and Wales under Company Registration No. 3798903 Directors: Michael Cunningham (USA), Charlie Peters (USA), Matt Parsons (USA) and Brendan Lane (Ireland).





_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Primary-Backup replication scheme in Infinispan

Sebastiano Peluso
In reply to this post by Mircea Markus
Hi Mircea Markus,
    we are sorry for this late reply but we spent some time to "clean"
the code :-) .

> Is this propagation taking place in transaction's scope? Or after the transaction is completed locally? If later, you do propagate changes to other nodes async (this would leave place to consistency issues)?
This propagation takes place in transaction's scope just like the
prepare phase in the current Infinispan's replication scheme. Update
brodcast is synchronous in such a way we cannot have consistency issues
(now we suppose a failiure-free environment).
> This looks good to me. I would expect all the tests suite to pass. Is this code accessible somewhere(both ISPN and Radargun code)? I'd love to give it a look.
You can donwload the two implementations at:
     http://www.gsd.inesc-id.pt/~peluso/infinispan/infinispan.tar.bz2
     http://www.gsd.inesc-id.pt/~didona/Radargun/radargun.tar.bz2

We made a clone of the Infinispan code from github two weeks ago
(creating a breanch in local) and we would like to make a fork of the
current snapshot on github, in order to perform a push operation on the
resulting snapshot. How can we make the aforementioned fork? Do we only
need an account there?

> Thank you!
Thank you too!!!

Regards,
    Sebastiano Peluso, Diego Didona
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Primary-Backup replication scheme in Infinispan

Mircea Markus
Thanks - I 'll look at this and give you feedback.
Re:github, you only need an account and fork from our repo. More on it here: http://community.jboss.org/wiki/InfinispanandGitHub
github is a grate tool for code-review as well.

Cheers,
Mircea 

On 18 Feb 2011, at 13:30, Sebastiano Peluso wrote:

Hi Mircea Markus,
   we are sorry for this late reply but we spent some time to "clean"
the code :-) .

Is this propagation taking place in transaction's scope? Or after the transaction is completed locally? If later, you do propagate changes to other nodes async (this would leave place to consistency issues)?
This propagation takes place in transaction's scope just like the
prepare phase in the current Infinispan's replication scheme. Update
brodcast is synchronous in such a way we cannot have consistency issues
(now we suppose a failiure-free environment).
This looks good to me. I would expect all the tests suite to pass. Is this code accessible somewhere(both ISPN and Radargun code)? I'd love to give it a look.
You can donwload the two implementations at:
    http://www.gsd.inesc-id.pt/~peluso/infinispan/infinispan.tar.bz2
    http://www.gsd.inesc-id.pt/~didona/Radargun/radargun.tar.bz2

We made a clone of the Infinispan code from github two weeks ago
(creating a breanch in local) and we would like to make a fork of the
current snapshot on github, in order to perform a push operation on the
resulting snapshot. How can we make the aforementioned fork? Do we only
need an account there?

Thank you!
Thank you too!!!

Regards,
   Sebastiano Peluso, Diego Didona
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Primary-Backup replication scheme in Infinispan

Manik Surtani
In reply to this post by Sebastiano Peluso
Sorry I haven't responded on this thread till now, somehow slipped my attention.  P/B is interesting, I have a few thoughts:

* How do you pick which is the primary?  First owner on a list of owners for a key obtained via consistent hash?  What if this owner is not the peer on which your transaction is running?  Do you eagerly acquire locks on the single owner?

* How does this deal with transactions spanning > 1 key, mapped to > 1 "primary" node?

* I presume you introduce a window of inconsistency when (a) a tx commits to a primary node and (b) the primary node pushes changes to its backups.  Is the latter an async or batched process?

* Any thoughts on why - as per your graphs - P/B performance degrades as cluster size increases?

Cheers
Manik

On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:

> Hi all,
>
> first let us introduce ourselves given that it's the first time we write in this mailing list.
>
> Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive replication mechanisms (please refer to previous message by Paolo on this mailing list and to the www.cloudtm.eu website for additional details on this project).
>
> Up to date we have been working on developing a relatively simple primary-backup (PB) replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the primary) is allowed to process update transactions, whereas the remaining nodes only process read-only transactions. This allows coping very efficiently with write transactions, as the primary does not have to incur in the cost of remote coordination schemes nor in distributed deadlocks, which can hamper performance at high contention with two phase commit schemes. With PB, in fact, the primary can serialize transactions locally, and simply propagate the updates to the backups for fault-tolerance as well as to allow them to process read-only transactions on fresh data of course. On the other hand, the primary is clearly prone to become the bottleneck especially in large clusters and write intensive workloads.
>
> Thus, this scheme does not really represent a replacement for the default 2PC protocol, but rather an alternative approach that results particularly attractive (as we will illustrate in the following with some radargun-based performance results) in small scale clusters, or, in "elastic" cloud scenarios, in periods where the workload is either read dominated or not very intense. Being this scheme particularly efficient in these scenarios, in fact, its adoption would allow to minimize the number of resources acquired from the cloud provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM, in fact, we aim at designing autonomic mechanisms that would dynamically switch among multiple replication mechanisms depending on the current workload characteristics.
>
> Before discussing the results of our preliminary benchmarking study, we would like to briefly overview how we integrated this replication mechanism within Infinispan. Any comment/feedback is clearly highly appreciated. First of all we have defined a new command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to define a new command because we had to design customized "visiting" methods for the interceptors. Note that our protocol only affects the commit phase of a transaction, specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by the CommandsFactory and it is passed to the invoke method of the invoker. The PassiveReplicationCommand is then visited by the all the interceptors in the chain, by means of the visitPassiveReplicationCommand methods. We describe more in detail the operations performed by the non-trivial int!
 erceptors:

>
> -TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then for each modification stored in the PassiveReplicationCommand the chain of interceptors is invoked.
> -LockingInterceptor: first the next interceptor is called, then the cleanupLocks is performed with the second parameter set to true (commit the operations). This operation is always safe: on the primary it is called only after that the acks from all the slaves are received  (see the ReplicationInterceptor below); on the slave there are no concurrent conflicting writes since these are already locally serialized by the locking scheme performed at the primary.
> -ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the modifications in a synchronous mode (waiting for an explicit ack from all the backups).
>
> As for the commit phase:
> -Given that in the Primary Backup the prepare phase works as a commit too, the commit method on a TransactionXaAdapter object in this case simply returns.
>
> On the resulting extended Infinispan version a subset of unit/functional tests were executed and successfully passed:
> - commands.CommandIdUniquenessTest
> - replication.SyncReplicatedAPITest
> - replication.SyncReplImplicitLockingTest
> - replication.SyncReplLockingTest
> - replication.SyncReplTest
> - replication.SyncCacheListenerTest
> - replication.ReplicationExceptionTest
>
>
> We have tested this solution using a customized version of Radargun. Our customizations were first of all aimed at having each thread accessing data within transactions, instead of executing single put/get operations. In addition, now every Stresser thread accesses with uniform probability all of the keys stored by Infinispan, thus generating conflicts with a probability proportional to the number of concurrently active threads and inversely proportional to the total number of keys maintained.
>
> As already hinted, our results highlight that, depending on the current workload/number of nodes in the system, it is possible to identify scenarios where the PB scheme significantly outperforms the current 2PC scheme, and vice versa. Our experiments were performed in a cluster of homogeneous 8-core (Xeon@2.16GhZ) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8 parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In 2PC, each thread executes transactions which consist of 10 (get/put) operations, with a 10% of probability of generating a put operation. With PB, the same kind of transactions are executed on the primary, but the backups execute read-only transactions composed of 10 get operations. This allows to compare the maximum throughput of update transactions provided by the two compared schemes, without excessively favoring PB !
 by keeping the backups totally idle.

>
> The total write transactions' throughput exhibited by the cluster (i.e. not the throughput per node) is shown in the attached plots, relevant to caches containing 1000, 10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the chance of contention and the probability of aborts due to conflicts. In particular with the 2PC scheme the number of failed transactions steadily increases at high contention up to 6% (to the best of our understanding in particular due to distributed deadlocks). With PB, instead the number of failed txs due to contention is always 0.
>
> Note that, currently, we are assuming that backup nodes do not generate update transactions. In practice this corresponds to assuming the presence of some load balancing scheme which directs (all and only) update transactions to the primary node, and read transactions to the backup. In the negative case (a put operation is generated on a backup node), we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably suboptimal/undesirable in real settings, as update transactions may be transparently rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on how to implement such a redirection facility in a transparent/non-intrusive manner would be highly appreciated of course! ;-)
>
> To conclude, we are currently working on a statistical model that is able to predict the best suited replication scheme given the current workload/number of machines, as well as on a mechanism to dynamically switch from one replication scheme to the other.
>
>
> We'll keep you posted on our progresses!
>
> Regards,
>   Sebastiano Peluso, Diego Didona
> <PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>_______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Primary-Backup replication scheme in Infinispan

Manik Surtani
In reply to this post by Sebastiano Peluso

On 18 Feb 2011, at 13:30, Sebastiano Peluso wrote:

We made a clone of the Infinispan code from github two weeks ago 
(creating a breanch in local) and we would like to make a fork of the 
current snapshot on github, in order to perform a push operation on the 
resulting snapshot. How can we make the aforementioned fork? Do we only 
need an account there?

Yes, just create an account on GitHub (free) and fork away.  Create a topic branch in your fork and work off that - it makes it easier for folks to review/compare changes, and for you to sync up your forks and branches with other changes that may be happening upstream.  I've detailed the process here:


Cheers
Manik



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Manik Surtani
In reply to this post by Paolo Romano
Paolo,


On 17 Feb 2011, at 19:51, Paolo Romano wrote:

> First an important premise, which I think was not clear in their previous message. We are here considering the *full replication mode* of Infinispan, in which every node maintains all copies of the same data items. This implies that there is no need to fetch data from remote nodes during transaction execution. Also, Infinispan was configured NOT TO use eager locking. In other words, during transaction's execution Infinispan acquires locks only locally.

So are you suggesting that this scheme maintains a single, global master node for the entire cluster, for *all* keys?  Doesn't this become a bottleneck, and how do you deal with the master node failing?

Cheers
Manik

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Paolo Romano
On 3/9/11 11:41 AM, Manik Surtani wrote:
> Paolo,
>
>
> On 17 Feb 2011, at 19:51, Paolo Romano wrote:
>
>> First an important premise, which I think was not clear in their previous message. We are here considering the *full replication mode* of Infinispan, in which every node maintains all copies of the same data items. This implies that there is no need to fetch data from remote nodes during transaction execution. Also, Infinispan was configured NOT TO use eager locking. In other words, during transaction's execution Infinispan acquires locks only locally.
> So are you suggesting that this scheme maintains a single, global master node for the entire cluster, for *all* keys?  Doesn't this become a bottleneck, and how do you deal with the master node failing?
Hi Manik,

of course the primary (or master) can become a bottleneck if the number
of update transactions is very large. If the % of write transactions is
very high, however, then we have to distinguish two cases: low vs high
contention.

At high contention, in fact, the 2PC-based replication scheme used by
Infinispan (2PC from now for the sake of brevity ;-) ) falls prey of
deadlocks and starts trashing. This is the reasons why 2PC's performance
is so poor in the plot attached to Diego and Sebastiano's mail for the
case of 1000 keys. Using the primary-backup, being concurrency regulated
locally at the primary, and much more efficiently, the actual performace
is overall much better.

Clearly 2PC *can* scale much better if there's no contention and a write
intensive workload, as it can use the horsepower of more nodes to
process writes...  but this does depend on the workload.

One of the results we would like to achieve with Cloud-TM is designing
mechanisms that adaptively commute between multiple replication schemes
depending on the scale of the platform and its current workload. That's
just a first step in this direction!

Concerning failures of the master, this is not an issue. In fact, it
waits synchronously for the replies of the backups. Thus, if it fails,
it will be purged by the current view, and as a new one is delivered we
can elect a new primary.  The only glitch that may occur is on more
complicated failure scenario:
Primary sends an update, say "u",to the backups {B1,....Bn}.
Primary crashes while doing so.
B1 receives u, while the others B2,..,Bn do not.
B1 runs some read-only transaction that see u (and money is dispensed,
missiles are fired and other horrible things happen due to that)
B1 crashes
A new view is delivered which excludes the former primary and B1.
B2 (for instance) is elected the new primary, but "u" is lost.

This can be avoided by having the backups acknowledging each other the
reception of the updates before committing them (more formally, we
should have the primary disseminating updates via Uniform Reliable
Broadcast [1]).... but at the moment we're not doing this mainly for the
sake of simplicity, but we expect the results to be very similar.

Cheers,

     Paolo

[1] Uniform Reliable Multicast in a Virtually Synchronous Environment,
citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.5421

> Cheers
> Manik
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


--

Paolo Romano, PhD
Senior Researcher
INESC-ID
Rua Alves Redol, 9
1000-059, Lisbon Portugal
Tel. + 351 21 3100300
Fax  + 351 21 3145843
Webpage http://www.gsd.inesc-id.pt/~romanop

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Manik Surtani

On 9 Mar 2011, at 18:58, Paolo Romano wrote:

> On 3/9/11 11:41 AM, Manik Surtani wrote:
>> Paolo,
>>
>>
>> On 17 Feb 2011, at 19:51, Paolo Romano wrote:
>>
>>> First an important premise, which I think was not clear in their previous message. We are here considering the *full replication mode* of Infinispan, in which every node maintains all copies of the same data items. This implies that there is no need to fetch data from remote nodes during transaction execution. Also, Infinispan was configured NOT TO use eager locking. In other words, during transaction's execution Infinispan acquires locks only locally.
>> So are you suggesting that this scheme maintains a single, global master node for the entire cluster, for *all* keys?  Doesn't this become a bottleneck, and how do you deal with the master node failing?
> Hi Manik,
>
> of course the primary (or master) can become a bottleneck if the number of update transactions is very large. If the % of write transactions is very high, however, then we have to distinguish two cases: low vs high contention.
>
> At high contention, in fact, the 2PC-based replication scheme used by Infinispan (2PC from now for the sake of brevity ;-) ) falls prey of deadlocks and starts trashing. This is the reasons why 2PC's performance is so poor in the plot attached to Diego and Sebastiano's mail for the case of 1000 keys. Using the primary-backup, being concurrency regulated locally at the primary, and much more efficiently, the actual performace is overall much better.

This is even with deadlock detection enabled?

Cheers
Manik

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Paolo Romano
On 3/10/11 11:44 AM, Manik Surtani wrote:
> On 9 Mar 2011, at 18:58, Paolo Romano wrote:
>
> <snip>
>> of course the primary (or master) can become a bottleneck if the number of update transactions is very large. If the % of write transactions is very high, however, then we have to distinguish two cases: low vs high contention.
>>
>> At high contention, in fact, the 2PC-based replication scheme used by Infinispan (2PC from now for the sake of brevity ;-) ) falls prey of deadlocks and starts trashing. This is the reasons why 2PC's performance is so poor in the plot attached to Diego and Sebastiano's mail for the case of 1000 keys. Using the primary-backup, being concurrency regulated locally at the primary, and much more efficiently, the actual performace is overall much better.
> This is even with deadlock detection enabled?

We have not experimented to enable deadlock detection yet. We'll do it
and let you know!

Cheers,

     Paolo
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Manik Surtani

On 10 Mar 2011, at 18:41, Paolo Romano wrote:

> On 3/10/11 11:44 AM, Manik Surtani wrote:
>> On 9 Mar 2011, at 18:58, Paolo Romano wrote:
>>
>> <snip>
>>> of course the primary (or master) can become a bottleneck if the number of update transactions is very large. If the % of write transactions is very high, however, then we have to distinguish two cases: low vs high contention.
>>>
>>> At high contention, in fact, the 2PC-based replication scheme used by Infinispan (2PC from now for the sake of brevity ;-) ) falls prey of deadlocks and starts trashing. This is the reasons why 2PC's performance is so poor in the plot attached to Diego and Sebastiano's mail for the case of 1000 keys. Using the primary-backup, being concurrency regulated locally at the primary, and much more efficiently, the actual performace is overall much better.
>> This is even with deadlock detection enabled?
>
> We have not experimented to enable deadlock detection yet. We'll do it
> and let you know!

Ok, it gets pretty heavily used in the community so there must be some good in it.  :-)

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Primary-Backup replication scheme in Infinispan

Sebastiano Peluso
In reply to this post by Manik Surtani
Hi Manik,

On 3/9/11 11:30 AM, Manik Surtani wrote:
> Sorry I haven't responded on this thread till now, somehow slipped my attention.  P/B is interesting, I have a few thoughts:
>
> * How do you pick which is the primary?
         The primary matches with the coordinator in the JGroups view
(EmbeddedCacheManager.isCoordinator()).
> First owner on a list of owners for a key obtained via consistent hash?  What if this owner is not the peer on which your transaction is running?  Do you eagerly acquire locks on the single owner?
>
> * How does this deal with transactions spanning>  1 key, mapped to>  1 "primary" node?
     As Paolo already said, we use a fully replication mode in which
every node maintains all copies of the same data items and we don't
adopt an eager locking strategy.
> * I presume you introduce a window of inconsistency when (a) a tx commits to a primary node and (b) the primary node pushes changes to its backups.  Is the latter an async or batched process?
     Why? Before a tx finalizes the commit on the primary, all the
"modifications" are synchronously replicated to the other nodes (the
backups). This means that the primary waits for acknowledgments from all
backups and then it can apply the updates to the data container and
release the locks relevant to the written items.
> * Any thoughts on why - as per your graphs - P/B performance degrades as cluster size increases?
     The first attached graph highlights that this behavior is
influenced on the time spent to commit for a transaction executed on the
primary node (a write transaction). In fact, as cluster size increases,
the average commit duration increases while the average execution
duration (after-commit) remains more or less uniform. Given that we are
using UDP as transport layer in JGroups stack, we suppose that the
aforementioned behavior is caused by the increasing of network
collisions on the acknowledgments from the backups.

In order to attempt to verify this assumption, we ran a test with TCP
rather than UDP using Primary-Backup only. The results are reported in
the second attached graph, highlighting that, as the number of nodes grows,
TCP achieves remarkably higher performance vs UDP. This is somewhat
surprising giving that the overhead to disseminate the updates on the
Primary should be lower with UDP than with TCP.
The only explanation we could come up with up to now is the increase of
collisions on the return of ACKs to the backups, as we already mentioned.
What do you think? Do you believe that Jgroups gathers any statistics
about UDP collisions? That would of course allow us to confirm/reject
our theory.
As we were curious, we ran a comparative test with 2PC (with deadlock
detection and 1000 keys) with UDP vs TCP (also in attach). The
performances here do not really change... we imagine that this depends
on the fact that, in this case, since we are at high contention, the
nodes are less synchronized in the return of the vote messages to the
coordinator (due to the high probability of encountering lock
contention... and that this may reduce the number of collisions.

The results were obtained by running the same experiments described in
the first mail (in the case of 1000 keys).

Cheers
    Sebastiano

> Cheers
> Manik
>
> On 11 Feb 2011, at 17:56, Sebastiano Peluso wrote:
>
>> Hi all,
>>
>> first let us introduce ourselves given that it's the first time we write in this mailing list.
>>
>> Our names are Sebastiano Peluso and Diego Didona, and we are working at INESC-ID Lisbon in the context of the Cloud-TM project. Our work is framed in the context of self-adaptive replication mechanisms (please refer to previous message by Paolo on this mailing list and to the www.cloudtm.eu website for additional details on this project).
>>
>> Up to date we have been working on developing a relatively simple primary-backup (PB) replication mechanism, which we integrated within Infinispan 4.2 and 5.0. In this kind of scheme only one node (called the primary) is allowed to process update transactions, whereas the remaining nodes only process read-only transactions. This allows coping very efficiently with write transactions, as the primary does not have to incur in the cost of remote coordination schemes nor in distributed deadlocks, which can hamper performance at high contention with two phase commit schemes. With PB, in fact, the primary can serialize transactions locally, and simply propagate the updates to the backups for fault-tolerance as well as to allow them to process read-only transactions on fresh data of course. On the other hand, the primary is clearly prone to become the bottleneck especially in large clusters and write intensive workloads.
>>
>> Thus, this scheme does not really represent a replacement for the default 2PC protocol, but rather an alternative approach that results particularly attractive (as we will illustrate in the following with some radargun-based performance results) in small scale clusters, or, in "elastic" cloud scenarios, in periods where the workload is either read dominated or not very intense. Being this scheme particularly efficient in these scenarios, in fact, its adoption would allow to minimize the number of resources acquired from the cloud provider in these periods, with direct benefits in terms of cost reductions. In Cloud-TM, in fact, we aim at designing autonomic mechanisms that would dynamically switch among multiple replication mechanisms depending on the current workload characteristics.
>>
>> Before discussing the results of our preliminary benchmarking study, we would like to briefly overview how we integrated this replication mechanism within Infinispan. Any comment/feedback is clearly highly appreciated. First of all we have defined a new command, namely PassiveReplicationCommand, that is a subclass of PrepareCommand. We had to define a new command because we had to design customized "visiting" methods for the interceptors. Note that our protocol only affects the commit phase of a transaction, specifically, during the prepare phase, in the prepare method of the TransactionXaAdapter class, if the Primary Backup mode is enabled, then a PassiveReplicationCommand is built by the CommandsFactory and it is passed to the invoke method of the invoker. The PassiveReplicationCommand is then visited by the all the interceptors in the chain, by means of the visitPassiveReplicationCommand methods. We describe more in detail the operations performed by the non-trivial int!
>   erceptors:
>> -TxInterceptor: like in the 2PC protocol, if the context is not originated locally, then for each modification stored in the PassiveReplicationCommand the chain of interceptors is invoked.
>> -LockingInterceptor: first the next interceptor is called, then the cleanupLocks is performed with the second parameter set to true (commit the operations). This operation is always safe: on the primary it is called only after that the acks from all the slaves are received  (see the ReplicationInterceptor below); on the slave there are no concurrent conflicting writes since these are already locally serialized by the locking scheme performed at the primary.
>> -ReplicationInterceptor: first it invokes the next iterceptor; then if the predicate shouldInvokeRemoteTxCommand(ctx) is true, then the method rpcManager.broadcastRpcCommand(command, true, false) is performed, that replicates the modifications in a synchronous mode (waiting for an explicit ack from all the backups).
>>
>> As for the commit phase:
>> -Given that in the Primary Backup the prepare phase works as a commit too, the commit method on a TransactionXaAdapter object in this case simply returns.
>>
>> On the resulting extended Infinispan version a subset of unit/functional tests were executed and successfully passed:
>> - commands.CommandIdUniquenessTest
>> - replication.SyncReplicatedAPITest
>> - replication.SyncReplImplicitLockingTest
>> - replication.SyncReplLockingTest
>> - replication.SyncReplTest
>> - replication.SyncCacheListenerTest
>> - replication.ReplicationExceptionTest
>>
>>
>> We have tested this solution using a customized version of Radargun. Our customizations were first of all aimed at having each thread accessing data within transactions, instead of executing single put/get operations. In addition, now every Stresser thread accesses with uniform probability all of the keys stored by Infinispan, thus generating conflicts with a probability proportional to the number of concurrently active threads and inversely proportional to the total number of keys maintained.
>>
>> As already hinted, our results highlight that, depending on the current workload/number of nodes in the system, it is possible to identify scenarios where the PB scheme significantly outperforms the current 2PC scheme, and vice versa. Our experiments were performed in a cluster of homogeneous 8-core (Xeon@2.16GhZ) nodes interconnected via a Gigabit Ethernet and running a Linux Kernel 64 bit version 2.6.32-21-server. The results were obtained by running for 30 seconds 8 parallel Stresser threads per nodes, and letting the number of node vary from 2 to 10. In 2PC, each thread executes transactions which consist of 10 (get/put) operations, with a 10% of probability of generating a put operation. With PB, the same kind of transactions are executed on the primary, but the backups execute read-only transactions composed of 10 get operations. This allows to compare the maximum throughput of update transactions provided by the two compared schemes, without excessively favoring PB !
>   by keeping the backups totally idle.
>> The total write transactions' throughput exhibited by the cluster (i.e. not the throughput per node) is shown in the attached plots, relevant to caches containing 1000, 10000 and 100000 keys. As already discussed, the lower the number of keys, the higher the chance of contention and the probability of aborts due to conflicts. In particular with the 2PC scheme the number of failed transactions steadily increases at high contention up to 6% (to the best of our understanding in particular due to distributed deadlocks). With PB, instead the number of failed txs due to contention is always 0.
>>
>> Note that, currently, we are assuming that backup nodes do not generate update transactions. In practice this corresponds to assuming the presence of some load balancing scheme which directs (all and only) update transactions to the primary node, and read transactions to the backup. In the negative case (a put operation is generated on a backup node), we simply throw a PassiveReplicationException at the CacheDelegate level. This is probably suboptimal/undesirable in real settings, as update transactions may be transparently rerouted (at least in principle!) to the primary node in a RPC-style. Any suggestion on how to implement such a redirection facility in a transparent/non-intrusive manner would be highly appreciated of course! ;-)
>>
>> To conclude, we are currently working on a statistical model that is able to predict the best suited replication scheme given the current workload/number of machines, as well as on a mechanism to dynamically switch from one replication scheme to the other.
>>
>>
>> We'll keep you posted on our progresses!
>>
>> Regards,
>>    Sebastiano Peluso, Diego Didona
>> <PCvsPR_WRITE_TX_1000keys.png><PCvsPR_WRITE_TX_10000keys.png><PCvsPR_WRITE_TX_100000keys.png>_______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Sebastiano Peluso
GSD  INESC-ID


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

AvgCommitDuration_vs_AvgExecDuration_PassiveReplication.png (118K) Download Attachment
PassiveReplicationUDPvsTCP_1000keys.png (47K) Download Attachment
TwoPhaseCommitUDPvsTCP_1000keys.png (50K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [Cloudtm-discussion] [SPAM] Re: Primary-Backup replication scheme in Infinispan

Sebastiano Peluso
In reply to this post by Manik Surtani
Hi Manik,

we have executed the same experiments described in the first mail,
introducing test cases for 2-Phase Commit with deadlock detection
(spinDuration="500").
Indeed, with the deadlock detection on, at high contention, there is a
significant speed-up....but the Primary-Backup still delivers the best
performance especially at high contention (1000 keys) independently of
the number of nodes (at least up to 10), and at medium contention (10000
keys) up to 6 nodes. The results are in the graphs in attach (1000 keys,
10000 keys, 100000 keys).

Cheers
    Sebastiano

On 3/11/11 12:28 PM, Manik Surtani wrote:

> On 10 Mar 2011, at 18:41, Paolo Romano wrote:
>
>> On 3/10/11 11:44 AM, Manik Surtani wrote:
>>> On 9 Mar 2011, at 18:58, Paolo Romano wrote:
>>>
>>> <snip>
>>>> of course the primary (or master) can become a bottleneck if the number of update transactions is very large. If the % of write transactions is very high, however, then we have to distinguish two cases: low vs high contention.
>>>>
>>>> At high contention, in fact, the 2PC-based replication scheme used by Infinispan (2PC from now for the sake of brevity ;-) ) falls prey of deadlocks and starts trashing. This is the reasons why 2PC's performance is so poor in the plot attached to Diego and Sebastiano's mail for the case of 1000 keys. Using the primary-backup, being concurrency regulated locally at the primary, and much more efficiently, the actual performace is overall much better.
>>> This is even with deadlock detection enabled?
>> We have not experimented to enable deadlock detection yet. We'll do it
>> and let you know!
> Ok, it gets pretty heavily used in the community so there must be some good in it.  :-)
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> ------------------------------------------------------------------------------
> Colocation vs. Managed Hosting
> A question and answer guide to determining the best fit
> for your organization - today and in the future.
> http://p.sf.net/sfu/internap-sfd2d
> _______________________________________________
> Cloudtm-discussion mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/cloudtm-discussion

--
Sebastiano Peluso
GSD  INESC-ID


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev

TwoPhaseCommitVsPassiveReplication_1000keys.png (48K) Download Attachment
TwoPhaseCommitVsPassiveReplication_10000keys.png (53K) Download Attachment
TwoPhaseCommitVsPassiveReplication_100000keys.png (58K) Download Attachment