[infinispan-dev] Data center replication

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[infinispan-dev] Data center replication

Erik Salter

Hi all,

 

Bela was kind enough to have a discussion with me last week regarding my data center replication requirements. 

 

At a high level, I have 3 independent data centers (sites A, B, C).  The latency between data centers is high. (~200ms round trip)  So initially I was thinking about using a backing store (like Cassandra) to handle the replication between data centers.  Each center would have its own individual grid to manage “local” resources.  So when a local TX is committed successfully, it is replicated to the stores in the other data centers.  That way, on a data center failure, the requests can be directed to the other data centers by loading from the backing store.

 

The largest drawback:  Certain distributed applications require highly serialized access to resources in the grid.  This means lots of explicit locking of keys in a single transaction.  In the event that a request is directed to, say, Data Center B because of an intermittent failure of Data Center A, as it stands there exists the possibility that a stale resource could still be resident in that grid.  It naturally follows that there will have to be application logic to for the grid in each data center to know which resources it “owns”.  And once the backing store gets an update from another data center, it will need to aggressively evict non-owned resources from the grid. 

 

I (and the customer) would like to use a single data grid across multiple data centers.  Bela detailed an option based off of JGroups RELAY that is a candidate solution.

 

- When doing a 2PC, Infinispan broadcasts the PREPARE to all nodes (in A, B and C). *However*, it only expects responses from *local* nodes (in this case nodes in data center A).  Infinispan knows its own siteId  and can extract the siteId from every address, so it can grab the current view (say A1, A2, A3... A10, B1-B10, C1-C10) and remove  non-local nodes, to arrive at a sanitized list A1-10. This means it expects responses to its PREPARE message only from A1-10. When it receives a response from non-local nodes, it simply discards them.

- On rollback, a ROLLBACK(TX) message is broadcast to the entire virtual cluster (A, B and C)

- On commit, a COMMIT(TX) is broadcast to the entire virtual cluster (A, B and C).

 

The downside here is that the 2PC won't be atomic, in the sense that it is only atomic for A, but not for B or C. A PREPARE might fail on a node in B and C, but the 2PC won't get rolled back as long as all nodes in A sent back a successful PREPARE-OK response. This is the same though in the current solution.

 

Comments?  Thoughts?

 

Erik Salter

[hidden email]

Software Architect

BNI Video

Cell: (404) 317-0693



The information contained in this message is legally privileged and confidential, and is intended for the individual or entity to whom it is addressed (or their designee). If this message is read by anyone other than the intended recipient, please be advised that distribution of this message, in any form, is strictly prohibited. If you have received this message in error, please notify the sender immediately and delete or destroy all copies of this message.

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Data center replication

Bela Ban


On 7/25/11 6:06 PM, Erik Salter wrote:

> Hi all,
>
>
>
> Bela was kind enough to have a discussion with me last week regarding my data center replication requirements.
>
>
>
> At a high level, I have 3 independent data centers (sites A, B, C).  The latency between data centers is high. (~200ms round trip)  So initially I was thinking about using a backing store (like Cassandra) to handle the replication between data centers.  Each center would have its own individual grid to manage "local" resources.  So when a local TX is committed successfully, it is replicated to the stores in the other data centers.  That way, on a data center failure, the requests can be directed to the other data centers by loading from the backing store.
>
>
> The largest drawback:  Certain distributed applications require highly serialized access to resources in the grid.  This means lots of explicit locking of keys in a single transaction.  In the event that a request is directed to, say, Data Center B because of an intermittent failure of Data Center A, as it stands there exists the possibility that a stale resource could still be resident in that grid.  It naturally follows that there will have to be application logic to for the grid in each data center to know which resources it "owns".  And once the backing store gets an update from another data center, it will need to aggressively evict non-owned resources from the grid.
>
>
>
> I (and the customer) would like to use a single data grid across multiple data centers.  Bela detailed an option based off of JGroups RELAY that is a candidate solution.
>
>
>
> - When doing a 2PC, Infinispan broadcasts the PREPARE to all nodes (in A, B and C). *However*, it only expects responses from *local* nodes (in this case nodes in data center A).  Infinispan knows its own siteId  and can extract the siteId from every address, so it can grab the current view (say A1, A2, A3... A10, B1-B10, C1-C10) and remove  non-local nodes, to arrive at a sanitized list A1-10. This means it expects responses to its PREPARE message only from A1-10. When it receives a response from non-local nodes, it simply discards them.
>
> - On rollback, a ROLLBACK(TX) message is broadcast to the entire virtual cluster (A, B and C)
>
> - On commit, a COMMIT(TX) is broadcast to the entire virtual cluster (A, B and C).
>
>
>
> The downside here is that the 2PC won't be atomic, in the sense that it is only atomic for A, but not for B or C. A PREPARE might fail on a node in B and C, but the 2PC won't get rolled back as long as all nodes in A sent back a successful PREPARE-OK response. This is the same though in the current solution.
>
> Comments?  Thoughts?



Let's look at a use case (btw, this is all in Infinispan's DIST mode,
with numOwners=3 and TopologyAwareConsistentHash):

- The local clusters are {A,B,C} and {D,E,F}.

- RELAY connected them together into a virtual cluster {A,B,C,D,E,F}.

- Infinispan (currently) only knows about the virtual cluster; it has no
knowledge about the local clusters

- A TX is started
- Now we have a PUT1(K,V)
- TACH decides that K,V needs to be stored on B, C and F

- Now we have a PUT2(K2,V2)
- TACH decides that K2,V2 needs to be stored on A,B and D
- The TX commits

- I'm not sure how the 2PC happens with mode=DIST, but I assume the
participant set of the 2PC is the set of hosts touched by modifications:
A,B,C,D,F
- So we have all 3 hosts in the local cluster and 2 hosts in the remote
cluster

#1 The PREPARE phase now sends a PREPARE to A,B,C,D,F
#2 If successful --> COMMIT to A,B,C,D,F
#3 If not, ROLLBACK to the same members

If this is the way 2PC over DIST is done today, I suggest the following
new TX interceptor (RelayTransactionInterceptor ?):
- RTI knows about sites (perhaps TACH injects siteId into it ?)
- The modified steps #1-3 above would now be:

#1 The PREPARE message is sent to all hosts touched by modifications,
but we only wait for acks from hosts from the local cluster (same siteId).
      - So we send PREPARE to A,B,C,D,F, but return as soon as we have
valid (or invalid) responses from A,B,C
      - We discard PREPARE acks from non-local members, such as D or F
      - D and F may or may not be able to apply the changes !

#2 If successful --> COMMIT to A,B,C, and an *asynchronous* COMMIT to D
and F

#3 If not --> ROLLBACK to A,B,C, and an *asynchronous* ROLLBACK to D and F


Of course, the downside to this approach is that updates to the remote
cluster may or may not get applied. If this is critical, perhaps a
scheme that does a periodic retry of the updates could be used ?

Note that if we don't use TXs, then the updates to the remote cluster(s)
would have to be sent *asynchronously* (even if sync mode is
configured), too.

The changes to Infinispan would be:
- A new TX interceptor which knows about siteIds
- A new DistributionInterceptor which performs async updates to remote
clusters, possibly with a retry mechanism.

WDYT ?


--
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Data center replication

Manik Surtani
Doesn't this leave things wide open for inconsistency in a number of ways:

1) PREPAREs fail or go missing on remote clusters
2) COMMITs fail or go missing on remote clusters
3) PREPARE and COMMIT gets reordered on remote clusters
4) And you still have that window where cluster 1 fails and writes haven't propagated to cluster 2.  Same problem with Erik's Cassandra-based solution.

Is this level of inconsistency acceptable?

On 26 Jul 2011, at 12:21, Bela Ban wrote:

>
>
> On 7/25/11 6:06 PM, Erik Salter wrote:
>> Hi all,
>>
>>
>>
>> Bela was kind enough to have a discussion with me last week regarding my data center replication requirements.
>>
>>
>>
>> At a high level, I have 3 independent data centers (sites A, B, C).  The latency between data centers is high. (~200ms round trip)  So initially I was thinking about using a backing store (like Cassandra) to handle the replication between data centers.  Each center would have its own individual grid to manage "local" resources.  So when a local TX is committed successfully, it is replicated to the stores in the other data centers.  That way, on a data center failure, the requests can be directed to the other data centers by loading from the backing store.
>>
>>
>> The largest drawback:  Certain distributed applications require highly serialized access to resources in the grid.  This means lots of explicit locking of keys in a single transaction.  In the event that a request is directed to, say, Data Center B because of an intermittent failure of Data Center A, as it stands there exists the possibility that a stale resource could still be resident in that grid.  It naturally follows that there will have to be application logic to for the grid in each data center to know which resources it "owns".  And once the backing store gets an update from another data center, it will need to aggressively evict non-owned resources from the grid.
>>
>>
>>
>> I (and the customer) would like to use a single data grid across multiple data centers.  Bela detailed an option based off of JGroups RELAY that is a candidate solution.
>>
>>
>>
>> - When doing a 2PC, Infinispan broadcasts the PREPARE to all nodes (in A, B and C). *However*, it only expects responses from *local* nodes (in this case nodes in data center A).  Infinispan knows its own siteId  and can extract the siteId from every address, so it can grab the current view (say A1, A2, A3... A10, B1-B10, C1-C10) and remove  non-local nodes, to arrive at a sanitized list A1-10. This means it expects responses to its PREPARE message only from A1-10. When it receives a response from non-local nodes, it simply discards them.
>>
>> - On rollback, a ROLLBACK(TX) message is broadcast to the entire virtual cluster (A, B and C)
>>
>> - On commit, a COMMIT(TX) is broadcast to the entire virtual cluster (A, B and C).
>>
>>
>>
>> The downside here is that the 2PC won't be atomic, in the sense that it is only atomic for A, but not for B or C. A PREPARE might fail on a node in B and C, but the 2PC won't get rolled back as long as all nodes in A sent back a successful PREPARE-OK response. This is the same though in the current solution.
>>
>> Comments?  Thoughts?
>
>
>
> Let's look at a use case (btw, this is all in Infinispan's DIST mode,
> with numOwners=3 and TopologyAwareConsistentHash):
>
> - The local clusters are {A,B,C} and {D,E,F}.
>
> - RELAY connected them together into a virtual cluster {A,B,C,D,E,F}.
>
> - Infinispan (currently) only knows about the virtual cluster; it has no
> knowledge about the local clusters
>
> - A TX is started
> - Now we have a PUT1(K,V)
> - TACH decides that K,V needs to be stored on B, C and F
>
> - Now we have a PUT2(K2,V2)
> - TACH decides that K2,V2 needs to be stored on A,B and D
> - The TX commits
>
> - I'm not sure how the 2PC happens with mode=DIST, but I assume the
> participant set of the 2PC is the set of hosts touched by modifications:
> A,B,C,D,F
> - So we have all 3 hosts in the local cluster and 2 hosts in the remote
> cluster
>
> #1 The PREPARE phase now sends a PREPARE to A,B,C,D,F
> #2 If successful --> COMMIT to A,B,C,D,F
> #3 If not, ROLLBACK to the same members
>
> If this is the way 2PC over DIST is done today, I suggest the following
> new TX interceptor (RelayTransactionInterceptor ?):
> - RTI knows about sites (perhaps TACH injects siteId into it ?)
> - The modified steps #1-3 above would now be:
>
> #1 The PREPARE message is sent to all hosts touched by modifications,
> but we only wait for acks from hosts from the local cluster (same siteId).
>      - So we send PREPARE to A,B,C,D,F, but return as soon as we have
> valid (or invalid) responses from A,B,C
>      - We discard PREPARE acks from non-local members, such as D or F
>      - D and F may or may not be able to apply the changes !
>
> #2 If successful --> COMMIT to A,B,C, and an *asynchronous* COMMIT to D
> and F
>
> #3 If not --> ROLLBACK to A,B,C, and an *asynchronous* ROLLBACK to D and F
>
>
> Of course, the downside to this approach is that updates to the remote
> cluster may or may not get applied. If this is critical, perhaps a
> scheme that does a periodic retry of the updates could be used ?
>
> Note that if we don't use TXs, then the updates to the remote cluster(s)
> would have to be sent *asynchronously* (even if sync mode is
> configured), too.
>
> The changes to Infinispan would be:
> - A new TX interceptor which knows about siteIds
> - A new DistributionInterceptor which performs async updates to remote
> clusters, possibly with a retry mechanism.
>
> WDYT ?
>
>
> --
> Bela Ban
> Lead JGroups (http://www.jgroups.org)
> JBoss / Red Hat
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] Data center replication

Bela Ban


On 7/27/11 6:01 PM, Manik Surtani wrote:
> Doesn't this leave things wide open for inconsistency in a number of ways:
>
> 1) PREPAREs fail or go missing on remote clusters
> 2) COMMITs fail or go missing on remote clusters
> 3) PREPARE and COMMIT gets reordered on remote clusters


Yes. I described this in my email (see below).

This is not better or worse than Erik's current design with Cassandra.
AFAIU, Erik's willing to make this tradeoff; as he cannot have
transactions spanning data centers, with the main argument being the
200ms latency that would slow things down...


> 4) And you still have that window where cluster 1 fails and writes haven't propagated to cluster 2.  Same problem with Erik's Cassandra-based solution.
>
> Is this level of inconsistency acceptable?
>
> On 26 Jul 2011, at 12:21, Bela Ban wrote:
>
>>
>>
>> On 7/25/11 6:06 PM, Erik Salter wrote:
>>> Hi all,
>>>
>>>
>>>
>>> Bela was kind enough to have a discussion with me last week regarding my data center replication requirements.
>>>
>>>
>>>
>>> At a high level, I have 3 independent data centers (sites A, B, C).  The latency between data centers is high. (~200ms round trip)  So initially I was thinking about using a backing store (like Cassandra) to handle the replication between data centers.  Each center would have its own individual grid to manage "local" resources.  So when a local TX is committed successfully, it is replicated to the stores in the other data centers.  That way, on a data center failure, the requests can be directed to the other data centers by loading from the backing store.
>>>
>>>
>>> The largest drawback:  Certain distributed applications require highly serialized access to resources in the grid.  This means lots of explicit locking of keys in a single transaction.  In the event that a request is directed to, say, Data Center B because of an intermittent failure of Data Center A, as it stands there exists the possibility that a stale resource could still be resident in that grid.  It naturally follows that there will have to be application logic to for the grid in each data center to know which resources it "owns".  And once the backing store gets an update from another data center, it will need to aggressively evict non-owned resources from the grid.
>>>
>>>
>>>
>>> I (and the customer) would like to use a single data grid across multiple data centers.  Bela detailed an option based off of JGroups RELAY that is a candidate solution.
>>>
>>>
>>>
>>> - When doing a 2PC, Infinispan broadcasts the PREPARE to all nodes (in A, B and C). *However*, it only expects responses from *local* nodes (in this case nodes in data center A).  Infinispan knows its own siteId  and can extract the siteId from every address, so it can grab the current view (say A1, A2, A3... A10, B1-B10, C1-C10) and remove  non-local nodes, to arrive at a sanitized list A1-10. This means it expects responses to its PREPARE message only from A1-10. When it receives a response from non-local nodes, it simply discards them.
>>>
>>> - On rollback, a ROLLBACK(TX) message is broadcast to the entire virtual cluster (A, B and C)
>>>
>>> - On commit, a COMMIT(TX) is broadcast to the entire virtual cluster (A, B and C).
>>>
>>>
>>>
>>> The downside here is that the 2PC won't be atomic, in the sense that it is only atomic for A, but not for B or C. A PREPARE might fail on a node in B and C, but the 2PC won't get rolled back as long as all nodes in A sent back a successful PREPARE-OK response. This is the same though in the current solution.
>>>
>>> Comments?  Thoughts?
>>
>>
>>
>> Let's look at a use case (btw, this is all in Infinispan's DIST mode,
>> with numOwners=3 and TopologyAwareConsistentHash):
>>
>> - The local clusters are {A,B,C} and {D,E,F}.
>>
>> - RELAY connected them together into a virtual cluster {A,B,C,D,E,F}.
>>
>> - Infinispan (currently) only knows about the virtual cluster; it has no
>> knowledge about the local clusters
>>
>> - A TX is started
>> - Now we have a PUT1(K,V)
>> - TACH decides that K,V needs to be stored on B, C and F
>>
>> - Now we have a PUT2(K2,V2)
>> - TACH decides that K2,V2 needs to be stored on A,B and D
>> - The TX commits
>>
>> - I'm not sure how the 2PC happens with mode=DIST, but I assume the
>> participant set of the 2PC is the set of hosts touched by modifications:
>> A,B,C,D,F
>> - So we have all 3 hosts in the local cluster and 2 hosts in the remote
>> cluster
>>
>> #1 The PREPARE phase now sends a PREPARE to A,B,C,D,F
>> #2 If successful -->  COMMIT to A,B,C,D,F
>> #3 If not, ROLLBACK to the same members
>>
>> If this is the way 2PC over DIST is done today, I suggest the following
>> new TX interceptor (RelayTransactionInterceptor ?):
>> - RTI knows about sites (perhaps TACH injects siteId into it ?)
>> - The modified steps #1-3 above would now be:
>>
>> #1 The PREPARE message is sent to all hosts touched by modifications,
>> but we only wait for acks from hosts from the local cluster (same siteId).
>>       - So we send PREPARE to A,B,C,D,F, but return as soon as we have
>> valid (or invalid) responses from A,B,C
>>       - We discard PREPARE acks from non-local members, such as D or F
>>       - D and F may or may not be able to apply the changes !
>>
>> #2 If successful -->  COMMIT to A,B,C, and an *asynchronous* COMMIT to D
>> and F
>>
>> #3 If not -->  ROLLBACK to A,B,C, and an *asynchronous* ROLLBACK to D and F
>>
>>
>> Of course, the downside to this approach is that updates to the remote
>> cluster may or may not get applied. If this is critical, perhaps a
>> scheme that does a periodic retry of the updates could be used ?
>>
>> Note that if we don't use TXs, then the updates to the remote cluster(s)
>> would have to be sent *asynchronously* (even if sync mode is
>> configured), too.
>>
>> The changes to Infinispan would be:
>> - A new TX interceptor which knows about siteIds
>> - A new DistributionInterceptor which performs async updates to remote
>> clusters, possibly with a retry mechanism.
>>
>> WDYT ?
>>
>>
>> --
>> Bela Ban
>> Lead JGroups (http://www.jgroups.org)
>> JBoss / Red Hat
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Bela Ban
Lead JGroups (http://www.jgroups.org)
JBoss / Red Hat
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev