[infinispan-dev] [ISPN-78] Alternative interface for writing large objects

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
I've started working on ISPN-78 - Large Object Support - closely
following Manik's design document
http://community.jboss.org/wiki/LargeObjectSupport. As a starting point
I'm currently trying to implement

OutputStream writeToKey(K key),

which, for the time being, I chose to declare on AdvancedCache rather
that on Cache proper.

While thinking about the implications, I stumbled upon a few questions
which may well be owing to my lack of knowledge about Infinispan's inner
workings.

1. OutputStream writeToKey(K key) implies that the interaction between
user code and Infinispan happens in the OutputStream returned. Contrary
to existing methods, there would be no well defined *single* point where
control passes from user code to Infinispan. Instead, a user would write
a few bytes, passing control to Infinispan. Infinispan would buffer
those bytes and return control to the user until a preconfigured chunk
size is reached, whereupon Infinispan would probably issue a more or
less standard request to store that chunk on some node. Rinse and repeat.

This is certainly doable but leaves me wondering where that proposed
ChunkingInterceptor might come into play. It is my current understanding
that interceptors, well, intercept commands, and in this scenario there
could not be such a PutLargeObjectCommand as, as I said, there is no
single point where control passes from user code to Infinispan. Instead,
chunking would have to be done directly within the CacheOutputStream
returned, and the only commands involved would be the more or less
standard PutKeyValueCommands mentioned above. In this scenario there
wouldn't be any PutLargeObjectCommand to encapsulate the whole process.

Does this make sense? If so, I'd prefer to have

void writeToKey(K key, InputStream largeObject)

instead. Thus, after calling this method control would be handed over to
Infinispan until that LargeObject is stored, and we could have indeed
have some PutLargeObject command to encapsulate the whole process.

2. For the mapping Key -> LargeObjectMetadata I would intuitively choose
to use a dedicated "system" cache. Is this the Infinispan way of doing
things? If so, where can I find some code to use as a template? If not,
what would be an alternative approach that is in keeping with
Infinispan's architecture?

3. The design suggests to use a fresh UUID as the key for each new
chunk. While this in all likelihood gives us a unique new key for each
chunk I currently fail to see how that guarantees that this key maps to
a node that is different from all the nodes already used to store chunks
of the same Large Object. But then again I know next to nothing about
Infinispan's constant hashing algorithm.

4. Finally, the problem regarding eager locking and transactions
mentioned in Manik's comment seems rather ... hairy. If we indeed forego
transactions readers of a key just being written shouldn't be affected
provided we write the LargeObjectMetadata object only after all chunks
have been written. But what about writers?

I hope this makes sense. Otherwise, don't hesitate to point out where I
went wrong and I will happily retreat to the drawing board.

Cheers,
Olaf


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Elias Ross
On Tue, Mar 29, 2011 at 3:49 AM, Olaf Bergner <[hidden email]> wrote:

> I've started working on ISPN-78 - Large Object Support - closely
> following Manik's design document
> http://community.jboss.org/wiki/LargeObjectSupport. As a starting point
> I'm currently trying to implement
>
> OutputStream writeToKey(K key),

As it's a common use case that data is being streamed from disk (or
Socket) anyway, and there's no good "pipe" in the Java SDK, your API
change is an improvement.

What you want to express is:

Cache c;
c.write(K key, new FileInputStream("/var/db/bigfile.txt"));

But being handed an OutputStream is good these cases:
1. If an exception is thrown from an InputStream (disk error), the
exception doesn't have to come through Infinispan. (I suggest the API
supports IOException.)
2. A user can better compose the output. For example, you want to add,
say, a header to a file being read from disk, it's much easier to do a
series of write operations, like os.write(<header>), os.write(<data>).
Still, I wouldn't recommend that.
3. If you want to append new data.

I think it'd be BEST if you could support both models. I would add:

interface Cache {
  /**
   * Returns a new or existing LargeObject object for the following key.
   * @throws ClassCastException if the key exists and is not a LargeObject.
   */
  LargeObject largeObject(K key);
}

Use:

Cache<K, LargeObject> c;
c.largeObject(key).append(new FileInputStream(...));
- or -
c.largeObject(key);
/// some time passes ///
OutputStream os = c.largeObject(key).appendStream();
os.write("more data now");
os.close(); // flushes data to Cache

public class LargeObject {
  transient final Cache cache;
  transient final Object key;
  int chunks;
  final int chunkSize;
  long totalSize;

  /** Constructor intended only for Cache itself. But should allow
subclassing for tests. */
  protected LargeObject(Cache c, Object key, int chunkSize) {}
  /** Data is written to Cache and not entirely stored until the
stream is closed or flushed. */
  public OutputStream getAppendStream();
  /** Data is read until EOF, then the stream is closed */
  public void append(InputStream is);
  /** Should support "seek" and "skip" and "available" methods */
  public InputStream getInput();
  public long getTotalSize();
  public void truncate(long length);
  protected void remove();
}

> This is certainly doable but leaves me wondering where that proposed
> ChunkingInterceptor might come into play.

I would think ideally you don't need to create any new commands. Less
protocol messages is better.

You do need to deal with the case of "remove": Ultimately, you will
need to call LargeObject.remove().

> 3. The design suggests to use a fresh UUID as the key for each new
> chunk. While this in all likelihood gives us a unique new key for each
> chunk I currently fail to see how that guarantees that this key maps to
> a node that is different from all the nodes already used to store chunks
> of the same Large Object. But then again I know next to nothing about
> Infinispan's constant hashing algorithm.

I wouldn't use UUID. I'd just store (K, #) where # is the chunk.

>
> 4. Finally, the problem regarding eager locking and transactions
> mentioned in Manik's comment seems rather ... hairy. If we indeed forego
> transactions readers of a key just being written shouldn't be affected
> provided we write the LargeObjectMetadata object only after all chunks
> have been written. But what about writers?

I would think a use case for this API would be streaming audio or
video, maybe something like access logs even?

In which case, you would want to read while you're writing. So,
locking shouldn't be imposed. I would say, rely on the transaction
manager to keep a consistent view. If transactions aren't being used,
then the user might see some unexpected behavior. The API could
compensate for that.
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
Am 30.03.11 02:32, schrieb Elias Ross:
> I think it'd be BEST if you could support both models. I would add:
>
> interface Cache {
>    /**
>     * Returns a new or existing LargeObject object for the following key.
>     * @throws ClassCastException if the key exists and is not a LargeObject.
>     */
>    LargeObject largeObject(K key);
> }
OK, I'll keep that on my todo list, yet for the time being I'v opted to
start with implementing void writeToKey(K key, InputStream largeObject).
>> This is certainly doable but leaves me wondering where that proposed
>> ChunkingInterceptor might come into play.
> I would think ideally you don't need to create any new commands. Less
> protocol messages is better.
It is my understanding that PutKeyValueCommand will *always* attempt to
read the current value stored under the given key first. I'm not sure if
we want this in our situation where that current value may be several GB
in size. Anyway, it should be easy to refactor if reusing
PutKeyValueCommand should prove viable.
>> 3. The design suggests to use a fresh UUID as the key for each new
>> chunk. While this in all likelihood gives us a unique new key for each
>> chunk I currently fail to see how that guarantees that this key maps to
>> a node that is different from all the nodes already used to store chunks
>> of the same Large Object. But then again I know next to nothing about
>> Infinispan's constant hashing algorithm.
> I wouldn't use UUID. I'd just store (K, #) where # is the chunk.
>
Since this is important and might reveal a fundamental misunderstanding
on my part, I need to sort this out before moving on. These are my
assumptions, please point out any errors:

1. We want to partition a large object into chunks since, by definition,
a large object is too big to be stored in a single node in the cluster.
It follows that it is paramount that no two chunks be stored in the same
node, correct?

2. Constant hashing guarantees that any given key maps to *some* node in
the cluster. There is no way, however, such a key's creator could know
to what node exactly its key maps. In other words, there is no inverse
to the hash function, correct?

3. The current design mandates that for storing each chunk the existing
put(key, value) be reused, correct?

It follows that we have no way whatsoever of generating a set of keys
that guarantees that no two keys are mapped to the same node. In the
pathological case, *all* keys map to the same node, correct?
>> I would think a use case for this API would be streaming audio or
>> video, maybe something like access logs even?
>>
>> In which case, you would want to read while you're writing. So,
>> locking shouldn't be imposed. I would say, rely on the transaction
>> manager to keep a consistent view. If transactions aren't being used,
>> then the user might see some unexpected behavior. The API could
>> compensate for that.
>>
If I understand you correctly you propose two alternatives:

1. Use transactions, thus delegating all consistency requirements to the
transaction manager.

2. Don't use transactions and change the API so that readers may be told
that a large object they are interested in is currently being written.

Further, to support streaming use cases you propose that it should be
possible to read a large object while it is being written.

Is that correct?

Hmm, I need to think about this. If I understand Manik's comment and the
tx subsystem correctly each transaction holds its *entire* associated
state in memory. Thus, if we are to write all chunks of a given large
object within the scope of a single transaction we will blow up the
originator node's heap. Correct?

So many questions ...

Cheers,
Olaf

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Galder Zamarreno
In reply to this post by Olaf Bergner
Hi Olaf,

See below for comments:

On Mar 29, 2011, at 12:49 PM, Olaf Bergner wrote:

> I've started working on ISPN-78 - Large Object Support - closely
> following Manik's design document
> http://community.jboss.org/wiki/LargeObjectSupport. As a starting point
> I'm currently trying to implement
>
> OutputStream writeToKey(K key),
>
> which, for the time being, I chose to declare on AdvancedCache rather
> that on Cache proper.
>
> While thinking about the implications, I stumbled upon a few questions
> which may well be owing to my lack of knowledge about Infinispan's inner
> workings.
>
> 1. OutputStream writeToKey(K key) implies that the interaction between
> user code and Infinispan happens in the OutputStream returned. Contrary
> to existing methods, there would be no well defined *single* point where
> control passes from user code to Infinispan. Instead, a user would write
> a few bytes, passing control to Infinispan. Infinispan would buffer
> those bytes and return control to the user until a preconfigured chunk
> size is reached, whereupon Infinispan would probably issue a more or
> less standard request to store that chunk on some node. Rinse and repeat.
>
> This is certainly doable but leaves me wondering where that proposed
> ChunkingInterceptor might come into play. It is my current understanding
> that interceptors, well, intercept commands, and in this scenario there
> could not be such a PutLargeObjectCommand as, as I said, there is no
> single point where control passes from user code to Infinispan. Instead,
> chunking would have to be done directly within the CacheOutputStream
> returned, and the only commands involved would be the more or less
> standard PutKeyValueCommands mentioned above. In this scenario there
> wouldn't be any PutLargeObjectCommand to encapsulate the whole process.

Hmmmm, the initial step in writeToKey() is to create an map entry for the metadata, so the internal writeToKey() could indeed create a PutLargeObjectMetadataCommand and pass that down the interceptor stack, or more simply have a ChunkingInterceptor that implements visitPutKeyValue...() that would keep an eye for a transaction call that puts an LargeObjectMetadata, and at that point, the interceptor could return a new specialised outputstream...etc. The first suggestion would be more useful if you expect other normal cache commands such as get...etc to deal with large object related cache calls in a different way, but I don't think that's the case here since all the interaction would be via the Output/Input stream.


>
> Does this make sense? If so, I'd prefer to have
>
> void writeToKey(K key, InputStream largeObject)
>
> instead. Thus, after calling this method control would be handed over to
> Infinispan until that LargeObject is stored, and we could have indeed
> have some PutLargeObject command to encapsulate the whole process.
>
> 2. For the mapping Key -> LargeObjectMetadata I would intuitively choose
> to use a dedicated "system" cache. Is this the Infinispan way of doing
> things? If so, where can I find some code to use as a template? If not,
> what would be an alternative approach that is in keeping with
> Infinispan's architecture?

Yeah, this information would be stored in an internal cache. There're several examples of such caches such as the topology cache for Hot Rod servers. When the server is started, it creates a configuration for this type of cache (i.e. REPL_SYNC....) and then it's named in a particular way...etc.

>
> 3. The design suggests to use a fresh UUID as the key for each new
> chunk. While this in all likelihood gives us a unique new key for each
> chunk I currently fail to see how that guarantees that this key maps to
> a node that is different from all the nodes already used to store chunks
> of the same Large Object. But then again I know next to nothing about
> Infinispan's constant hashing algorithm.

I think there's a service that will generate a key mapped to particular node, so that might be a better option here to avoid all chunks going to the same node. I think Mircea might be able to help further with this.

>
> 4. Finally, the problem regarding eager locking and transactions
> mentioned in Manik's comment seems rather ... hairy. If we indeed forego
> transactions readers of a key just being written shouldn't be affected
> provided we write the LargeObjectMetadata object only after all chunks
> have been written. But what about writers?

Hmmmmm, I don't understand your question.

>
> I hope this makes sense. Otherwise, don't hesitate to point out where I
> went wrong and I will happily retreat to the drawing board.
>
> Cheers,
> Olaf
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Galder Zamarreno

On Apr 4, 2011, at 10:09 AM, Galder Zamarreño wrote:

> Hi Olaf,
>
> See below for comments:
>
> On Mar 29, 2011, at 12:49 PM, Olaf Bergner wrote:
>
>> </snip>
>>
>> 3. The design suggests to use a fresh UUID as the key for each new
>> chunk. While this in all likelihood gives us a unique new key for each
>> chunk I currently fail to see how that guarantees that this key maps to
>> a node that is different from all the nodes already used to store chunks
>> of the same Large Object. But then again I know next to nothing about
>> Infinispan's constant hashing algorithm.
>
> I think there's a service that will generate a key mapped to particular node, so that might be a better option here to avoid all chunks going to the same node. I think Mircea might be able to help further with this.

Actually, it's not that simple, it needs to be adaptive but it might be going into the territory of virtual nodes and sizing of virtual nodes. The key thing of choosing the nodes to store the chunks is that there should be enough memory in the node where it lands. IOW, if a 5GB dvd is being chunked into 100MB pieces, it would not make sense sending chunks to a node that does not have memory to fit that.

>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Galder Zamarreno
In reply to this post by Olaf Bergner

On Mar 31, 2011, at 7:46 AM, Olaf Bergner wrote:

> Am 30.03.11 02:32, schrieb Elias Ross:
>> I think it'd be BEST if you could support both models. I would add:
>>
>> interface Cache {
>>   /**
>>    * Returns a new or existing LargeObject object for the following key.
>>    * @throws ClassCastException if the key exists and is not a LargeObject.
>>    */
>>   LargeObject largeObject(K key);
>> }
> OK, I'll keep that on my todo list, yet for the time being I'v opted to
> start with implementing void writeToKey(K key, InputStream largeObject).
>>> This is certainly doable but leaves me wondering where that proposed
>>> ChunkingInterceptor might come into play.
>> I would think ideally you don't need to create any new commands. Less
>> protocol messages is better.
> It is my understanding that PutKeyValueCommand will *always* attempt to
> read the current value stored under the given key first. I'm not sure if
> we want this in our situation where that current value may be several GB
> in size. Anyway, it should be easy to refactor if reusing
> PutKeyValueCommand should prove viable.

The only reason it reads the previous value is to return it as part of contract of "V put(K, V)" - but that can be skipped.

>>> 3. The design suggests to use a fresh UUID as the key for each new
>>> chunk. While this in all likelihood gives us a unique new key for each
>>> chunk I currently fail to see how that guarantees that this key maps to
>>> a node that is different from all the nodes already used to store chunks
>>> of the same Large Object. But then again I know next to nothing about
>>> Infinispan's constant hashing algorithm.
>> I wouldn't use UUID. I'd just store (K, #) where # is the chunk.
>>
> Since this is important and might reveal a fundamental misunderstanding
> on my part, I need to sort this out before moving on. These are my
> assumptions, please point out any errors:
>
> 1. We want to partition a large object into chunks since, by definition,
> a large object is too big to be stored in a single node in the cluster.
> It follows that it is paramount that no two chunks be stored in the same
> node, correct?

No. The idea is that the whole object should not end up being stored in a single JVM, but nothing should stop you from storing two chunks of the same object in the same node.

What we somehow need to avoid is chunks ending up in nodes that do not have enough memory to store them, and that could complicate things.

>
> 2. Constant hashing guarantees that any given key maps to *some* node in
> the cluster. There is no way, however, such a key's creator could know
> to what node exactly its key maps. In other words, there is no inverse
> to the hash function, correct?

I vaguely remember something about a consistent hash algorithm that given a node where to store data, it would generate a key for it (Mircea, did you create this?). This could work in conjunction with my previous point assuming that a node would know what the available memory in other nodes is, but this would require some thinking.


>
> 3. The current design mandates that for storing each chunk the existing
> put(key, value) be reused, correct?
>
> It follows that we have no way whatsoever of generating a set of keys
> that guarantees that no two keys are mapped to the same node. In the
> pathological case, *all* keys map to the same node, correct?

See my previous point.


>>> I would think a use case for this API would be streaming audio or
>>> video, maybe something like access logs even?
>>>
>>> In which case, you would want to read while you're writing. So,
>>> locking shouldn't be imposed. I would say, rely on the transaction
>>> manager to keep a consistent view. If transactions aren't being used,
>>> then the user might see some unexpected behavior. The API could
>>> compensate for that.
>>>
> If I understand you correctly you propose two alternatives:
>
> 1. Use transactions, thus delegating all consistency requirements to the
> transaction manager.
>
> 2. Don't use transactions and change the API so that readers may be told
> that a large object they are interested in is currently being written.
>
> Further, to support streaming use cases you propose that it should be
> possible to read a large object while it is being written.
>
> Is that correct?
>
> Hmm, I need to think about this. If I understand Manik's comment and the
> tx subsystem correctly each transaction holds its *entire* associated
> state in memory. Thus, if we are to write all chunks of a given large
> object within the scope of a single transaction we will blow up the
> originator node's heap. Correct?

Hmmmm, maybe what's needed here is a mix of the two. You want metadata information to be transactional, so when you start writing and chunking an object and you keep updating the metadata object, this is transactionally protected, so no one can read the metadata in the mean time, however, the actual chunk writing in the cache could be non-transactional to make chunks do not pile up in the transaction context.

>
> So many questions ...
>
> Cheers,
> Olaf
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Sanne Grinovero
I don't think you should make it too complex by looking at available
memory, you have the same issue when storing many different keys in
Infinispan in any mode, but we never worry about this, relying instead
on the spreading quality of the hash function, and of course the
available total heap size must be able to store all values, plus the
replicas, plus some extra % due to the hashing function not being
perfect; In effect you can always define some spill-over to
CacheLoaders.

The fact that some nodes will have less memory available will be
solved by the virtual nodes patch, if you refer to bigger vs. smaller
machines in the same cluster.

If you make sure the file is split in "many" chunks, they will be
randomly distributed and that should be good enough for this purpose,
wherein the definition of "many" can be a configuration option, or a
method parameter during store.

There's something similar happening in the Lucene Directory code,
these are some issues I had to consider:

1) make sure you store a metadata object with the used configuration
details, like the number and size of chunks, so that in case the chunk
size is configurable, if the cluster is restarted with a different
configuration you are still able to retrieve the correct stream.

2) There might be concurrency issues while one thread/node is
streaming it, and another one is deleting or replacing it. Infinispan
provides you with consistency at a key level, but as you're dealing
with multiple keys, you might get a view composed of chunks from
different transactions.

You'll have to think about how to solve 2), I guess you could store a
version number in the metadata object mentioned in 1) and have all
modified keys contain the version they refer to. garbage collection
would be tricky, as at some point you want to delete chunks no longer
referred to by any node, including those who crashed without
explicitly releasing anything.

Sanne


2011/4/4 Galder Zamarreño <[hidden email]>:

>
> On Mar 31, 2011, at 7:46 AM, Olaf Bergner wrote:
>
>> Am 30.03.11 02:32, schrieb Elias Ross:
>>> I think it'd be BEST if you could support both models. I would add:
>>>
>>> interface Cache {
>>>   /**
>>>    * Returns a new or existing LargeObject object for the following key.
>>>    * @throws ClassCastException if the key exists and is not a LargeObject.
>>>    */
>>>   LargeObject largeObject(K key);
>>> }
>> OK, I'll keep that on my todo list, yet for the time being I'v opted to
>> start with implementing void writeToKey(K key, InputStream largeObject).
>>>> This is certainly doable but leaves me wondering where that proposed
>>>> ChunkingInterceptor might come into play.
>>> I would think ideally you don't need to create any new commands. Less
>>> protocol messages is better.
>> It is my understanding that PutKeyValueCommand will *always* attempt to
>> read the current value stored under the given key first. I'm not sure if
>> we want this in our situation where that current value may be several GB
>> in size. Anyway, it should be easy to refactor if reusing
>> PutKeyValueCommand should prove viable.
>
> The only reason it reads the previous value is to return it as part of contract of "V put(K, V)" - but that can be skipped.
>
>>>> 3. The design suggests to use a fresh UUID as the key for each new
>>>> chunk. While this in all likelihood gives us a unique new key for each
>>>> chunk I currently fail to see how that guarantees that this key maps to
>>>> a node that is different from all the nodes already used to store chunks
>>>> of the same Large Object. But then again I know next to nothing about
>>>> Infinispan's constant hashing algorithm.
>>> I wouldn't use UUID. I'd just store (K, #) where # is the chunk.
>>>
>> Since this is important and might reveal a fundamental misunderstanding
>> on my part, I need to sort this out before moving on. These are my
>> assumptions, please point out any errors:
>>
>> 1. We want to partition a large object into chunks since, by definition,
>> a large object is too big to be stored in a single node in the cluster.
>> It follows that it is paramount that no two chunks be stored in the same
>> node, correct?
>
> No. The idea is that the whole object should not end up being stored in a single JVM, but nothing should stop you from storing two chunks of the same object in the same node.
>
> What we somehow need to avoid is chunks ending up in nodes that do not have enough memory to store them, and that could complicate things.
>
>>
>> 2. Constant hashing guarantees that any given key maps to *some* node in
>> the cluster. There is no way, however, such a key's creator could know
>> to what node exactly its key maps. In other words, there is no inverse
>> to the hash function, correct?
>
> I vaguely remember something about a consistent hash algorithm that given a node where to store data, it would generate a key for it (Mircea, did you create this?). This could work in conjunction with my previous point assuming that a node would know what the available memory in other nodes is, but this would require some thinking.
>
>
>>
>> 3. The current design mandates that for storing each chunk the existing
>> put(key, value) be reused, correct?
>>
>> It follows that we have no way whatsoever of generating a set of keys
>> that guarantees that no two keys are mapped to the same node. In the
>> pathological case, *all* keys map to the same node, correct?
>
> See my previous point.
>
>
>>>> I would think a use case for this API would be streaming audio or
>>>> video, maybe something like access logs even?
>>>>
>>>> In which case, you would want to read while you're writing. So,
>>>> locking shouldn't be imposed. I would say, rely on the transaction
>>>> manager to keep a consistent view. If transactions aren't being used,
>>>> then the user might see some unexpected behavior. The API could
>>>> compensate for that.
>>>>
>> If I understand you correctly you propose two alternatives:
>>
>> 1. Use transactions, thus delegating all consistency requirements to the
>> transaction manager.
>>
>> 2. Don't use transactions and change the API so that readers may be told
>> that a large object they are interested in is currently being written.
>>
>> Further, to support streaming use cases you propose that it should be
>> possible to read a large object while it is being written.
>>
>> Is that correct?
>>
>> Hmm, I need to think about this. If I understand Manik's comment and the
>> tx subsystem correctly each transaction holds its *entire* associated
>> state in memory. Thus, if we are to write all chunks of a given large
>> object within the scope of a single transaction we will blow up the
>> originator node's heap. Correct?
>
> Hmmmm, maybe what's needed here is a mix of the two. You want metadata information to be transactional, so when you start writing and chunking an object and you keep updating the metadata object, this is transactionally protected, so no one can read the metadata in the mean time, however, the actual chunk writing in the cache could be non-transactional to make chunks do not pile up in the transaction context.
>
>>
>> So many questions ...
>>
>> Cheers,
>> Olaf
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
In reply to this post by Galder Zamarreno
Hi Galder,

thanks for your input. See my comments below:

-------- Original-Nachricht --------
> Datum: Mon, 4 Apr 2011 10:09:39 +0200
> Von: "Galder Zamarreño" <[hidden email]>

>
> Hmmmm, the initial step in writeToKey() is to create an map entry for the
> metadata, so the internal writeToKey() could indeed create a
> PutLargeObjectMetadataCommand and pass that down the interceptor stack, or more simply
> have a ChunkingInterceptor that implements visitPutKeyValue...() that would
> keep an eye for a transaction call that puts an LargeObjectMetadata, and at
> that point, the interceptor could return a new specialised
> outputstream...etc. The first suggestion would be more useful if you expect other normal
> cache commands such as get...etc to deal with large object related cache
> calls in a different way, but I don't think that's the case here since all the
> interaction would be via the Output/Input stream.

Just to make sure that we are on the same page: the supposed call sequence would appear to be

1. User calls OutputStream writeToKey(K key)
2. writeToKey creates a PutKeyValueCommand, marking it as pertaining to a large object.
3. writeToKey passes that command down the interceptor chain.
4. A LargeObjectChunkingInterceptor or maybe LargeObjectMetadataInterceptor processes that command, recognizing it as pertaining to a large object and thus storing a mapping from large object key to an initially empty large object metadata instance.
5. That interceptor returns a specialized output stream.
6. User writes bytes to that output stream until chunk limit is reached.
7. Output stream calls cache.put(key, byte[]) or alternatively creates PutKeyValueCommand itself, passing it down the interceptor chain.
8. LargeObjectChunking/MetadataInterceptor recognizes that it is dealing with a large object and that there already exists a mapping for that key in the metadata cache.
9. Interceptor generates new chunk key.
10. Interceptor replaces key with new cunk key and calls the next interceptor.
11. Interceptor restores original key (if only consistency reasons).
12. Repeat 6 - 11 until user closes output stream.

Makes sense. Thanks.


> Yeah, this information would be stored in an internal cache. There're
> several examples of such caches such as the topology cache for Hot Rod servers.
> When the server is started, it creates a configuration for this type of
> cache (i.e. REPL_SYNC....) and then it's named in a particular way...etc.

Found some example code in the meantime. It's obviously no rocket science.

> > 3. The design suggests to use a fresh UUID as the key for each new
> > chunk. While this in all likelihood gives us a unique new key for each
> > chunk I currently fail to see how that guarantees that this key maps to
> > a node that is different from all the nodes already used to store chunks
> > of the same Large Object. But then again I know next to nothing about
> > Infinispan's constant hashing algorithm.
>
> I think there's a service that will generate a key mapped to particular
> node, so that might be a better option here to avoid all chunks going to the
> same node. I think Mircea might be able to help further with this.

Implemented a workaround in the meantime. Would be fab if such a service existed. Haven't found it yet.

> >
> > 4. Finally, the problem regarding eager locking and transactions
> > mentioned in Manik's comment seems rather ... hairy. If we indeed forego
> > transactions readers of a key just being written shouldn't be affected
> > provided we write the LargeObjectMetadata object only after all chunks
> > have been written. But what about writers?
>
> Hmmmmm, I don't understand your question.

Well, the problem with transactions in this context seems to be that a transaction *always* holds it entire associated state. In our situation this would potentially blow up the heap.

Reading might not be a problem since in my original design a reader would not find a mapping for a key as long as the writer has not finished writing. It would assume that the large object it is looking for does not exist. In the updated design suggested above I currently think about marking the metadata as being incomplete. A reader could then either block until the writer is finished, or it could inform the reader about it.

I haven't thought deeply about the implications for concurrent writes, though. Is it possible to lock keys outside of a transactional context? If so, this might be a solution for reading and writing.

Cheers,
Olaf
 

> >
> > _______________________________________________
> > infinispan-dev mailing list
> > [hidden email]
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
In reply to this post by Galder Zamarreno
Hi Galder,

-------- Original-Nachricht --------
> Datum: Mon, 4 Apr 2011 11:01:21 +0200
> Von: "Galder Zamarreño" <[hidden email]>
> Actually, it's not that simple, it needs to be adaptive but it might be
> going into the territory of virtual nodes and sizing of virtual nodes. The
> key thing of choosing the nodes to store the chunks is that there should be
> enough memory in the node where it lands. IOW, if a 5GB dvd is being chunked
> into 100MB pieces, it would not make sense sending chunks to a node that
> does not have memory to fit that.

Is there an JMX interface for querying a node's available memory? Would it otherwise make sense to add one?

Cheers,
Olaf

> >>
> >>
> >>
> >> _______________________________________________
> >> infinispan-dev mailing list
> >> [hidden email]
> >> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> >
> > --
> > Galder Zamarreño
> > Sr. Software Engineer
> > Infinispan, JBoss Cache
> >
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > [hidden email]
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
In reply to this post by Galder Zamarreno
Hi Galder,

-------- Original-Nachricht --------
> Datum: Mon, 4 Apr 2011 11:29:06 +0200
> Von: "Galder Zamarreño" <[hidden email]>
> > in size. Anyway, it should be easy to refactor if reusing
> > PutKeyValueCommand should prove viable.
>
> The only reason it reads the previous value is to return it as part of
> contract of "V put(K, V)" - but that can be skipped.

Meanwhile, I replaced my custom WriteLargeObjectCommand with a customized PutKeyValueCommand sporting a new putLargeObject flag.

> > Since this is important and might reveal a fundamental misunderstanding
> > on my part, I need to sort this out before moving on. These are my
> > assumptions, please point out any errors:
> >
> > 1. We want to partition a large object into chunks since, by definition,
> > a large object is too big to be stored in a single node in the cluster.
> > It follows that it is paramount that no two chunks be stored in the same
> > node, correct?
>
> No. The idea is that the whole object should not end up being stored in a
> single JVM, but nothing should stop you from storing two chunks of the same
> object in the same node.

Ah, this takes away some complexity. Good.

> What we somehow need to avoid is chunks ending up in nodes that do not
> have enough memory to store them, and that could complicate things.

Definitely. What about replication, for instance? Does INFINISPAN use the replication mechanism suggested by Dynamo, i.e. walking the constant hash ring in clockwise direction until the desired number of replicas is reached (if I recall correctl)? I'm afraid this might fail in our case.

Plus, I fear rehashing would have to be aware of wheter it is dealing with relocating a large object chunk or a "regular" value.


> Hmmmm, maybe what's needed here is a mix of the two. You want metadata
> information to be transactional, so when you start writing and chunking an
> object and you keep updating the metadata object, this is transactionally
> protected, so no one can read the metadata in the mean time, however, the
> actual chunk writing in the cache could be non-transactional to make chunks do
> not pile up in the transaction context.

Sounds reasonable. Something definitely worth thinking about.

Cheers,
Olaf

>
> >
> > So many questions ...
> >
> > Cheers,
> > Olaf
> >
> > _______________________________________________
> > infinispan-dev mailing list
> > [hidden email]
> > https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir
belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Dan Berindei
In reply to this post by Olaf Bergner
>
>> > 3. The design suggests to use a fresh UUID as the key for each new
>> > chunk. While this in all likelihood gives us a unique new key for each
>> > chunk I currently fail to see how that guarantees that this key maps to
>> > a node that is different from all the nodes already used to store chunks
>> > of the same Large Object. But then again I know next to nothing about
>> > Infinispan's constant hashing algorithm.
>>
>> I think there's a service that will generate a key mapped to particular
>> node, so that might be a better option here to avoid all chunks going to the
>> same node. I think Mircea might be able to help further with this.
>
> Implemented a workaround in the meantime. Would be fab if such a service existed. Haven't found it yet.
>

http://community.jboss.org/wiki/Keyaffinityservice ?

Cheers
Dan
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Sanne Grinovero
In reply to this post by Olaf Bergner
<cut />

> I haven't thought deeply about the implications for concurrent writes, though. Is it possible to lock keys outside of a transactional context? If so, this might be a solution for reading and writing.

No, it's not possible to lock anything out of transaction; the most
obvious reason is that there's no unlock() method, all acquired locks
are implicitly released at transaction commit.

Cheers,
Sanne

>
> Cheers,
> Olaf
>
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Galder Zamarreno
In reply to this post by Olaf Bergner

On Apr 4, 2011, at 11:50 AM, Olaf Bergner wrote:

> Hi Galder,
>
> -------- Original-Nachricht --------
>> Datum: Mon, 4 Apr 2011 11:01:21 +0200
>> Von: "Galder Zamarreño" <[hidden email]>
>> Actually, it's not that simple, it needs to be adaptive but it might be
>> going into the territory of virtual nodes and sizing of virtual nodes. The
>> key thing of choosing the nodes to store the chunks is that there should be
>> enough memory in the node where it lands. IOW, if a 5GB dvd is being chunked
>> into 100MB pieces, it would not make sense sending chunks to a node that
>> does not have memory to fit that.
>
> Is there an JMX interface for querying a node's available memory? Would it otherwise make sense to add one?

Actually, you should follow Sanne's advice who's worked on a similar type of problem for the Lucene directory impl based on Infinispan and not worry about it.

>
> Cheers,
> Olaf
>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Galder Zamarreño
>>> Sr. Software Engineer
>>> Infinispan, JBoss Cache
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Galder Zamarreño
>> Sr. Software Engineer
>> Infinispan, JBoss Cache
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Galder Zamarreño
Sr. Software Engineer
Infinispan, JBoss Cache


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
Am 05.04.11 10:27, schrieb Galder Zamarreño:

> On Apr 4, 2011, at 11:50 AM, Olaf Bergner wrote:
>
>> Hi Galder,
>>
>> -------- Original-Nachricht --------
>>> Datum: Mon, 4 Apr 2011 11:01:21 +0200
>>> Von: "Galder Zamarreño"<[hidden email]>
>>> Actually, it's not that simple, it needs to be adaptive but it might be
>>> going into the territory of virtual nodes and sizing of virtual nodes. The
>>> key thing of choosing the nodes to store the chunks is that there should be
>>> enough memory in the node where it lands. IOW, if a 5GB dvd is being chunked
>>> into 100MB pieces, it would not make sense sending chunks to a node that
>>> does not have memory to fit that.
>> Is there an JMX interface for querying a node's available memory? Would it otherwise make sense to add one?
> Actually, you should follow Sanne's advice who's worked on a similar type of problem for the Lucene directory impl based on Infinispan and not worry about it.
That's the conclusion I arrived at. Just removed any code aiming at
guaranteeing that no two chunks be stored on the same node.

Cheers,
Olaf

P.S.: Unfortunately I won't make it to your presentation in Berlin. I
just bought the ticket, but my doctor strongly advised me against it.
For once, I will try to be a well-behaved patient. Sad, though.

>> Cheers,
>> Olaf
>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>> --
>>>> Galder Zamarreño
>>>> Sr. Software Engineer
>>>> Infinispan, JBoss Cache
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>> --
>>> Galder Zamarreño
>>> Sr. Software Engineer
>>> Infinispan, JBoss Cache
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>> --
>> GMX DSL Doppel-Flat ab 19,99 Euro/mtl.! Jetzt mit
>> gratis Handy-Flat! http://portal.gmx.net/de/go/dsl
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
> --
> Galder Zamarreño
> Sr. Software Engineer
> Infinispan, JBoss Cache
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Manik Surtani
In reply to this post by Sanne Grinovero
+1, I don't think we should over-complicate by mandating that chunks are on different nodes.  Let the distribution code handle this.  If, at a later date, we see that the system frequently fails due to too many chunks on certain nodes, we can revisit.  But that would be an implementation detail.

Personally, I think with virtual nodes, we should have a high chance of distribution taking care of this.

On 4 Apr 2011, at 10:43, Sanne Grinovero wrote:

> I don't think you should make it too complex by looking at available
> memory, you have the same issue when storing many different keys in
> Infinispan in any mode, but we never worry about this, relying instead
> on the spreading quality of the hash function, and of course the
> available total heap size must be able to store all values, plus the
> replicas, plus some extra % due to the hashing function not being
> perfect; In effect you can always define some spill-over to
> CacheLoaders.
>
> The fact that some nodes will have less memory available will be
> solved by the virtual nodes patch, if you refer to bigger vs. smaller
> machines in the same cluster.
>
> If you make sure the file is split in "many" chunks, they will be
> randomly distributed and that should be good enough for this purpose,
> wherein the definition of "many" can be a configuration option, or a
> method parameter during store.
>
> There's something similar happening in the Lucene Directory code,
> these are some issues I had to consider:
>
> 1) make sure you store a metadata object with the used configuration
> details, like the number and size of chunks, so that in case the chunk
> size is configurable, if the cluster is restarted with a different
> configuration you are still able to retrieve the correct stream.
>
> 2) There might be concurrency issues while one thread/node is
> streaming it, and another one is deleting or replacing it. Infinispan
> provides you with consistency at a key level, but as you're dealing
> with multiple keys, you might get a view composed of chunks from
> different transactions.
>
> You'll have to think about how to solve 2), I guess you could store a
> version number in the metadata object mentioned in 1) and have all
> modified keys contain the version they refer to. garbage collection
> would be tricky, as at some point you want to delete chunks no longer
> referred to by any node, including those who crashed without
> explicitly releasing anything.
>
> Sanne
>
>
> 2011/4/4 Galder Zamarreño <[hidden email]>:
>>
>> On Mar 31, 2011, at 7:46 AM, Olaf Bergner wrote:
>>
>>> Am 30.03.11 02:32, schrieb Elias Ross:
>>>> I think it'd be BEST if you could support both models. I would add:
>>>>
>>>> interface Cache {
>>>>   /**
>>>>    * Returns a new or existing LargeObject object for the following key.
>>>>    * @throws ClassCastException if the key exists and is not a LargeObject.
>>>>    */
>>>>   LargeObject largeObject(K key);
>>>> }
>>> OK, I'll keep that on my todo list, yet for the time being I'v opted to
>>> start with implementing void writeToKey(K key, InputStream largeObject).
>>>>> This is certainly doable but leaves me wondering where that proposed
>>>>> ChunkingInterceptor might come into play.
>>>> I would think ideally you don't need to create any new commands. Less
>>>> protocol messages is better.
>>> It is my understanding that PutKeyValueCommand will *always* attempt to
>>> read the current value stored under the given key first. I'm not sure if
>>> we want this in our situation where that current value may be several GB
>>> in size. Anyway, it should be easy to refactor if reusing
>>> PutKeyValueCommand should prove viable.
>>
>> The only reason it reads the previous value is to return it as part of contract of "V put(K, V)" - but that can be skipped.
>>
>>>>> 3. The design suggests to use a fresh UUID as the key for each new
>>>>> chunk. While this in all likelihood gives us a unique new key for each
>>>>> chunk I currently fail to see how that guarantees that this key maps to
>>>>> a node that is different from all the nodes already used to store chunks
>>>>> of the same Large Object. But then again I know next to nothing about
>>>>> Infinispan's constant hashing algorithm.
>>>> I wouldn't use UUID. I'd just store (K, #) where # is the chunk.
>>>>
>>> Since this is important and might reveal a fundamental misunderstanding
>>> on my part, I need to sort this out before moving on. These are my
>>> assumptions, please point out any errors:
>>>
>>> 1. We want to partition a large object into chunks since, by definition,
>>> a large object is too big to be stored in a single node in the cluster.
>>> It follows that it is paramount that no two chunks be stored in the same
>>> node, correct?
>>
>> No. The idea is that the whole object should not end up being stored in a single JVM, but nothing should stop you from storing two chunks of the same object in the same node.
>>
>> What we somehow need to avoid is chunks ending up in nodes that do not have enough memory to store them, and that could complicate things.
>>
>>>
>>> 2. Constant hashing guarantees that any given key maps to *some* node in
>>> the cluster. There is no way, however, such a key's creator could know
>>> to what node exactly its key maps. In other words, there is no inverse
>>> to the hash function, correct?
>>
>> I vaguely remember something about a consistent hash algorithm that given a node where to store data, it would generate a key for it (Mircea, did you create this?). This could work in conjunction with my previous point assuming that a node would know what the available memory in other nodes is, but this would require some thinking.
>>
>>
>>>
>>> 3. The current design mandates that for storing each chunk the existing
>>> put(key, value) be reused, correct?
>>>
>>> It follows that we have no way whatsoever of generating a set of keys
>>> that guarantees that no two keys are mapped to the same node. In the
>>> pathological case, *all* keys map to the same node, correct?
>>
>> See my previous point.
>>
>>
>>>>> I would think a use case for this API would be streaming audio or
>>>>> video, maybe something like access logs even?
>>>>>
>>>>> In which case, you would want to read while you're writing. So,
>>>>> locking shouldn't be imposed. I would say, rely on the transaction
>>>>> manager to keep a consistent view. If transactions aren't being used,
>>>>> then the user might see some unexpected behavior. The API could
>>>>> compensate for that.
>>>>>
>>> If I understand you correctly you propose two alternatives:
>>>
>>> 1. Use transactions, thus delegating all consistency requirements to the
>>> transaction manager.
>>>
>>> 2. Don't use transactions and change the API so that readers may be told
>>> that a large object they are interested in is currently being written.
>>>
>>> Further, to support streaming use cases you propose that it should be
>>> possible to read a large object while it is being written.
>>>
>>> Is that correct?
>>>
>>> Hmm, I need to think about this. If I understand Manik's comment and the
>>> tx subsystem correctly each transaction holds its *entire* associated
>>> state in memory. Thus, if we are to write all chunks of a given large
>>> object within the scope of a single transaction we will blow up the
>>> originator node's heap. Correct?
>>
>> Hmmmm, maybe what's needed here is a mix of the two. You want metadata information to be transactional, so when you start writing and chunking an object and you keep updating the metadata object, this is transactionally protected, so no one can read the metadata in the mean time, however, the actual chunk writing in the cache could be non-transactional to make chunks do not pile up in the transaction context.
>>
>>>
>>> So many questions ...
>>>
>>> Cheers,
>>> Olaf
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Galder Zamarreño
>> Sr. Software Engineer
>> Infinispan, JBoss Cache
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Manik Surtani
In reply to this post by Olaf Bergner

On 4 Apr 2011, at 11:01, Olaf Bergner wrote:

>>
>> What we somehow need to avoid is chunks ending up in nodes that do not
>> have enough memory to store them, and that could complicate things.
>
> Definitely. What about replication, for instance? Does INFINISPAN use the replication mechanism suggested by Dynamo, i.e. walking the constant hash ring in clockwise direction until the desired number of replicas is reached (if I recall correctl)? I'm afraid this might fail in our case.

Yes, this is how Infinispan's distribution works.  Just for clarification, we refer to this as distribution rather than replication.  

On the other hand, when we speak of replication, we mean copies are repliciated to ALL other nodes in the cluster.  I.e., each node is a replica of its neighbour, and all nodes are treated equal.  Replication, as such, has no need for a consistent hash wheel.

Why do you feel this may fail?

> Plus, I fear rehashing would have to be aware of wheter it is dealing with relocating a large object chunk or a "regular" value.

Again, why is this the case?

Cheers
Manik

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Olaf Bergner
Am 15.04.11 16:54, schrieb Manik Surtani:

> On 4 Apr 2011, at 11:01, Olaf Bergner wrote:
>
>>> What we somehow need to avoid is chunks ending up in nodes that do not
>>> have enough memory to store them, and that could complicate things.
>> Definitely. What about replication, for instance? Does INFINISPAN use the replication mechanism suggested by Dynamo, i.e. walking the constant hash ring in clockwise direction until the desired number of replicas is reached (if I recall correctl)? I'm afraid this might fail in our case.
> Yes, this is how Infinispan's distribution works.  Just for clarification, we refer to this as distribution rather than replication.
>
> On the other hand, when we speak of replication, we mean copies are repliciated to ALL other nodes in the cluster.  I.e., each node is a replica of its neighbour, and all nodes are treated equal.  Replication, as such, has no need for a consistent hash wheel.
>
> Why do you feel this may fail?
This fear rested on the assumption that no two chunks may be stored on
the same node. I has code in place to enforce this, yet it didn't take
distribution into account.

Meanwhile, following Sanne's suggestion's, I removed a great deal of my
earlier code's complexity and rely on INFINISPAN's constant hashing
algorithm to evenly distribute chunks across the cluster.
>> Plus, I fear rehashing would have to be aware of wheter it is dealing with relocating a large object chunk or a "regular" value.
See above: if a node leaves the cluster rehashing might relocate large
object's A 's chunk to a node that already has another chunk belonging
to A stored. Again, this argument is obsolete by now.

Cheers,
Olaf

> Again, why is this the case?
>
> Cheers
> Manik
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] [ISPN-78] Alternative interface for writing large objects

Manik Surtani

On 15 Apr 2011, at 17:43, Olaf Bergner wrote:

>>>
>>> Plus, I fear rehashing would have to be aware of wheter it is dealing with relocating a large object chunk or a "regular" value.
> See above: if a node leaves the cluster rehashing might relocate large
> object's A 's chunk to a node that already has another chunk belonging
> to A stored. Again, this argument is obsolete by now.

Ok.

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev