[infinispan-dev] chunking ability on the JDBC cacheloader

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[infinispan-dev] chunking ability on the JDBC cacheloader

Sanne Grinovero
As mentioned on the user forum [1], people setting up a JDBC
cacheloader need to be able to define the size of columns to be used.
The Lucene Directory has a feature to autonomously chunk the segment
contents at a configurable specified byte number, and so has the
GridFS; still there are other metadata objects which Lucene currently
doesn't chunk as it's "fairly small" (but undefined and possibly
growing), and in a more general sense anybody using the JDBC
cacheloader would face the same problem: what's the dimension I need
to use ?

While in most cases the maximum size can be estimated, this is still
not good enough, as when you're wrong the byte array might get
truncated, so I think the CacheLoader should take care of this.

what would you think of:
 - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
and JdbcBinaryCacheStore
 - have them store in multiple rows the values which would be bigger
than max_chunk_size
 - this will need transactions, which are currently not being used by
the cacheloaders

It looks like to me that only the JDBC cacheloader has these issues,
as the other stores I'm aware of are more "blob oriented". Could it be
worth to build this abstraction in an higher level instead of in the
JDBC cacheloader?

Cheers,
Sanne

[1] - http://community.jboss.org/thread/166760
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

"이희승 (Trustin Lee)"
On 05/20/2011 03:06 AM, Sanne Grinovero wrote:

> As mentioned on the user forum [1], people setting up a JDBC
> cacheloader need to be able to define the size of columns to be used.
> The Lucene Directory has a feature to autonomously chunk the segment
> contents at a configurable specified byte number, and so has the
> GridFS; still there are other metadata objects which Lucene currently
> doesn't chunk as it's "fairly small" (but undefined and possibly
> growing), and in a more general sense anybody using the JDBC
> cacheloader would face the same problem: what's the dimension I need
> to use ?
>
> While in most cases the maximum size can be estimated, this is still
> not good enough, as when you're wrong the byte array might get
> truncated, so I think the CacheLoader should take care of this.
>
> what would you think of:
>  - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
> and JdbcBinaryCacheStore
>  - have them store in multiple rows the values which would be bigger
> than max_chunk_size
>  - this will need transactions, which are currently not being used by
> the cacheloaders
>
> It looks like to me that only the JDBC cacheloader has these issues,
> as the other stores I'm aware of are more "blob oriented". Could it be
> worth to build this abstraction in an higher level instead of in the
> JDBC cacheloader?

I'm not sure if I understand the idea correctly.  Do you mean an entry
should span to more than one row if the entry's value is larger than the
maximum column capacity?  I guess it's not about keys, right?

Sounds like a good idea for ISPN-701 because it will surely result in
different schema.

> [1] - http://community.jboss.org/thread/166760

--
Trustin Lee, http://gleamynode.net/
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Manik Surtani
In reply to this post by Sanne Grinovero
Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.

On 19 May 2011, at 19:06, Sanne Grinovero wrote:

> As mentioned on the user forum [1], people setting up a JDBC
> cacheloader need to be able to define the size of columns to be used.
> The Lucene Directory has a feature to autonomously chunk the segment
> contents at a configurable specified byte number, and so has the
> GridFS; still there are other metadata objects which Lucene currently
> doesn't chunk as it's "fairly small" (but undefined and possibly
> growing), and in a more general sense anybody using the JDBC
> cacheloader would face the same problem: what's the dimension I need
> to use ?
>
> While in most cases the maximum size can be estimated, this is still
> not good enough, as when you're wrong the byte array might get
> truncated, so I think the CacheLoader should take care of this.
>
> what would you think of:
> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
> and JdbcBinaryCacheStore
> - have them store in multiple rows the values which would be bigger
> than max_chunk_size
> - this will need transactions, which are currently not being used by
> the cacheloaders
>
> It looks like to me that only the JDBC cacheloader has these issues,
> as the other stores I'm aware of are more "blob oriented". Could it be
> worth to build this abstraction in an higher level instead of in the
> JDBC cacheloader?
>
> Cheers,
> Sanne
>
> [1] - http://community.jboss.org/thread/166760
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Sanne Grinovero
In reply to this post by "이희승 (Trustin Lee)"
2011/5/20 "이희승 (Trustin Lee)" <[hidden email]>:

> On 05/20/2011 03:06 AM, Sanne Grinovero wrote:
>> As mentioned on the user forum [1], people setting up a JDBC
>> cacheloader need to be able to define the size of columns to be used.
>> The Lucene Directory has a feature to autonomously chunk the segment
>> contents at a configurable specified byte number, and so has the
>> GridFS; still there are other metadata objects which Lucene currently
>> doesn't chunk as it's "fairly small" (but undefined and possibly
>> growing), and in a more general sense anybody using the JDBC
>> cacheloader would face the same problem: what's the dimension I need
>> to use ?
>>
>> While in most cases the maximum size can be estimated, this is still
>> not good enough, as when you're wrong the byte array might get
>> truncated, so I think the CacheLoader should take care of this.
>>
>> what would you think of:
>>  - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>> and JdbcBinaryCacheStore
>>  - have them store in multiple rows the values which would be bigger
>> than max_chunk_size
>>  - this will need transactions, which are currently not being used by
>> the cacheloaders
>>
>> It looks like to me that only the JDBC cacheloader has these issues,
>> as the other stores I'm aware of are more "blob oriented". Could it be
>> worth to build this abstraction in an higher level instead of in the
>> JDBC cacheloader?
>
> I'm not sure if I understand the idea correctly.  Do you mean an entry
> should span to more than one row if the entry's value is larger than the
> maximum column capacity?  I guess it's not about keys, right?

yes, having it on the keys would be ideal (just as a safety feature)
but I don't think we could map that.
table schema proposal would look like:

[serialized key] [chunk id] [timestamp] [blob or part of it]

instead of
[serialized key] [timestamp] [blob]

>
> Sounds like a good idea for ISPN-701 because it will surely result in
> different schema.
>
>> [1] - http://community.jboss.org/thread/166760
>
> --
> Trustin Lee, http://gleamynode.net/
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Sanne Grinovero
In reply to this post by Manik Surtani
2011/5/20 Manik Surtani <[hidden email]>:
> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.

I think that using multiple rows for variable sized data is a good
approach, unless the database supports very variable blob sizes.

I'm not suggesting that we should heavily fragment each value, more
the opposite I think that as right now people should define a
reasonable estimate to keep it in one column, but the JDBC cacheloader
should be able to handle the case in which the estimate is wrong and
the blob can't be stored completely.

As Trustin said, ISPN-701 could be related so that exact
implementation depends on database capabilities.

>
> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>
>> As mentioned on the user forum [1], people setting up a JDBC
>> cacheloader need to be able to define the size of columns to be used.
>> The Lucene Directory has a feature to autonomously chunk the segment
>> contents at a configurable specified byte number, and so has the
>> GridFS; still there are other metadata objects which Lucene currently
>> doesn't chunk as it's "fairly small" (but undefined and possibly
>> growing), and in a more general sense anybody using the JDBC
>> cacheloader would face the same problem: what's the dimension I need
>> to use ?
>>
>> While in most cases the maximum size can be estimated, this is still
>> not good enough, as when you're wrong the byte array might get
>> truncated, so I think the CacheLoader should take care of this.
>>
>> what would you think of:
>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>> and JdbcBinaryCacheStore
>> - have them store in multiple rows the values which would be bigger
>> than max_chunk_size
>> - this will need transactions, which are currently not being used by
>> the cacheloaders
>>
>> It looks like to me that only the JDBC cacheloader has these issues,
>> as the other stores I'm aware of are more "blob oriented". Could it be
>> worth to build this abstraction in an higher level instead of in the
>> JDBC cacheloader?
>>
>> Cheers,
>> Sanne
>>
>> [1] - http://community.jboss.org/thread/166760
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

"이희승 (Trustin Lee)"
On 05/20/2011 05:01 PM, Sanne Grinovero wrote:

> 2011/5/20 Manik Surtani<[hidden email]>:
>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>
> I think that using multiple rows for variable sized data is a good
> approach, unless the database supports very variable blob sizes.
>
> I'm not suggesting that we should heavily fragment each value, more
> the opposite I think that as right now people should define a
> reasonable estimate to keep it in one column, but the JDBC cacheloader
> should be able to handle the case in which the estimate is wrong and
> the blob can't be stored completely.
>
> As Trustin said, ISPN-701 could be related so that exact
> implementation depends on database capabilities.

Yeah, let's post the relevant ideas to the JIRA page for easier tracking.

>>
>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>
>>> As mentioned on the user forum [1], people setting up a JDBC
>>> cacheloader need to be able to define the size of columns to be used.
>>> The Lucene Directory has a feature to autonomously chunk the segment
>>> contents at a configurable specified byte number, and so has the
>>> GridFS; still there are other metadata objects which Lucene currently
>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>> growing), and in a more general sense anybody using the JDBC
>>> cacheloader would face the same problem: what's the dimension I need
>>> to use ?
>>>
>>> While in most cases the maximum size can be estimated, this is still
>>> not good enough, as when you're wrong the byte array might get
>>> truncated, so I think the CacheLoader should take care of this.
>>>
>>> what would you think of:
>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>> and JdbcBinaryCacheStore
>>> - have them store in multiple rows the values which would be bigger
>>> than max_chunk_size
>>> - this will need transactions, which are currently not being used by
>>> the cacheloaders
>>>
>>> It looks like to me that only the JDBC cacheloader has these issues,
>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>> worth to build this abstraction in an higher level instead of in the
>>> JDBC cacheloader?
>>>
>>> Cheers,
>>> Sanne
>>>
>>> [1] - http://community.jboss.org/thread/166760
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Manik Surtani
>> [hidden email]
>> twitter.com/maniksurtani
>>
>> Lead, Infinispan
>> http://www.infinispan.org
>>
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


--
Trustin Lee, http://gleamynode.net/
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

"이희승 (Trustin Lee)"
In reply to this post by Manik Surtani
On 05/20/2011 03:54 PM, Manik Surtani wrote:
> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.

There are more hidden overheads.  To update a value, the cache store
must determine how many chunks already exists in the cache store and
selectively delete and update them.  To simply aggressively, we could
delete all chunks and insert new chunks.  Both at the cost of great
overhead.

Even MySQL supports a blog up to 4GiB, so I think it's better update the
schema?

> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>
>> As mentioned on the user forum [1], people setting up a JDBC
>> cacheloader need to be able to define the size of columns to be used.
>> The Lucene Directory has a feature to autonomously chunk the segment
>> contents at a configurable specified byte number, and so has the
>> GridFS; still there are other metadata objects which Lucene currently
>> doesn't chunk as it's "fairly small" (but undefined and possibly
>> growing), and in a more general sense anybody using the JDBC
>> cacheloader would face the same problem: what's the dimension I need
>> to use ?
>>
>> While in most cases the maximum size can be estimated, this is still
>> not good enough, as when you're wrong the byte array might get
>> truncated, so I think the CacheLoader should take care of this.
>>
>> what would you think of:
>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>> and JdbcBinaryCacheStore
>> - have them store in multiple rows the values which would be bigger
>> than max_chunk_size
>> - this will need transactions, which are currently not being used by
>> the cacheloaders
>>
>> It looks like to me that only the JDBC cacheloader has these issues,
>> as the other stores I'm aware of are more "blob oriented". Could it be
>> worth to build this abstraction in an higher level instead of in the
>> JDBC cacheloader?
>>
>> Cheers,
>> Sanne
>>
>> [1] - http://community.jboss.org/thread/166760
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


--
Trustin Lee, http://gleamynode.net/
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Dan Berindei
On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)" <[hidden email]> wrote:

> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>
> There are more hidden overheads.  To update a value, the cache store
> must determine how many chunks already exists in the cache store and
> selectively delete and update them.  To simply aggressively, we could
> delete all chunks and insert new chunks.  Both at the cost of great
> overhead.
>
> Even MySQL supports a blog up to 4GiB, so I think it's better update the
> schema?
>

+1

BLOBs are only stored in external storage if the actual data can't fit
in a normal table row, so the only penalty in using a LONGBLOB
compared to a VARBINARY(255) is 3 extra bytes for the length.

If the user really wants to use a data type with a smaller max length,
we can just report an error when the data column size is too small. We
will need to check the length and throw an exception ourselves though,
with MySQL we can't be sure that it is configured to raise errors when
a value is truncated.

Cheers
Dan


>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>
>>> As mentioned on the user forum [1], people setting up a JDBC
>>> cacheloader need to be able to define the size of columns to be used.
>>> The Lucene Directory has a feature to autonomously chunk the segment
>>> contents at a configurable specified byte number, and so has the
>>> GridFS; still there are other metadata objects which Lucene currently
>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>> growing), and in a more general sense anybody using the JDBC
>>> cacheloader would face the same problem: what's the dimension I need
>>> to use ?
>>>
>>> While in most cases the maximum size can be estimated, this is still
>>> not good enough, as when you're wrong the byte array might get
>>> truncated, so I think the CacheLoader should take care of this.
>>>
>>> what would you think of:
>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>> and JdbcBinaryCacheStore
>>> - have them store in multiple rows the values which would be bigger
>>> than max_chunk_size
>>> - this will need transactions, which are currently not being used by
>>> the cacheloaders
>>>
>>> It looks like to me that only the JDBC cacheloader has these issues,
>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>> worth to build this abstraction in an higher level instead of in the
>>> JDBC cacheloader?
>>>
>>> Cheers,
>>> Sanne
>>>
>>> [1] - http://community.jboss.org/thread/166760
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> --
>> Manik Surtani
>> [hidden email]
>> twitter.com/maniksurtani
>>
>> Lead, Infinispan
>> http://www.infinispan.org
>>
>>
>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Trustin Lee, http://gleamynode.net/
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Sanne Grinovero
2011/5/23 Dan Berindei <[hidden email]>:
> On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)" <[hidden email]> wrote:
>> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>>
>> There are more hidden overheads.  To update a value, the cache store
>> must determine how many chunks already exists in the cache store and
>> selectively delete and update them.  To simply aggressively, we could
>> delete all chunks and insert new chunks.  Both at the cost of great
>> overhead.

I see no alternative to delete all values for each key, as we don't
know which part of the byte array is dirty;
At which overhead are you referring? We would still store the same
amount of data, slit or not split, but yes multiple statements might
require clever batching.

>>
>> Even MySQL supports a blog up to 4GiB, so I think it's better update the
>> schema?

You mean by accommodating the column size only, or adding the chunk_id ?
I'm just asking, but all of yours and Dan's feedback have already
persuaded me that my initial idea of providing chunking should be
avoided.

>
> +1
>
> BLOBs are only stored in external storage if the actual data can't fit
> in a normal table row, so the only penalty in using a LONGBLOB
> compared to a VARBINARY(255) is 3 extra bytes for the length.
>
> If the user really wants to use a data type with a smaller max length,
> we can just report an error when the data column size is too small. We
> will need to check the length and throw an exception ourselves though,
> with MySQL we can't be sure that it is configured to raise errors when
> a value is truncated.

+1
it might be better to just check for the maximum size of stored values
to fit in "something"; I'm not sure if we can guess the proper size
from database metadata: not only the column maximum size is involved,
but MySQL (to keep it as reference example, but might apply to others)
also has a default maximum packet size for the connections which is
not very big, when using it with Infinispan I always had to
reconfigure the database server.

Also as BLOBs are very poor as primary key, people might want to use a
limited and well known byte size for their keys.

So, shall we just add a method to check to not have surpassed a user
defined threshold, checking for both key and value but on different
configurable sizes? Should an exception be raised in that case?

Cheers,
Sanne

>
> Cheers
> Dan
>
>
>>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>>
>>>> As mentioned on the user forum [1], people setting up a JDBC
>>>> cacheloader need to be able to define the size of columns to be used.
>>>> The Lucene Directory has a feature to autonomously chunk the segment
>>>> contents at a configurable specified byte number, and so has the
>>>> GridFS; still there are other metadata objects which Lucene currently
>>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>>> growing), and in a more general sense anybody using the JDBC
>>>> cacheloader would face the same problem: what's the dimension I need
>>>> to use ?
>>>>
>>>> While in most cases the maximum size can be estimated, this is still
>>>> not good enough, as when you're wrong the byte array might get
>>>> truncated, so I think the CacheLoader should take care of this.
>>>>
>>>> what would you think of:
>>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>>> and JdbcBinaryCacheStore
>>>> - have them store in multiple rows the values which would be bigger
>>>> than max_chunk_size
>>>> - this will need transactions, which are currently not being used by
>>>> the cacheloaders
>>>>
>>>> It looks like to me that only the JDBC cacheloader has these issues,
>>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>>> worth to build this abstraction in an higher level instead of in the
>>>> JDBC cacheloader?
>>>>
>>>> Cheers,
>>>> Sanne
>>>>
>>>> [1] - http://community.jboss.org/thread/166760
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> --
>>> Manik Surtani
>>> [hidden email]
>>> twitter.com/maniksurtani
>>>
>>> Lead, Infinispan
>>> http://www.infinispan.org
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> --
>> Trustin Lee, http://gleamynode.net/
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

"이희승 (Trustin Lee)"
On 05/23/2011 07:40 PM, Sanne Grinovero wrote:

> 2011/5/23 Dan Berindei<[hidden email]>:
>> On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)"<[hidden email]>  wrote:
>>> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>>>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>>>
>>> There are more hidden overheads.  To update a value, the cache store
>>> must determine how many chunks already exists in the cache store and
>>> selectively delete and update them.  To simply aggressively, we could
>>> delete all chunks and insert new chunks.  Both at the cost of great
>>> overhead.
>
> I see no alternative to delete all values for each key, as we don't
> know which part of the byte array is dirty;
> At which overhead are you referring? We would still store the same
> amount of data, slit or not split, but yes multiple statements might
> require clever batching.
>
>>>
>>> Even MySQL supports a blog up to 4GiB, so I think it's better update the
>>> schema?
>
> You mean by accommodating the column size only, or adding the chunk_id ?
> I'm just asking, but all of yours and Dan's feedback have already
> persuaded me that my initial idea of providing chunking should be
> avoided.

I mean user's updating the column type of the schema.

>>
>> +1
>>
>> BLOBs are only stored in external storage if the actual data can't fit
>> in a normal table row, so the only penalty in using a LONGBLOB
>> compared to a VARBINARY(255) is 3 extra bytes for the length.
>>
>> If the user really wants to use a data type with a smaller max length,
>> we can just report an error when the data column size is too small. We
>> will need to check the length and throw an exception ourselves though,
>> with MySQL we can't be sure that it is configured to raise errors when
>> a value is truncated.
>
> +1
> it might be better to just check for the maximum size of stored values
> to fit in "something"; I'm not sure if we can guess the proper size
> from database metadata: not only the column maximum size is involved,
> but MySQL (to keep it as reference example, but might apply to others)
> also has a default maximum packet size for the connections which is
> not very big, when using it with Infinispan I always had to
> reconfigure the database server.
>
> Also as BLOBs are very poor as primary key, people might want to use a
> limited and well known byte size for their keys.
>
> So, shall we just add a method to check to not have surpassed a user
> defined threshold, checking for both key and value but on different
> configurable sizes? Should an exception be raised in that case?

Exception will be raised by JDBC driver if key doesn't fit into the key
column, so we could simply wrap it?

>
> Cheers,
> Sanne
>
>>
>> Cheers
>> Dan
>>
>>
>>>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>>>
>>>>> As mentioned on the user forum [1], people setting up a JDBC
>>>>> cacheloader need to be able to define the size of columns to be used.
>>>>> The Lucene Directory has a feature to autonomously chunk the segment
>>>>> contents at a configurable specified byte number, and so has the
>>>>> GridFS; still there are other metadata objects which Lucene currently
>>>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>>>> growing), and in a more general sense anybody using the JDBC
>>>>> cacheloader would face the same problem: what's the dimension I need
>>>>> to use ?
>>>>>
>>>>> While in most cases the maximum size can be estimated, this is still
>>>>> not good enough, as when you're wrong the byte array might get
>>>>> truncated, so I think the CacheLoader should take care of this.
>>>>>
>>>>> what would you think of:
>>>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>>>> and JdbcBinaryCacheStore
>>>>> - have them store in multiple rows the values which would be bigger
>>>>> than max_chunk_size
>>>>> - this will need transactions, which are currently not being used by
>>>>> the cacheloaders
>>>>>
>>>>> It looks like to me that only the JDBC cacheloader has these issues,
>>>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>>>> worth to build this abstraction in an higher level instead of in the
>>>>> JDBC cacheloader?
>>>>>
>>>>> Cheers,
>>>>> Sanne
>>>>>
>>>>> [1] - http://community.jboss.org/thread/166760
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> --
>>>> Manik Surtani
>>>> [hidden email]
>>>> twitter.com/maniksurtani
>>>>
>>>> Lead, Infinispan
>>>> http://www.infinispan.org
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>> --
>>> Trustin Lee, http://gleamynode.net/
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev


--
Trustin Lee, http://gleamynode.net/
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Sanne Grinovero
2011/5/23 "이희승 (Trustin Lee)" <[hidden email]>:

> On 05/23/2011 07:40 PM, Sanne Grinovero wrote:
>> 2011/5/23 Dan Berindei<[hidden email]>:
>>> On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)"<[hidden email]>  wrote:
>>>> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>>>>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>>>>
>>>> There are more hidden overheads.  To update a value, the cache store
>>>> must determine how many chunks already exists in the cache store and
>>>> selectively delete and update them.  To simply aggressively, we could
>>>> delete all chunks and insert new chunks.  Both at the cost of great
>>>> overhead.
>>
>> I see no alternative to delete all values for each key, as we don't
>> know which part of the byte array is dirty;
>> At which overhead are you referring? We would still store the same
>> amount of data, slit or not split, but yes multiple statements might
>> require clever batching.
>>
>>>>
>>>> Even MySQL supports a blog up to 4GiB, so I think it's better update the
>>>> schema?
>>
>> You mean by accommodating the column size only, or adding the chunk_id ?
>> I'm just asking, but all of yours and Dan's feedback have already
>> persuaded me that my initial idea of providing chunking should be
>> avoided.
>
> I mean user's updating the column type of the schema.
>
>>>
>>> +1
>>>
>>> BLOBs are only stored in external storage if the actual data can't fit
>>> in a normal table row, so the only penalty in using a LONGBLOB
>>> compared to a VARBINARY(255) is 3 extra bytes for the length.
>>>
>>> If the user really wants to use a data type with a smaller max length,
>>> we can just report an error when the data column size is too small. We
>>> will need to check the length and throw an exception ourselves though,
>>> with MySQL we can't be sure that it is configured to raise errors when
>>> a value is truncated.
>>
>> +1
>> it might be better to just check for the maximum size of stored values
>> to fit in "something"; I'm not sure if we can guess the proper size
>> from database metadata: not only the column maximum size is involved,
>> but MySQL (to keep it as reference example, but might apply to others)
>> also has a default maximum packet size for the connections which is
>> not very big, when using it with Infinispan I always had to
>> reconfigure the database server.
>>
>> Also as BLOBs are very poor as primary key, people might want to use a
>> limited and well known byte size for their keys.
>>
>> So, shall we just add a method to check to not have surpassed a user
>> defined threshold, checking for both key and value but on different
>> configurable sizes? Should an exception be raised in that case?
>
> Exception will be raised by JDBC driver if key doesn't fit into the key
> column, so we could simply wrap it?

If that always happens, the I wouldn't wrap it. entering the business
of wrapping driver specific exceptions is very tricky ;)
I was more concerned about the fact that some database might not raise
any exception ? Not sure if that's the case, and possibly not our
problem.

Sanne

>
>>
>> Cheers,
>> Sanne
>>
>>>
>>> Cheers
>>> Dan
>>>
>>>
>>>>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>>>>
>>>>>> As mentioned on the user forum [1], people setting up a JDBC
>>>>>> cacheloader need to be able to define the size of columns to be used.
>>>>>> The Lucene Directory has a feature to autonomously chunk the segment
>>>>>> contents at a configurable specified byte number, and so has the
>>>>>> GridFS; still there are other metadata objects which Lucene currently
>>>>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>>>>> growing), and in a more general sense anybody using the JDBC
>>>>>> cacheloader would face the same problem: what's the dimension I need
>>>>>> to use ?
>>>>>>
>>>>>> While in most cases the maximum size can be estimated, this is still
>>>>>> not good enough, as when you're wrong the byte array might get
>>>>>> truncated, so I think the CacheLoader should take care of this.
>>>>>>
>>>>>> what would you think of:
>>>>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>>>>> and JdbcBinaryCacheStore
>>>>>> - have them store in multiple rows the values which would be bigger
>>>>>> than max_chunk_size
>>>>>> - this will need transactions, which are currently not being used by
>>>>>> the cacheloaders
>>>>>>
>>>>>> It looks like to me that only the JDBC cacheloader has these issues,
>>>>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>>>>> worth to build this abstraction in an higher level instead of in the
>>>>>> JDBC cacheloader?
>>>>>>
>>>>>> Cheers,
>>>>>> Sanne
>>>>>>
>>>>>> [1] - http://community.jboss.org/thread/166760
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> [hidden email]
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>> --
>>>>> Manik Surtani
>>>>> [hidden email]
>>>>> twitter.com/maniksurtani
>>>>>
>>>>> Lead, Infinispan
>>>>> http://www.infinispan.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>> --
>>>> Trustin Lee, http://gleamynode.net/
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
>
> --
> Trustin Lee, http://gleamynode.net/
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Dan Berindei
On Mon, May 23, 2011 at 8:05 PM, Sanne Grinovero
<[hidden email]> wrote:

> 2011/5/23 "이희승 (Trustin Lee)" <[hidden email]>:
>> On 05/23/2011 07:40 PM, Sanne Grinovero wrote:
>>> 2011/5/23 Dan Berindei<[hidden email]>:
>>>> On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)"<[hidden email]>  wrote:
>>>>> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>>>>>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>>>>>
>>>>> There are more hidden overheads.  To update a value, the cache store
>>>>> must determine how many chunks already exists in the cache store and
>>>>> selectively delete and update them.  To simply aggressively, we could
>>>>> delete all chunks and insert new chunks.  Both at the cost of great
>>>>> overhead.
>>>
>>> I see no alternative to delete all values for each key, as we don't
>>> know which part of the byte array is dirty;
>>> At which overhead are you referring? We would still store the same
>>> amount of data, slit or not split, but yes multiple statements might
>>> require clever batching.
>>>
>>>>>
>>>>> Even MySQL supports a blog up to 4GiB, so I think it's better update the
>>>>> schema?
>>>
>>> You mean by accommodating the column size only, or adding the chunk_id ?
>>> I'm just asking, but all of yours and Dan's feedback have already
>>> persuaded me that my initial idea of providing chunking should be
>>> avoided.
>>
>> I mean user's updating the column type of the schema.
>>
>>>>
>>>> +1
>>>>
>>>> BLOBs are only stored in external storage if the actual data can't fit
>>>> in a normal table row, so the only penalty in using a LONGBLOB
>>>> compared to a VARBINARY(255) is 3 extra bytes for the length.
>>>>
>>>> If the user really wants to use a data type with a smaller max length,
>>>> we can just report an error when the data column size is too small. We
>>>> will need to check the length and throw an exception ourselves though,
>>>> with MySQL we can't be sure that it is configured to raise errors when
>>>> a value is truncated.
>>>
>>> +1
>>> it might be better to just check for the maximum size of stored values
>>> to fit in "something"; I'm not sure if we can guess the proper size
>>> from database metadata: not only the column maximum size is involved,
>>> but MySQL (to keep it as reference example, but might apply to others)
>>> also has a default maximum packet size for the connections which is
>>> not very big, when using it with Infinispan I always had to
>>> reconfigure the database server.
>>>
>>> Also as BLOBs are very poor as primary key, people might want to use a
>>> limited and well known byte size for their keys.
>>>
>>> So, shall we just add a method to check to not have surpassed a user
>>> defined threshold, checking for both key and value but on different
>>> configurable sizes? Should an exception be raised in that case?
>>
>> Exception will be raised by JDBC driver if key doesn't fit into the key
>> column, so we could simply wrap it?
>
> If that always happens, the I wouldn't wrap it. entering the business
> of wrapping driver specific exceptions is very tricky ;)
> I was more concerned about the fact that some database might not raise
> any exception ? Not sure if that's the case, and possibly not our
> problem.
>

By default MySQL only gives a warning if the value is truncated. We
could throw an exception every time we got a warning from the DB, but
the wrong value has already been inserted in the DB and if the key was
truncated then we don't even have enough information to delete it.

A better option to avoid checking ourselves may be to check on startup
if STRICT_ALL_TABLES
(http://dev.mysql.com/doc/refman/5.5/en/server-sql-mode.html#sqlmode_strict_all_tables)
is enabled with SELECT @@SESSION.sql_mode in the MySQL implementation
and refuse to start if it's not. There is another STRICT_TRANS_TABLES
mode, but I don't know how to find out if a table is transactional or
not...

Cheers
Dan


> Sanne
>
>>
>>>
>>> Cheers,
>>> Sanne
>>>
>>>>
>>>> Cheers
>>>> Dan
>>>>
>>>>
>>>>>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>>>>>
>>>>>>> As mentioned on the user forum [1], people setting up a JDBC
>>>>>>> cacheloader need to be able to define the size of columns to be used.
>>>>>>> The Lucene Directory has a feature to autonomously chunk the segment
>>>>>>> contents at a configurable specified byte number, and so has the
>>>>>>> GridFS; still there are other metadata objects which Lucene currently
>>>>>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>>>>>> growing), and in a more general sense anybody using the JDBC
>>>>>>> cacheloader would face the same problem: what's the dimension I need
>>>>>>> to use ?
>>>>>>>
>>>>>>> While in most cases the maximum size can be estimated, this is still
>>>>>>> not good enough, as when you're wrong the byte array might get
>>>>>>> truncated, so I think the CacheLoader should take care of this.
>>>>>>>
>>>>>>> what would you think of:
>>>>>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>>>>>> and JdbcBinaryCacheStore
>>>>>>> - have them store in multiple rows the values which would be bigger
>>>>>>> than max_chunk_size
>>>>>>> - this will need transactions, which are currently not being used by
>>>>>>> the cacheloaders
>>>>>>>
>>>>>>> It looks like to me that only the JDBC cacheloader has these issues,
>>>>>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>>>>>> worth to build this abstraction in an higher level instead of in the
>>>>>>> JDBC cacheloader?
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Sanne
>>>>>>>
>>>>>>> [1] - http://community.jboss.org/thread/166760
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> [hidden email]
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>> --
>>>>>> Manik Surtani
>>>>>> [hidden email]
>>>>>> twitter.com/maniksurtani
>>>>>>
>>>>>> Lead, Infinispan
>>>>>> http://www.infinispan.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> [hidden email]
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>>
>>>>> --
>>>>> Trustin Lee, http://gleamynode.net/
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>>
>> --
>> Trustin Lee, http://gleamynode.net/
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Sanne Grinovero
2011/5/24 Dan Berindei <[hidden email]>:

> On Mon, May 23, 2011 at 8:05 PM, Sanne Grinovero
> <[hidden email]> wrote:
>> 2011/5/23 "이희승 (Trustin Lee)" <[hidden email]>:
>>> On 05/23/2011 07:40 PM, Sanne Grinovero wrote:
>>>> 2011/5/23 Dan Berindei<[hidden email]>:
>>>>> On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)"<[hidden email]>  wrote:
>>>>>> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>>>>>>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>>>>>>
>>>>>> There are more hidden overheads.  To update a value, the cache store
>>>>>> must determine how many chunks already exists in the cache store and
>>>>>> selectively delete and update them.  To simply aggressively, we could
>>>>>> delete all chunks and insert new chunks.  Both at the cost of great
>>>>>> overhead.
>>>>
>>>> I see no alternative to delete all values for each key, as we don't
>>>> know which part of the byte array is dirty;
>>>> At which overhead are you referring? We would still store the same
>>>> amount of data, slit or not split, but yes multiple statements might
>>>> require clever batching.
>>>>
>>>>>>
>>>>>> Even MySQL supports a blog up to 4GiB, so I think it's better update the
>>>>>> schema?
>>>>
>>>> You mean by accommodating the column size only, or adding the chunk_id ?
>>>> I'm just asking, but all of yours and Dan's feedback have already
>>>> persuaded me that my initial idea of providing chunking should be
>>>> avoided.
>>>
>>> I mean user's updating the column type of the schema.
>>>
>>>>>
>>>>> +1
>>>>>
>>>>> BLOBs are only stored in external storage if the actual data can't fit
>>>>> in a normal table row, so the only penalty in using a LONGBLOB
>>>>> compared to a VARBINARY(255) is 3 extra bytes for the length.
>>>>>
>>>>> If the user really wants to use a data type with a smaller max length,
>>>>> we can just report an error when the data column size is too small. We
>>>>> will need to check the length and throw an exception ourselves though,
>>>>> with MySQL we can't be sure that it is configured to raise errors when
>>>>> a value is truncated.
>>>>
>>>> +1
>>>> it might be better to just check for the maximum size of stored values
>>>> to fit in "something"; I'm not sure if we can guess the proper size
>>>> from database metadata: not only the column maximum size is involved,
>>>> but MySQL (to keep it as reference example, but might apply to others)
>>>> also has a default maximum packet size for the connections which is
>>>> not very big, when using it with Infinispan I always had to
>>>> reconfigure the database server.
>>>>
>>>> Also as BLOBs are very poor as primary key, people might want to use a
>>>> limited and well known byte size for their keys.
>>>>
>>>> So, shall we just add a method to check to not have surpassed a user
>>>> defined threshold, checking for both key and value but on different
>>>> configurable sizes? Should an exception be raised in that case?
>>>
>>> Exception will be raised by JDBC driver if key doesn't fit into the key
>>> column, so we could simply wrap it?
>>
>> If that always happens, the I wouldn't wrap it. entering the business
>> of wrapping driver specific exceptions is very tricky ;)
>> I was more concerned about the fact that some database might not raise
>> any exception ? Not sure if that's the case, and possibly not our
>> problem.
>>
>
> By default MySQL only gives a warning if the value is truncated. We
> could throw an exception every time we got a warning from the DB, but
> the wrong value has already been inserted in the DB and if the key was
> truncated then we don't even have enough information to delete it.

yes I had some bell ringing, it was MySQL then indeed.

>
> A better option to avoid checking ourselves may be to check on startup
> if STRICT_ALL_TABLES
> (http://dev.mysql.com/doc/refman/5.5/en/server-sql-mode.html#sqlmode_strict_all_tables)
> is enabled with SELECT @@SESSION.sql_mode in the MySQL implementation
> and refuse to start if it's not. There is another STRICT_TRANS_TABLES
> mode, but I don't know how to find out if a table is transactional or
> not...

You check the transactional capabilities of a table by using SHOW
CREATE TABLE and check which engine it's using.

Still I don't think we should prevent people from shooting themselves
in the foot, I'm not going to raise another exception
if I don't detect they have a proper backup policy either, nor if
they're using a JDBC driver having known bugs.

I think that what people need from us is a way to understand the size
of what they're going to store;
logging it as we did for ISPN-1125 is a first step, and maybe it's enough?
Maybe it would be useful to collect max sizes, and print them too
regularly in the logs, or expose that through MBeans?

Cheers,
Sanne


>
> Cheers
> Dan
>
>
>> Sanne
>>
>>>
>>>>
>>>> Cheers,
>>>> Sanne
>>>>
>>>>>
>>>>> Cheers
>>>>> Dan
>>>>>
>>>>>
>>>>>>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>>>>>>
>>>>>>>> As mentioned on the user forum [1], people setting up a JDBC
>>>>>>>> cacheloader need to be able to define the size of columns to be used.
>>>>>>>> The Lucene Directory has a feature to autonomously chunk the segment
>>>>>>>> contents at a configurable specified byte number, and so has the
>>>>>>>> GridFS; still there are other metadata objects which Lucene currently
>>>>>>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>>>>>>> growing), and in a more general sense anybody using the JDBC
>>>>>>>> cacheloader would face the same problem: what's the dimension I need
>>>>>>>> to use ?
>>>>>>>>
>>>>>>>> While in most cases the maximum size can be estimated, this is still
>>>>>>>> not good enough, as when you're wrong the byte array might get
>>>>>>>> truncated, so I think the CacheLoader should take care of this.
>>>>>>>>
>>>>>>>> what would you think of:
>>>>>>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>>>>>>> and JdbcBinaryCacheStore
>>>>>>>> - have them store in multiple rows the values which would be bigger
>>>>>>>> than max_chunk_size
>>>>>>>> - this will need transactions, which are currently not being used by
>>>>>>>> the cacheloaders
>>>>>>>>
>>>>>>>> It looks like to me that only the JDBC cacheloader has these issues,
>>>>>>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>>>>>>> worth to build this abstraction in an higher level instead of in the
>>>>>>>> JDBC cacheloader?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Sanne
>>>>>>>>
>>>>>>>> [1] - http://community.jboss.org/thread/166760
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>> --
>>>>>>> Manik Surtani
>>>>>>> [hidden email]
>>>>>>> twitter.com/maniksurtani
>>>>>>>
>>>>>>> Lead, Infinispan
>>>>>>> http://www.infinispan.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> [hidden email]
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Trustin Lee, http://gleamynode.net/
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> [hidden email]
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>>
>>> --
>>> Trustin Lee, http://gleamynode.net/
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Dan Berindei
On Tue, May 24, 2011 at 11:42 AM, Sanne Grinovero
<[hidden email]> wrote:

> 2011/5/24 Dan Berindei <[hidden email]>:
>> On Mon, May 23, 2011 at 8:05 PM, Sanne Grinovero
>> <[hidden email]> wrote:
>>> 2011/5/23 "이희승 (Trustin Lee)" <[hidden email]>:
>>>> On 05/23/2011 07:40 PM, Sanne Grinovero wrote:
>>>>> 2011/5/23 Dan Berindei<[hidden email]>:
>>>>>> On Mon, May 23, 2011 at 7:04 AM, "이희승 (Trustin Lee)"<[hidden email]>  wrote:
>>>>>>> On 05/20/2011 03:54 PM, Manik Surtani wrote:
>>>>>>>> Is spanning rows the only real solution?  As you say it would mandate using transactions to keep multiple rows coherent, and 'm not sure if everyone would want to enable transactions for this.
>>>>>>>
>>>>>>> There are more hidden overheads.  To update a value, the cache store
>>>>>>> must determine how many chunks already exists in the cache store and
>>>>>>> selectively delete and update them.  To simply aggressively, we could
>>>>>>> delete all chunks and insert new chunks.  Both at the cost of great
>>>>>>> overhead.
>>>>>
>>>>> I see no alternative to delete all values for each key, as we don't
>>>>> know which part of the byte array is dirty;
>>>>> At which overhead are you referring? We would still store the same
>>>>> amount of data, slit or not split, but yes multiple statements might
>>>>> require clever batching.
>>>>>
>>>>>>>
>>>>>>> Even MySQL supports a blog up to 4GiB, so I think it's better update the
>>>>>>> schema?
>>>>>
>>>>> You mean by accommodating the column size only, or adding the chunk_id ?
>>>>> I'm just asking, but all of yours and Dan's feedback have already
>>>>> persuaded me that my initial idea of providing chunking should be
>>>>> avoided.
>>>>
>>>> I mean user's updating the column type of the schema.
>>>>
>>>>>>
>>>>>> +1
>>>>>>
>>>>>> BLOBs are only stored in external storage if the actual data can't fit
>>>>>> in a normal table row, so the only penalty in using a LONGBLOB
>>>>>> compared to a VARBINARY(255) is 3 extra bytes for the length.
>>>>>>
>>>>>> If the user really wants to use a data type with a smaller max length,
>>>>>> we can just report an error when the data column size is too small. We
>>>>>> will need to check the length and throw an exception ourselves though,
>>>>>> with MySQL we can't be sure that it is configured to raise errors when
>>>>>> a value is truncated.
>>>>>
>>>>> +1
>>>>> it might be better to just check for the maximum size of stored values
>>>>> to fit in "something"; I'm not sure if we can guess the proper size
>>>>> from database metadata: not only the column maximum size is involved,
>>>>> but MySQL (to keep it as reference example, but might apply to others)
>>>>> also has a default maximum packet size for the connections which is
>>>>> not very big, when using it with Infinispan I always had to
>>>>> reconfigure the database server.
>>>>>
>>>>> Also as BLOBs are very poor as primary key, people might want to use a
>>>>> limited and well known byte size for their keys.
>>>>>
>>>>> So, shall we just add a method to check to not have surpassed a user
>>>>> defined threshold, checking for both key and value but on different
>>>>> configurable sizes? Should an exception be raised in that case?
>>>>
>>>> Exception will be raised by JDBC driver if key doesn't fit into the key
>>>> column, so we could simply wrap it?
>>>
>>> If that always happens, the I wouldn't wrap it. entering the business
>>> of wrapping driver specific exceptions is very tricky ;)
>>> I was more concerned about the fact that some database might not raise
>>> any exception ? Not sure if that's the case, and possibly not our
>>> problem.
>>>
>>
>> By default MySQL only gives a warning if the value is truncated. We
>> could throw an exception every time we got a warning from the DB, but
>> the wrong value has already been inserted in the DB and if the key was
>> truncated then we don't even have enough information to delete it.
>
> yes I had some bell ringing, it was MySQL then indeed.
>
>>
>> A better option to avoid checking ourselves may be to check on startup
>> if STRICT_ALL_TABLES
>> (http://dev.mysql.com/doc/refman/5.5/en/server-sql-mode.html#sqlmode_strict_all_tables)
>> is enabled with SELECT @@SESSION.sql_mode in the MySQL implementation
>> and refuse to start if it's not. There is another STRICT_TRANS_TABLES
>> mode, but I don't know how to find out if a table is transactional or
>> not...
>
> You check the transactional capabilities of a table by using SHOW
> CREATE TABLE and check which engine it's using.
>
> Still I don't think we should prevent people from shooting themselves
> in the foot, I'm not going to raise another exception
> if I don't detect they have a proper backup policy either, nor if
> they're using a JDBC driver having known bugs.
>
> I think that what people need from us is a way to understand the size
> of what they're going to store;
> logging it as we did for ISPN-1125 is a first step, and maybe it's enough?
> Maybe it would be useful to collect max sizes, and print them too
> regularly in the logs, or expose that through MBeans?
>

Ok, this probably won't be such a big problem with the redesigned JDBC
cache store, but with the actual design you can't put a new value
without first reading the old value, and you can't read the old value
because the data has been truncated, so the only way to get out of
this mess is to delete everything from that table from a SQL prompt.

If we can make it easier for the user to recover from a truncation
then sure, just throw an exception on get, we don't want to handle
every possible configuration problem in our code.

Dan


> Cheers,
> Sanne
>
>
>>
>> Cheers
>> Dan
>>
>>
>>> Sanne
>>>
>>>>
>>>>>
>>>>> Cheers,
>>>>> Sanne
>>>>>
>>>>>>
>>>>>> Cheers
>>>>>> Dan
>>>>>>
>>>>>>
>>>>>>>> On 19 May 2011, at 19:06, Sanne Grinovero wrote:
>>>>>>>>
>>>>>>>>> As mentioned on the user forum [1], people setting up a JDBC
>>>>>>>>> cacheloader need to be able to define the size of columns to be used.
>>>>>>>>> The Lucene Directory has a feature to autonomously chunk the segment
>>>>>>>>> contents at a configurable specified byte number, and so has the
>>>>>>>>> GridFS; still there are other metadata objects which Lucene currently
>>>>>>>>> doesn't chunk as it's "fairly small" (but undefined and possibly
>>>>>>>>> growing), and in a more general sense anybody using the JDBC
>>>>>>>>> cacheloader would face the same problem: what's the dimension I need
>>>>>>>>> to use ?
>>>>>>>>>
>>>>>>>>> While in most cases the maximum size can be estimated, this is still
>>>>>>>>> not good enough, as when you're wrong the byte array might get
>>>>>>>>> truncated, so I think the CacheLoader should take care of this.
>>>>>>>>>
>>>>>>>>> what would you think of:
>>>>>>>>> - adding a max_chunk_size option to JdbcStringBasedCacheStoreConfig
>>>>>>>>> and JdbcBinaryCacheStore
>>>>>>>>> - have them store in multiple rows the values which would be bigger
>>>>>>>>> than max_chunk_size
>>>>>>>>> - this will need transactions, which are currently not being used by
>>>>>>>>> the cacheloaders
>>>>>>>>>
>>>>>>>>> It looks like to me that only the JDBC cacheloader has these issues,
>>>>>>>>> as the other stores I'm aware of are more "blob oriented". Could it be
>>>>>>>>> worth to build this abstraction in an higher level instead of in the
>>>>>>>>> JDBC cacheloader?
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Sanne
>>>>>>>>>
>>>>>>>>> [1] - http://community.jboss.org/thread/166760
>>>>>>>>> _______________________________________________
>>>>>>>>> infinispan-dev mailing list
>>>>>>>>> [hidden email]
>>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>>
>>>>>>>> --
>>>>>>>> Manik Surtani
>>>>>>>> [hidden email]
>>>>>>>> twitter.com/maniksurtani
>>>>>>>>
>>>>>>>> Lead, Infinispan
>>>>>>>> http://www.infinispan.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> infinispan-dev mailing list
>>>>>>>> [hidden email]
>>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Trustin Lee, http://gleamynode.net/
>>>>>>> _______________________________________________
>>>>>>> infinispan-dev mailing list
>>>>>>> [hidden email]
>>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> infinispan-dev mailing list
>>>>>> [hidden email]
>>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>>
>>>>> _______________________________________________
>>>>> infinispan-dev mailing list
>>>>> [hidden email]
>>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>>
>>>>
>>>> --
>>>> Trustin Lee, http://gleamynode.net/
>>>> _______________________________________________
>>>> infinispan-dev mailing list
>>>> [hidden email]
>>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>>
>>> _______________________________________________
>>> infinispan-dev mailing list
>>> [hidden email]
>>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>>
>> _______________________________________________
>> infinispan-dev mailing list
>> [hidden email]
>> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Manik Surtani
In reply to this post by Sanne Grinovero

On 23 May 2011, at 18:05, Sanne Grinovero wrote:

> I was more concerned about the fact that some database might not raise
> any exception ? Not sure if that's the case, and possibly not our
> problem.

Yes, not sure how we can detect this.

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|

Re: [infinispan-dev] chunking ability on the JDBC cacheloader

Manik Surtani
In reply to this post by Dan Berindei

On 24 May 2011, at 12:18, Dan Berindei wrote:

> Ok, this probably won't be such a big problem with the redesigned JDBC
> cache store, but with the actual design you can't put a new value
> without first reading the old value, and you can't read the old value
> because the data has been truncated, so the only way to get out of
> this mess is to delete everything from that table from a SQL prompt.
>
> If we can make it easier for the user to recover from a truncation
> then sure, just throw an exception on get, we don't want to handle
> every possible configuration problem in our code.

Once we have ISPN-701, we could potentially add some checks for specific vendors, but this would need to be optional/configurable.

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev