Re: [infinispan-dev] Distributed tasks - specifying task input

classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani
I'm gonna cc infinispan-dev as well, so others can pitch in.

But yeah, good start. I think your dissection of the use cases makes sense, except that I think (a) and (c) are essentially the same; only that in the case of (c) the task ignores any data input.  This would raise the question of why would anybody do this, and what purpose does it solve. :-)  So yeah, we could support it as a part of (a), but I am less than convinced of its value.  

And as for data locality, perhaps this would work: don't define the necessary keys in the Task, but instead provide these when submitting the task. E.g.,

CacheManager.submitTask(task); // task goes everywhere, and receives an input Iterator of all local entries on each node.
CacheManager.submitTask(task, K...); // task goes to certain nodes that contain some or all of K.  Receives an input Iterator of all local entries on each node.

WDYT?

Cheers
Manik


On 15 Dec 2010, at 15:46, Vladimir Blagojevic wrote:

> Manik,
>
> Will write more towards the end if I remember something I forgot but here is my initial feedback. I think two of us should focus on steps to finalize API. I am taking Trustin's proposal as starting point because I really like it; it is very simple but powerful. So the first thing that I want to review, something I think is missing, is that Trustin's proposal covers only one case of task input and consequently does not elaborate how tasks are mapped to execution nodes.
>
> In order to be as clear as possible, lets go one step back. I can think of three distinct cases of input:
>
> a) all data of type T available on cluster is input for a task (covered by Trustin's proposal)
> b) subset of data of type T is an input for a task
> c) no data from caches used as input for a task
>
> For case a) we migrate/assign task units to all cache nodes, each migrated task fetches local data avoiding double inclusion of input data i.e. do not fetch input data D of type T on node N and D's replica on node M as well
>
> For case b) we have to involve CH and find out where to migrate task units so that data fetched is local once task units are moved to nodes N (N is a subset of entire cluster)
>
> What do to for tasks c) if anything? Do we want to support this?
>
> I think Trustin made a unintentional mistake in the example below by adding task units to cm rather than task. I would clearly delineate distributed task as a whole and task units which comprise distributed task.
>
> So instead of:
>
> // Now assemble the tasks into a larger task.
> CacheManager cm = ...;
> CoordinatedTask task = cm.newCoordinatedTask("finalOutcomeCacheName");
> cm.addLocalTask(new GridFileQuery("my_gridfs"));          // Fetch
> cm.addLocalTask(new GridFileSizeCounter());               // Map
> cm.addGlobalTask(new IntegerSummarizer("my_gridfs_size"); // Reduce
>
> we have:
>
> // Now assemble the tasks into a larger task.
> CacheManager cm = ...;
> DistributedTask task = cm.newDistributedTask("finalOutcomeCacheName",...);
> task.addTaskUnit(new GridFileQuery("my_gridfs"));          // Fetch
> task.addTaskUnit(new GridFileSizeCounter());               // Map
> task.addTaskUnit(DistributedTask.GLOBAL, new IntegerSummarizer("my_gridfs_size"); // Reduce
>
>
> So the grid file task example from wiki page falls under use case a) - we fetch all available data of type GridFile.Metadata available on the cluster. There is no need to map tasks across the cluster or consult CH - we migrate tasks to every node on the cluster and feed node local data into task units pipeline.
>
> For case b) we need to somehow indicate to DistributedTask what the input is so the task and its units can be mapped across the cluster using CH. My suggestion is that CacheManager#newDistributedTask factory method has a few forms. The first, which covers use case a) with no additional parameters and another factory method similar to my original proposal where we specify task input in the form Map<String, Set<Object>> getRelatedKeys()
>
> So maybe we can define an interface:
>
> public interface DistributedTaskInput{
> public Map<String, Set<Object>> getKeys();
> }
>
> and the factory method becomes:
>
> public class CacheManager{
> ...
> public DistributedTask newDistributedTask(String name,DistributedTaskInput input);
> ...
> }
>
> Now, we can distinguish this case, case b) from case a). For case b) when task is executed we first have to ask CH where to migrate/map these tasks, everything else should be the same as use case a) - at least when it comes to plumbing underneath. Of course, task implementer is completely shielded from this. Also, there is no need for task implementer to do the equivalent of GridFileQuery for case b). Since input is already explicitly specified using newDistributedTask we can/need to somehow setup EntryStreamProcessorContext with local input already set and pass it to the pipeline.
>
> WDYT,
> Vladimir
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Vladimir Blagojevic
On 10-12-15 3:03 PM, Manik Surtani wrote:
I'm gonna cc infinispan-dev as well, so others can pitch in.

But yeah, good start. I think your dissection of the use cases makes sense, except that I think (a) and (c) are essentially the same; only that in the case of (c) the task ignores any data input.  This would raise the question of why would anybody do this, and what purpose does it solve. :-)  So yeah, we could support it as a part of (a), but I am less than convinced of its value.  

Ok, no problem. We just have to somehow bootstrap EntryStreamProcessorContext with no access to cache for c case.

And as for data locality, perhaps this would work: don't define the necessary keys in the Task, but instead provide these when submitting the task. E.g.,

CacheManager.submitTask(task); // task goes everywhere, and receives an input Iterator of all local entries on each node.
CacheManager.submitTask(task, K...); // task goes to certain nodes that contain some or all of K.  Receives an input Iterator of all local entries on each node.

WDYT?

Yes, I agree except I'd rather not submit task using CacheManager API. I'd rather just obtain DistributedTask through CacheManager and use task#execute. That we we do not overpollute CacheManager API.


CacheManager cm = ...;
DistributedTask task = cm.newDistributedTask(...);
task.execute();
task.execute(K...);


Ok, if we agree on this what else remains to be resolved? What did you have in mind for further discussion?

Cheers,
Vladimir





_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani

On 15 Dec 2010, at 18:43, Vladimir Blagojevic wrote:

On 10-12-15 3:03 PM, Manik Surtani wrote:
I'm gonna cc infinispan-dev as well, so others can pitch in.

But yeah, good start. I think your dissection of the use cases makes sense, except that I think (a) and (c) are essentially the same; only that in the case of (c) the task ignores any data input.  This would raise the question of why would anybody do this, and what purpose does it solve. :-)  So yeah, we could support it as a part of (a), but I am less than convinced of its value.  

Ok, no problem. We just have to somehow bootstrap EntryStreamProcessorContext with no access to cache for c case.

Well, I reckon we provide the ref to the cache.  If the impl doesn't make use of it, too bad.


And as for data locality, perhaps this would work: don't define the necessary keys in the Task, but instead provide these when submitting the task. E.g.,

CacheManager.submitTask(task); // task goes everywhere, and receives an input Iterator of all local entries on each node.
CacheManager.submitTask(task, K...); // task goes to certain nodes that contain some or all of K.  Receives an input Iterator of all local entries on each node.

WDYT?

Yes, I agree except I'd rather not submit task using CacheManager API. I'd rather just obtain DistributedTask through CacheManager and use task#execute. That we we do not overpollute CacheManager API.


CacheManager cm = ...;
DistributedTask task = cm.newDistributedTask(...);
task.execute();
task.execute(K...);

+1.

Ok, if we agree on this what else remains to be resolved? What did you have in mind for further discussion?

Just to finalise the API.  Would this be specific to a cache or a cache manager?  E.g., should it be Cache.newDistributedTask()?

Cheers
Manik
--


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Vladimir Blagojevic
On 10-12-15 4:00 PM, Manik Surtani wrote:

On 15 Dec 2010, at 18:43, Vladimir Blagojevic wrote:

On 10-12-15 3:03 PM, Manik Surtani wrote:
I'm gonna cc infinispan-dev as well, so others can pitch in.

But yeah, good start. I think your dissection of the use cases makes sense, except that I think (a) and (c) are essentially the same; only that in the case of (c) the task ignores any data input.  This would raise the question of why would anybody do this, and what purpose does it solve. :-)  So yeah, we could support it as a part of (a), but I am less than convinced of its value.  

Ok, no problem. We just have to somehow bootstrap EntryStreamProcessorContext with no access to cache for c case.

Well, I reckon we provide the ref to the cache.  If the impl doesn't make use of it, too bad.

Agreed.

Just to finalise the API.  Would this be specific to a cache or a cache manager?  E.g., should it be Cache.newDistributedTask()?

I think it makes more sense on CacheManager because we might have a task operating on data of a few caches, not just one. So to be safe and not sorry I'd go with CacheManager. newDistributedTask can have parameters to specify which cache(s) to use.

WDYT?

Cheers



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani

On 15 Dec 2010, at 22:46, Vladimir Blagojevic wrote:

On 10-12-15 4:00 PM, Manik Surtani wrote:

On 15 Dec 2010, at 18:43, Vladimir Blagojevic wrote:

On 10-12-15 3:03 PM, Manik Surtani wrote:
I'm gonna cc infinispan-dev as well, so others can pitch in.

But yeah, good start. I think your dissection of the use cases makes sense, except that I think (a) and (c) are essentially the same; only that in the case of (c) the task ignores any data input.  This would raise the question of why would anybody do this, and what purpose does it solve. :-)  So yeah, we could support it as a part of (a), but I am less than convinced of its value.  

Ok, no problem. We just have to somehow bootstrap EntryStreamProcessorContext with no access to cache for c case.

Well, I reckon we provide the ref to the cache.  If the impl doesn't make use of it, too bad.

Agreed.

Just to finalise the API.  Would this be specific to a cache or a cache manager?  E.g., should it be Cache.newDistributedTask()?

I think it makes more sense on CacheManager because we might have a task operating on data of a few caches, not just one. So to be safe and not sorry I'd go with CacheManager. newDistributedTask can have parameters to specify which cache(s) to use.

So you'd do CacheManager.newDistributedTask(task, Map<String, K> cacheNamesAndKeys)?



_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Vladimir Blagojevic
On 10-12-16 7:25 AM, Manik Surtani wrote:
>
> I think it makes more sense on CacheManager because we might have a
> task operating on data of a few caches, not just one. So to be safe
> and not sorry I'd go with CacheManager. newDistributedTask can have
> parameters to specify which cache(s) to use.
>
> So you'd do CacheManager.newDistributedTask(task, Map<String, K>
> cacheNamesAndKeys)?
>

No, I'd shift these parameters to DistributedTask#execute unless we
absolutely need them at for task creation. Let's keep factory method for
DistributedTask as simple as possible. I'd rather have a single factory
method and develop and grow DistributedTask API independently of
CacheManager than have multiple overloaded factory methods for
DistributedTask.


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani

On 16 Dec 2010, at 11:34, Vladimir Blagojevic wrote:

On 10-12-16 7:25 AM, Manik Surtani wrote:

I think it makes more sense on CacheManager because we might have a task operating on data of a few caches, not just one. So to be safe and not sorry I'd go with CacheManager. newDistributedTask can have parameters to specify which cache(s) to use.

So you'd do CacheManager.newDistributedTask(task, Map<String, K> cacheNamesAndKeys)?


No, I'd shift these parameters to DistributedTask#execute unless we absolutely need them at for task creation. Let's keep factory method for DistributedTask as simple as possible. I'd rather have a single factory method and develop and grow DistributedTask API independently of CacheManager than have multiple overloaded factory methods for DistributedTask.

Hmm.  Maybe it is better to not involve an API on the CacheManager at all.  Following JSR166y [1], we could do:

DistributedForkJoinPool p = DisributedForkJoinPool.newPool(cache); // I still think it should be on a per-cache basis

DistributedTask<MyResultType, K, V> dt = new DistributedTask<MyResultType, K, V>() {
    
    public void map(Map.Entry<K, V> entry, Map<K, V> context) {
        // select the entries you are interested in.  Transform if needed and store in context
    }

    public MyResultType reduce(Map<Address, Map<K, V>> contexts) {
        // aggregate from context and return value.
    };

};

MyResultType result = p.invoke(dt, key1, key2, key3); // keys are optional.

What I see happening is:

* dt is broadcast to all nodes that hold either of {key1, key2, key3}.  If keys are not provided, broadcast to all.
* dt.map() is called on each node, for each key specified (if it exists on the local node).
* Contexts are sent back to the calling node and are passed to dt.reduce()
* Result of dt.reduce() passed to the caller of p.invoke()

What do you think? 


--


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani

On 17 Dec 2010, at 11:41, Vladimir Blagojevic wrote:

Even better this way. I would like to hear more about your reasoning behind using DT on per-cache basis. Yes, it would be simpler and easier API for the users but we do not cover uses cases when distributed task execution needs access to more than one cache during its execution....

I was just wondering whether such a use case exists or whether we're just inventing stuff.  :-)  It would lead to a much more cumbersome API since you'd need to provide a map of cache names to keys, etc.


On 10-12-16 9:07 AM, Manik Surtani wrote:

Hmm.  Maybe it is better to not involve an API on the CacheManager at all.  Following JSR166y [1], we could do:

DistributedForkJoinPool p = DisributedForkJoinPool.newPool(cache); // I still think it should be on a per-cache basis

DistributedTask<MyResultType, K, V> dt = new DistributedTask<MyResultType, K, V>() {
    
    public void map(Map.Entry<K, V> entry, Map<K, V> context) {
        // select the entries you are interested in.  Transform if needed and store in context
    }

    public MyResultType reduce(Map<Address, Map<K, V>> contexts) {
        // aggregate from context and return value.
    };

};

MyResultType result = p.invoke(dt, key1, key2, key3); // keys are optional.

What I see happening is:

* dt is broadcast to all nodes that hold either of {key1, key2, key3}.  If keys are not provided, broadcast to all.
* dt.map() is called on each node, for each key specified (if it exists on the local node).
* Contexts are sent back to the calling node and are passed to dt.reduce()
* Result of dt.reduce() passed to the caller of p.invoke()

What do you think? 


--





_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Sanne Grinovero

I would like to try it in combination with the Lucene Directory, wich uses two or three different caches per index.

Even better this way. I would like to hear more about your reasoning behind using DT on per-cache basis. Yes, it would be simpler and easier API for the users but we do not cover uses cases when distributed task execution needs access to more than one cache during its execution....

I was just wondering whether such a use case exists or whether we're just inventing stuff.  :-)  It would lead to a much more cumbersome API since you'd need to provide a map of cache names to keys, etc.



>
> On 10-12-16 9:07 AM, Manik Surtani wrote:
>>
>>
>> Hmm.  Maybe it is better to not involve an ...


--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.o...


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev


_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Vladimir Blagojevic
In reply to this post by Manik Surtani
On 10-12-17 8:42 AM, Manik Surtani wrote:
>
> I was just wondering whether such a use case exists or whether we're
> just inventing stuff.  :-)  It would lead to a much more cumbersome
> API since you'd need to provide a map of cache names to keys, etc.

Ok agreed. Unless there is a way to solve this nicely lets go with cache
based DT. I'll think more about this over the weekend.

Also, why would you call DT#map for each key specified? I am thinking it
should be called once when task is migrated to execution node, context
should be filled with entries on that cache node and passed to map. So
call map once per node with all keys already in context.

I'd also try to use our own interfaces here rather than pure Map so we
can evolve them if we need to - DistributedTaskContext or whatever.....

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani

On 17 Dec 2010, at 11:56, Vladimir Blagojevic wrote:

> On 10-12-17 8:42 AM, Manik Surtani wrote:
>>
>> I was just wondering whether such a use case exists or whether we're
>> just inventing stuff.  :-)  It would lead to a much more cumbersome
>> API since you'd need to provide a map of cache names to keys, etc.
>
> Ok agreed. Unless there is a way to solve this nicely lets go with cache
> based DT. I'll think more about this over the weekend.

Hmm, just saw Sanne's message.  Ok, so if we were to support multiple caches... how could this look?

interface DistributedTask<K, V, R> {

   Collection<String> cacheNames();

   Collection<K> keys(String cache);

   void map(String cacheName, Map.Entry<K, V> e, DistributedTaskContext <K, V, R> ctx);

   R reduce(Map<Address, DistributedTaskContext<K, V, R>> contexts);

}

interface DistributedTaskContext<K, V, R> {
   
   String getCacheName();
 
   // ... map-like methods on K, V
}

DistributedForkJoinPool p = cacheManager.newDistributedForkJoinPool();

result = p.invoke(new DistributedTask() { ... });

and let the DistributedTask impl provide the names of caches and keys as needed?  Some semantics - a null collection of cache names = all caches, a null collection of keys = all keys?  Or is that a bit shit?  :-)

Cheers
Manik

>
> Also, why would you call DT#map for each key specified? I am thinking it
> should be called once when task is migrated to execution node, context
> should be filled with entries on that cache node and passed to map. So
> call map once per node with all keys already in context.

Just saves the implementation having to iterate over the collection.  We do the iterating and callback each time.  :-)  Minor point.

> I'd also try to use our own interfaces here rather than pure Map so we
> can evolve them if we need to - DistributedTaskContext or whatever.....

+1.

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Sanne Grinovero
About>
 void map(String cacheName, Map.Entry<K, V> e, DistributedTaskContext
<K, V, R> ctx);

To satisfy a Lucene query it should be able to at least read multiple
caches at the same time.

It would still do map/reduce on a single cache, just I need to be able
to invoke getCache() during both
the map and reduce operations - you might think that would defeat the
purpose, but the situation is that one cache is
distributed and the other replicated, so I won't do remote get
operations as we obviously use the keys of the
distributed one.

So the API seems fine - I hope as I don't have a prototype - if there
is some means to have a reference to the
EmdeddableCacheManager.

Wouldn't it be more usefull to have a reference to the Cache object
instead of it's name.

Finally, I might consider it very useful to be able to declare usage
on keys which don't exist yet, hope that's not a problem?
So in case of Lucene I might know that I'm going to write and rewrite
several times a file called "A",
so I send the operation to the node which is going to host A, though
this key A might exist, or be created
by the task during the process.

Sanne

2010/12/17 Manik Surtani <[hidden email]>:

>
> On 17 Dec 2010, at 11:56, Vladimir Blagojevic wrote:
>
>> On 10-12-17 8:42 AM, Manik Surtani wrote:
>>>
>>> I was just wondering whether such a use case exists or whether we're
>>> just inventing stuff.  :-)  It would lead to a much more cumbersome
>>> API since you'd need to provide a map of cache names to keys, etc.
>>
>> Ok agreed. Unless there is a way to solve this nicely lets go with cache
>> based DT. I'll think more about this over the weekend.
>
> Hmm, just saw Sanne's message.  Ok, so if we were to support multiple caches... how could this look?
>
> interface DistributedTask<K, V, R> {
>
>   Collection<String> cacheNames();
>
>   Collection<K> keys(String cache);
>
>   void map(String cacheName, Map.Entry<K, V> e, DistributedTaskContext <K, V, R> ctx);
>
>   R reduce(Map<Address, DistributedTaskContext<K, V, R>> contexts);
>
> }
>
> interface DistributedTaskContext<K, V, R> {
>
>   String getCacheName();
>
>   // ... map-like methods on K, V
> }
>
> DistributedForkJoinPool p = cacheManager.newDistributedForkJoinPool();
>
> result = p.invoke(new DistributedTask() { ... });
>
> and let the DistributedTask impl provide the names of caches and keys as needed?  Some semantics - a null collection of cache names = all caches, a null collection of keys = all keys?  Or is that a bit shit?  :-)
>
> Cheers
> Manik
>
>>
>> Also, why would you call DT#map for each key specified? I am thinking it
>> should be called once when task is migrated to execution node, context
>> should be filled with entries on that cache node and passed to map. So
>> call map once per node with all keys already in context.
>
> Just saves the implementation having to iterate over the collection.  We do the iterating and callback each time.  :-)  Minor point.
>
>> I'd also try to use our own interfaces here rather than pure Map so we
>> can evolve them if we need to - DistributedTaskContext or whatever.....
>
> +1.
>
> --
> Manik Surtani
> [hidden email]
> twitter.com/maniksurtani
>
> Lead, Infinispan
> http://www.infinispan.org
>
>
>
>
> _______________________________________________
> infinispan-dev mailing list
> [hidden email]
> https://lists.jboss.org/mailman/listinfo/infinispan-dev
>

_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Manik Surtani

On 17 Dec 2010, at 14:58, Sanne Grinovero wrote:

> About>
> void map(String cacheName, Map.Entry<K, V> e, DistributedTaskContext
> <K, V, R> ctx);
>
> To satisfy a Lucene query it should be able to at least read multiple
> caches at the same time.
>
> It would still do map/reduce on a single cache, just I need to be able
> to invoke getCache() during both
> the map and reduce operations - you might think that would defeat the
> purpose, but the situation is that one cache is
> distributed and the other replicated, so I won't do remote get
> operations as we obviously use the keys of the
> distributed one.
>
> So the API seems fine - I hope as I don't have a prototype - if there
> is some means to have a reference to the
> EmdeddableCacheManager.
>
> Wouldn't it be more usefull to have a reference to the Cache object
> instead of it's name.

Makes sense.  Perhaps we could pass in a Set<Cache> of all caches declared as required in cacheNames().

> Finally, I might consider it very useful to be able to declare usage
> on keys which don't exist yet, hope that's not a problem?
> So in case of Lucene I might know that I'm going to write and rewrite
> several times a file called "A",
> so I send the operation to the node which is going to host A, though
> this key A might exist, or be created
> by the task during the process.

Should be fine, since the CH works on non-existent keys just as well.  But then it would mean that we can't call map() for each entry if the entry doesn't as yet exist.  Perhaps then to satisfy this case, Vladimir's suggestion of calling map() once per node makes more sense.  E.g.,

        void map(Set<Cache<K, V>> caches, DistributedTaskContext <K, V, R> ctx);

Cheers
Manik

--
Manik Surtani
[hidden email]
twitter.com/maniksurtani

Lead, Infinispan
http://www.infinispan.org




_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Vladimir Blagojevic
On 10-12-17 12:10 PM, Manik Surtani wrote:

>> Finally, I might consider it very useful to be able to declare usage
>> on keys which don't exist yet, hope that's not a problem?
>> So in case of Lucene I might know that I'm going to write and rewrite
>> several times a file called "A",
>> so I send the operation to the node which is going to host A, though
>> this key A might exist, or be created
>> by the task during the process.
> Should be fine, since the CH works on non-existent keys just as well.  But then it would mean that we can't call map() for each entry if the entry doesn't as yet exist.  Perhaps then to satisfy this case, Vladimir's suggestion of calling map() once per node makes more sense.  E.g.,
>
> void map(Set<Cache<K, V>>  caches, DistributedTaskContext<K, V, R>  ctx);
>
>

Cool. Except I would possibly "stuff" Set<Cache<K,V>> into
DistributedTaskContext since they may be needed for reduce call as well.
In order to use caches in reduce phase users might be tempted to "save"
reference to these caches into a field of DistributedTask
implementation. This would create problems as we migrate instances of DT
across JVMs. Having access to Set<Cache<K,V>> from DistributedTask we
better signal to users the general life cycle and proper use of these
references.

Cheers
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: [infinispan-dev] Distributed tasks - specifying task input

Vladimir Blagojevic
In reply to this post by Vladimir Blagojevic
On 10-12-17 8:56 AM, Vladimir Blagojevic wrote:
> Also, why would you call DT#map for each key specified? I am thinking it
> should be called once when task is migrated to execution node, context
> should be filled with entries on that cache node and passed to map. So
> call map once per node with all keys already in context.
I might have caused a confusion with above statements! I believe Sanne
and Manik thought of map as being explicit instruction where to migrate
execution tasks?
_______________________________________________
infinispan-dev mailing list
[hidden email]
https://lists.jboss.org/mailman/listinfo/infinispan-dev
Loading...