ParallelContext

classes
   ImplementationNotes  nhost    submit         upkvec         
   context        pack           take           userid         
   done           post           unpack         working        
   look           retval         upkscalar      
   look_take      runworker      upkstr

SYNTAX

objref pc
pc = new ParallelContext()
pc = new ParallelContext(nhost)

DESCRIPTION

"Embarassingly" parallel computations using a Bulletin board style analogous to LINDA. Useful when doing weeks or months worth of simulation runs each taking more than a second and where not much communication is required. Eg. parameter sensitivity, and some forms of optimization. The underlying strategy is to keep all machines in a PVM virtual machine (eg. workstation cluster) as busy as possible by distinguishing between hosts (cpu's) and tasks. A task started by a host stays on that host til it finishes. The code that a host is executing may submit other tasks and while waiting for them to finish that host may start other tasks, perhaps one it is waiting for. Early tasks tend to get done first through the use of a tree shaped priority scheme. We try to set things up so that any cpu can execute any task. The performance is good when there are always tasks to perform. In this case, cpu's never are waiting for other cpu's to finish results but constantly take a task from the bulletin board and put the result back onto the bulletin board. Communication overhead is not bad if each task takes a second or more.

The simplest form of parallelization of a loop from the users point of view is

func f() {              // a function with no context that *CHANGES*
   return $1*$1         //except its argument
}

objref pc
pc = new ParallelContext()
pc.runworker()          // master returns immediately, workers in
                        // infinite loop
s = 0
if (pc.nhost == 0) {    // use the serial form
   for i=1, 20 {
      s += f(i)
   }
}else{                  // use the bulletin board form
   for i=1, 20 {        // scatter processes
      pc.submit("f", i) // any context needed by f had better be
   }                    // the same on all hosts
   while (pc.working) { // gather results
      s += pc.retval    // the return value for the executed function
   }
}
print s
pc.done                 // wait for workers to finish printing

Several things need to be highlighted:

If a given task submits other tasks, only those child tasks will be gathered by the working loop for that given task. At this time the system groups tasks according to the parent task and the pc instance is not used. See submit for further discussion of this limitation. The safe strategy is always to use the idiom:

for i = 1,n {pc.submit(...)} // scatter a set of tasks
while(pc.working)) { ... }   // gather them all

Earlier submitted tasks tend to complete before later submitted tasks, even if they submit tasks themselves. Ie, A submitted task has the same general priority as the parent task and the specific priority of tasks with the same parent is in submission order. A free cpu always works on the next unexecuted task with highest priority.

Each task manages a separate group of submissions whose results are returned only to that task. Therefore you can submit tasks which themselves submit tasks.

The pc.working call checks to see if a result is ready. If so it returns the unique system generated task id (a positive integer) and the return value of the task function is accessed via the pc.retval function. The arguments to the function executed by the submit call are also available. If all submissions have been computed and all results have been returned, pc.working returns 0. If results are pending, working executes tasks from ANY ParallelContext until a result is ready. This last feature keeps cpus busy but places stringent requirements on how the user changes global context without introducing bugs. See the discussion in working .

ParallelContext.working may not return results in the order of submission.

Hoc code subsequent to pc.runworker() is executed only by the master since that call returns immediately if the process is the master and otherwise starts an infinite loop on each worker which requests and executes submit tasks from ANY ParallelContext instance. This is the standard way to seed the bulletin board with submissions. Note that workers may also execute tasks that themselves cause submissions. If subsidiary tasks call pc.runworker, the call returns immediately. Otherwise the task it is working on would never complete! The pc.runworker() function is also called for each worker after all hoc files are read in and executed.

The basic organization of a simulation is:

//setup which is exactly the same on every machine.
// ie declaration of all functions, procedures, setup of neurons

pc.runworker() to start the execute loop if this machine is a worker

// the master scatters tasks onto the bulletin board and gathers results

pc.done()

Issues having to do with context can become quite complex. Context transfer from one machine to another should be as small as possible. Don't fall into the trap of a context transfer which takes longer than the computation itself. Remember, you can do thousands of c statements in the time it takes to transfer a few doubles. Also, with a single cpu, it is often the case that statements can be moved out of an innermost loop, but can't be in a parallel computation. eg.

for i = 1, 20
   forall gnabar_hh = g[i]
   for j = 1, 5
      stim.amp = s[j]
      run()
   }
}

ie we only need to set gnabar_hh 20 times. But the first pass at parallization would look like:

for i = 1, 20 {
   for j= 1, 5 {
      sprint(cmd, "forall gnabar_hh = g[%d] stim.amp = s[%d] run()\n", i, j)
      pc.submit(cmd)
   }
}
while (pc.working) {
}

and not only do we take the hit of repeated evaluation of gnabar_hh but the statement must be interpreted each time. A run must be quite lengthy to amortize this overhead.

The PVM (parallel virtual machine) should be setup so that it allows execution on all hosts of the csh script $NEURONHOME/bin/bbsworker.sh . The simulation hoc files should be available on each machine with the same relative path with respect to the user's $HOME directory. For example, I start my 3 machine pvm with the command

   pvm hineshf

where hineshf is a text file with the contents:

hines ep=$HOME/nrn/bin
spine ep=$HOME/nrn/bin
pinky ep=$HOME/nrn/bin

Again, the purpose of the ep=$HOME/nrn/bin tokens is to specify the path to find bbsworker.sh

A simulation is started by moving to the proper working directory (should be a descendent of your $HOME directory) and launching neuron as in

special init.hoc

The exact same hoc files should exist in the same relative locations on all host machines.

BUGS

Not much checking for correctness or help in finding common bugs.

nhost

ParallelContext

SYNTAX

n = pc.nhost()

DESCRIPTION

Returns number of host neuron processes (master + workers). If nhost is 0 then PVM is not being used---Although all ParallelContext methods still work properly some performance loss over a purely serial loop will result.

if (pc.nhost == 0) {
   for i=1, 20 {
      print i, sin(i)
   }
}else{
   for i=1,20 {
      pc.submit(i, "sin", i)
   }

   while (pc.working) {
      print pc.userid, pc.retval
   }
}

submit

ParallelContext

SYNTAX

pc.submit("statement\n")
pc.submit("function_name", arg1, ...)
pc.submit(object, "function_name", arg1, ...)
pc.submit(userid, ..as above..)

DESCRIPTION

Submits statement for execution by any host. Submit returns the userid not the system generated global id of the task. However when the task is executed, the hoc_ac_ variable is set to this unique id (positive integer) of the task. This unique id is returned by working .

If the first argument to submit is a non-negative integer then args are not saved and when the id for this task is returned by working that non-negative integer can be retrieved with userid

If there is no explicit userid, then the args (after the function name) are saved locally and can be unpacked when the corresponding working call returns. A local userid (unique only for this ParallelContext) is generated and returned by the submit call and is also retrieved with userid when the corresponding working call returns. This is very useful in associating a particular parameter vector with its return value and avoids the necessity of explicitly saving them or posting them. If they are not needed and you do not wish to pay the overhead of storage, supply an explicit userid. Unpacking args must be done in the same order and have the same type as the args of the "function_name". They do not have to be unpacked. Saving args is time efficient since it does not imply extra communication with the server.

The argument form causes function_name(copyofarg1, ...) to execute on some indeterminate host in the PVM. Args must be scalars, strings, or Vectors. Note that they are *COPIES* so that even string and Vector arguments are call by value and not call by reference. (This is different from the normal semantics of a direct function call). In this case efficiency was chosen at the expense of pedantic consistency since it is expected that in most cases the user does not need the return copy. In the event more than a single scalar return value is required use post within the function_name body with a key equal to the id of the task. For example:

func function_name() {local id
   id = hoc_ac_
   $o1.reverse()
   pc.post(id, $o1)
}
...
while( (id = pc.working) != 0) {
   pc.take(id)
   pc.upkvec.printf
}

The object form executes the function_name(copyofarg1, ...) in the context of the object. IT MUST BE THE CASE that the string result

   print object

identifies the "same" object on the host executing the function as on the host that submitted the task. This is guaranteed only if all hosts, when they start up, execute the same code that creates these objects. If you start creating these objects after the worker code diverges from the master (after pc.runworker) you really have to know what you are doing and the bugs will be VERY subtle.

BUGS

submit does not return the system generated unique id of the task but either the first arg (must be a positive integer to be a userid) or a locally (in this ParallelContext) generated userid which starts at 1.

A task should gather the results of all the tasks it submits before scattering other tasks even if scattering with different ParallelContext instances. This is because results are grouped by parent task id's instead of (parent task id, pc instance). Thus the following idiom needs extra user defined info to distinguish between pc1 and pc2 task results.

for i=1,10 pc1.submit(...)
for i=1,10 pc2.submit(...)
for i=1,10 { pc1.working() ...)
for i=1,10 { pc2.working() ...)

since pc1.working may get a result from a pc2 submission If this behavior is at all inconvenient, I will change the semantics so that pc1 results only are gathered by pc1.working calls and by no others.

Searching for the proper object context (pc.submit(object, ...) on the host executing the submitted task is linear in the number of objects of that type.

working

ParallelContext

SYNTAX

id = pc.working()

DESCRIPTION

Returns 0 if there are no pending submissions which were submitted by the current task. (see bug below with regard to the distinction between the current task and a ParallelContext instance). Returns the id of a previous pc.submit which has completed and whose results from that computation are ready for retrieval.

While there are pending submissions and results are not ready, pending submissions from any ParallelContext from any host are calculated. Note that returns of completed submissions are not necessarily in the order that they were made by pc.submit.

while ((id = pc.working) > 0) {
   // gather results of previous pc.submit calls
   print id, pc.retval
}

Note that if the submission did not have an explicit userid then all the arguments of the executed function may be unpacked.

It is essential to emphasize that when a task calls pc.working, while it is waiting for a result, it may execute any number of other tasks and unless care is taken to understand the meaning of "task context" and guarantee that context after the working call is the same as the context before the working call, SUBTLE ERRORS WILL HAPPEN more or less frequently and indeterminately. For example consider the following:

function f() {
   ... write some values to some global variables ...
   pc.submit("g", ...)
   // when g is executed on another host it will not in general
   // see the same global variable values you set above.
   pc.working() // get back result of execution of g(...)
   // now the global variables may be different than what you
   // set above. And not because g changes them but perhaps
   // because the host executing this task started executing
   // another task that called f which then wrote DIFFERENT values
   // to these global variables.

I only know one way around this problem. Perhaps there are other and better ways.

function f() { local id
   id = hoc_ac_;
   ... write some values to some global variables ...
   pc.post(id, the, global, variables)
   pc.submit("g", ...)
   pc.working()
   pc.take(id)
   // unpack the info back into the global variables
   ...
}

BUGS

Submissions are grouped according to parent task id and not by parallel context instance. If suggested by actual experience, the grouping will be according to the pair (parent task id, parallel context instance). Confusion arises only in the case where a task submits jobs with one pc and fails to gather them before submitting another group of jobs with another pc. See the bugs section of submit

retval

ParallelContext

SYNTAX

scalar = pc.retval()

DESCRIPTION

The return value of the function executed by the task gathered by the last working call. If the statement form of the submit is used then the return value is the value of hoc_ac_ when the statement completes on the executing host.

userid

ParallelContext

SYNTAX

scalar = pc.userid()

DESCRIPTION

The return value of the corresponding submit call. The value of the userid is either the first argument (if it was a non-negative integer) of the submit call or else it is a positive integer unique only to this ParallelContext.

See submit with regard to retrieving the orginal arguments of the submit call corresponding to the working return.

Can be useful in organizing results according to an index defined during submission.

runworker

ParallelContext

SYNTAX

pc.runworker()

DESCRIPTION

The master host returns immediately. Worker hosts start an infinite loop of requesting tasks for execution.

The basic style is that the master and each host execute the same code up til the pc.runworker call and that code sets up all the context that is required to be identical on all hosts so that any host can run any task whenever the host requests somthing todo. The latter takes place in the runworker loop and when a task is waiting for a result in a working call. Many parallel processing bugs are due to inconsistent context among hosts and those bugs can be VERY subtle. Tasks should not change the context required by other tasks without extreme caution. The only way I know how to do this safely is to store and retrieve a copy of the authoritative context on the bulletin board. See working for further discussion in this regard.

The runworker method is called automatically for each worker after all files have been read in and executed --- i.e. if the user never calls it explicitly from hoc. Otherwise the workers would exit since the standard input is at the end of file for workers. This is useful in those cases where the only distinction between master and workers is that code executed from the gui or console.

done

ParallelContext

SYNTAX

pc.done()

DESCRIPTION

Sends the QUIT message to all worker hosts. Those NEURON processes then exit. The master waits til all worker output has been transferred to the master host.

context

ParallelContext

SYNTAX

pc.context("statement\n")
pc.context("function_name", arg1, ...])
pc.context(object, "function_name", arg1, ...)
pc.context(userid, ..as above..)

DESCRIPTION

The arguments have the same semantics as those of the submit method. The function or statement is executed on every worker host but is not executed on the master. pc.context can only be called by the master. The workers will execute the context statement when they are idle or have completed their current task.

There is no return in the sense that working does not return when one of these tasks completes.

This method was introduced with the following protocol in mind

proc save_context() { // executed on master
   sprint(tstr, "%s", this)
   pc.look_take(tstr) // remove previous context if it exists
   // master packs a possibly complicated context from within
   // an object whose counterpart exists on all workers
   pc.post(tstr)
   pc.context(this, "restore_context", tstr) // all workers do this
}

proc restore_context() {
   pc.look($s1) // don't remove! Others need it as well.
   // worker unpacks possibly complicated context
}

BUGS

It is not clear if it would be useful to generalize the semantics to the case of executing on every host except the host that executed the pc.context call. (strictly, the host would execute the task when it requests something to do. i.e. in a working loop or in a worker's infinite work loop.) The simplest and safest use of this method is if it is called by the master when all workers are idle.

This method was introduced in an attempt to get a parallel multiple run fitter which worked in an interactive gui setting. As such it increases safety but is not bulletproof since there is no guarantee that the user doesn't change a global variable that is not part of the fitter. It is also difficult to write safe code that invariably makes all the relevant worker context identical to the master. An example of a common bug is to remove a parameter from the parameter list and then call save_context(). Sure enough, the multiple run fitters on all the workers will no longer use that parameter, but the global variables that depend on the parameter may be different on different hosts and they will now stay different! One fix is to call save_context() before the removal of the parameter from the list and save_context() after its removal. But the inefficiency is upsetting. We need a better automatic mirroring method.

post

ParallelContext

SYNTAX

pc.post(key)
pc.post(key, ...)

DESCRIPTION

Post the message with the address key, (key may be a string or scalar), and a body consisting of any number of pack calls since the last post, and any number of arguments of type scalar, Vector, or strdef.

Later unpacking of the message body must be done in the same order as this posting sequence.

take

ParallelContext

SYNTAX

pc.take(key)
pc.take(key, ...)

DESCRIPTION

Takes the message with key from the bulletin board. If the key does not exist then the call blocks. Two processes can never take the same message (unless someone posts it twice). The key may be a string or scalar. Unpacking the message must take place in the same order as the packing and must be complete before the next bulletin board operation. (at which time remaining message info will be discarded) It is not required to unpack the entire message, but later items cannot be retrieve without unpacking earlier items first. Optional arguments get the first unpacked values. Scalar, Vectors, and strdef may be unpacked. Scalar arguments must be pointers to a variable. eg &x. Unpacked Vectors will be resized to the correct size of the vector item of the message.

look

ParallelContext

SYNTAX

boolean = pc.look(key)
boolean = pc.look(key, ...)

DESCRIPTION

Like take but does not block or remove message from bulletin board. Returns 1 if the key exists, 0 if the key does not exist on the bulletin board. The message associated with the key (if the key exists) is available for unpacking each time pc.look returns 1.

look_take

ParallelContext

SYNTAX

boolean = pc.look_take(key, ...)

DESCRIPTION

Like take but does not block. The message is removed from the bulletin board and two processes will never receive this message. Returns 1 if the key exists, 0 if the key does not exist on the bulletin board. If the key exists, the message can be unpacked.

Note that a look followed by a take is *NOT* equivalent to look_take. It can easily occur that another task might take the message between the look and take and the latter will then block until some other process posts a message with the same key.

pack

ParallelContext

SYNTAX

pc.pack(...)

DESCRIPTION

Append arguments consisting of scalars, Vectors, and strdefs into a message for a subseqent post.

unpack

ParallelContext

SYNTAX

pc.unpack(...)

DESCRIPTION

Extract items from the last message retrieved with take, look, or look_take. The type and sequence of items retrieved must agree with the order in which the message was constructed with post and pack. Note that scalar items must be retrieved with pointer syntax as in &soma.gnabar_hh(.3)

upkscalar

ParallelContext

SYNTAX

x = pc.upkscalar()

DESCRIPTION

Return the scalar item which must be the next item in the unpacking sequence of the message retrieved by the previous take, look, or look_take.

upkstr

ParallelContext

SYNTAX

str = pc.upkstr(str)

DESCRIPTION

Copy the next item in the unpacking sequence into str and return that strdef.

upkvec

ParallelContext

SYNTAX

vec = pc.upkvec()
vec = pc.upkvec(vecsrc)

DESCRIPTION

Copy the next item in the unpacking sequence into vecsrc (if that arg exists, it will be resized if necessary). If the arg does not exist return a new Vector.

ImplementationNotes

ParallelContext

DESCRIPTION

With the following information you may be encouraged to provide a more efficient implementation. You may also see enough information here to decide that this implementation is about as good as can be expected in the context of your problem.

The master NEURON process contains the server for the bulletin board system. Communication between normal hoc code executing on the master NEURON process and the server is direct with no overhead except packing and unpacking messages and manipulating the send and receive buffers with pvm commands. The reason I put the server into the master process is twofold. 1) While the master is number crunching, client messages are still promptly dealt with. I noticed that when neuron was cpu bound, a separate server process did not respond to requests for about a tenth of a second. 2) No context switching between master process and server. If pvm is not running, a local implementation of the server is used which has even less overhead than pvm packing and unpacking.

Client (worker processes) communicate with the bulletin board server (in the master machine) with pvm commands pvm_send and pvm_recv. The master process is notified of asynchronous events via the SIGPOLL signal. Unfortunately this is often early since a pvm message often consists of several of these asynchronous events and my experience so far is that (pvm_probe(-1,-1) > 0) is not always true even after the last of this burst of signals. Also SIGPOLL is not available except under UNIX. However SIGPOLL is only useful on the master process and should not affect performance with regard to whether a client is working under Win95, NT, or Linux. So even with SIGPOLL there must be software polling on the server and this takes place on the next execute() call in the interpreter. (an execute call takes place when the body of every for loop, if statement, or function/procedure call is executed.) In the absence of a SIGPOLL signal this software polling takes place every POLLDELAY=20 executions. Of course this is too seldom in the case of fadvance calls with a very large model, and too often in the case of for i=1,100000 x+=i. Things are generally ok if the message at the end of a run says that the amount of time spent waiting for something to do is small compared to the amount of time spent doing things. Perhaps a timer would help.

The bulletin board server consists of several lists implemented with the STL (Standard Template Library) which makes for reasonably fast lookup of keys. ie searching is not proportional to the size of the list but proportional to the log of the list size.

Posts go into the message list ordered by key (string order). They stay there until taken with look_take or take. Submissions go into a work list ordered by id and a todo list of id's by priority. When a host requests something to do, the highest priority (first in the list) id is taken off the todo list. When done, the id goes onto a results list ordered by parent id. When working is called and a results list has an id with the right parent id, the id is removed from the results list and the (id, message) pair is removed from the work list.

If args are saved (no explicit userid in the submit call), they are stored locally and become the active buffer on the correspoinding working return. The saving is in an STL map associated with userid. The data itself is not copied but neither is it released until the next usage of the receive buffer after the working call returns.

neuron/neuron/classes/parcon.hel : 25746 Sep 6

SYNTAX

DESCRIPTION

BUGS

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

SEE ALSO

BUGS

SYNTAX

DESCRIPTION

SEE ALSO

BUGS

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

BUGS

SYNTAX

DESCRIPTION

SEE ALSO

SYNTAX

DESCRIPTION

SEE ALSO

SYNTAX

DESCRIPTION

SEE ALSO

SYNTAX

DESCRIPTION

SEE ALSO

SYNTAX

DESCRIPTION

SEE ALSO

SYNTAX

DESCRIPTION

SEE ALSO

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

SYNTAX

DESCRIPTION

DESCRIPTION