classes ImplementationNotes nhost submit upkvec context pack take userid done post unpack working look retval upkscalar look_take runworker upkstr
objref pc
pc = new ParallelContext()
pc = new ParallelContext(nhost)
The simplest form of parallelization of a loop from the users point of view is
func f() { // a function with no context that *CHANGES* return $1*$1 //except its argument } objref pc pc = new ParallelContext() pc.runworker() // master returns immediately, workers in // infinite loop s = 0 if (pc.nhost == 0) { // use the serial form for i=1, 20 { s += f(i) } }else{ // use the bulletin board form for i=1, 20 { // scatter processes pc.submit("f", i) // any context needed by f had better be } // the same on all hosts while (pc.working) { // gather results s += pc.retval // the return value for the executed function } } print s pc.done // wait for workers to finish printing
Several things need to be highlighted:
If a given task submits other tasks, only those child tasks will be gathered by the working loop for that given task. At this time the system groups tasks according to the parent task and the pc instance is not used. See submit for further discussion of this limitation. The safe strategy is always to use the idiom:
for i = 1,n {pc.submit(...)} // scatter a set of tasks while(pc.working)) { ... } // gather them all
Earlier submitted tasks tend to complete before later submitted tasks, even if they submit tasks themselves. Ie, A submitted task has the same general priority as the parent task and the specific priority of tasks with the same parent is in submission order. A free cpu always works on the next unexecuted task with highest priority.
Each task manages a separate group of submissions whose results are returned only to that task. Therefore you can submit tasks which themselves submit tasks.
The pc.working call checks to see if a result is ready. If so it returns the unique system generated task id (a positive integer) and the return value of the task function is accessed via the pc.retval function. The arguments to the function executed by the submit call are also available. If all submissions have been computed and all results have been returned, pc.working returns 0. If results are pending, working executes tasks from ANY ParallelContext until a result is ready. This last feature keeps cpus busy but places stringent requirements on how the user changes global context without introducing bugs. See the discussion in working .
ParallelContext.working may not return results in the order of submission.
Hoc code subsequent to pc.runworker() is executed only by the master since that call returns immediately if the process is the master and otherwise starts an infinite loop on each worker which requests and executes submit tasks from ANY ParallelContext instance. This is the standard way to seed the bulletin board with submissions. Note that workers may also execute tasks that themselves cause submissions. If subsidiary tasks call pc.runworker, the call returns immediately. Otherwise the task it is working on would never complete! The pc.runworker() function is also called for each worker after all hoc files are read in and executed.
The basic organization of a simulation is:
Issues having to do with context can become quite complex. Context transfer from one machine to another should be as small as possible. Don't fall into the trap of a context transfer which takes longer than the computation itself. Remember, you can do thousands of c statements in the time it takes to transfer a few doubles. Also, with a single cpu, it is often the case that statements can be moved out of an innermost loop, but can't be in a parallel computation. eg.//setup which is exactly the same on every machine. // ie declaration of all functions, procedures, setup of neurons pc.runworker() to start the execute loop if this machine is a worker // the master scatters tasks onto the bulletin board and gathers results pc.done()
ie we only need to set gnabar_hh 20 times. But the first pass at parallization would look like:for i = 1, 20 forall gnabar_hh = g[i] for j = 1, 5 stim.amp = s[j] run() } }
and not only do we take the hit of repeated evaluation of gnabar_hh but the statement must be interpreted each time. A run must be quite lengthy to amortize this overhead.for i = 1, 20 { for j= 1, 5 { sprint(cmd, "forall gnabar_hh = g[%d] stim.amp = s[%d] run()\n", i, j) pc.submit(cmd) } } while (pc.working) { }
The PVM (parallel virtual machine) should be setup so that it allows execution on all hosts of the csh script $NEURONHOME/bin/bbsworker.sh . The simulation hoc files should be available on each machine with the same relative path with respect to the user's $HOME directory. For example, I start my 3 machine pvm with the command
where hineshf is a text file with the contents:pvm hineshf
Again, the purpose of the ep=$HOME/nrn/bin tokens is to specify the path to find bbsworker.shhines ep=$HOME/nrn/bin spine ep=$HOME/nrn/bin pinky ep=$HOME/nrn/bin
A simulation is started by moving to the proper working directory (should be a descendent of your $HOME directory) and launching neuron as in
The exact same hoc files should exist in the same relative locations on all host machines.special init.hoc
ParallelContext
n = pc.nhost()
if (pc.nhost == 0) { for i=1, 20 { print i, sin(i) } }else{ for i=1,20 { pc.submit(i, "sin", i) } while (pc.working) { print pc.userid, pc.retval } }
ParallelContext
pc.submit("statement\n")
pc.submit("function_name", arg1, ...)
pc.submit(object, "function_name", arg1, ...)
pc.submit(userid, ..as above..)
If the first argument to submit is a non-negative integer then args are not saved and when the id for this task is returned by working that non-negative integer can be retrieved with userid
If there is no explicit userid, then the args (after the function name) are saved locally and can be unpacked when the corresponding working call returns. A local userid (unique only for this ParallelContext) is generated and returned by the submit call and is also retrieved with userid when the corresponding working call returns. This is very useful in associating a particular parameter vector with its return value and avoids the necessity of explicitly saving them or posting them. If they are not needed and you do not wish to pay the overhead of storage, supply an explicit userid. Unpacking args must be done in the same order and have the same type as the args of the "function_name". They do not have to be unpacked. Saving args is time efficient since it does not imply extra communication with the server.
The argument form causes function_name(copyofarg1, ...) to execute on some indeterminate host in the PVM. Args must be scalars, strings, or Vectors. Note that they are *COPIES* so that even string and Vector arguments are call by value and not call by reference. (This is different from the normal semantics of a direct function call). In this case efficiency was chosen at the expense of pedantic consistency since it is expected that in most cases the user does not need the return copy. In the event more than a single scalar return value is required use post within the function_name body with a key equal to the id of the task. For example:
The object form executes the function_name(copyofarg1, ...) in the context of the object. IT MUST BE THE CASE that the string resultfunc function_name() {local id id = hoc_ac_ $o1.reverse() pc.post(id, $o1) } ... while( (id = pc.working) != 0) { pc.take(id) pc.upkvec.printf }
identifies the "same" object on the host executing the function as on the host that submitted the task. This is guaranteed only if all hosts, when they start up, execute the same code that creates these objects. If you start creating these objects after the worker code diverges from the master (after pc.runworker) you really have to know what you are doing and the bugs will be VERY subtle.print object
A task should gather the results of all the tasks it submits before scattering other tasks even if scattering with different ParallelContext instances. This is because results are grouped by parent task id's instead of (parent task id, pc instance). Thus the following idiom needs extra user defined info to distinguish between pc1 and pc2 task results.
since pc1.working may get a result from a pc2 submission If this behavior is at all inconvenient, I will change the semantics so that pc1 results only are gathered by pc1.working calls and by no others.for i=1,10 pc1.submit(...) for i=1,10 pc2.submit(...) for i=1,10 { pc1.working() ...) for i=1,10 { pc2.working() ...)
Searching for the proper object context (pc.submit(object, ...) on the host executing the submitted task is linear in the number of objects of that type.
ParallelContext
id = pc.working()
While there are pending submissions and results are not ready, pending submissions from any ParallelContext from any host are calculated. Note that returns of completed submissions are not necessarily in the order that they were made by pc.submit.
while ((id = pc.working) > 0) { // gather results of previous pc.submit calls print id, pc.retval }
Note that if the submission did not have an explicit userid then all the arguments of the executed function may be unpacked.
It is essential to emphasize that when a task calls pc.working, while it is waiting for a result, it may execute any number of other tasks and unless care is taken to understand the meaning of "task context" and guarantee that context after the working call is the same as the context before the working call, SUBTLE ERRORS WILL HAPPEN more or less frequently and indeterminately. For example consider the following:
I only know one way around this problem. Perhaps there are other and better ways.function f() { ... write some values to some global variables ... pc.submit("g", ...) // when g is executed on another host it will not in general // see the same global variable values you set above. pc.working() // get back result of execution of g(...) // now the global variables may be different than what you // set above. And not because g changes them but perhaps // because the host executing this task started executing // another task that called f which then wrote DIFFERENT values // to these global variables.
function f() { local id id = hoc_ac_; ... write some values to some global variables ... pc.post(id, the, global, variables) pc.submit("g", ...) pc.working() pc.take(id) // unpack the info back into the global variables ... }
ParallelContext
scalar = pc.retval()
ParallelContext
scalar = pc.userid()
See submit with regard to retrieving the orginal arguments of the submit call corresponding to the working return.
Can be useful in organizing results according to an index defined during submission.
ParallelContext
pc.runworker()
The basic style is that the master and each host execute the same code up til the pc.runworker call and that code sets up all the context that is required to be identical on all hosts so that any host can run any task whenever the host requests somthing todo. The latter takes place in the runworker loop and when a task is waiting for a result in a working call. Many parallel processing bugs are due to inconsistent context among hosts and those bugs can be VERY subtle. Tasks should not change the context required by other tasks without extreme caution. The only way I know how to do this safely is to store and retrieve a copy of the authoritative context on the bulletin board. See working for further discussion in this regard.
The runworker method is called automatically for each worker after all files have been read in and executed --- i.e. if the user never calls it explicitly from hoc. Otherwise the workers would exit since the standard input is at the end of file for workers. This is useful in those cases where the only distinction between master and workers is that code executed from the gui or console.
ParallelContext
pc.done()
ParallelContext
pc.context("statement\n")
pc.context("function_name", arg1, ...])
pc.context(object, "function_name", arg1, ...)
pc.context(userid, ..as above..)
There is no return in the sense that working does not return when one of these tasks completes.
This method was introduced with the following protocol in mind
proc save_context() { // executed on master sprint(tstr, "%s", this) pc.look_take(tstr) // remove previous context if it exists // master packs a possibly complicated context from within // an object whose counterpart exists on all workers pc.post(tstr) pc.context(this, "restore_context", tstr) // all workers do this } proc restore_context() { pc.look($s1) // don't remove! Others need it as well. // worker unpacks possibly complicated context }
This method was introduced in an attempt to get a parallel multiple run fitter which worked in an interactive gui setting. As such it increases safety but is not bulletproof since there is no guarantee that the user doesn't change a global variable that is not part of the fitter. It is also difficult to write safe code that invariably makes all the relevant worker context identical to the master. An example of a common bug is to remove a parameter from the parameter list and then call save_context(). Sure enough, the multiple run fitters on all the workers will no longer use that parameter, but the global variables that depend on the parameter may be different on different hosts and they will now stay different! One fix is to call save_context() before the removal of the parameter from the list and save_context() after its removal. But the inefficiency is upsetting. We need a better automatic mirroring method.
ParallelContext
pc.post(key)
pc.post(key, ...)
Later unpacking of the message body must be done in the same order as this posting sequence.
ParallelContext
pc.take(key)
pc.take(key, ...)
&x
. Unpacked Vectors will be resized to the
correct size of the vector item of the message.
ParallelContext
boolean = pc.look(key)
boolean = pc.look(key, ...)
ParallelContext
boolean = pc.look_take(key, ...)
Note that a look followed by a take is *NOT* equivalent to look_take. It can easily occur that another task might take the message between the look and take and the latter will then block until some other process posts a message with the same key.
ParallelContext
pc.pack(...)
ParallelContext
pc.unpack(...)
&soma.gnabar_hh(.3)
ParallelContext
x = pc.upkscalar()
ParallelContext
str = pc.upkstr(str)
ParallelContext
vec = pc.upkvec()
vec = pc.upkvec(vecsrc)
ParallelContext
The master NEURON process contains the server for the bulletin board system. Communication between normal hoc code executing on the master NEURON process and the server is direct with no overhead except packing and unpacking messages and manipulating the send and receive buffers with pvm commands. The reason I put the server into the master process is twofold. 1) While the master is number crunching, client messages are still promptly dealt with. I noticed that when neuron was cpu bound, a separate server process did not respond to requests for about a tenth of a second. 2) No context switching between master process and server. If pvm is not running, a local implementation of the server is used which has even less overhead than pvm packing and unpacking.
Client (worker processes) communicate with the bulletin board server (in the master machine) with pvm commands pvm_send and pvm_recv. The master process is notified of asynchronous events via the SIGPOLL signal. Unfortunately this is often early since a pvm message often consists of several of these asynchronous events and my experience so far is that (pvm_probe(-1,-1) > 0) is not always true even after the last of this burst of signals. Also SIGPOLL is not available except under UNIX. However SIGPOLL is only useful on the master process and should not affect performance with regard to whether a client is working under Win95, NT, or Linux. So even with SIGPOLL there must be software polling on the server and this takes place on the next execute() call in the interpreter. (an execute call takes place when the body of every for loop, if statement, or function/procedure call is executed.) In the absence of a SIGPOLL signal this software polling takes place every POLLDELAY=20 executions. Of course this is too seldom in the case of fadvance calls with a very large model, and too often in the case of for i=1,100000 x+=i. Things are generally ok if the message at the end of a run says that the amount of time spent waiting for something to do is small compared to the amount of time spent doing things. Perhaps a timer would help.
The bulletin board server consists of several lists implemented with the STL (Standard Template Library) which makes for reasonably fast lookup of keys. ie searching is not proportional to the size of the list but proportional to the log of the list size.
Posts go into the message list ordered by key (string order). They stay there until taken with look_take or take. Submissions go into a work list ordered by id and a todo list of id's by priority. When a host requests something to do, the highest priority (first in the list) id is taken off the todo list. When done, the id goes onto a results list ordered by parent id. When working is called and a results list has an id with the right parent id, the id is removed from the results list and the (id, message) pair is removed from the work list.
If args are saved (no explicit userid in the submit call), they are stored locally and become the active buffer on the correspoinding working return. The saving is in an STL map associated with userid. The data itself is not copied but neither is it released until the next usage of the receive buffer after the working call returns.