Cluster parallel

All python based solutions listed on https://wiki.python.org/moin/ParallelProcessing are investivated.

Projects without a github link are filtered out. These are usually repos too old or not well maintained, without proper issue-solving mechanism.

Projects with too little github stars are filtered out.

The rest shows that "ray" and "celery" are the two most prominent candidate, followed by "dask" and "deap".

lsf

To begin with, IBM LSF is tried first since it's the recommended tool. However, the functions does not meet my requirements. LSF fundamentally lacks a programmable interface.

ray

"ray" is deployed successfully on the campus hpc, but problems are encounted that cannot be found and solved. Trials are also made to deploy "ray" on newer systems. The results are as following.

  • zju-i1c311-ws0-u: successful
  • zju-i19b118-ws0-u: successful
  • zju-i19b118-ws1-u: successful
  • zju-i1b305-ws0-u: successful deployment, error when running, probably caused by wsl
  • zju-i1c410-ws0-c: successful deployment, error when running, probably caused by centos 7
  • zju-i19b118-mgt01 with 40 nodes: successful deployment, unstable running result, sometimes successful, sometimes error, unusuable.

n2n is used to avoid firewall port restriction.

Although the deplyment and running are successful on the first 3 nodes, ray shows problems on load balancing, which is yet to be solved.

However, the ray dashboard serves great.

diy

It is possible to write one's own parallel framework. In fact, a tool 'bexec' have been made to allow parallel execution of commands (e.g. ray manangement commands) on hpc, using python multiprocessing and ssh.

To diy such a framework, there are problems to be solved and requirements to be met as following:

  • remote connection
    • paramiko for reusable ssh connection: too buggy, too slow, cannot make it work
    • alternatives with detail comparison is found at https://parallel-ssh.readthedocs.io/en/stable/alternatives.html
    • parallel-ssh also has similar problems, solved by deploying ed25519 keys (also paramiko) turned out 8192 bit rsa is not well supported
  • resource balance
    • monitoring of remaining resource with accuracy and efficiency
  • fault tolerance
    • node problem
    • worker problem
    • process problem
  • avoid management node bottleneck
    • async and threading based task schedulling does not run on multiple cores

Despite the cluster can run, the performance is task scheduler bottlenecked. In both case of paramiko and parallel-ssh, I cannot make multiprocessing on these two work due to 'can not pickle'.

The framework, however, can still be used in future where task scheduler is not a bottleneck.

Better still, parallel-ssh proves to be a good parallel command tool, which is what it's designed to be. Scripts are written to manage ray and celery instances using parallel-ssh.

celery

It works.

celery + rabbitmq + redis works.

All 5 workstations managed by me work smoothly, including 3 ubuntu 22.04, 1 ubuntu 20.04, and 1 centos 7.9. This gives 128+64+32+24+32=280 physical cores, or 560 threads.

Of all the 45 hpc nodes, all centos 7.4, 2 nodes (mgt01, mgt02, fat01) are management nodes, 3 nodes (ndoe28, node31, gpu08) are not physically not available, making a total of 39 nodes available to use. This gives 1904 physical cores, or 2072 threads.

One of the workstation is choosen to be the head node where the tasks are assigned, messages passed with rabbitmq, and results collected by redis. This gives a toatal of 280+1904-64=2120 physical cores, or 560+2072-128=2504 threads.

All the tasks are set with 19 nice value, i.e. lowest priority to keep a low profile and minimize influence on other users. Even though it is still competitive enough against other users' tasks. The head node would show delay on processing other nodes' celery heartbeats should it be filled with tasks too.

1904 cores cost ¥190.4 per hour, for a campus public facility this is very arrogant. However, an inspection into the billing system shows that it counts only the lsf record, thus all these calculation are done for free! This is clearly an exploit, but a rule is a rule and the price is set obviously improperly. In fact, members of the administration board have an unclear 'discount' on using the cluster, which could be the actual reasonable price. Now that the billing system is both arrogant and stupid, I see no point to make it smart while remaining arrogant.

The actual exploit lies that the cluster nodes allow direct ssh access other than through lsf.