Argo Overview, Table of available queues/partitions
Argo overview
Argo is ICTP HPC cluster, comprising of 153 hosts/nodes, with total count of 2588 CPUs, nearly 10 TB of memory, 40Gbps+ Infiniband interconnects, 1Gbps network and several houndreds of TB of dedicated NFS storage.
The available worker/compute nodes are organised in queues(partitions).
There are three more special cluster nodes: a master node that controls job execution and login nodes argo-login1 and argo-login2, where users login, submit jobs, and compile code.
Jobs can be submitted from argo (argo-login2), argo-login1 and argo-login2.
ssh argo.ictp.it
or
ssh argo-login1.ictp.it
List of available queues/partitions
Queue infomation can be listed with the sinfo command:
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
cmsp up 1-00:00:00 35/5/0/40 node[01-16,161-184]
esp up 1-00:00:00 24/11/1/36 node[61-96]
esp1 up 1-00:00:00 16/11/1/28 node[101-128]
long* up 1-00:00:00 19/18/1/38 node[21-32,131-156]
gpu up 1-00:00:00 0/2/0/2 gpu[01-02]
serial up 7-00:00:00 0/2/0/2 serial[01-02]
testing up 6:00:00 0/2/0/2 testing[01-02]
westmere up 6:00:00 0/0/1/1 westmere01
nehalem up 6:00:00 0/2/0/2 nehalem[01-02]
esp_guest up 1-00:00:00 0/2/0/2 nehalem[01-02]
The principal queue for all users is the long queue, with 26 nodes and 1 day time limit. It is the default queue, if none is specified in the job.
Dedicated queues cmsp, esp, esp1, esp_guest and gpu are available to specific Argo users, upon authorization.
testing queue is small, comprising of two nodes, and a short time limit of 6h.
serial queue is specific for serial jobs, while it has a very long time limit of 7 days. Two nodes are in the serial queue.
Generally for all queues the nodes are NOT shared among jobs. The exceptions are queues serial and gpu.
Node features
Overall, Argo is a heterogeneous cluster, with nodes belonging to various generations of Intel CPU microarchitectures. Numerous are nodes of the sandybridge and ivybridge architecture, followed by broadwell. We have kept, for historical reasons, several nodes of older microarchitectures like nehalem and westmere, so you can run code on them, for testing and comparison. Memory size also varies.
For each node we list it's microarchitecture, memory size, and other features in sinfo output:
$ sinfo -N -o "%.20N %.15C %.15P %.40b" NODELIST CPUS(A/I/O/T) PARTITION ACTIVE_FEATURES node01 20/0/0/20 cmsp omnipart,128gb,broadwell-ep,e5-2640v4 node02 20/0/0/20 cmsp omnipart,128gb,broadwell-ep,e5-2640v4 ... node21 0/12/0/12 long infiniband,32gb,sandybridge-ep,e5-2620 node22 0/12/0/12 long infiniband,32gb,sandybridge-ep,e5-2620 ... node61 0/12/0/12 esp infiniband,32gb,sandybridge-ep,e5-2620 node62 0/12/0/12 esp infiniband,32gb,sandybridge-ep,e5-2620 ... node101 20/0/0/20 esp1 infiniband,64gb,ivybridge-ep,e5-2680v2 node102 20/0/0/20 esp1 infiniband,64gb,ivybridge-ep,e5-2680v2 ... node131 20/0/0/20 long* infiniband,64gb,ivybridge-ep,e5-2680v2 node132 20/0/0/20 long* infiniband,64gb,ivybridge-ep,e5-2680v2 ... node139 0/16/0/16 long* infiniband,32gb,sandybridge-ep,e5-2650 node140 16/0/0/16 long* infiniband,32gb,sandybridge-ep,e5-2650
Within each queue, nodes are homogeneous in terms of all of their features. The ony exception is the "long" queue.
All nodes are networked together with 1 Gbps ethernet links spanning multiple switches.
Access to storage is also done through the Gigabit ethernet network.
Nodes within each queue are also networked together in a low-latency fabric for MPI communication, thanks to Infiniband (IB) or Omni-Path technology.
The two IB switches perform at 40 Gbps (QDR), while the Omni-Path switch supports 100 Gbps .
Table of available queues and nodes
Table below summarises queue, nodes and their characteristics
Queue/Partition |
Access Policy | Notes |
---|---|---|
long | All users |
- Allows allocations of a maximum of 10 nodes for running parallel jobs. |
testing | All users | |
serial | All users |
- ONLY FOR single core, cpu (or task) jobs; Parallel or MPI jobs will NOT WORK. - Up-to a maximum of 7 running independent serial jobs are allowed. - Resources are over-subscribed (the nodes are shared among jobs and users). |
nehalem | All users | |
westmere | All users | |
OTHER Dedicated Queues |
||
cmsp | Special authorization needed | - Omni-Path connectivity is provided. |
esp |
Special authorization needed | |
esp1 |
Special authorization needed | |
gpu
|
Special authorization needed |
- several GPU Accelerators are available of the type:
Nvidia Tesla K40
Nvidia Tesla P100
- resources are over-subscribed (the nodes are shared among jobs.).
|
Table with technical details
Queue/Partition |
Max walltime (h) | Node range | Micro-architecture | Cores |
Ram per core (GB/c) | Total nodes/cores |
Ram per node (GB) |
---|---|---|---|---|---|---|---|
long | 24:00 |
node[139-148] node[131-138],[149-156] node[21-32] |
Sandybridge Ivybridge Sandybridge |
16(8x2) 20(10x2) 12(6x2) |
2 3.2 2.7 |
10 / 160 16 / 320 12/144 |
32 64 32 |
testing | 6:00 | testing[01-02] | Nehalem |
8(4x2) |
1.5 | 1 / 8 | 12 |
nehalem | 6:00 | nehalem[01-02] | Nehalem |
8(4x2) |
3 | 2 / 16 | 24 |
westmere | 6:00 | westmere01 | Westmere | 12(6x2) | 2 | 1 / 12 | 24 |
cmsp | 24:00 |
node[01-16] node[161-184] |
Broadwell |
20(10x2) |
6.4 9.4 |
16 / 320 24/480 |
128 188 |
esp | 24:00 | node[61-96] | Sandybridge |
12(6x2) |
2.7 | 36 / 432 | 32 |
esp1 | 24:00 | node[101-128] | Ivybridge |
20(10x2) |
3.2 | 28 / 560 | 64 |
gpu | 24:00 |
gpu01 gpu02 |
Broadwell + 2* gp100 Sandybridge+ 2* k40c |
20(10x2) + GPUs 16(8x2) + GPUs |
6.4 2 |
1 / 120 1 /16 |
128 32 |