DANTE

name: inverse
class: center, inverse, middle
# APC Computing resources
Martin Souchal et Paul Zakharov,  2021-2024

---
layout: false
.left-column[
### - Overview
]
.right-column[
What computing resources are available at APC?

* Your computer

* Computing resources (HTC/HPC resources)

* Storage (home folder, backup folders, group folder, high performance folders)

* Security (VPN, ssh)

Today we will focus on computing resources. You can find more information about the other resources in the intranet :  
https://apc.u-paris.fr/APC_CS/fr/intranet/catalogue-de-services-informatique

]
---
layout: false
.left-column[
### - Overview
### - Architecture
]
.right-column[

The computing platform accessible by the job scheduler is composed of Linux computing servers. It is composed of three main use cases:

* HTC platforms (High-Throughput Computing)
    is suitable for running most of traditional HEP mono or multi-core applications
* HPC platforms (High-Performance Computing)
    is designed to accommodate parallel computing. It is composed of a set of servers connected in InfiniBand, which allows an effective communication inter-servers via the use of MPI libraries.
* GPU platforms
    is composed of a group of servers equipped with graphic cards to accommodate vector calculation applications.

<img src="images/cluster.jpg" >
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
]
.right-column[
* platforms supports a broad set of workloads. In general, anything that could be executed from a command-line can be submitted and it will be executed as a job under the control of the scheduler. Examples for suitable workloads are:
 * Shell scripts
 * Executable binaries
 * Interactive sessions

* Workloads may have requirements which need to be satisfied in order for them to function as desired. Those can include:
  * Dependencies on certain operating system
  * Requirements for available memory, disk space or CPU
  * Availability of software licenses to be utilized as part of an application embedded in the workload
  * Minimum of free disk space on a specific system

* These requirements are stated when submitting a workload and will instruct the sheduler how to handle the corresponding job.

* The workloads are that we call batch jobs

]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
]
.right-column[
### What is computing cluster ?

* Hardware : configured to work together
* Special Software that consists of two roles : 
Resource Manager and Job Scheduler
 * Resource Manager
 * Allocate resources within a cluster
 * Launch and otherwise manage job
 * Job Scheduler
 * When there is more work than resources, the job scheduler manages **queue(s)** of work
 * Supports complex scheduling algorithms (priority)
 * Supports resource limits, QoS: by queue, user, group, etc.

Sometimes it could be two pieces of software (e.g. Torque + Maui), sometimes just one (Slurm).

**Queue** (aka **Partitions**) is a set of global configuration properties (generally linked to some hardware profile).  
Jobs in the queue is subject to wide range of limits  
(e.g.: Runtime (real (wall clock) time), CPU time, Memory size, etc)
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
]
.right-column[
## What is wall-clock time? CPU time?

* Wall-clock time, or wall time is the actual time taken from the start of a computer program to the end. It is the difference between the time at which a task finishes and the time at which the task started.

* Wall time is thus different from CPU time, which measures only the time during which the processor is actively working on a certain task.

* The difference between the two can arise from architecture and run-time dependent factors, e.g. programmed delays or waiting for system resources to become available.
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
]
.right-column[
### What is a Resource Request?

When you submit a job you could (should) request some special resources.  
*E.g.: the job needs a host which has at least 512 MB main memory.*

Those requestable resources are predefined for the cluster (also called complex).

In addition to hardware resources, it could refer to some specific requests :
* hostname
* qname
* some specific storage constraint : *irods, mysql,...*
* some licensed software : *idl, matlab,...*
* ...
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
]
.right-column[
### What is Priority?

When you job is submitted, scheduler will assign some priority to it.

* Priority is used to be a qualifier for the rank order in which jobs are waiting to be scheduled.
* It refers to the POSIX priority which is either implicitly or explicitly assigned to a job.
* The POSIX priority is simply a number in the range between -1023 (lowest priority) and 1024 (highest priority).

~ *Overall calculation of priority could be a some complex multi-factor schema, 
but in reality one could think it relativly simple : more you ask => lower is your job priority.*
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
]
.right-column[
<img style="float: right;" src="images/DANTE.png" >
DANTE is computing cluster of APC (in partnership with IPGP).

* hosted in IPGP DataCenter, but APC part is administred by APC IT Team
* Scheduler : Slurm
* There are 3 different queues to submit your jobs. Here are their characteristics :

| QUEUE  | TIMELIMIT |  CPU | RAM by Node | NODELIST |
| ------  | ------ | ------ | ------ | ------ |
| quiet (default) |   infinite  |        600 | 96Gb | node[05-08,13-16][17-23] |
| bigmem  |    infinite  |        320 | 192Gb | node[01-04,09-12] |
| debug   |      30:00   |       160 | 96Gb | node[01-02][05-06] |

]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
]
.right-column[
<img style="float: right;" src="images/DANTE.png" >

* 130 Tb of fast storage (BeeGFS), Network : 2x10 Gb/s entre l'APC et l'IPGP
* CPU : Intel Xeon Gold 6230 2.1GHz, 20C/40T with DDR4-2933, 
* Some of available software :
      * Python, Fortran, C++ 
      * Matlab
      * OpenMPI
      * Singularity
* There is 23 nodes available. The nodes are distributed as follows :

| Node | CPU | RAM | Queue |
| ------ | ------ | ------ | ------ |
| apcdante | 32 | 32Gb | N/A |
| node01 | 40 | 192Gb | bigmem, debug |
| node02 | 40 | 192Gb | bigmem, debug |
| node03 | 40 | 192Gb | bigmem |
| node04 | 40 | 192Gb | bigmem |
| node05 | 40 | 96Gb | quiet, debug |
| node06 | 40 | 96Gb | quiet, debug |
| node07 | 40 | 96Gb | quiet |
| node08 | 40 | 96Gb | quiet |
| node09 | 40 | 192Gb | bigmem |
| node10 | 40 | 192Gb | bigmem |
| node11 | 40 | 192Gb | bigmem |
| node12 | 40 | 192Gb | bigmem |
| node13 | 40 | 96Gb | quiet |
| node14 | 40 | 96Gb | quiet |
| node15 | 40 | 96Gb | quiet |
| node16 | 40 | 96Gb | quiet |
| node17 | 32 | 96Gb | quiet |
| node18 | 32 | 96Gb | quiet |
| node19 | 32 | 96Gb | quiet |
| node20 | 32 | 96Gb | quiet |
| node21 | 32 | 96Gb | quiet |
| node22 | 32 | 96Gb | quiet |
| node23 | 32 | 96Gb | quiet |
| Total | 864 | 2976Gb |

]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
]
.right-column[
1) Check the execution of your job
* Launch it on interactive worker (debug queue)
* Check the output
* Customize the needed resources for the job

```bash
srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i --partition=debug
```

2) Run the job
* Write a submission script
* Ask for resources
* attributes, complex

```
$ sbatch run.sh
```
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
]
.right-column[

3) Job monitoring

* When you use the `squeue` command, you will see that the queue is divided in 3 parts: **Active** jobs, **Idle** jobs, and **Blocked** jobs.
 * Active jobs are the jobs that are running at the moment. 
 * Idle jobs are next in line to start running, when the needed resources become available. Each user can by default have only one job in the Idle Jobs queue.
 * Blocked jobs are all other jobs. Their state can be *Idle*, *Hold*, or *Deferred*.

* `scontrol show jobid -dd <jobid>` lists detailed information for a job (useful for troubleshooting).

* `sacct -j <jobid> --format=JobID,JobName,MaxRSS,Elapsed` will give you statistics on completed jobs by jobID. Once your job has completed, you can get additional information that was not available during the run. This includes run time, memory used, etc.

* Take a look at your mails : you may receive mail alarms in case of problems

* Cluster usage : https://si-apc.pages.in2p3.fr/dante-cluster/monitoring/
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
]
.right-column[
## My jobs still queued : Why?

* Check the <qsub> syntax : lack of PE, resources
* Check if the resource requested match with a BATCH queue
* Check if you are allowed in the project requested
* Check if you are allowed to submit to the queue requested
* Check the RQS of the requested complexes
* Check the group activity and the fair share
* Check the available slots for the requested project
* Check the status of the whole farm (it is maybe full)
* Check if there is currently an incident or an upgrade with GE

None of above?!  
Don't panic!  
Create a ticket on [OSticket {https://supportapc.in2p3.fr/}](https://supportapc.in2p3.fr/) with Help Topic "Cluster Support (Cluster)"
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
### - Best practice
]
.right-column[
* Read the doc : https://si-apc.pages.in2p3.fr/dante-cluster/

* DANTE use Slurm Scheduler. You can find Slurm documentation on their website : https://slurm.schedmd.com/documentation.html

* For a correct usage of cluster thank you to respect our recommendations.

* For a long (> 24h00) parallel job: not more than 80 cores and 60Gb of memory.
 * No limit for short (<5h00) parallel jobs. 
 * Nevertheless, for the jobs that take up to 8h00-12h00, it is preferable to execute these jobs at the end of the day to run overnight.
 * Each user of the queue should not take more than 40% of the available resources in this queue.

* These rules ensure that a maximum of users can work on the cluster, however they can be flexible due to the system load (memory, cpu, etc.) on the cluster.

* If you have an exceptional request please contact us via [OSticket](https://supportapc.in2p3.fr/).
]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
### - Best practice
### - External resources
]
.right-column[
# CC IN2P3

* Centre de Calcul de l’IN2P3 / CNRS
  * Established in Villeurbanne since 1986

* Missions
  * Mass storage and computing infrastructure
  * Network and connectivity
  * Common and collaborative services (electronic mail, electronic document management, software versioning system, projects management, etc.)

* Staff
  * 84 people (engineers, technicians, administration and researchers)

]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
### - Best practice
### - External resources
]
.right-column[
# CC IN2P3

* A HTC computer cluster
  * more than 18 000 cores, +90 Tb of memory, 360 000 HS06*
  * Network connectivity 1Gbps/server
  * 512 cores, memory 2048 Gb
  * Network connectivity 1Gbps/server

* SIMBA, a HPC cluster
  * 512 cores, memory 2048 Gb
  * Network connectivity 1Gbps/server

* NALA, HPC and GPGPU
  * 320 cores, memory 2.4 Tb
  * 64 GPU Nvidia
  * Network connectivity 1Gbps/server

* Documentation : https://doc.cc.in2p3.fr/fr/Computing/computing-introduction.html

]
---
.left-column[
### - Overview
### - Architecture
### - Batch queue system
### - Dante
### - Job submission HowTo
### - Best practice
### - External resources
]
.right-column[
# IPGP DANTE

* Open to APC members, request access : http://webpublix.ipgp.fr/rech/scp/Acces.php

]
---
name: inverse
class: center, inverse, middle

# Questions

Retrouvez toutes les informations et la documentation : https://si-apc.pages.in2p3.fr/dante-cluster/

Support : https://supportapc.in2p3.fr