Slurm

1. fep login

Login to the FEP cluster, over ssh:

$ ssh <username>@fep.grid.pub.ro

Replace <username> with the ID you use to login to moodle (something like mihai.dumitru2201).

Most likely, the login process will hang and you will see a message like:

Please login at https://login.upb.ro/auth/realms/UPB/device. Then input code YXAH-KGRL

Follow the instructions. This can get quite annoying, so you should enable secret key authentication. Follow the instructions here to see how to generate a key pair and how to add the public key to the remote server.

You can add the following to your ~/.ssh/config file to make the process easier (from then on, login using ssh fep):

Host fep
    Hostname fep.grid.pub.ro
    User <username>
    ServerAliveInterval 10
    IdentityFile ~/.ssh/<private_key_used_on_fep>
    IdentitiesOnly yes

2. Submitting a job

Create and run the following info.sh script, to get some information about the machine you are running on:

#!/bin/bash

# Print the hostname
printf "Hostname: %s\n\n" $(hostname)

# Print the number of installed processors 
printf "Number of processors: %s\n\n" $(nproc --all)

# Print the amount of memory available
printf "Total memory: %s\n\n" $(free -h | grep Mem | awk '{print $2}')

# Print the amount of disk space available
printf "Total disk space (for /): %s\n\n" $(df -h / | grep / | awk '{print $2}')

# Print information about the GPUs
if command -v nvidia-smi &> /dev/null
then
    printf "GPUs available:\n"
    nvidia-smi --query-gpu=gpu_name --format=csv,noheader
else
    printf "No GPUs available\n"
fi

You are seeing information about the front-end processor itself, which is not designed for workloads, but rather serves as a gateway to the cluster.

Now let's run this on a compute node; run:

srun --partition=ucsx --gres gpu:2 info.sh 

You should now see the details of the compute node you are running on.

How much GPU memory is available in total?

3. Getting more information about the cluster

To see a list of all the partitions available on the cluster, run:

$ sinfo -o "%10P %30N %10c %10m %20G "

Alternatively, you can inquire solely about a specific partition:

$ sinfo -p <partition_name> -o "%30N %10c %10m %20G "

Read more about the sinfo command here (or by running man sinfo!).

Find out how many CPUs are idle, in total, over all nodes in all partitions.

4. Running jobs

You can simply launch an interactive shell on a compute node by running:

srun --partition=ucsx --gres gpu:2 --pty bash

Alternatively, you can replace --pty bash with the path to a script you want to run. However, srun is a blocking command, so you will have to wait for the job to be accepted, then completed, before you regain control. Because this is a shared environment, it might happen that the resources you desire are not available at the moment.

So you might want to use the sbatch command to submit a job to the queue; when resources become available, this will be automatically taken from the queue and executed:

sbatch --partition=ucsx --gres gpu:2 --time=1:00:00 --wrap="bash info.sh"

Read more about srun here and about sbatch here.

What does --gres gpu:2 mean?

Each job is identified by a specific "job ID", which is printed to stdout after submission. At any point, you can check the status of your job by running:

squeue -j <job-id> -o "%10i %10P %15j %15u %10T %M"

This shows its ID, the partition it is running on, the job name, the user who submitted it, the state of the job, and the amount of time it has been running for (if it is running).

Read more about the squeue command here or from the man squeue.

In particular, running it without the -j <job-id> argument will show you all the jobs running on the cluster.

For each job submitted, a file is created in the directory where the job was submitted, with the name slurm-<job-id>.out; here, you will find the output of the job.

Submit a job to the queue that takes more than 30 seconds to run, then prints a "hello" message; cancel it before finishing.

5. Running a chat LLM

We will now load a model on the GPU and run a chatbot using the transformers library.

Please install miniconda on fep following the tutorial here.

First, we need some dependencies; because this is a shared environment (so we can't install stuff globally) and because we'll be using cutting-edge stuff that's very volatile, we will create virtual environments, in which we can install whatever we want. If you move on to another application, which might need other, conflicting versions of libraries, we will create another environment and install stuff there.

For managing the virtual environments, we will use conda.

Here's the basics of creating a new environment, activating it and installing packages:

# Create a new environment
$ conda create -n llmchat python=3.11

# Activate the environments
# (You should see the name of the enironment in paranthese before the prompt)
$ conda activate llmchat

# Install the necessary packages for our llm chat
$ pip install torch transformers accelerate
$ conda install tmux

# Deactivate the environment (when done with it)
$ conda deactivate

Read more about conda here.

Now let's create a script, chat.py using huggingface's transformers library, which takes care of downloading, verifying, loading the model on the GPU etc. Don't worry about the details for now, the script simply creates an endless prompt-reply interactive session between you and the model.

#!/usr/bin/env python3
import time
import transformers
import torch

model_id = "meta-llama/Llama-3.2-1B-Instruct"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

if __name__ == '__main__':
    while True:
        user = input("Prompt: ")
        messages = [
            {"role": "user", "content": user},
        ]

        outputs = pipeline(
            messages,
            max_new_tokens=2048,
        )

        print(outputs[0]["generated_text"][-1]["content"])

Run the script on a ucsx node and play around with it.

There are multiple ways to do this. You can either launch it directly:

$ srun --partition=ucsx --gres gpu:2 ./chat.py 

Or launch an interactive shell and run it from there:

$ srun --partition=ucsx --gres gpu:2 --pty bash
$ ./chat.py

When you get to the compute note, you need to manually activate the environment created previously. This mode allows you to run tmux so that you can safely detach while keeping your session and you can run multiple interactive shells at once on the same job. See here for an intro to tmux.

Use tmux to run watch -n0.5 nvidia-smi in one pane, then launch chat.py from another and see how the GPU is used (note that whenever a new pane is opened, you need to activate the environment again).

Write a non-interactive script that takes two arguments:

  • a file containing a "system" message with special instructions for how the LLM should reply
  • a file containing a prompt

The LLM's answer is printed to stdout. Submit this script to the queue.

IMPORTANT: The model you are using is automatically downloaded to ~/.cache/huggingface/hub/ the first time you run the script (and is taken from there, without downloading, on subsequent runs). Even though the model is quite small, it still sums up to 2.4 GB, so we ask you to clean up your ~/.cache/huggingface/hub/ directory after you complete the lab.