Slurm
1. fep login
Login to the FEP cluster, over ssh:
$ ssh <username>@fep.grid.pub.ro
Replace <username>
with the ID you use to login to moodle (something like mihai.dumitru2201
).
Most likely, the login process will hang and you will see a message like:
Please login at https://login.upb.ro/auth/realms/UPB/device. Then input code YXAH-KGRL
Follow the instructions. This can get quite annoying, so you should enable secret key authentication. Follow the instructions here to see how to generate a key pair and how to add the public key to the remote server.
You can add the following to your ~/.ssh/config
file to make the process
easier (from then on, login using ssh fep
):
Host fep
Hostname fep.grid.pub.ro
User <username>
ServerAliveInterval 10
IdentityFile ~/.ssh/<private_key_used_on_fep>
IdentitiesOnly yes
2. Submitting a job
Create and run the following info.sh
script, to get some information about the
machine you are running on:
#!/bin/bash
# Print the hostname
printf "Hostname: %s\n\n" $(hostname)
# Print the number of installed processors
printf "Number of processors: %s\n\n" $(nproc --all)
# Print the amount of memory available
printf "Total memory: %s\n\n" $(free -h | grep Mem | awk '{print $2}')
# Print the amount of disk space available
printf "Total disk space (for /): %s\n\n" $(df -h / | grep / | awk '{print $2}')
# Print information about the GPUs
if command -v nvidia-smi &> /dev/null
then
printf "GPUs available:\n"
nvidia-smi --query-gpu=gpu_name --format=csv,noheader
else
printf "No GPUs available\n"
fi
You are seeing information about the front-end processor itself, which is not designed for workloads, but rather serves as a gateway to the cluster.
Now let's run this on a compute node; run:
srun --partition=ucsx --gres gpu:2 info.sh
You should now see the details of the compute node you are running on.
How much GPU memory is available in total?
3. Getting more information about the cluster
To see a list of all the partitions available on the cluster, run:
$ sinfo -o "%10P %30N %10c %10m %20G "
Alternatively, you can inquire solely about a specific partition:
$ sinfo -p <partition_name> -o "%30N %10c %10m %20G "
Read more about the sinfo
command
here (or by running man sinfo
!).
Find out how many CPUs are idle, in total, over all nodes in all partitions.
4. Running jobs
You can simply launch an interactive shell on a compute node by running:
srun --partition=ucsx --gres gpu:2 --pty bash
Alternatively, you can replace --pty bash
with the path to a script you want to
run.
However, srun
is a blocking command, so you will have to wait for the job to
be accepted, then completed, before you regain control.
Because this is a shared environment, it might happen that the resources you
desire are not available at the moment.
So you might want to use the sbatch
command to submit a job to the queue; when
resources become available, this will be automatically taken from the queue and
executed:
sbatch --partition=ucsx --gres gpu:2 --time=1:00:00 --wrap="bash info.sh"
Read more about srun
here and about
sbatch
here.
What does --gres gpu:2
mean?
Each job is identified by a specific "job ID", which is printed to stdout after submission. At any point, you can check the status of your job by running:
squeue -j <job-id> -o "%10i %10P %15j %15u %10T %M"
This shows its ID, the partition it is running on, the job name, the user who submitted it, the state of the job, and the amount of time it has been running for (if it is running).
Read more about the squeue
command
here or from the man squeue
.
In particular, running it without the -j <job-id>
argument will show you all
the jobs running on the cluster.
For each job submitted, a file is created in the directory where the job
was submitted, with the name slurm-<job-id>.out
; here, you will find the output
of the job.
Submit a job to the queue that takes more than 30 seconds to run, then prints a "hello" message; cancel it before finishing.
5. Running a chat LLM
We will now load a model on the GPU and run a chatbot using the transformers
library.
Please install miniconda on fep following the tutorial here.
First, we need some dependencies; because this is a shared environment (so we can't install stuff globally) and because we'll be using cutting-edge stuff that's very volatile, we will create virtual environments, in which we can install whatever we want. If you move on to another application, which might need other, conflicting versions of libraries, we will create another environment and install stuff there.
For managing the virtual environments, we will use conda
.
Here's the basics of creating a new environment, activating it and installing packages:
# Create a new environment
$ conda create -n llmchat python=3.11
# Activate the environments
# (You should see the name of the enironment in paranthese before the prompt)
$ conda activate llmchat
# Install the necessary packages for our llm chat
$ pip install torch transformers accelerate
$ conda install tmux
# Deactivate the environment (when done with it)
$ conda deactivate
Read more about conda
here.
Now let's create a script, chat.py
using huggingface's
transformers
library, which
takes care of downloading, verifying, loading the model on the GPU etc.
Don't worry about the details for now, the script simply creates an endless
prompt-reply interactive session between you and the model.
#!/usr/bin/env python3
import time
import transformers
import torch
model_id = "meta-llama/Llama-3.2-1B-Instruct"
pipeline = transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype": torch.bfloat16},
device_map="auto",
)
if __name__ == '__main__':
while True:
user = input("Prompt: ")
messages = [
{"role": "user", "content": user},
]
outputs = pipeline(
messages,
max_new_tokens=2048,
)
print(outputs[0]["generated_text"][-1]["content"])
Run the script on a ucsx
node and play around with it.
There are multiple ways to do this. You can either launch it directly:
$ srun --partition=ucsx --gres gpu:2 ./chat.py
Or launch an interactive shell and run it from there:
$ srun --partition=ucsx --gres gpu:2 --pty bash
$ ./chat.py
When you get to the compute note, you need to manually activate the environment
created previously.
This mode allows you to run tmux
so that you can safely detach while keeping
your session and you can run multiple interactive shells at once on the same job.
See here for an intro to
tmux
.
Use tmux
to run watch -n0.5 nvidia-smi
in one pane, then launch chat.py
from another and see how the GPU is used (note that whenever a new pane is
opened, you need to activate the environment again).
Write a non-interactive script that takes two arguments:
- a file containing a "system" message with special instructions for how the LLM should reply
- a file containing a prompt
The LLM's answer is printed to stdout. Submit this script to the queue.
IMPORTANT: The model you are using is automatically downloaded to
~/.cache/huggingface/hub/
the first time you run the script (and is taken from there, without downloading, on subsequent runs). Even though the model is quite small, it still sums up to 2.4 GB, so we ask you to clean up your~/.cache/huggingface/hub/
directory after you complete the lab.