PyTorch and MONAI
This is a simple guide to using PyTorch via the provided Apptainer container on JADE. This will cover basic script setup with SLURM and invoking a PyTorch script. A more involved example is given using the MONAI biomedical deep learning framework.
The following instructions will use the container /apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
and demonstrate data and scratch directories. The version of PyTorch in this container is specific to the AMD hardware on
JADE and so shouldn’t be updated, however other libraries can be installed at runtime to customise the environment.
- This expects that you’re familiar with:
How to log into JADE
SLURM and batch scripts
Python and PyTorch deep learning
bash usage
PyTorch
As demonstrated elsewhere, PyTorch can be invoked in an interactive job using the container and a Python script. First,
write the following into your home directory as the file pytorch_check.py:
import torch
print("Version:", torch.__version__)
print("Devices:", torch.cuda.device_count())
for i in range(torch.cuda.device_count()):
print("Device", i, ":", torch.cuda.get_device_properties(i))
This can then be run through an interactive job:
$ CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
$ srun -p short --gres=gpu:2 --pty apptainer run --rocm $CONTAINER python pytorch_check.py
srun: GPU gres requested, checking settings/requirements...
Version: 2.3.0a0+gitd2f9472
Devices: 2
Device 0 : _CudaDeviceProperties(name='AMD Instinct MI300X', major=9, minor=4, gcnArchName='gfx942:sramecc+:xnack-', total_memory=196592MB, multi_processor_count=304)
Device 1 : _CudaDeviceProperties(name='AMD Instinct MI300X', major=9, minor=4, gcnArchName='gfx942:sramecc+:xnack-', total_memory=196592MB, multi_processor_count=304)
Submitting batch scripts is the correct method for submitting long running jobs to JADE. The equivalent script, saved to
pytorch_check.sh will perform the same operation using the same Apptainer image:
#!/bin/bash
# set partition
#SBATCH --partition=short
# set the number of nodes
#SBATCH --nodes=1
# set number of CPUs
#SBATCH --cpus-per-task=8
# set max wallclock time
#SBATCH --time=00:05:00
# set name of job
#SBATCH --job-name=pytorchtest
# set number of GPUs
#SBATCH --gres=gpu:2
# container to run commands in
CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
# runs the script pytorch_check.py in the container
apptainer run --rocm "$CONTAINER" python pytorch_check.py
Once submitted with sbatch pytorch_check.sh, a log file will eventually appear with the same output as above.
Apptainer by default makes the user’s home directory available in the running container but other directories need
to be bound to locations in the container on the command line. These would normally include the data and scratch
directories, which will be discussed in the next section.
MONAI
The Medical Open Network for Artificial Intelligence (MONAI, https://monai.io) is a PyTorch-based framework for medical imaging developed in collaboration with King’s College London, Nvidia, and many other consortium partners.
This tutorial will go through downloading the test dataset, using the DATA and scratch directories effectively, and
training a simple MLP classifier. The DATA directory is pointed to by an environment variable present in your login
session but corresponds to the location /data/$PROJECT/$USER. The /scratch partition is a fast file system used for
loading data efficiently but is only available within running jobs, so the script for a batch job must move data there
before it’s used. These concepts will be demonstrated here with job scripts using MONAI.
To use MONAI in the PyTorch container, it first must be installed along with dependencies. This is done with pip before
anything is run, which has the effect of installing packages into your home ~/.local/lib/python3.10/site-packages
directory as a result of how Apptainer makes your home directory available by default (unlike Docker).
It’s important to be aware that this places some installed components in a location that isn’t ephemeral and so changes
your environment.
The previous example batch script simply ran python within the container with the script file in your home directory.
To install MONAI and then use it we would need to run a bash script within the container which does this, but instead
the “here document” feature of bash can be used in the submission script to run these commands. In the following, any
commands between _EOF_ will be run within the container and so saves having to create another file:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --time=00:05:00
#SBATCH --job-name=monaitest
#SBATCH --gres=gpu:1
CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
# runs commands in the container until _EOF_ is reached, put your actual script commands here
apptainer run --rocm "$CONTAINER" << _EOF_
# installs locally into ~/.local since the running container is read-only
pip install monai[ignite,nibabel,scipy]==1.4.0
# run your script here instead of this example
python -m monai.config.deviceconfig
_EOF_
Submitting this should produce a output log with MONAI’s configuration information. The key thing being done here is
installing MONAI with a selection of dependencies with monai[ignite,nibabel,scipy], but avoiding installing new
versions of PyTorch or replacing anything depending on it. This PyTorch is specifically compiled for the system’s
hardward so shouldn’t be replaced, but other things compatible with PyTorch 2.3.0 can be installed.
The version of MONAI is also pinned here at 1.4.0 since later versions will drop support for PyTorch 2.3.0. Later
versions of MONAI can be installed with this unsupported version, however ensure pip doesn’t replace the
pre-compiled version in the container by including torch==2.3.0a0+gitd2f9472 with the command.
Training Script
Next, download the MedNIST dataset to your data directory with the following:
wget https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz -O $DATA/MedNIST.tar.gz
This will place the file in your $DATA directory provided to you by the JADE environment. Your home directory quota
is very small so this storage space is necessary for large datasets. The file system is however not very fast so the
tarball will be left compressed for now and will instead be unpacked in the running job into the scratch location.
The script below is derived from https://github.com/Project-MONAI/tutorials/blob/main/2d_classification/monai_101.ipynb
so refer to that tutorial on what exactly is being done here. This should be saved to the file monai_101.py:
import logging
import numpy as np
import os
from pathlib import Path
import sys
import tempfile
import torch
from monai.apps import MedNISTDataset
from monai.config import print_config
from monai.data import DataLoader
from monai.engines import SupervisedTrainer
from monai.handlers import StatsHandler
from monai.inferers import SimpleInferer
from monai.networks import eval_mode
from monai.networks.nets import densenet121
from monai.transforms import LoadImageD, EnsureChannelFirstD, ScaleIntensityD, Compose
print_config()
directory = os.environ.get("MONAI_DATA_DIRECTORY")
if directory is not None:
os.makedirs(directory, exist_ok=True)
root_dir = tempfile.mkdtemp() if directory is None else directory
print(root_dir)
transform = Compose(
[
LoadImageD(keys="image", image_only=True),
EnsureChannelFirstD(keys="image"),
ScaleIntensityD(keys="image"),
]
)
dataset = MedNISTDataset(
root_dir=root_dir, transform=transform, section="training", download=True, progress=False
)
max_epochs = 5
model = densenet121(spatial_dims=2, in_channels=1, out_channels=6).to("cuda:0")
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
trainer = SupervisedTrainer(
device=torch.device("cuda:0"),
max_epochs=max_epochs,
train_data_loader=DataLoader(dataset, batch_size=512, shuffle=True, num_workers=4),
network=model,
optimizer=torch.optim.Adam(model.parameters(), lr=1e-5),
loss_function=torch.nn.CrossEntropyLoss(),
inferer=SimpleInferer(),
train_handlers=StatsHandler(),
)
trainer.run()
dataset_dir = Path(root_dir, "MedNIST")
class_names = sorted(f"{x.name}" for x in dataset_dir.iterdir() if x.is_dir())
testdata = MedNISTDataset(
root_dir=root_dir, transform=transform, section="test", download=False, progress=False
)
max_items_to_print = 10
with eval_mode(model):
for item in DataLoader(testdata, batch_size=1, num_workers=0):
prob = np.array(model(item["image"].to("cuda:0")).detach().to("cpu"))[0]
pred = class_names[prob.argmax()]
gt = item["class_name"][0]
print(f"Class prediction is {pred}. Ground-truth: {gt}")
max_items_to_print -= 1
if max_items_to_print == 0:
break
We can now create monai_101.sh and submit it:
#!/bin/bash
#SBATCH --partition=short
#SBATCH --nodes=1
#SBATCH --cpus-per-task=8
#SBATCH --time=01:00:00
#SBATCH --job-name=monaitest
#SBATCH --gres=gpu:1
# don't exit on error to ensure cleanup is done
set +e
CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
# location of a scratch directory to use for temporary data
SCRATCH=/scratch/slurm-$SLURM_JOBID
# make a scratch temporary location
mkdir -p "$SCRATCH"
apptainer run --rocm --bind "$DATA:/data" --bind "$SCRATCH:/scratch" "$CONTAINER" << _EOF_
pip install monai[ignite,tqdm,pillow]==1.4.0
# copy pre-downloaded data to the fast scratch location bound in the container to /scratch
cp /data/MedNIST.tar.gz /scratch
# set location used in monai_101.py for finding data
export MONAI_DATA_DIRECTORY=/scratch
python monai_101.py
_EOF_
# delete temporary data
rm -rf $SCRATCH
This job will install MONAI if not present along with dependencies and then run the Python script within the container.
The above script can be used as a template for other PyTorch training jobs by changing the contents between _EOF_
to run other things in the container with access to both the data and scratch locations. For MONAI or other Python training
runs, replacing monai_101.py as the executed script file may be sufficient for simple set ups.
Note that your $DATA directory is mounted (or bound) within the running container at /data, this is done with the
--bind "$DATA:/data" argument in the apptainer command line. Similarly, the scratch directory that was created
for the job is bound to /scratch. The typical practice is to load data from /data when needed but to rely on the
scratch location for fast disk access for intermediate results or caching. Here the tarball is copied from /data
into /scratch with the expectation that the script will unpack it there and load files much faster this way. As your
code and workflow will differ from what’s illustrated here, it’s up to you to determine how to handle data and where
to place it for performance.