.. _pytorch:

PyTorch and MONAI
=================

This is a simple guide to using PyTorch via the provided Apptainer container on JADE. This will cover basic script 
setup with SLURM and invoking a PyTorch script. A more involved example is given using the MONAI biomedical deep
learning framework.

The following instructions will use the container ``/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif``
and demonstrate data and scratch directories. The version of PyTorch in this container is specific to the AMD hardware on
JADE and so shouldn't be updated, however other libraries can be installed at runtime to customise the environment.

This expects that you're familiar with:
  * How to log into JADE
  * SLURM and batch scripts
  * Python and PyTorch deep learning
  * bash usage

PyTorch
-------

As demonstrated elsewhere, PyTorch can be invoked in an interactive job using the container and a Python script. First,
write the following into your home directory as the file ``pytorch_check.py``: 

.. code-block:: python

    import torch

    print("Version:", torch.__version__)
    print("Devices:", torch.cuda.device_count())

    for i in range(torch.cuda.device_count()):
        print("Device", i, ":", torch.cuda.get_device_properties(i))
      
This can then be run through an interactive job: 

.. code-block:: bash

    $ CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
    $ srun -p short --gres=gpu:2 --pty apptainer run --rocm $CONTAINER python pytorch_check.py
    srun: GPU gres requested, checking settings/requirements...
    Version: 2.3.0a0+gitd2f9472
    Devices: 2
    Device 0 : _CudaDeviceProperties(name='AMD Instinct MI300X', major=9, minor=4, gcnArchName='gfx942:sramecc+:xnack-', total_memory=196592MB, multi_processor_count=304)
    Device 1 : _CudaDeviceProperties(name='AMD Instinct MI300X', major=9, minor=4, gcnArchName='gfx942:sramecc+:xnack-', total_memory=196592MB, multi_processor_count=304)

Submitting batch scripts is the correct method for submitting long running jobs to JADE. The equivalent script, saved to
``pytorch_check.sh`` will perform the same operation using the same Apptainer image:

.. code-block:: bash

    #!/bin/bash

    # set partition
    #SBATCH --partition=short

    # set the number of nodes
    #SBATCH --nodes=1

    # set number of CPUs
    #SBATCH --cpus-per-task=8

    # set max wallclock time
    #SBATCH --time=00:05:00

    # set name of job
    #SBATCH --job-name=pytorchtest

    # set number of GPUs
    #SBATCH --gres=gpu:2

    # container to run commands in
    CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif

    # runs the script pytorch_check.py in the container
    apptainer run --rocm "$CONTAINER" python pytorch_check.py

Once submitted with ``sbatch pytorch_check.sh``, a log file will eventually appear with the same output as above. 
Apptainer by default makes the user's home directory available in the running container but other directories need
to be bound to locations in the container on the command line. These would normally include the data and scratch
directories, which will be discussed in the next section.


MONAI
-----

The Medical Open Network for Artificial Intelligence (MONAI, https://monai.io) is a PyTorch-based framework for medical 
imaging developed in collaboration with King's College London, Nvidia, and many other consortium partners. 

This tutorial will go through downloading the test dataset, using the DATA and scratch directories effectively, and 
training a simple MLP classifier. The DATA directory is pointed to by an environment variable present in your login
session but corresponds to the location ``/data/$PROJECT/$USER``. The ``/scratch`` partition is a fast file system used for
loading data efficiently but is only available within running jobs, so the script for a batch job must move data there
before it's used. These concepts will be demonstrated here with job scripts using MONAI.

To use MONAI in the PyTorch container, it first must be installed along with dependencies. This is done with ``pip`` before
anything is run, which has the effect of installing packages into your home ``~/.local/lib/python3.10/site-packages``
directory as a result of how Apptainer makes your home directory available by default (unlike Docker).
It's important to be aware that this places some installed components in a location that isn't ephemeral and so changes 
your environment.

The previous example batch script simply ran ``python`` within the container with the script file in your home directory.
To install MONAI and then use it we would need to run a bash script within the container which does this, but instead
the "here document" feature of bash can be used in the submission script to run these commands. In the following, any
commands between ``_EOF_`` will be run within the container and so saves having to create another file:

.. code-block:: bash

    #!/bin/bash

    #SBATCH --partition=short
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=8
    #SBATCH --time=00:05:00
    #SBATCH --job-name=monaitest
    #SBATCH --gres=gpu:1

    CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif

    # runs commands in the container until _EOF_ is reached, put your actual script commands here
    apptainer run --rocm "$CONTAINER" << _EOF_
      # installs locally into ~/.local since the running container is read-only
      pip install monai[ignite,nibabel,scipy]==1.4.0
      # run your script here instead of this example
      python -m monai.config.deviceconfig
    _EOF_

Submitting this should produce a output log with MONAI's configuration information. The key thing being done here is
installing MONAI with a selection of dependencies with ``monai[ignite,nibabel,scipy]``, but avoiding installing new 
versions of PyTorch or replacing anything depending on it. This PyTorch is specifically compiled for the system's
hardward so shouldn't be replaced, but other things compatible with PyTorch 2.3.0 can be installed. 

The version of MONAI is also pinned here at 1.4.0 since later versions will drop support for PyTorch 2.3.0. Later
versions of MONAI can be installed with this unsupported version, however ensure ``pip`` doesn't replace the 
pre-compiled version in the container by including ``torch==2.3.0a0+gitd2f9472`` with the command.

Training Script
---------------

Next, download the MedNIST dataset to your data directory with the following:

.. code-block:: bash

    wget https://github.com/Project-MONAI/MONAI-extra-test-data/releases/download/0.8.1/MedNIST.tar.gz -O $DATA/MedNIST.tar.gz

This will place the file in your ``$DATA`` directory provided to you by the JADE environment. Your home directory quota 
is very small so this storage space is necessary for large datasets. The file system is however not very fast so the 
tarball will be left compressed for now and will instead be unpacked in the  running job into the scratch location.

The script below is derived from https://github.com/Project-MONAI/tutorials/blob/main/2d_classification/monai_101.ipynb
so refer to that tutorial on what exactly is being done here. This should be saved to the file ``monai_101.py``:

.. code-block:: python

    import logging
    import numpy as np
    import os
    from pathlib import Path
    import sys
    import tempfile
    import torch

    from monai.apps import MedNISTDataset
    from monai.config import print_config
    from monai.data import DataLoader
    from monai.engines import SupervisedTrainer
    from monai.handlers import StatsHandler
    from monai.inferers import SimpleInferer
    from monai.networks import eval_mode
    from monai.networks.nets import densenet121
    from monai.transforms import LoadImageD, EnsureChannelFirstD, ScaleIntensityD, Compose

    print_config()

    directory = os.environ.get("MONAI_DATA_DIRECTORY")
    if directory is not None:
        os.makedirs(directory, exist_ok=True)
    root_dir = tempfile.mkdtemp() if directory is None else directory
    print(root_dir)

    transform = Compose(
        [
            LoadImageD(keys="image", image_only=True),
            EnsureChannelFirstD(keys="image"),
            ScaleIntensityD(keys="image"),
        ]
    )

    dataset = MedNISTDataset(
        root_dir=root_dir, transform=transform, section="training", download=True, progress=False
    )

    max_epochs = 5
    model = densenet121(spatial_dims=2, in_channels=1, out_channels=6).to("cuda:0")

    logging.basicConfig(stream=sys.stdout, level=logging.INFO)
    trainer = SupervisedTrainer(
        device=torch.device("cuda:0"),
        max_epochs=max_epochs,
        train_data_loader=DataLoader(dataset, batch_size=512, shuffle=True, num_workers=4),
        network=model,
        optimizer=torch.optim.Adam(model.parameters(), lr=1e-5),
        loss_function=torch.nn.CrossEntropyLoss(),
        inferer=SimpleInferer(),
        train_handlers=StatsHandler(),
    )

    trainer.run()

    dataset_dir = Path(root_dir, "MedNIST")
    class_names = sorted(f"{x.name}" for x in dataset_dir.iterdir() if x.is_dir())
    testdata = MedNISTDataset(
        root_dir=root_dir, transform=transform, section="test", download=False, progress=False
    )

    max_items_to_print = 10
    with eval_mode(model):
        for item in DataLoader(testdata, batch_size=1, num_workers=0):
            prob = np.array(model(item["image"].to("cuda:0")).detach().to("cpu"))[0]
            pred = class_names[prob.argmax()]
            gt = item["class_name"][0]
            print(f"Class prediction is {pred}. Ground-truth: {gt}")
            max_items_to_print -= 1
            if max_items_to_print == 0:
                break

We can now create ``monai_101.sh`` and submit it:

.. code-block:: bash

    #!/bin/bash

    #SBATCH --partition=short
    #SBATCH --nodes=1
    #SBATCH --cpus-per-task=8
    #SBATCH --time=01:00:00
    #SBATCH --job-name=monaitest
    #SBATCH --gres=gpu:1

    # don't exit on error to ensure cleanup is done
    set +e

    CONTAINER=/apps/common/containers/apptainer/AMD/PyTorch/2.3.0/PyTorch-2.3.0-AMD.sif
    # location of a scratch directory to use for temporary data
    SCRATCH=/scratch/slurm-$SLURM_JOBID
    # make a scratch temporary location
    mkdir -p "$SCRATCH"

    apptainer run --rocm --bind "$DATA:/data" --bind "$SCRATCH:/scratch" "$CONTAINER" << _EOF_
      pip install monai[ignite,tqdm,pillow]==1.4.0

      # copy pre-downloaded data to the fast scratch location bound in the container to /scratch
      cp /data/MedNIST.tar.gz /scratch
      # set location used in monai_101.py for finding data
      export MONAI_DATA_DIRECTORY=/scratch

      python monai_101.py
    _EOF_

    # delete temporary data
    rm -rf $SCRATCH


This job will install MONAI if not present along with dependencies and then run the Python script within the container.
The above script can be used as a template for other PyTorch training jobs by changing the contents between ``_EOF_`` 
to run other things in the container with access to both the data and scratch locations. For MONAI or other Python training
runs, replacing ``monai_101.py`` as the executed script file may be sufficient for simple set ups. 

Note that your ``$DATA`` directory is mounted (or bound) within the running container at ``/data``, this is done with the
``--bind "$DATA:/data"`` argument in the ``apptainer`` command line. Similarly, the scratch directory that was created 
for the job is bound to ``/scratch``. The typical practice is to load data from ``/data`` when needed but to rely on the
scratch location for fast disk access for intermediate results or caching. Here the tarball is copied from ``/data``
into ``/scratch`` with the expectation that the script will unpack it there and load files much faster this way. As your
code and workflow will differ from what's illustrated here, it's up to you to determine how to handle data and where
to place it for performance.