waalge.xyz - Fine-tuning Distilbert for Squad With Huggingface: A Terse Account

This is a short step-by-step account of how to fine-tune Huggingface’s distilbert for closed domain question answering task, squad, on Azure with a free trial account. Many aspects should be analogous when fine-tuning other transformer models, to other tasks, on other cloud compute services.

EDIT : Massive caveat. These tools and libraries are being changed and updated blisteringly fast. Although, at the time of writing, these steps lead to functioning output, it is unlikely time will be kind.

Context

Huggingface have kindly intergrated into a (almost) coherent endpoint numerous bert like models. This means you can (almost) exchange different models at will.

One model from members of huggingface is distilbert. Best thing about distilbert is that it’s as ludicrously huge as other models. And it performs reasonably.

Squad is a dataset and associated benchmark on closed domain question answering set by Stanford.

Azure is Microsoft’s cloud platform.

Azure setup

Create a VM with GPUs

From the main dashboard: find Machine learning and open the Azure ML lab. Find ‘Compute’ in the left hand menu. Here select the ‘Training Cluster’ tab, and click ’ + ’ to add a new training cluster.

A form slides in from the right.

Name: (suggestion) ‘tc-distilbert-fine-tune-squad’
Size: On the dropdown select GPU, and select the size (suggestion) STANDARD_NC_12.
Leave priority as is.
Min Node size: select 1
Max Node size: select 1
In advance toggle open port 22 for ssh access.
As the user (suggestion) add ‘jo’.
Add your public ssh key (password authentification is an alternative, so do one or the other.
Click create.

A request is put in to have your resources allocated to you. Meanwhile a row appears in the Training Cluster table, with a loading status. This may take a few minutes to be ready. Click on the row to see the dashboard for the cluster.

There is a tab entitled ‘nodes’. When the node is up, a public IP (123.456.789.0) and a port number (50000) will be listed. Open a terminal and ssh in

ssh -p 50000 jo@123.456.789.0

Troubleshooting:

A VM with GPU access is not available on a trial account via the VM option on the Azure dashboard. Setting up a VM this way is subject to a 4 core limit. The smallest VM with GPU access has 6 CPUs.
I have experience repeatedly finding Azure failing to provision me a machine without comment. Random trial an error, led me to instead set Min number of nodes to 0, and edit this to 1 after my request had been provisioned.
If I try sshing into the machine shortly after the node details are advertised, then I am often asked for a password, despite none being set. Waiting a minute saw this issue disappear.

At the time of writing, the VM is running ubuntu 16.04, has python 3.5 and has two NVidia Tesla K80s.

Setting up the VM

In the terminal, sshed into the machine, you can switch to bash over shell (equipped with its more user friendly tools like tab completion and history).

bash

Install python3.7, pip, pytorch, transformers, and their dependencies. (We assume that vim, git, and tmux are already installed.) Download the squad v1.1 data and put it in the directory ~/squad_data/. One way to do all of this is to copy and paste the snippet below into a script setup.sh and run it with bash setup.sh.

## SETUP 

## Install python3.7 
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7 python3.7-dev

## Get pip 
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3.7 get-pip.py --user

## install torch from pip (might need adapting for hardware spec!)
pip install --user torch

## Pip to install from local transformer instructions 
git clone https://github.com/huggingface/transformers.git
pip install --user --editable ./transformers 
pip install --user -r transformers/examples/requirements.txt

# Get squad data
mkdir squad_data 
curl https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -o squad_data/train-v1.1.json
curl https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -o squad_data/dev-v1.1.json 

# OPTIONALS: tree, gpu status programs. 
sudo apt install tree 
sudo apt install cmake libncurses5-dev libncursesw5-dev
git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
cmake .. -DNVML_RETRIEVE_HEADER_ONLINE=True
sudo make
sudo make install
pip install --user gpustat

Configuring squad training

There are a lot of flags and options involved when running the squad training scripts. I personally found it much easier to work with a python script that simply prints the required command, than editing the bash command directly. It feels a bit hacky, but it’s simple and works.

To follow this method in the home directory, add the following content to the file make_cmd.py.

import os, subprocess

# Types: 'bert' 'xlnet' 'xlm' 'distilbert' 'albert'
MODEL_TYPE = "distilbert"
MODEL_NAME = "distilbert-??"

PYTHON = "python3.7"

PWD = os.getcwd()
SQUAD_DIR = os.path.join(PWD, "squad_data")
TRAIN_PATH = os.path.join(SQUAD_DIR, "train-v1.1.json")
DEV_PATH = os.path.join(SQUAD_DIR, "dev-v1.1.json")

OUTPUT_DIR = os.path.join(PWD, "output")

TRANSFORMERS = os.path.join(PWD, "transformers") 
RUN_SQUAD = os.path.join(TRANSFORMERS, "examples/run_squad.py")

cmd = [
    PYTHON, RUN_SQUAD,
    "--model_type", MODEL_TYPE,
    "--model_name_or_path", MODEL_NAME,
    "--do_train",
    "--do_eval",
    # "--do_lower_case",
    "--train_file", TRAIN_PATH,
    "--predict_file", DEV_PATH,
    # "--evaluate_during_training",
    "--per_gpu_train_batch_size", "12",
    "--learning_rate", "3e-5",
    "--num_train_epochs", "2.0",
    "--max_seq_length", "384",
    "--doc_stride", "128",
    "--output_dir", OUTPUT_DIR,
]

print(" ".join(cmd))

Running this with

python3.7 make_cmd.py

will output the string which can be copied to clipboard.

Training

Running in tmux gives you the option to detach from the process, so that it won’t be terminated if your connection dies. Same functionality is available from screen. Neither provide a seamless experience.

As the old adage goes

If your model is running, it still counts as work

It would be smart to automate this bit, but I did not do this quite enough times in a repetitive enough manner to deem this worthwhile. And got a kick from watching the output and nvtop.

Clean up

scp/sftp the trained model from the remote to the local machine. Weirdly, I often found this step the most painful. (Very slow, occasionally fails.)

Put it somewhere sensible with clear instructions of how the model came into existence.

Oh. And close the remote machine.