This is a short step-by-step account of how to fine-tune Huggingface’s distilbert for closed domain question answering task, squad, on Azure with a free trial account. Many aspects should be analogous when fine-tuning other transformer models, to other tasks, on other cloud compute services.
EDIT : Massive caveat. These tools and libraries are being changed and updated blisteringly fast. Although, at the time of writing, these steps lead to functioning output, it is unlikely time will be kind.
Context
Huggingface have kindly intergrated into a (almost) coherent endpoint numerous bert like models. This means you can (almost) exchange different models at will.
One model from members of huggingface is distilbert. Best thing about distilbert is that it’s as ludicrously huge as other models. And it performs reasonably.
Squad is a dataset and associated benchmark on closed domain question answering set by Stanford.
Azure is Microsoft’s cloud platform.
Azure setup
Sign up. No card details necessary unlike competitors. Just an email address. A new account comes with $200 free credit.
Create a VM with GPUs
From the main dashboard: find Machine learning and open the Azure ML lab. Find ‘Compute’ in the left hand menu. Here select the ‘Training Cluster’ tab, and click ’ + ’ to add a new training cluster.
A form slides in from the right.
- Name: (suggestion) ‘tc-distilbert-fine-tune-squad’
- Size: On the dropdown select GPU, and select the size (suggestion) STANDARD_NC_12.
- Leave priority as is.
- Min Node size: select 1
- Max Node size: select 1
- In advance toggle open port 22 for ssh access.
- As the user (suggestion) add ‘jo’.
- Add your public ssh key (password authentification is an alternative, so do one or the other.
- Click create.
A request is put in to have your resources allocated to you. Meanwhile a row appears in the Training Cluster table, with a loading status. This may take a few minutes to be ready. Click on the row to see the dashboard for the cluster.
There is a tab entitled ‘nodes’. When the node is up, a public IP (123.456.789.0) and a port number (50000) will be listed. Open a terminal and ssh in
ssh -p 50000 jo@123.456.789.0
Troubleshooting:
A VM with GPU access is not available on a trial account via the VM option on the Azure dashboard. Setting up a VM this way is subject to a 4 core limit. The smallest VM with GPU access has 6 CPUs.
I have experience repeatedly finding Azure failing to provision me a machine without comment. Random trial an error, led me to instead set Min number of nodes to 0, and edit this to 1 after my request had been provisioned.
If I try sshing into the machine shortly after the node details are advertised, then I am often asked for a password, despite none being set. Waiting a minute saw this issue disappear.
At the time of writing, the VM is running ubuntu 16.04, has python 3.5 and has two NVidia Tesla K80s.
Setting up the VM
In the terminal, sshed into the machine, you can switch to bash over shell (equipped with its more user friendly tools like tab completion and history).
bash
Install python3.7, pip, pytorch, transformers, and their dependencies.
(We assume that vim, git, and tmux are already installed.)
Download the squad v1.1 data and put it in the directory ~/squad_data/
.
One way to do all of this is to copy and paste the snippet below into a script setup.sh
and run it with
bash setup.sh
.
## SETUP
## Install python3.7
sudo apt install software-properties-common
sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt update
sudo apt install python3.7 python3.7-dev
## Get pip
curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
python3.7 get-pip.py --user
## install torch from pip (might need adapting for hardware spec!)
pip install --user torch
## Pip to install from local transformer instructions
git clone https://github.com/huggingface/transformers.git
pip install --user --editable ./transformers
pip install --user -r transformers/examples/requirements.txt
# Get squad data
mkdir squad_data
curl https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json -o squad_data/train-v1.1.json
curl https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json -o squad_data/dev-v1.1.json
# OPTIONALS: tree, gpu status programs.
sudo apt install tree
sudo apt install cmake libncurses5-dev libncursesw5-dev
git clone https://github.com/Syllo/nvtop.git
mkdir -p nvtop/build && cd nvtop/build
cmake .. -DNVML_RETRIEVE_HEADER_ONLINE=True
sudo make
sudo make install
pip install --user gpustat
Configuring squad training
There are a lot of flags and options involved when running the squad training scripts. I personally found it much easier to work with a python script that simply prints the required command, than editing the bash command directly. It feels a bit hacky, but it’s simple and works.
To follow this method in the home directory, add the following content to the file make_cmd.py
.
import os, subprocess
# Types: 'bert' 'xlnet' 'xlm' 'distilbert' 'albert'
= "distilbert"
MODEL_TYPE = "distilbert-??"
MODEL_NAME
= "python3.7"
PYTHON
= os.getcwd()
PWD = os.path.join(PWD, "squad_data")
SQUAD_DIR = os.path.join(SQUAD_DIR, "train-v1.1.json")
TRAIN_PATH = os.path.join(SQUAD_DIR, "dev-v1.1.json")
DEV_PATH
= os.path.join(PWD, "output")
OUTPUT_DIR
= os.path.join(PWD, "transformers")
TRANSFORMERS = os.path.join(TRANSFORMERS, "examples/run_squad.py")
RUN_SQUAD
= [
cmd
PYTHON, RUN_SQUAD,"--model_type", MODEL_TYPE,
"--model_name_or_path", MODEL_NAME,
"--do_train",
"--do_eval",
# "--do_lower_case",
"--train_file", TRAIN_PATH,
"--predict_file", DEV_PATH,
# "--evaluate_during_training",
"--per_gpu_train_batch_size", "12",
"--learning_rate", "3e-5",
"--num_train_epochs", "2.0",
"--max_seq_length", "384",
"--doc_stride", "128",
"--output_dir", OUTPUT_DIR,
]
print(" ".join(cmd))
Running this with
python3.7 make_cmd.py
will output the string which can be copied to clipboard.
Training
Running in tmux gives you the option to detach from the process, so that it won’t be terminated if your connection dies.
Same functionality is available from screen
.
Neither provide a seamless experience.
As the old adage goes
If your model is running, it still counts as work
It would be smart to automate this bit, but I did not do this quite enough times in a repetitive enough manner to deem this worthwhile. And got a kick from watching the output and nvtop.
Clean up
scp/sftp the trained model from the remote to the local machine. Weirdly, I often found this step the most painful. (Very slow, occasionally fails.)
Put it somewhere sensible with clear instructions of how the model came into existence.
Oh. And close the remote machine.