waalge.xyz - (Try to) XLM Fine-tuned for Squad

Aim

To fine-tune XLM for closed domain question answering.

EDIT : Spoiler - I failed.

Context

The quality and quantity of publicly available resources such as squad, have been one reason that NLP capabilities in English have not been matched in other languages. Without the data to train on, or to even benchmark progress, transposing the cutting edge models into languages other than English introduces new problems.

One approach is try to make a model that is somewhat language agnostic: able to represent say, German and English simultaneously,
Then using datasets available in English, train the model to perform question answering and hope that capability bootstraps back to the other tongue. The unreasonable effectiveness of multilingual methods has been demonstrated by mBERT and the XLM-models.

XLM models fine-tuned on squad do not seem to be available in the transformer library, so let’s try to make one.

Plan

Set up a VM with GPUs on azure.
Install all the necessary resources to run training.
Fetch the model from the remote to my laptop.
Ask it some questions in German.

Setting up a VM

There is an option on azure to just set up a VM. However, at the time of writing, there does not seem to be access to GPUs via a VM with a free account. I’m not sure if I’m just missing something. The advertised deep learning VMs such as this suggest I should get an NC-series with HDD disk type.

EDIT : As with almost everything in this field the link is now a 404.

Via the azure machine learning studio additional VMs are available. On the azure machine learning studio, go to

compute > Training Cluster

Choose a name like train-xlm. My region is insistent on being ‘northeurope’. Under VM size, select GPUs. I’m gonna select the middle option

                |VCPUs  |GPUs   |   RAM     |   Disk    |
Standard_NC12   |   12  |   2   |   112 GB  |   680 GB  |

First time I chose low priority but it took too long to get going. Maybe I was being impatient - the setup time is not insignificant. I set node min and max to 1. In the advanced settings, enable port 22. Think of an admin name you won’t immediately forget (I did), and add your id_rsa.pub into the ssh keys

Create!

Under compute details > Nodes, I see the public IP and port. I almost forgot my hilarious admin name then… That would have been annoying

ssh bob@52.155.166.124 -p 50000

Do the following:

Install python3.7
Install pip
pip install modules torch transformers and tensorboardX.
git clone transformers
Get squad datasets
mkdir debug

I used the following script just to write the command to do training.

import os

# Types: 'bert' 'xlnet' 'xlm' 'distilbert' 'albert' 
MODEL_TYPE = "xlm"
MODEL_NAME = "xlm-mlm-ende-1024"

PYTHON = "python3.7"

PWD = os.getcwd()
SQUAD_DIR = os.path.join(PWD, "squad_data")
TRAIN_PATH = os.path.join(SQUAD_DIR, "train-v1.1.json")
DEV_PATH = os.path.join(SQUAD_DIR,"dev-v1.1.json")

DEBUG_DIR = os.path.join(PWD, "debug")

TRANSFORMERS = "transformers"
RUN_SQUAD = os.path.join(TRANSFORMERS, "examples/run_squad.py") 

cmd = [
        PYTHON, RUN_SQUAD,
        "--model_type", MODEL_TYPE,
        "--model_name_or_path", MODEL_NAME,
        "--do_train",
        "--do_eval",
        "--do_lower_case",
        "--train_file", TRAIN_PATH,
        "--predict_file", DEV_PATH,
        "--per_gpu_train_batch_size", "12",
        "--learning_rate", "3e-5",
        "--num_train_epochs", "2.0",
        "--max_seq_length", "384",
        "--doc_stride", "128",
        "--output_dir", DEBUG_DIR,
        ]

print(" ".join(cmd))

Copy the output of python3.7 make_cmd.py, open tmux, and paste and run the command. This is so I can detach from the process and leave it running.

Inspecting + troubleshooting

After detaching I had a look at the output of nvidia-smi. The GPUs were sitting idle and this persisted. In fact, only one of the 12 CPUs was busy. I have previously fine-tuned distilbert, and I didn’t recall this being an issue. I tried a number of things including starting over, and attempting to install the latest version of cuda to which pytorch refers (10.1).

It seems that all that was actually happening is that a preprocessing step only wants to run on a single core - and this takes about an hour. I’m not sure if I just didn’t clock that in my previous run, or whether this step was faster with Distilbert.

In future consider running this preprocessing step locally to save time on the GPU machine. The preprocessing saves a cached version into the squad data directory.

Eventually, I just let it run, but not before I lost the morning to it. After this step was done, the GPUs kicked in.

The first line of the output does state that pytorch can see the GPUs, but it disappears off the screen pretty fast. Easy to miss.

The code ran without issue for just over two hours, perhaps. It completed both training epochs, and cached the dev dataset. At this point it ran into an error.

At this point, I retrieved the trained model and associated files. Again, it’s super fiddly and slow pulling only the pytorch model and necessary files. This time I used scp but not without issues.

Not such a model answer

Returning to the evaluation step, I posted the issue here. After trying to patch over this, as explained in the suggestion, I ran into another error (linked above).

Reading through the run_squad.py file, it seems that XLM and XLNet have been bolted onto a script originally exclusively for the original Bert type models. The comments are not too enlightening. For example (at the time of writing) line 275:

if args.model_type in ['xlnet', 'xlm']:
    # XLNet uses a more complex post-processing procedure

No comment on why XLM gets the same treatment.

I also cannot find an explanation of why XLM has a class XLMForQuestionAnsweringSimple which doesn’t appear in the docs, only in the source code. According to the docs, the output of XLMForQuestionAnswering is determined by the config, which makes it harder to track down what the numbers it spits out are supposed to mean. The output of the model made no sense to me.

Leaving transformers module, I had a look to see if XLM had a run_squad.py script of their own, adapted for XLM. They definitely do so CDQA on their shiny new dataset, MLQA. If they have such script, it’s not obvious on the repo.

Without a better understanding this was the end of the line. I plugged the model into the API. As mentioned, the XLM model output is distinct to Bert based models, and my code errored. My code expects the output to be two tensors of size (1,T), where T is the number of tokens: one for the probability that the corresponding token is the start or end token respectively. But none of the outputs of the model matched this form.

The simple Q&A class did spit out this recognizable form, but the answers on my tiny test were garbage.

Next?

Perhaps I could switch all occurrences to the simple model? Or I could try instead fine-tuning mBert, since at least it’s a Bert model.

On attempt one I ran out of memory with too many checkpoints (after 50). I bumped this up to 500.

After starting from scratch for 3 epochs I got a very minor improvement on 2 epochs trained on 2 GPUS.

12/23/2019 20:40:06 - INFO - transformers.configuration_utils -   loading configuration file ./output/config.json
12/23/2019 20:40:06 - INFO - transformers.configuration_utils -   Model config {
  "asm": false,
  "attention_dropout": 0.1,
  "bos_index": 0,
  "bos_token_id": 0,
  "causal": false,
  "do_sample": false,
  "dropout": 0.1,
  "emb_dim": 1024,
  "embed_init_std": 0.02209708691207961,
  "end_n_top": 5,
  "eos_index": 1,
  "eos_token_ids": 0,
  "finetuning_task": null,
  "gelu_activation": true,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "id2lang": {
    "0": "de",
    "1": "en"
  },
  "init_std": 0.02,
  "is_decoder": false,
  "is_encoder": true,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "lang2id": {
    "de": 0,
    "en": 1
  },
  "lang_id": 0,
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "mask_index": 5,
  "mask_token_id": 0,
  "max_length": 20,
  "max_position_embeddings": 512,
  "max_vocab": -1,
  "min_count": 0,
  "n_heads": 8,
  "n_langs": 2,
  "n_layers": 6,
  "num_beams": 1,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_index": 2,
  "pad_token_id": 0,
  "pruned_heads": {},
  "repetition_penalty": 1.0,
  "same_enc_dec": true,
  "share_inout_emb": true,
  "sinusoidal_embeddings": false,
  "start_n_top": 5,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "first",
  "summary_use_proj": true,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "torchscript": false,
  "unk_index": 3,
  "use_bfloat16": false,
  "use_lang_emb": true,
  "vocab_size": 64699
}

12/23/2019 20:40:06 - INFO - transformers.modeling_utils -   loading weights file ./output/pytorch_model.bin
12/23/2019 20:40:09 - INFO - __main__ -   Creating features from dataset file at .
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 48/48 [00:03<00:00, 12.03it/s]
convert squad examples to features: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10570/10570 [03:23<00:00, 51.96it/s]
add example index and unique id: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10570/10570 [00:00<00:00, 756652.67it/s]
12/23/2019 20:43:38 - INFO - __main__ -   Saving features into cached file ./cached_dev_xlm-mlm-ende-1024_384
12/23/2019 20:43:50 - INFO - __main__ -   ***** Running evaluation  *****
12/23/2019 20:43:50 - INFO - __main__ -     Num examples = 10918
12/23/2019 20:43:50 - INFO - __main__ -     Batch size = 16
Evaluating: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 683/683 [05:18<00:00,  2.15it/s]
12/23/2019 20:49:08 - INFO - __main__ -     Evaluation done in total 318.349047 secs (0.029158 sec per example)
12/23/2019 20:49:08 - INFO - transformers.data.metrics.squad_metrics -   Writing predictions to: ./output/predictions_.json
12/23/2019 20:49:29 - INFO - __main__ -   Results: {'exact': 57.71996215704825, 'f1': 68.31493292742532, 'total': 10570, 'HasAns_exact': 57.71996215704825, 'HasAns_f1': 68.31493292742532, 'HasAns_total': 10570, 'best_exact': 57.71996215704825, 'best_exact_thresh': 0.0, 'best_f1': 68.31493292742532, 'best_f1_thresh': 0.0}

This was run on 3 epochs rather than 2. I may try some of the other parameters before I run out of credit.