RobBERT

A Dutch RoBERTa-based Language Model

Pieter Delobelle, Thomas Winters and Bettina Berendt

Pre-trained language models have been dominating the field of natural language processing in recent years, and have led to significant performance gains for various complex natural language tasks. One of the most prominent pre-trained language models is BERT. Although the multilingual version of BERT performs well on many tasks, recent studies showed that BERT models trained on a single language significantly outperform the multilingual results.
For this reason we present a Dutch model based on RoBERTa, which we call RobBERT. We show that RobBERT improves state of the art results in Dutch-specific language tasks.

pdf Code Resource Posted on 2020-01-20

The advent of neural networks in natural language processing (NLP) has significantly improved state-of-the-art results within the field. While recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) initially dominated the field, recent models started incorporating attention mechanisms and then later dropped the recurrent part and just kept the attention mechanisms in so-called transformer models. This latter type of model caused a new revolution in NLP and led to popular language models like GPT-2 and ELMo. BERT improved over previous transformer models and recurrent networks by allowing the system to learn from input text in a bidirectional way, rather than only from left-to-right or the other way around. This model was later re-implemented, critically evaluated and improved in the RoBERTa model.

These large-scale transformer models provide the advantage of being able to solve NLP tasks by having a common, expensive pre-training phase, followed by a smaller fine-tuning phase. The pre-training happens in an unsupervised way by providing large corpora of text in the desired language. The second phase only needs a relatively small annotated data set for fine-tuning to outperform previous popular approaches in one of a large number of possible language tasks.

While language models are usually trained on English data, some multilingual models also exist. These are usually trained on a large quantity of text in different languages. For example, Multilingual-BERT is trained on a collection of corpora in 104 different languages and generalizes language components well across languages. However, models trained on data from one specific language usually improve the performance over multilingual models for this particular language. Training a RoBERTa model on a Dutch dataset thus has a lot of potential for increasing performance for many downstream Dutch NLP tasks.

Get started with our models

We release our pretrained models for both Hugging Face's transformers and Facebook's Fairseq. For some downstream tasks, we also have models in either or both formats.

Awesome that you're using 🤗 Transformers. You can import our models directly using:

from transformers import RobertaTokenizer, RobertaForSequenceClassification
tokenizer = RobertaTokenizer.from_pretrained("pdelobelle/robBERT-base")
model = RobertaForSequenceClassification.from_pretrained("pdelobelle/robBERT-base")
          

With the pretrained model being either the base model, that was trained on language modeling (LM) or a finetuned model that we provide.

Model Task Accuracy (%) F1 (%)
pdelobelle/robBERT-base LM
pdelobelle/robBERT-dutch-books DBRD 94.4% 94.4%

Great that you're using Fairseq! We pre-trained our model with this library. You can import this pre-trained version of RobBERT by downloading it directly.

Model Task Accuracy (%) F1 (%)
RobBERT-base Download (1.4 GB) LM
RobBERT-diedat Download (1.4 GB) Die vs dat 98.4% 98.1%

A note on copyright

We're releasing all our models under MIT.

Intro in RobBERT, BERT and RoBERTa

Transformer models Transformer models Encoder Self-attention matrices from the last encoder Input Previous output Output Read more » “Ik zie een giraf in mijn tuin.” “I see a giraffe in” “I see a giraffe in my Decoder Read more » Attention heads Attention heads Attention heads Attention heads Attention heads Attention heads Attention heads Attention heads
Illustration of transformer models and how the encoder and decoder stacks are part of it.

In NLP, encoder-decoder models have been used for some time. These models, often called sequence-to-sequence or seq2seq, are good at various sequence-based tasks: translations, token labeling, named entity recognition (NER), etc. Historically, these seq2seq models were usually LSTMs or other recurrent networks. A major improvement in these networks was an attention mechanism, that allowed to communicate more than one feature vector. (For those coming from computer vision, this looks a bit like the connections in UNet).

The by now famous transformer model was based solely on this attention mechanism. It features 2 stacks: (i) an encoder stack that uses multiple layers of self-attention and (ii) a decoder stack with attention layers that connect back to the encoder outputs.

Encoders

So this attention-based encoder generates a transformed representation of the input sequence at once.

Encoder Input Tokenizing the input Get embeddings for every token Calculate self-attention for all layers Output argmax “Ik zie een <mask> in mijn tuin.” Ik zie een “Ik zie een boom in mijn tuin” Attention heads Attention heads <mask> in mijn tuin Ik zie een <mask> in mijn tuin
Illustration of the encoder in RobBERT used for language modeling, by predicting the most likely word.

We could also interpret this probabilistically, we have a language model

P(\text{``giraf"} \mid \text{``ik zie een <mask> in mijn tuin."})<0.0001

Or a more probable:

P(\text{``boom"} \mid \text{ik zie een <mask> in mijn tuin."})=0.1498

In fact, we can even query the most likely results. For this sentence, RobBERT gives us:

[('Ik zie een lamp in mijn tuin.', 0.39584335684776306, ' lamp'),
 ('Ik zie een boom in mijn tuin.', 0.1497979462146759, ' boom'),
 ('Ik zie een camera in mijn tuin.', 0.089895099401474, ' camera'),
 ('Ik zie een ster in mijn tuin.', 0.046020057052373886, ' ster'),
 ('Ik zie een stip in mijn tuin.', 0.009481011889874935, ' stip'),
 ('Ik zie een man in mijn tuin.', 0.009198895655572414, ' man'),
 ('Ik zie een slang in mijn tuin.', 0.009129301644861698, ' slang'),
 ('Ik zie een stem in mijn tuin.', 0.007939961738884449, ' stem'),
 ('Ik zie een bos in mijn tuin.', 0.007785357069224119, ' bos'),
 ('Ik zie een storm in mijn tuin.', 0.0077188946306705475, ' storm')]

Ok, I'll be honest. I thought a tree (een boom) would be the most likely, I didn't even think of a lamp (een lamp). But I guess it makes sense anyway. And giraffes are not even in the top 10k suggestions, so that's disappointing.

Decoders

So that’s the encoder side. For some language tasks, it is enough (NER, POS tagging, etc.). But for others, like translation, we need the decoder as well. Translation could thus be formulated as a task P(A) P(B\mid A) with a decoder that depends on the outputs of the encoder or language model. Practically, this looks a bit like this:

P(\text{``Ik zie een giraf in mijn tuin"})\newline \cdot P(\text{``I see a giraffe in my garden"}\mid \text{``Ik zie een giraf in mijn tuin"})

To actually get to this outcome, the attention mechanism implicitly uses marginal probabilities over all tokens in the encoder, which are then also used by the decoder. It’s also possible to use only the decoder, which is what GPT-2 does: they can generate an output sequence based on only the input without an encoder. But of course, GPT-2 doesn’t translate sentences, as it misses that part.

P(\text{``I see a giraffe"}|\text{``I see a"})=0.01

P(\text{``I see a giraffe in"}|\text{``I see a giraffe"})=0.6

But since RobBERT is only an encoder stack, we won’t dive deeper into this. From now on, we will only describe an encoder we happen to have laying around (hint: it’s RobBERT).

Great, but how do you input a sentence?

Word embeddings like word2vec were quite popular to use in seq2seq models. Especially since they happened to encode some semantic and grammatical meaning, so they were a quick way to boost the utility. But these models had two drawbacks: (i) if a word is not in the vocabulary, it has no vector (word2vec trained on Google News had 3 million words so usually it was fine) and (ii) word embeddings give the same vector regardless of context. The go-to example is “stick”, that is both a verb and a noun.

This was addressed by ELMo, which could generate contextualized embeddings. But BERT and RobBERT have a similar trick up their sleeves. If we look back at the probabilistic interpretation, we see that the input of the language model is not just a bag-of-words or TF-IDF input, but the actual text with one word masked. The input is namely the whole sentence, or even multiple sentences.

To deal with all these words, we need something to convert words into an embedding. All transformer-based models—including RobBERT—use a tokenizer. This tokenizer splits the input into words and, if a word is not in its vocabulary, it will split it into subwords. BERT uses WordPiece and RobBERT uses a byte-level BPE tokenizer. These these tokenizers have the benefit that all input sentences can be represented, even if some words are missing. Yay, no more out-of-vocabulary (OOV) issues!

For all these tokens, we take a vector (similar to word embeddings) from our embeddings matrix. Hold up, wasn’t that problematic for word2vec? Yes, but transformer-based models do something else before the attention heads: they multiply the embeddings with a positional encoding, which is a sine and cosine concatenated. These altered embeddings then get fed into the first layer of attention heads.

Self-attention mechanism

So we have multiple layers (12 in the case of RobBERT) that each have multiple attention heads (also 12). Each head calculates scaled dot-product attention based on Query (Q), Key (K) and Value (V) vectors, where these vectors are calculated from three respective weight matrices that are learned during training.

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

After calculating the self-attention for each head, the outputs are fed into a linear layer and then concatinated.

If you want to learn more about the attention mechanism specifically, there are some good resources, like this post from Towards Data Science and Jay Alammar's The Illustrated Transformer!

Pre-training and finetuning

As it happens, language models are quite expensive to train. We used a high-performance computing cluster with ~80 Nvidia P100’s for several days. This is because we train it on a large dataset (39 GB of text) with the word-masking objective, which means we randomly replace some words by a <mask> token (or another word or the same word, but those are less likely). After a few epochs, we have something that resembles the probabilistic language model we described earlier.

But this language model can do more than just filling in some <mask> tokens! This is one of the so-called heads, the one we use to pre-train our language model on. After training, we can easily take it off—perhaps calling them heads was a bit insensitive?—and replace it another one. This step is then called finetuning. All the weights of the model stay the same and we add a newly initialized head that we train on the data that we want. And since most weights are from the trained base model, we only need a fraction of the data. So it will go a lot faster as well!

Custom heads: sentence and token classification

So to train our model, we used a language modeling head. But as it turns out, we could also add other heads that do other things. These heads can be trained on a lot less data and are faster to train, so it's really easy to train your own on a custom task.

Sentence and token level classification Encoder Tokenized input Output Per token Attention heads Attention heads <s> Ik zie een giraf . </s> 0 N VERB DET N PUNCT 0 Encoder Tokenized input <s> Ik zie Attention heads Attention heads een giraf . </s> Positive Output Only for the first token
Token-level classification (left) and sentence-level classification (right) for an example input.

Of course, not all tasks are the same. Roughly, there are kinds of tasks: (i) sentence-level prediction and (ii) token-level prediction. As the names imply, the difference is that one predicts something on a sentence or document level versus making a prediction for each token.

If you’re interested in the custom heads we have trained and their performance, see Downstream tasks.

Wrapping it up and why monolingual models matter

So we discussed language models—and their probabilistic interpretation—and how they are related to transformer models. With a lot of self-supervised training on large corpora, these models are relatively good for these tasks. Multilingual models also perform very well, but they do mix a lot of languages (over 100 for Google's mBERT).

As a language model, a multilingual model will then have to deal with very different linguistic properties. For the tokenizer, this means either a lot more tokens, or less tokens that represent actual words in one language. Despite these drawbacks, multilingual models might leverage some features from related languages, which could increase performance.

This is also what we observed, multilingual models do perform well on a variety of tasks, especially if sufficient training data is available. If this is not the case, monolingual models like RobBERT have a slight edge.

Downstream tasks

Anaphora resolution with die and dat

We evaluated RobBERT's performance on a task that is specific to Dutch, namely disambiguating "die" and "dat" (= "that" in English). In Dutch, depending on the sentence, both terms can be either demonstrative or relative pronouns; in addition they can also be used in a subordinating conjunction, i.e. to introduce a clause. The use of either of these words depends on the gender of the word it refers to. Distinguishing these words is a task introduced by Allein et al. (2020), who presented multiple models trained on the Europarl and SoNaR corpora. Their results ranged from an accuracy of 75.03% on Europarl to 84.56% on SoNaR.

For this task, we use the Dutch version of the Europarl corpus, which we split in 1.3M utterances for training, 319k for validation, and 399k for testing. We then process every sentence by checking if it contains "die" and "dat", and if so, add a training example for every occurrence of this word in the sentence, where a single occurrence is masked. For the test set for example, this resulted in about 289k masked sentences. We then test two different approaches for solving this task on this dataset. The first approach is making the BERT models use their MLM task and guess which word should be filled in this spot, and check if it has more confidence in either "die" and "dat" (by checking the first 2,048 guesses at most, as this seemed sufficiently large). This allows us to compare the zero-shot BERT models, i.e. without any fine-tuning after pre-training.. The second approach uses the same data, but creates two sentences by filling in the mask with both "die" and "dat", appending both with the <sep> token and making the model predict which of the two sentences is correct.

High-level sentiment analysis

we compare its performance with other BERT-models and state-of-the-art systems in sentiment analysis, to show its performance for classification tasks. We replicated the high-level sentiment analysis task used to evaluate BERTje to be able to compare our methods. This task uses a dataset called Dutch Book Reviews Dataset (DBRD), in which book reviews scraped from hebban.nl are labeled as positive or negative. Although the dataset contains 118,516 reviews, only 22,252 of these reviews are actually labeled as positive or negative.

We trained our model for 2000 iterations with a batch size of 128 and a warm-up of 500 iterations, reaching a learning rate of 10⁻⁵. We found that our model performed better when trained on the last part of the book reviews than on the first part. This is likely due to this part containing concluding remarks summarizing the overall sentiment.

Examples on sentiment analysis

We've trained and evaluated our model on the Dutch Books Review Dataset (DBRD), which is a collection of labeled possitive, negative and neutral reviews. This allowed us to do sentiment analysis on these reviews. We show the first examples from our test set here.

Finally, a note on the name and the logo

We named our model RobBERT, or to be more precise: our model named itself RobBERT. When we used word masking in a sentence to introduce itself, it picked RobBERT as the most likely name. In an serendipitous way, this also highlighted the link to RoBERTa, so that name was perfect!

The word rob also means seal in Dutch, hence our logo is a seal being dressed up as Bert. Special thanks to Thomas Winters for the logo!