Implementing BERT for Question and Answer

14 min readMar 1, 2021

List of Contents

Overview
BERT Summary
BERT Tokenization
BERT Embeddings
Understand BERT Outputs
Embedding layer in BERT
BERT for QnA
Testing model performance

Code Reference

To access relevant code for any section of this blog please refer to my Github repo, a python notebook where codes are explained in detail.

Live Implementation of BERT for Q&A

The video of working of the final model is attached to motivate the readers to learn this amazing state of the art NLP technique.

Overview

BERT (Bidirectional Encoder Representations from Transformers) is the state of the art NLP technique released by the google AI research in october 2018. The best thing about BERT is it is an open source transfer learning platform for almost all varieties of NLP problems. One of the most common use cases of BERT is question and answer problems where the model requires a passage and question and it generates an answer for that question from the passage. In this blog post I will implement the BERT model for Question and Answer task.

Unlike Other NLP tasks such as Sentiment Analysis, Part of speech tagging or Name entity recognition, BERT model does not require fine tuning for question and answer application as it gives significantly good results on pretrained models.

This blog is continuation of Jay Allamar’s blog where he explained the architecture and working of the BERT model amazingly. I will try to fill the missing dots which are not covered in that blog and I will also share some problems that I faced while implementing this model for the Question and Answer task.

BERT Summary

If you have gone through this blog then you must have some idea about query , key , value, self attention , attention heads and positional encoding. Now let’s directly go to the problems that you might face when you start implementing BERT models.

In this blog we will implement BERT for Q&A tasks. We know that mathematical models do not accept words or sentences so first we have to convert our sentences into some numbers so that the model can process it. The sequence of steps is Tokenization , Token ids and Embedding vectors.

BERT Tokenization

Tokenization is the first process where we convert a sentence into a list of words such that each word in the list can be mapped to the Embedding matrix which will return 768 or 1024 dimensional vectors for bert base and bert large respectively. Now what is Embedding matrix? I will show you this matrix when we load our BERT large model for Q&A, for now let’s first understand how BERT tokenizer works and how it’s different from other embedding techniques.

You can simply print tokenizer.vocab to see all 30,522 tokens and its corresponding token ids. There are some special tokens which the BERT tokenizer uses for special purposes.

[UNK] : token_id = 100, and used for unknown or misspelled words.
[CLS] : token_id = 101 By default starting token for each input. Last hidden vector corresponding to [CLS] used for classification. We will talk about it later in this blog.
[SEP] : token_id = 102 , used as separation taken between two sentences.
[MASK]: token_id = 103 used for masking. Masking is a technique that BERT uses while generating sequences of words, at a time it generates one word and masks all other tokens. This token is also used while performing masked language modeling task.
[PAD]: token_id = 0 used to equal the length of sentences in a batch.

The vocabulary size is 30,522 for english language and of course this can not cover all english words. Then how BERT tokenizer deals with the words which are not in the vocabulary. Let’s see this

Of Course the word ‘Nitesh’ which is my name is not there in the vocabulary as I am not a famous personality. One thing to notice that when a word is not there in the vocabulary then the tokenizer breaks down the word into subwords like in this case ‘Nitesh’ → ‘ni’ + ‘##tes’ + ‘##h’ .

What are those # tags?

BERT tokenizer uses two # tags to identify that the token containing # tags are not actual word rather it is extension of previous non # tag token. let’s see one other eg.

Here also the word ‘Dewangan’ which is my surname is break down into three tokens. You must be thinking that breaking down unknown words into small tokens could actually work because the model generates a separate hidden vector for each token. I will discuss about this topic when I will load the model. Till now we have understood about BERT token and token ids. Now let’s understand BERT Embaddings.

BERT Embaddings

We have a number of word embedding techniques such as bag of word, tf-idf embedding and word2vec embedding technique. Out of those word2vec is advanced word embedding technique which takes care of semantic meaning of words. We will compare word2vec embedding technique with BERT Embedding and understand how geniously BERT generates word embedding by considering context word into consideration.

Though the state of the art word2vec embedding technique holds semantic meaning of words but it does not take care of context words. lets understand it with an eg. The sentence

‘‘After stealing money from the bank vault, the bank robber was seen fishing on the Mississippi river bank.’’

uses the word ‘bank’ three times, first and second ‘bank’ is referring to a “financial establishment that uses money deposited by customers for investment” and third ‘bank’ is referring to a river edge or river site. Word2vec embedding gives the same embedding vector for all three ‘banks’ where BERT uses context words and will generate different embedding vectors for the same word ‘bank’.

To know more about BERT Embedding you can visit this blog.

Our actual task is Q&A but I want to show that embedding for all three ‘bank’ will be different and for that I will use bert-base-uncased just for this purpose you don’t have to do this. Below code shows tokens and corresponding token ids.

The best thing about hugging-face library is that you don’t have to do all the ground level work like placing [CLS] token at the beginning , [SEP] token between sentences and at the end, identify punctuation etc. All this dirty work is done by their inbuilt functions and modules, you just have to call them with proper inputs that’s it.

Now let’s load BERT-base-uncased and tokenizer from pretrained bert-base-uncased (just for embedding vector comparison) and and compare the embedding vector generated for each ‘‘bank’’. I will talk about how to interpret the output of the BERT model in this section only so that we won’t have any confusion when we dive into the BERT Q&A section.

Remember that the BERT model is pre-trained on a large dataset using a masked language model and next sentence prediction. So by default the model expects two sentences as input which starts with [CLS] token and separated by [SEP] token, though tasks like classification and sentiment analysis do not feed two sentences as input. In task like classification we feed additional attention mask (segment_ids) that tells model that which token is coming from sentences and which are just padding tokens to match the length of the batches. Here we just want to see the embedding generated for each “bank” and we have only one sentence so we need to create segment ids(attention mask).

Now we token ids and segment ids so let’s feed this as input to our BERT model and see the embeddings it generates.

The word ‘‘bank’’ occurred at index 6th , 10th and 19th. First we will see that embeddings are different or not and then we will examine whether it holds the contextual information or not by calculating cosine distance. Though the embedding vector is 768 dimensional, I will print the first five just for comparison.

We can see that BERT model generates different embedding vectors for same word ‘‘bank’’ and these embedding will become more meaningful when we see the cosine distance between them.

We can see that while generating embedding vector for tokens BERT model holds the contextual information as cosine similarity between first and second bank is high where it is low for first and third bank.

Now let’s understand the outputs generated by the model. Since we enabled output_hidden_states = True that’s why we will get three outputs.

Understand BERT Outputs

Bert base has 12 bert layers and for each bert layer it gives embeddings for tokens. we are getting a number of layers = 13 because the model adds one more additional embedding layer at the very beginning.

last_hidden_state : It is the first output we get from the model and as its name it is the output from last layer. The size of this output will be (no. of batches , no. of tokens in each batch, size of hidden layer). in our case we have only one batch with 22 token and for BERT-base size of the hidden layer is 768 so the size of last hidden state (1,22,768) can be seen in the code above.
Pooler Output : Creator of BERT model said that pooler_output which corresponds to the last hidden state of [CLS] token does not have any interpretability but it is best choice as an input for a classifier that can be fine tuned on a separate dataset. The main reason behind this is the way it pretrained. We must not forget that we are using the BERT model as transfer learning which is already pre trained on large amounts of data. [CLS] token is always used as a starting token while performing pretraing using a masked language model and next sentence prediction that’s why a pooler output which itself doesn’t have any interpretability but it captures all the information for a particular input.
Hidden_state : One good reason to love this model is its hidden state. BERT application is not limited to using pooler output to fine tune the classifier but one can also explore the advantages by using its hidden states. Though there is not hard and fast rule to which hidden state to use for embedding but depending on dataset and tasks one can try this followings combination :

Just for cross check first output which is last hidden state and last index of third output(all hidden_state) should be same.

Till now we have understood BERT tokens , token ids , BERT Embaddings and model outputs including last hidden output , pooled output and hidden states and we also seen that BERT Embedding takes care of contextual information.

Now let’s start our main task which is Q&A. First we will load the model

Embedding Layer In BERT

After loading BERT large model for Q&A when you print the model it will show all the layers and sub layers of the model. BERT large has 24 bert layer, 1 BertEmbedding layer at beginning and qa_output layer at the end.

In this blog I will talk about sections in the Embedding layer and not the bert layer as I am assuming that readers already know about self attention , attention heads , query , key , value , layer_norm and finally feed forward output layer. Now let’s understand sub layer inside BertEmbeddings.

word_embeddings : At the beginning there is a section in the embedding layer called word_embedding which has a matrix of size (30522,1024). This matrix can be thought of as a dictionary where keys are token ids and values are 1024(bert-large) dimensional vectors. Note that this matrix is trainable and the model will update vectors (by back propagation ) corresponding to only those tokens which are present in the training data while fine tuning the model.

Remember that this embedding matrix is just a dictionary which gives the same vectors for the same token, it is the last hidden layer output where we can expect different embedding for the same word that captures context information.

position_embeddings : BERT-base and BERT-large both can accept sentences whose token lengths is less than 512. so before feeding inputs to the model ensure that the length of tokens should be less than or equal to 512. Unlike LSTM or RNN, here we feed the entire sequence to the model in one go so the model has to do something to understand the sequential information(which word comes before and which comes after). Model adds positional embedding to each token to do so. I think that positional matrix is not trainable because it is used just to tell the model which word comes before and which comes later. So finally input to this layer is output of word_embedding layer which is 1024 dimensional vector for each input token and position_embedding add another 1024 dimensional vector for every token to hold the sequential information.

token_type_embeddings: Remember that the BERT model is pre-trained on a large dataset using masked language model and next sentence prediction. So by default the model expects two sentences as input which are separated by [SEP] token though tasks like classification and sentiment analysis do not feed two sentences as input. In task like classification we feed additional attention mask that tells model that which token is coming from sentences and which are just padding tokens to match the length. In tasks like QnA where we have to feed two sentences first as passage and second for question so we must provide additional segment ids (0’s for first segment and 1’s for second segment). we will see this in detail when we pass an input to our Q&A model.

So what is token_type_embedding : Embedding(2,1024) that we can see when we print the model.

We feed additional segment ids along with token ids but the token embeddings are 1024 dimensional vectors. I think that token_type_embeddings provides 1024 dimensional zero vectors for segment id =0 and 1024 dimensional vectors with ones for segment id = 1.

Before talking about last linear layer let’s take our first eg. to test the model BERT-large-for-Que-Ans. BertTokenizer provides an inbuilt module called ‘encode’ that takes a pair of question and passage sentences and performs three tasks. First it performs tokenization to get the tokens from vocabulary for the entire input. then it places [CLS] token at the start and [SEP] token at the end of first sequence and after the end of last sequence. then it gives token ids for all the tokens.

BERT for QnA

One can visualize tokens and its corresponding token ids by using the code below

Now we have to generate segment ids to tell the model which token id belongs to the first sequence and which token id belongs to the second sequence.

Now we have processed our input (question and passage) into the form that our BERT model expects the input to be. Let’s convert processed input into tensor to feed into the model.

Note: Previously when we use BERT-base for embedding comparison of word “bank”, we enabled return_hidden_state = True and we got three outputs for an input which was last_hidden_state, pooled_outputs and hidden_outputs but here we got two output tensor each 106 dimentsional.

To understand this output now let’s talk about last linear layer of our model.

You will see the above section when you print the BERT-large for Q&A model at the end. The last linear layer takes input as a 1024 dimensional vector and generates two outputs. After carefully reading this code provided by huggingface library I got to know that this last linear layer takes input from the last hidden state of every input tokens and for each input token this last linear layer generates two outputs. The value of the first output indicates the chance of the corresponding token to be the start index of the final answer. Large positive value means high chance that corresponding token is the starting index and large negative means low chance to be the starting index of the final answer. Same idea applicable for the second output except it is related to the end index of the final answer.

After this everything is very simple, you just have to find the index of largest value in the first output and assign its corresponding token as starting index and same for second output to get the last index of the final answer.

Instead of indexing of outputs we can also use start_logits (index 0) and end_logits (index 1). We can remove this # tags and combine the words.

passage : “BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task. Since BERT’s goal is to generate a language model, only the encoder mechanism is necessary. The detailed workings of Transformer are described in a paper by Google.’’

Above passage is used to generate this answer.

Example provided by hugging face for que and ans didn’t explain about the situation where start index > end index. In my case I didn’t encounter those situations but if it occurs then I think that instead of the largest value index we can use the 2nd largest value index depending on the situation.

Now let’s make a helper function that takes questions and passage as input and will return a final answer and test the model for other questions.

Testing model performance

Now you must be appreciating the brilliant idea of creator of this BERT model that how geniously they uses the idea of attention mechanism because by adding one more linear layer only the model is able to predict starting token and ending token of the final answer based on particular question that means the model very well understands the meaning of a question and able to find the answer in given passage.

References

Contact

Mail : niteshkumardew11@gmail.com

Github : https://github.com/Nitesh0406/BERT-QnA.git

Linkedin : www.linkedin.com/in/nitesh-kumar-550819169