BERT — a walkthrough
An interesting state-of-the-art NLP model.
BERT stands for Bidirectional Encoder Representations from Transformers.
Built upon clever ideas, BERT includes Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), Elmo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al). It is one of the few technologies that have brought a huge change in the field of NLP, considering free sharing of resources is one too.
Unlike other earlier models, BERT aims to provide a more in-depth analysis which it achieves by being bi-directional which improves the scope of language context and flow than any single-direction language models. It does so with highly improved accuracy in prediction as such. This technique that is used is also known as MLM (Masked LM).
Basics
In computer vision two basic approaches have been taken by researchers in order to make the best out of the given text, finding language context and flow.
- Transfer learning :
- It involves pre-training a neural network model on a known task first.
- Then performing fine-tuning using the trained neural network as the basis of a new model which can be used for a wide variety of NLP applications.
2. Feature-based training:
- A pre-trained neural network produces word embeddings which are then used as features in NLP models.
Architecture
Based on the Transformer Architecture.
The transformer architecture is such that it learns contextual relations between words (or sub-words) in a text.
Including two mechanisms:
- Encoder — which reads the text input
- Decoder — produces a prediction for the task.
But in the case of BERT, it is basically just the Encoder stack of transformer architecture.
Released in two sizes BERTbase and BERTlarge.
Where the base model:
- Is used to measure the performance of the architecture comparable to another architecture.
- 12 layers (transformer blocks), 12 attention heads & 110 million parameters.
The large model:
- Produces state-of-the-art results that will be shown in the proofs from research articles.
- 24 layers (transformer blocks), 16 attention heads & 340 million parameters
Behind the stage
There are n number of tasks that happen in order to make the model work some concepts related to it are also mentioned below for better understanding purposes.
INPUT: a sequence of tokens, embedding into vectors and processed in the neural network.
Text Preprocessing
A particular set of rules has been defined to represent input text for the model.
Every input embedding is a combination of 3 embeddings:
- Position Embedding:
- Helps express the position of words in a sentence.
- Helps overcome the limitation of transformer, which is not able to capture the “sequence” and “order” of information.
2. Segment Embeddings:
- Can also take sentence pairs as input for tasks.
- Therefore, learning about the unique embeddings for the first and second sentences help the model distinguish between them.
3. Token Embeddings:
- Embeddings learned for specific token from the token vocabulary
BERT undergoes pretraining on two NLP tasks:
- Masked LM (Masked Language Modeling )
Before having to input word sequences in BERT, 15% of the words in each sequence are replaced with a [MASK] token. So then, the task of the model is to predict the original word in the word sequence.
This forms the entity MLM.
The process of prediction of the missing word includes:
- Adding classification layer on top of encoder output
- Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
- Calculating the probability of each word in the vocabulary.
BERT loss function takes into consideration only the prediction of the masked values and ignores the prediction of the non-masked words. Causes increased context awareness. Leads to increased context awareness which in turn increases the models accuracy.
2. Next Sentence Prediction
MLM’s learn to understand the relationship between words. Additionally, BERT is also trained on the task of Next Sentence Prediction for tasks that require an understanding of the relationship between sentences.
Given two sentences — A and B. Is B the actual next sentence that comes after A in the corpus or is just a random sentence that does not have any correlation to it.
Q: We surely can do it, but how does BERT do it?
BERT recieves pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document.
Now, basically the model has to distinguish between the two sentences in training, the input is processed in a slightly different way than others:
BERT — the word and sentence predictor.
- Since it is binary classification task, the data can be easily generated from any corpus by splitting it into sentence pairs.
- Since it is binary classification task, the data can be easily generated from any corpus by splitting it into sentence pairs.
- The entire input sequence goes through the Transformer model.
- The output of the [CLS] token {added at the start of each sentence} is transformed into 2x1 shaped vector , using a simple classification layer.
- Calculating the probablility of isNextSequence variable(binary) to tell whether the sentence is supposed to be there or not.
Applications
- Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
- In Question Answering tasks , the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
- In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.
- Classifying Hate Speech on Twitter: We will identify tweets that contain hate speech if it has racist or sentiment associated with it.
BERT Hands -on
Now we will use BERT for our NLP tasks. Using BERT can be really time consuming considering it’s approach towards data set training (ie — bi-directional). So this might even take a few days to carry out the process, but what is achieved is a lot better.
Twitter since some time has had an issue tackling instances of hate speech and violence.
STEPS:
- Tokenizing
- Cleaning text
- Splitting into training and validation set.
- Training the classification model
- Checking it’s accuracy.
Reading the tweets:
import pandas as pd
import numpy as np# load training data
train = pd.read_csv('BERT_proj/train_E6oV3lV.csv', encoding='iso-8859-1')
train.shape
Cleaning the text
import re# clean text from noise
def clean_text(text):
# filter to allow only alphabets
text = re.sub(r'[^a-zA-Z\\']', ' ', text)
# remove Unicode characters
text = re.sub(r'[^\\x00-\\x7F]+', '', text)
# convert to lowercase to maintain consistency
text = text.lower()
return texttrain['clean_text'] = train.tweet.apply(clean_text)
Splitting into training and validation set
from sklearn.model_selection import train_test_split# split into training and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(train.clean_text, train.label, test_size=0.25, random_state=42)print('X_tr shape:',X_tr.shape)
Embeddings for training and validation sets
from bert_serving.client import BertClient# make a connection with the BERT server using it's ip address
bc = BertClient(ip="49.36.145.191")
# get the embedding for train and val sets
X_tr_bert = bc.encode(X_tr.tolist())
X_val_bert = bc.encode(X_val.tolist())
Training the classification model
from sklearn.linear_model import LogisticRegression# LR model
model_bert = LogisticRegression()
# train
model_bert = model_bert.fit(X_tr_bert, y_tr)
# predict
pred_bert = model_bert.predict(X_val_bert)
Checking the classification accuracy
from sklearn.metrics import accuracy_scoreprint(accuracy_score(y_val, pred_bert))
Conclusion
BERT makes NLP approachable and allows fast fine-tuning which would allows for a wide range of practical application.
Some eventual analysis on the model:
- Model size matters, even at a huge scale. eg: BERTlarge with 345 mil parameters does better on small scale tasks to BERTbase, using the same architecture but with only 110 mil parameters.
- With enough training data, more training steps leads to higher accuracy. eg: on a task, BERTbase accuracy improves by 1% when trained on 1mil steps compares to 500k steps in the same batch size.
- The bi-directional approach (MLM) converges slower than left-to-right approaches (because of a limit on %age of words predicted in each batch) but outperforms it after small number of pre-training steps.
Source: Articles at -
http://jalammar.github.io/illustrated-bert/
BERT_demystified — analyticsvidhya