BERT- a walkthrough

An interesting state-of-the-art NLP model.

7 min readAug 6, 2022

BERT stands for Bidirectional Encoder Representations from Transformers.

Built upon clever ideas, BERT includes Semi-supervised Sequence Learning (by Andrew Dai and Quoc Le), ELMo (by Matthew Peters and researchers from AI2 and UW CSE), ULMFiT (by fast.ai founder Jeremy Howard and Sebastian Ruder), the OpenAI transformer (by OpenAI researchers Radford, Narasimhan, Salimans, and Sutskever), and the Transformer (Vaswani et al). It is one of the few technologies that have brought a huge change in the field of NLP, considering free sharing of resources is one too.

Unlike other earlier models, BERT aims to provide a more in depth analysis which it achieves by being bi-directional which improves the scope of language context and flow than any single-direction language models. It does so with highly improved accuracy in prediction as such. This technique that is used is also known as MLM (Masked LM).

BASICS

In computer vision two basic approaches have been taken by researchers in order to make the best out of the given text, finding language context and flow.

Transfer learning :

It involves pre-training a neural network model on a known task first.
Then performing fine-tuning using the trained neural network as the basis of a new model which can be used for a wide variety of NLP applications.

2. Feature-based training:

A pre-trained neural network produces word embeddings which are then used as features in NLP models.

ARCHITECTURE

Based on the Transformer Architecture.

The transformer architecture is such that it learns contextual relations between words (or sub-words) in a text.

Including two mechanisms:

Encoder — which reads the text input
Decoder — produces a prediction for the task.

But in the case of BERT, it is basically just the Encoder stack of transformer architecture.

Released in two sizes BERTbase and BERTlarge.

Where the base model:

Is used to measure the performance of the architecture comparable to another architecture.
12 layers (transformer blocks), 12 attention heads & 110 million parameters.

The large model:

Produces state-of-the-art results that will be shown in the proofs from research articles.
24 layers (transformer blocks), 16 attention heads & 340 million parameters

BEHIND THE STAGE

There are n number of tasks that happen in order to make the model work some concepts related to it are also mentioned below for better understanding purposes.

INPUT: a sequence of tokens, embedding into vectors and processed in the neural network.

Text Preprocessing

A particular set of rules has been defined to represent input text for the model.

Every input embedding is a combination of 3 embeddings:

Position Embedding:

Helps express the position of words in a sentence.
Helps overcome the limitation of the transformer, which is not able to capture the “sequence” and “order” of information.

2. Segment Embeddings:

Can also take sentence pairs as input for tasks.
Therefore, learning about the unique embeddings for the first and second sentences help the model distinguish between them.

3. Token Embeddings:

Embeddings learned for a specific token from the token vocabulary

Pre-training tasks

BERT undergoes pretraining on two NLP tasks:

Masked LM (Masked Language Modeling )

Before having to input word sequences in BERT, 15% of the words in each sequence are replaced with a [MASK] token. So then, the task of the model is to predict the original word in the word sequence.

This forms the entity MLM.

The process of prediction of the missing word includes:

Adding classification layer on top of encoder output
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary.

2. Next Sentence Prediction

MLM’s learn to understand the relationship between words. Additionally, BERT is also trained on the task of Next Sentence Prediction for tasks that require an understanding of the relationship between sentences.

Given two sentences — A and B. Is B the actual next sentence that comes after A in the corpus or is just a random sentence that does not have any correlation to it.

Q: We surely can do it, but how does BERT do it?

BERT receives pairs of sentences as input and learns to predict if the second sentence in the pair is the subsequent sentence in the original document.

Now, basically, the model has to distinguish between the two sentences in training, the input is processed in a slightly different way than others:

Since it is a binary classification task, the data can be easily generated from any corpus by splitting it into sentence pairs.
Since it is a binary classification task, the data can be easily generated from any corpus by splitting it into sentence pairs.
The entire input sequence goes through the Transformer model.
The output of the [CLS] token {added at the start of each sentence} is transformed into a 2x1 shaped vector, using a simple classification layer.
Calculating the probability of the isNextSequence variable(binary) to tell whether the sentence is supposed to be there or not.

APPLICATIONS

Classification tasks such as sentiment analysis are done similarly to Next Sentence classification, by adding a classification layer on top of the Transformer output for the [CLS] token.
In Question Answering tasks , the software receives a question regarding a text sequence and is required to mark the answer in the sequence. Using BERT, a Q&A model can be trained by learning two extra vectors that mark the beginning and the end of the answer.
In Named Entity Recognition (NER), the software receives a text sequence and is required to mark the various types of entities (Person, Organization, Date, etc) that appear in the text. Using BERT, a NER model can be trained by feeding the output vector of each token into a classification layer that predicts the NER label.
Classifying Hate Speech on Twitter: We will identify tweets that contain hate speech if it has racist or sentiment associated with it.

BERT Hands -on

Now we will use BERT for our NLP tasks. Using BERT can be really time consuming considering it’s approach towards data set training (ie — bi-directional). So this might even take a few days to carry out the process, but what is achieved is a lot better.

STEPS:

Tokenizing
Cleaning text
Splitting into training and validation set.
Training the classification model
Checking its accuracy.

Twitter for some time has had an issue tackling instances of hate speech and violence.

Reading the tweets:

import pandas as pd
import numpy as np
# load training data
train = pd.read_csv(‘BERT_proj/train_E6oV3lV.csv’, encoding=’iso-8859–1')
train.shape

Cleaning the text:

import re
# clean text from noise
def clean_text(text):
# filter to allow only alphabets
text = re.sub(r’[^a-zA-Z\’]’, ‘ ‘, text)

# remove Unicode characters
text = re.sub(r’[^\x00-\x7F]+’, ‘’, text)

# convert to lowercase to maintain consistency
text = text.lower()

return text
train[‘clean_text’] = train.tweet.apply(clean_text)

Splitting into training and validation sets

from sklearn.model_selection import train_test_split
# split into training and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(train.clean_text, train.label, test_size=0.25, random_state=42)
print(‘X_tr shape:’,X_tr.shape)

Embeddings for training and validation sets:

from bert_serving.client import BertClient
# make a connection with the BERT server using it’s ip address
bc = BertClient(ip=”49.36.145.191")
# get the embedding for train and val sets
X_tr_bert = bc.encode(X_tr.tolist())
X_val_bert = bc.encode(X_val.tolist())

Training the classification model

from sklearn.linear_model import LogisticRegression
# LR model
model_bert = LogisticRegression()
# train
model_bert = model_bert.fit(X_tr_bert, y_tr)
# predict
pred_bert = model_bert.predict(X_val_bert)

Checking the classification accuracy

from sklearn.metrics import accuracy_score
print(accuracy_score(y_val, pred_bert))

Conclusion

BERT makes NLP approachable and allows fast fine-tuning which would allow for a wide range of practical applications.

Some eventual analysis of the model:

Model size matters, even at a huge scale. eg: BERTlarge with 345 mil parameters does better on small-scale tasks than BERTbase, using the same architecture but with only 110 mil parameters.
With enough training data, more training steps lead to higher accuracy. eg: on a task, BERTbase accuracy improves by 1% when trained on 1mil steps compares to 500k steps in the same batch size.
The bi-directional approach (MLM) converges slower than left-to-right approaches (because of a limit on % age of words predicted in each batch) but outperforms it after a small number of pre-training steps.

Source:

Articles at -

The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)

Discussions: Hacker News (98 points, 19 comments), Reddit r/MachineLearning (164 points, 20 comments) Translations…

jalammar.github.io

Explanation of BERT Model - NLP - GeeksforGeeks

BERT (Bidirectional Encoder Representations from Transformers) is a Natural Language Processing Model proposed by…

www.geeksforgeeks.org

What is BERT | BERT For Text Classification

Google's BERT has transformed the Natural Language Processing (NLP) landscape Learn what BERT is, how it works, the…

www.analyticsvidhya.com

BERT (language model) - Wikipedia

Bidirectional Encoder Representations from Transformers ( BERT) is a transformer-based machine learning technique for…

en.wikipedia.org

BERT Explained: State of the art language model for NLP

An approachable and understandable explanation of BERT, a recent paper by Google that achieved SOTA results in wide…

towardsdatascience.com

Want to connect?
Hit me up: