End-to-end Machine Learning — Machine Learning Model

14 min readMar 31, 2021

This article is the third in a series of articles on how to build an end-to-end machine learning pipeline for a chatbot AI that relies on a user-defined knowledge base. You can read about the description of the use case and the other articles in the series here.

In this article, you will read about the machine learning model, based on BERT, and how to execute a training job on the Google AI Platform.

You can explore the code at my GitHub repository.

The framework

The panorama of frameworks and packages available to design and build machine learning models is extremely rich: Tensorflow, Pytorch, scikit-learn, Caffe, Spark ML, Ludwig, etc.

The framework I have used to build the machine learning model designed to learn multilingual F.A.Q.s with user-defined special keywords is Tensorflow, one of the most-used frameworks. It allows easy definition of custom neural networks as well as usage of standard estimators and, in addition, it allows easy setup of distributed training.

Transformer and BERT

BERT [1] stands for Bidirectional Encoder Representations from Transformers: the model that in 2018 took the scientific and professional natural language processing (NLP) community by storm. The key technical innovation is that BERT applies bidirectional training of Transformer [2], another key architecture that was published the year before, to model languages. In contrast, previous methodologies tried to look at a text sequence either from left to right or combined left-to-right and right-to-left training. BERT shows that bidirectional training of Transformers allows the model to grasp a deeper understanding and sense of a language context and flow than single-direction language models. The tests resulted in new state-of-the-art results on eleven natural language processing tasks.

But let’s start with the basics: the Transformer. It’s a model that uses neither Convolutional Neural Networks [3] (CNNs) nor Recurrent Neural Networks [3] (RNNs) in its structure, but just the attention mechanisms. The main reason behind passing over these well known models is that RNNs aren’t easily parallelizable while the CNNs require a lot of layers to catch long-term dependencies in the sequential structure of the data, not always succeeding in the attempt. Hence the NLP scientific community required a new autoregressive model able to produce outputs that depend not only on the input, but also on previous outputs and this is where the attention mechanism came in to help.

Let’s consider the basic attention mechanism in Figure 1:

The blocks denoted by As are the input sequences and, in general, can be connected if you are using a RNN; while blocks denoted by Bs are analogous and represent the output sequences. Here you can immediately notice the autoregressive behavior of this architecture: the second input A requires the previous output B. The equations that describe the mechanism are:

where i stands for the i-th decoder’s timestep and u is the u-th encoder’s. The latter two equations refer to the softmax layer computing probabilities and a weighted sum layer whereas the first is a bit more complex: it states that the relationship between

and

can be expressed as an element-wise product (dot-product attention) or as the outcome of a neural network (additive attention).

In the paper “Attention is all you need” [2], which introduced the Transformer, you’ll probably see the diagram shown in Figure 2.

Well… It’s not rocket science, but it’s certainly not intuitive. If you consider:

we return to the situation illustrated in Figure 1. Indeed in the paper they expressed the dot-product attention as

where Q, K and V stand for Query, Key and Value respectively, and

is the dimension of the queries and keys.

This is a naïve approach to the attention mechanism because we are not considering it in combination with any other deep learning structure, but nonetheless the output is generated in an autoregressive manner. In addition, when the queries, keys and values come from the same sequence we call this the self-attention layer.

The full architecture of the Transformer is depicted in Figure 3. The main difference with other models is that the Transformer is entirely based on attention mechanisms and point-wise fully connected layers for both the encoder and the decoder. Both modules are thus cheaper in computational terms than any other competitor with similar test scores.

The Transformer is composed of 4 main components: the Input-Output, the Positional Encoding, the Encoder and the Decoder.

The Input-Output component is made up by the sentences encoded using byte-pair encoding [4], so the input of the Transformer is the sequence split into the sub-words, called tokens, as well as the output, with the only difference that at training time they act as labels. With this procedure our tokenized version of the words results in a sequence of vectors, each having a size equal to 512. Thus the whole sentence is now a matrix of real numbers.

Positional Encoding, instead, was originally introduced in [5] and is a way of dealing with sequential data when neither recurrency nor temporal information is available due to the lack of RNNs or CNNs.

The Encoder structure is shown in Figure 4.

The encoder is a stack of N=6 layers, each of which is composed of two sub-layers: a multi-head attention mechanism and a fully-connected feed forward net, plus residual connections [6] on both stages, followed by a layer normalization [7] procedure. The output of the encoder can be expressed as:

where every layer in the encoder has dimension 512 and only the vector from the last layer is sent to the decoder.

The encoder multi-head attention mechanism is a generalization of the basic attention mechanism analyzed before. Previously, the vector was analyzed into a subspace of fixed dimension, but what if we could jointly use information from different representation subspaces at different positions? Well, multi-head attention does exactly this. You can think of it as dot-product attention computed by blocks, with each representation having a different size. As can be seen in Figure 5, after the computation of the normal dot-products, all the different h representations are concatenated and projected again. In the paper “Attention is all you need” they took h=8, with all of the scaled dot-product attention mechanisms having a projection size of 64 elements.

Residual connections [6] are used to solve the problem facing deep learning models, where the more layers they have, the more difficult it is to effectively do training. To deal with this issue, the idea is to feed the deeper layers of an untouched copy of data coming from the previous layers.

The layer normalization, instead, has the objective of speeding up the training of the neural network by reducing time for convergence. Differently from batch normalization, here the goal is to extract statistical information within each single pass before the nonlinearity takes place.

The Decoder resembles the Encoder: it’s made up of a stack of 6 layers but, in contrast to the Encoder that had 2 sub-layers, the Decoder sub-layers are multi-head attention, fully-connected layers and masked multi-head attention (apart from the common residual connections and layer normalization).

The masked multi-head attention is responsible for the masked casual attention, that is the mechanism that hides the features that belong to future states of the sequence. Thanks to this strategy, the decoder is both autoregressive and causal, that means that the generation of the current state is then not conditioned in any way by future states.

Now that we have understood how the Transformer works, it’s time to go deeper into BERT. BERT’s goal is to generate a language model: for this reason, compared to the Transformer architecture explained earlier, BERT only uses the encoder mechanism. The Transformer encoder learning mechanism is indeed bidirectional: it reads the input sequentially, left-to-right and right-to-left, allowing the model to learn the context of a word based on all of its surroundings. This architecture allows BERT to be trained for Masked Language Model (MLM) and Next Sentence Prediction (NSP) tasks.

The authors presented two BERT models: denoting with L the number of layers (i.e. Transformer blocks), with H the hidden size and A the number of self-attention heads, BERT BASE had (L=12, H=768, A=12) with a total number of 110M parameters while BERT LARGE had (L=24, H=1024, A=16), resulting in 340M parameters.

In order to make BERT suitable for a variety of down-stream tasks, the authors decided to structure the inputs to represent unambiguously a single sentence and a pair of sentences (e.g. a question and an answer). As explained previously, the Transformer requires the input sentence to be represented as tokens. In particular, BERT requires the sentence to be divided into tokens and to have the special token [CLS] at the beginning of the sequence and the special token [SEP] at the end. The sentence pairs are differentiated in two ways: first, the special token [SEP] is added between the two sentences; second, it’s added a learned embedding to every token indicating whether it belongs to sentence A or sentence B.

The Masked Language Model task requires some of the tokens from the input to be randomly masked during training, with the objective being for the model to predict the original vocabulary id of the masked work based only on its context. Unlike left-to-right language model pre-training, the MLM objective enables the representation to consider both the left and the right contexts, allowing the pre-training of a deep bidirectional Transformer.

To train BERT MLM the authors masked 15% of the words in each sequence, replacing them with the token [MASK]. The prediction is obtained by adding a final softmax (classification) layer on top of the Transformer encoder and computing the probabilities over the vocabulary.

Training for the Next Sentence Prediction task is fundamental for downstream tasks such as Question Answering (QA) and Natural Language Inference (NLI). Specifically, the training is performed in such a way that when choosing the sentences A and B for each pre-training example, 50% of the time B is actually the sentence that follows A (labeled as isNext) and 50% of the time a sentence is randomly chosen from the corpus (labeled as notNext). As shown in Figure 9, C is used for next sentence prediction.

*Figure 9: BERT pre-training and fine tuning*

Figure 9 also shows the pre-training and the fine-tuning procedures. While pre-training follows the standard procedures and involves the BookCorpus (800M words) and the English Wikipedia (2.5B words), the fine-tuning is straightforward thanks to the self attention mechanism in the Transformer. In contrast with the methodologies preceding the publication of BERT that encoded text pairs before applying bidirectional cross attention, BERT unifies the two stages, encoding a concatenated text pair with self-attention, including bidirectional cross attention between two sentences. Hence, for each task, all that is necessary is to plug-in the inputs and the outputs and to fine-tune all the parameters end-to-end. The fine-tuning procedure, compared to pre-training, is inexpensive.

The model

The aim of the model is to interpret sentences written in multiple languages, such as Italian, English, French, German and Dutch, and to return the result of a classification prediction. Specifically, the model must manage sentences that could eventually contain business specific keywords: these are words with a very specific meaning in the context of the business that is (probably) not known to the model, or the keyword itself has never been seen during BERT training, hence will not be present in the model vocabulary.

For these reasons, since retraining the whole BERT model is impractical, a system must be implemented to drive the model towards the correct classification, especially when the sentence contains a keyword. Code snippet A shows the model summary.

Code snippet A: model summary

As you can see, transfer learning is applied on BERT: its weights are non trainable and hence is used out of the box. After BERT there are two convolutional layers that generate a vector of 70 features that are concatenated with the vector of inputs with the sentence encoding respect to the keywords. The resulting vector is finally used for the classification by means of a softmax layer.

Package structure

The package structure is depicted in Code snippet 1: it is organised with the objective of separating the application logic for data transformation to the model definition and implementation as much as possible.Specifically, the file configurations.py is responsible for setting the program configurations (loading them from config.yml) while the module connectors defines the client connector to Google Firestore and the locations for Google Cloud Storage. The datasetHandler defines the object that is responsible for managing the data flow from the format received by the database to the one that is most suited for the application and defines the training, validation and test sets. The module model creates the machine learning model and contains the preprocessor, the object in charge of preprocessing the input examples in the format required by BERT. Finally, the file training.py is responsible for orchestrating the flow of data, both executing the training and testing model performance.

Code snippet 1: package structure

The dataset handler

The dataset handler has the objective of managing the transformations of data format and of creating the training, validation and test sets. Code snippet 2 shows a section of dataset handler function:

Code snippet 2: portion of dataset handler function that organizes data

The first step is about data management: the keyword data format is changed from the one obtained by the component connecting to Firestore, a dictionary with several unused keys, to a plain and simple list of all the keywords defined in the database.

Supervised training requires a dataset composed of pairs of examples and labels: while training, the model performs the predictions and updates its parameters with respect to the prediction errors. For this reason, DatasetHandler manages data conversion and associates each training example with the corresponding label.

The keywords, instead, are business-specific words whose meanings are probably unknown to the model, or the word could just be missing from the model vocabulary. It is important to properly handle this detail for two reasons. Firstly, if we don’t manage unknown keywords, when the string is preprocessed the unknown words are replaced with the special token ‘[UNK]’. Secondly, let’s suppose that the keyword is known to BERT, but that during its training the model learned that the meaning is different from the one intended by the business. In such a case, we need to make the model understand that the keywords are linked to a subset of training examples. Hence, the inputs are enriched with features with that objective: for each training example a categorical tensor is generated that maps whether the example contains the keyword. This vector will be used in the tailing neural network used for classification.

Now, we still have to split the dataset and create the training, validation and test sets: Code snippet 3 depicts these steps.

Code snippet 3: portion of dataset handler’s function that organizes data

The model implementation

As described in “The model” section, BERT requires the corpus to be preprocessed . Thanks to Tensorflow Hub, it’s possible to download the wrapper and easily execute the preprocessing.

Code snippet 4: the preprocessor

The implementation of the model is fairly simple. First, we have to download the pre-trained model:

Code snippet 5: download of BERT pre-trained model

Then the custom deep learning model is defined by detailing the inputs and then using the Keras functional API to define the model structure. The first block is the BERT layer, from which is taken only the pooled output, a vector of dimension equal to 768, which is injected into a series of 2 dense layers with batch normalization. The output of the second dense layer is then concatenated to the input features mapping the keywords in the training examples. Finally, the resulting classification is obtained by means of a softmax activation function.

Code snippet 6: definition and model build

The train function, responsible for training the model, is shown in Code snippet 7.

Code snippet 7: definition of the train function

Putting it all together

Now that we have defined all the steps and the logic we need to put everything together by orchestrating the objects and the functions. After loading the configurations defined in a YAML file the model object, the preprocessor and the dataset handler are created:

Code snippet 8: model and dataset handler creation

The dataset handler then transforms the data retrieved from the database, preprocesses the corpus and creates the training, validation and test sets.

Code snippet 9: dataset creation

Finally, the model is built, trained and its performance is tested.

Code snippet 10: model build, training and testing

The deployment

The model training is executed on Google AI Platform Training, a service specifically designed and optimized for training machine learning models. As depicted in the Dockerfile shown in Code snippet 11, the package is wrapped in a docker container.

Code snippet 11: Dockerfile

Code snippet 12 is responsible for building the docker container and pushing it to the Google Container Registry, a repository of docker containers associated with the project.

Code snippet 12: build and push script

Once it is pushed to the Google Container Registry we are ready to run the deployment with Google Cloud Run. The script is shown in Code snippet 16:

Code snippet 13: submit training script

Conclusion

We have seen how to design and implement a deep learning model. We have also examined how to build a container for the training application, how to push it to Google Container Registry, and how to execute training on the Google AI Platform.

If you are interested in the description of the use case and the other articles in this series, click here.

References

[1] Devlin, Jacob, et al. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).

[2] Vaswani, Ashish, et al. “Attention is all you need.” arXiv preprint arXiv:1706.03762 (2017).

[3] Goodfellow, Ian, et al. Deep learning. Vol. 1. №2. Cambridge: MIT press, 2016.

[4] Britz, Denny, et al. “Massive exploration of neural machine translation architectures.” arXiv preprint arXiv:1703.03906 (2017).

[5] Gehring, Jonas, et al. “Convolutional sequence to sequence learning.” International Conference on Machine Learning. PMLR, 2017.

[6] He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

[7] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).

[Transformer images] https://ricardokleinklein.github.io/2017/11/16/Attention-is-all-you-need.html