# Review — Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

## Designed a New Hidden Unit for **Statistical Machine Translation (SMT) Using **RNN Encoder-Decoder

In this paper, **Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation**, (RNN Encoder-Decoder), by Universit´e de Montr´eal, Jacobs University, and Universit´e du Maine, is reviewed. This is also a paper by Prof. Bengio. In this paper:

**RNN Encoder-Decoder**is proposed that consists of**two recurrent neural networks (RNN)**, which are**jointly trained**.**One RNN encodes a sequence of symbols into a fixed-length vector representation**, and**the other decodes the representation into another sequence of symbols**.**A new hidden unit**that**adaptively remembers and forgets**are also proposed instead of using LSTM unit.

This is a paper in **2014 EMNLP **with over **15000 citations**. (Sik-Ho Tsang @ Medium) (I write this story to reinforce my NLP knowledge, if you are familiar with it, please skip to save time, lol.)

# Outline

**Recurrent Neural Networks (RNN)****RNN Encoder-Decoder****Hidden Unit that Adaptively Remembers and Forgets****Statistical Machine Translation (SMT)****English/French Translation Task Results**

# 1. Recurrent Neural Networks (RNN)

- A recurrent neural network (RNN) is a neural network that consists of a
**hidden state**and an*h***optional output**which operates on*y***a variable-length sequence**. At each*x*= (*x*1, …,*xT*)**time step**, the*t***hidden state**is updated by Eq. (1):*h*<*t*> of the RNN

- where
*f***non-linear activation function**.*f*may be as simple as an element-wise logistic**sigmoid**function and as complex as a**long short-term memory (LSTM)**unit. - An RNN can learn a probability distribution over a sequence by being trained to predict the next symbol in a sequence. In that case, the output at each timestep t is the conditional distribution
*p*(*xt*|*xt*-1, …,*x*1). - For example,
**a multinomial distribution (1-of-**Eq. (2):*K*coding) can be output using a softmax activation function

- where
*wj*are the rows of a weight matrix*W*. - By combining these probabilities, we can compute the probability of the sequence
*x*using Eq. (3):

# 2. RNN Encoder-Decoder

## 2.1. Probabilistic View

- As mentioned, a novel neural network architecture is proposed that learns to encode a variable-length sequence into a fixed-length vector representation and to decode a given fixed-length vector representation back into a variable-length sequence.
- From a probabilistic perspective, it
**learns the conditional distribution over a variable-length sequence conditioned on yet another variable-length sequence,***p*(*y*1, …,*yT’*|*x*1, …,*xT*). - The input and output sequence lengths
.*T*and*T*’ may differ

## 2.2. Encoder

- The encoder is an RNN that reads each symbol of an input sequence
*x*sequentially. As it reads each symbol, the hidden state of the RNN changes according to Eq. (1). - After reading the end of the sequence (marked by an end-of-sequence symbol),
**the hidden state of the RNN is a summary**.*c*of the whole input sequence

## 2.3. Decoder

- The decoder of the proposed model is another RNN which is trained to generate the output sequence by predicting the next symbol
*yt*given the hidden state*h*<*t*>. - Both
*yt*and*h*<*t*> are also conditioned on*yt*-1 and on the summary*c*of the input sequence. Hence,**the hidden state of the decoder at time**is computed by Eq. (4):*t*

- and similarly,
**the conditional distribution of the next symbol**is Eq. (5):

- where
*f*and*g*are the activation functions.

## 2.4. Training

- The two components of the proposed RNN Encoder-Decoder are jointly trained to maximize the conditional log-likelihood Eq. (6):

- where
*θ*is the set of the model parameters and each (*xn*,*yn*) is an (input sequence, output sequence) pair from the training set.

## 2.5. Testing

- Once trained, the model can be used in two ways.

- One way is to use the model to generate a target sequence given an input sequence.
- On the other hand, the model can be used to score a given pair of input and output sequences, where the score is simply a probability
*pθ*(*y*|*x*).

# 3. Hidden Unit that Adaptively Remembers and Forgets

## 3.1. New Hidden Unit

- A new hidden unit, which is
**much simpler than LSTM**, is proposed. - First,
**the reset gate***rj*

- where
*σ*is the logistic sigmoid function, and [.]*j*denotes the*j*-th element of a vector.*x*and*ht*-1 are the input and the previous hidden state, respectively.*Wr*and*Ur*are weight matrices. - Similarly,
**the update gate**is computed by Eq. (8):*zj*

**The actual activation of the proposed unit**is then computed by Eq. (9):*hj*

- where (Eq. (10)):

When

the reset gate is close to 0, the hidden state is forced toignore the previous hidden stateandreset with the current input only.This effectively

allows the hidden state to drop any information that is found to be irrelevant later in the future, thus, allowing a more compact representation.

- As each hidden unit has separate reset and update gates,
**each hidden unit will learn to capture dependencies over different time scales**.

# 4. Statistical Machine Translation (SMT)

- The goal of the system (decoder, specifically) is to find a translation
*f*given a source sentence*e*, which maximizes (Eq. (11)):

- where the
**first term**at the right hand side is called**translation model**and the**latter language model**. - In practice, however,
**most SMT systems model log**with additional features and corresponding weights (Eq. (12)):*p*(*f*|*e*) as a loglinear model

- where
*fn*and*wn*are the*n*-th feature and weight, respectively. is a*Z*(*e*)**normalization constant**that does not depend on the weights. The weights are often**optimized to maximize the BLEU score on a development set.**

# 5. English/French Translation Task Results

## 5.1. Data & Datasets

**WMT’14 English/French Translation**is used for evaluation.- Large amounts of resources are available to build an English/French SMT system in the framework of the WMT’14 translation task.
- The bilingual corpora include
**Europarl (61M words), news commentary (5.5M), UN (421M)**, and**two crawled corpora of 90M and 780M words respectively.**The**last two**corpora are quite**noisy**. - To train the
**French**language model, about**712M words of crawled newspaper material**is available in addition to the target side of the bitexts. **A subset of 418M words**is selected out of more than 2G words for**language modeling**and**a subset of 348M**out of 850M words is selected for training the RNN Encoder–Decoder.- The test set
**newstest2012 and 2013**is used for**data selection**and**weight tuning with MERT**, and**newstest2014**as our**test set**. **Each set has more than 70 thousand words**and a single reference translation.- The source and target vocabulary are
**limited to the most frequent 15,000 words for both English and French**. This covers**approximately 93% of the dataset**. All the**out-of-vocabulary words**were mapped to a**special token ([UNK])**.

## 5.2. BLEU Scores

(For more details about BLEU: Please feel free to watch: C5W3L06 Bleu Score (Optional))

**5.2.1. Baseline**

- The baseline phrase-based SMT system was built using Moses with default settings. This system achieves a
**BLEU score of 30.64**and**33.3**on the**development**and**test sets**, respectively.

## 5.2.2. **RNN**

- The proposed RNN Encoder-Decoder uses
**1000 hidden units**with the**proposed gates**at the encoder and at the decoder. **The activation function for ~h**in Eq. (10) is a**hyperbolic tangent**.- The computation
**from the hidden state in the decoder to the output**is with a single intermediate layer having**500 Maxout units**each pooling 2 inputs. - At each update,
**64 randomly selected phrase pairs**are used. The model was trained for approximately**three days**.

Finally, it outperforms baseline.

## 5.2.3. **CSLM**

- The CSLM model is trained on
**7-grams**from the target corpus.**Each input word**was projected into the**embedding space R512**, and they were**concatenated to form a 3072-dimensional vector**. - The concatenated vector was
**fed through two rectified layers (of size 1536 and 1024)**(Glorot et al., 2011). The**output**layer was a simple**softmax**layer.

## 5.2.4. **CSLM+RNN**

- As expected
**, adding features**computed by neural networks**consistently improves the performance**over the baseline performance.

The

best performancewas achieved when usingboth CSLM and the phrase scores from the RNN Encoder-Decoder. This suggests that the contributions of the CSLM and the RNN Encoder-Decoder are not too correlated.

## 5.2.5. CSLM+RNN+WP

- Furthermore,
**Word Penalization (WP)**is used which is to**penalize the number of words that are unknown**to the neural networks (i.e. words which are not in the shortlist), by simply adding the number of unknown words as an additional feature the loglinear model in (Eq. (12)). - But WP cannot achieve better performance on the test set, but only on the development set.

## 5.3. Qualitative Results

- The source phrases that are
**long (more than 3 words per source phrase)**and**frequent**are selected.

The choices of the target phrases by the RNN Encoder-Decoder are

closer to actual or literal translations.And the RNN Encoder-Decoder prefers

shorter phrasesin general.

- 50 samples are generated and the top-five phrases are shown accordingly to their scores.

RNN Encoder–Decoder is able to propose

well-formed target phrases without looking at the actual phrase table.Importantly, the generated phrases

do not overlap completely with the target phrasesfrom the phrase table.

## 5.4. Word and Phrase Representations

- Since the proposed RNN Encoder-Decoder also
**projects to and maps back from a sequence of words into a continuous space vector**,**semantically meaningful embeddings is observed**with the proposed model as well. - The representation
*c*in this case is a**1000-dimensional vector**. The**projection**was done by the recently proposed**Barnes-Hut-SNE**and is shown above. - It is clearly seen that
**semantically similar words are clustered with each other.**

- RNN Encoder-Decoder captures both semantic and syntactic structures of the phrases.
- For instance, in the
**bottom-left plot**, most of the phrases are about the duration of time, while**those phrases that are syntactically similar are clustered together.** - The
**bottom-right plot**shows the**cluster of phrases that are semantically similar (countries or regions).** - On the other hand, the
**top-right plot**shows the phrases that are**syntactically similar**.

I believe this is one of the beginner papers for NLP.

## Reference

[2014 EMNLP] [RNN Encoder-Decoder]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

## Natural Language Processing (NLP)

**2014** [Seq2Seq] [RNN Encoder-Decoder]