帶外部記憶機制的神經機器翻譯

最新 03-16

導語

PaddlePaddle提供了豐富的運算單元，幫助大家以模塊化的方式構建起千變萬化的深度學習模型來解決不同的應用問題。這裡，我們針對常見的機器學習任務，提供了不同的神經網路模型供大家學習和使用。本周推文目錄如下：

3.12：【命名實體識別】

訓練端到端的序列標註模型

3.13：【序列到序列學習】

無注意力機制的神經機器翻譯

3.14：【序列到序列學習】

使用Scheduled Sampling改善翻譯質量

3.15：【序列到序列學習】

帶外部記憶機制的神經機器翻譯

3.16：【序列到序列學習】

生成古詩詞

序列到序列學習實現兩個甚至是多個不定長模型之間的映射，有著廣泛的應用，包括：機器翻譯、智能對話與問答、廣告創意語料生成、自動編碼（如金融畫像編碼）、判斷多個文本串之間的語義相關性等。

在序列到序列學習任務中，我們首先以機器翻譯任務為例，提供了多種改進模型供大家學習和使用。包括：不帶注意力機制的序列到序列映射模型，這一模型是所有序列到序列學習模型的基礎；使用Scheduled Sampling改善RNN模型在生成任務中的錯誤累積問題；帶外部記憶機制的神經機器翻譯，通過增強神經網路的記憶能力，來完成複雜的序列到序列學習任務。除機器翻譯任務之外，我們也提供了一個基於深層LSTM網路生成古詩詞，實現同語言生成的模型。

【序列到序列學習】

帶外部記憶機制的神經機器翻譯

帶外部記憶（External Memory）機制的神經機器翻譯模型（Neural Machine Translation, NMT），是神經機器翻譯模型的一個重要擴展。它引入可微分的記憶網路作為額外的記憶單元，拓展神經翻譯模型內部工作記憶（Working Memory）的容量或帶寬，輔助完成翻譯等任務中信息的臨時存取，改善模型表現。

類似模型不僅可應用於翻譯任務，同時可廣泛應用於其他需「大容量動態記憶」的任務，例如：機器閱讀理解 / 問答、多輪對話、長文本生成等。同時，「記憶」作為認知的重要部分之一，可用於強化其他多種機器學習模型的表現。

本文所採用的外部記憶機制，主要指神經圖靈機 [1] 方式（將於後文詳細描述）。值得一提的是，神經圖靈機僅僅是神經網路模擬記憶機制的嘗試之一。記憶機制長久以來被廣泛研究，近年來在深度學習的背景下，湧現出一系列有價值的工作，例如記憶網路（Memory Networks）、可微分神經計算機（Differentiable Neural Computers, DNC）等。本文僅討論和實現神經圖靈機機制。

本文的實現主要參考論文[2]，並假設讀者已充分閱讀並理解 PaddlePaddle Book 中機器翻譯一章

|1.模型概述

A.記憶機制簡介

記憶（Memory)，是認知的重要環節之一。記憶賦予認知在時間上的協調性，使得複雜認知（如推理、規劃，不同於靜態感知）成為可能。靈活的記憶機制，是機器模仿人類智能所需要擁有的關鍵能力之一。

靜態記憶

任何機器學習模型，原生就擁有一定的靜態記憶能力：無論它是參數模型（模型參數即記憶），還是非參模型（樣本即記憶）；無論是傳統的 SVM（支持向量即記憶），還是神經網路模型（網路連接權值即記憶）。然而，這裡的「記憶」絕大部分是指靜態記憶，即在模型訓練結束後，「記憶」是固化的；在模型推斷時，模型是靜態一致的，不擁有額外的跨時間步的信息記憶能力。

動態記憶 1 --- RNNs 中的隱狀態向量

在處理序列認知問題（如自然語言處理、序列決策等）時，由於每個時間步對信息的處理需要依賴其他時間步的信息，我們往往需要在不同時間步上維持一個持久的信息通路。帶有隱狀態向量 h（或 LSTM 中的狀態 c）的循環神經網路（Recurrent Neural Networks, RNNs），即擁有這樣的「動態記憶」能力。每一個時間步，模型均可從 h 或 c 中獲取過去時間步的「記憶」信息，並可往上持續疊加新的信息以更新記憶。在模型推斷時，不同的樣本具有完全不同的一組記憶信息（h 或 c），具有「動態」性。

儘管上述對 LSTM中細胞狀態 c 的直覺說法有著諸多不嚴謹之處：例如從優化的角度看， c 的引入或者 GRU 中的線性 Leaky 結構的引入，是為了在梯度計算中使得單步梯度的雅克比矩陣的譜分布更接近單位陣，以減輕長程梯度衰減問題，降低優化難度。但這不妨礙我們從直覺的角度將它理解為增加「線性通路」使得「記憶通道」更順暢，如圖1（引自此文http://colah.github.io/posts/2015-08-Understanding-LSTMs/）所示的 LSTM 中的細胞狀態向量 c 可視為這樣一個用於信息持久化的「線性記憶通道」。

圖1. LSTM 中的細胞狀態向量作為「記憶通道」示意圖

動態記憶 2 --- Seq2Seq 中的注意力機制

然而上節所述的單個向量 h 或 c 的信息帶寬有限。在序列到序列生成模型中，這樣的帶寬瓶頸更表現在信息從編碼器（Encoder）轉移至解碼器（Decoder）的過程中：僅僅依賴一個有限長度的狀態向量來編碼整個變長的源語句，有著較大的潛在信息丟失。

[3] 提出了注意力機制（Attention Mechanism），以克服上述困難。在解碼時，解碼器不再僅僅依賴來自編碼器的唯一的句級編碼向量的信息，而是依賴一個向量組的記憶信息：向量組中的每個向量為編碼器的各字元（Token）的編碼向量（例如 ht）。通過一組可學習的注意強度（Attention Weights) 來動態分配注意力資源，以線性加權方式讀取信息，用於序列的不同時間步的符號生成（可參考 PaddlePaddle Book 機器翻譯一章）。這種注意強度的分布，可看成基於內容的定址（請參考神經圖靈機 [1] 中的定址描述），即在源語句的不同位置根據其內容決定不同的讀取強度，起到一種和源語句「軟對齊（Soft Alignment）」的作用。

相比上節的單個狀態向量，這裡的「向量組」蘊含著更多更精準的信息，例如它可以被認為是一個無界的外部記憶模塊（Unbounded External Memory），有效拓寬記憶信息帶寬。「無界」指的是向量組的向量個數非固定，而是隨著源語句的字元數的變化而變化，數量不受限。在源語句的編碼完成時，該外部存儲即被初始化為各字元的狀態向量，而在其後的整個解碼過程中被讀取使用。

動態記憶 3 --- 神經圖靈機

圖靈機（Turing Machine）或馮諾依曼體系（Von Neumann Architecture），是計算機體系結構的雛形。運算器（如代數計算）、控制器（如邏輯分支控制）和存儲器三者一體，共同構成了當代計算機的核心運行機制。神經圖靈機（Neural Turing Machines）[1] 試圖利用神經網路模擬可微分（即可通過梯度下降來學習）的圖靈機，以實現更複雜的智能。而一般的機器學習模型，大部分忽略了顯式的動態存儲。神經圖靈機正是要彌補這樣的潛在缺陷。

圖2. 圖靈機結構漫畫

圖靈機的存儲機制，常被形象比喻成在一個紙帶（Tape）的讀寫操作。讀頭（Read Head）和寫頭（Write Head）負責在紙帶上讀出或者寫入信息；紙袋的移動、讀寫頭的讀寫動作和內容，則受控制器（Contoller) 控制（見圖2，引自此處http://www.worldofcomputing.net/theory/turing-machine.html）；同時紙帶的長度通常有限。

神經圖靈機則以矩陣 M∈Rn×m 模擬「紙帶」，其中 n 為記憶向量（又成記憶槽）的數量，m 為記憶向量的長度。以前饋神經網路或循環神經網路來模擬控制器，決定本次讀寫在不同的記憶槽上的讀寫強度分布，即定址：

基於內容的定址（Content-based Addressing)：定址強度依賴於記憶槽的內容和該次讀寫的實際內容；

基於位置的定址(Location-based Addressing)：定址強度依賴於上次定址操作的定址強度（例如偏移）；

混合定址：混合上述定址方式（例如線性插值）；

（詳情請參考論文[1]）

和上節的注意力機制相比，神經圖靈機有著諸多相同點和不同點。相同點例如：

均利用矩陣（或向量組）形式的外部存儲。

均利用可微分的定址方式。

不同在於：

神經圖靈機有讀有寫，是真正意義上的存儲器；而注意力機制在編碼完成時即初始化存儲內容（僅簡單緩存，非可微分的寫操作），在其後的解碼過程中只讀不寫。

神經圖靈機不僅有基於內容的定址，同時結合基於位置的定址，使得例如「序列複製」等需「連續定址」的任務更容易；而注意力機制僅考慮基於內容的定址，以實現 Soft Aligment。

神經圖靈機利用有界（Bounded) 存儲；而注意力機制利用無界（Unbounded）存儲。

三種記憶方式的混合，以強化神經機器翻譯模型

儘管在一般的序列到序列模型中，注意力機制已經是標配。然而，注意機制中的外部存儲僅用於存儲編碼器信息。在解碼器內部，信息通路仍依賴 RNN 的狀態單向量 h 或 c。於是，利用神經圖靈機的外部存儲機制，來補充解碼器內部的單向量信息通路，成為自然而然的想法。

於是，我們混合上述的三種動態記憶機制，即RNN 原有的狀態向量、注意力機制被保留；同時，基於簡化版的神經圖靈機的有界外部記憶機制被引入以補充解碼器單狀態向量記憶。整體的模型實現參考論文[2]。

這裡額外需要理解的是，為什麼不直接通過增加 h 或 c的維度來擴大信息帶寬？

一方面因為通過增加 h 或 c的維度是以 O(n2) 的存儲和計算複雜度為代價（狀態-狀態轉移矩陣）；而基於神經圖靈機的記憶擴展代價是 O(n)的，因其定址是以記憶槽（Memory Slot）為單位，而控制器的參數結構僅僅是和 m（記憶槽的大小）有關。

基於狀態單向量的記憶讀寫機制，僅有唯一的讀寫強度，即本質上是全局的；而神經圖靈機的機制是局部的，即讀寫本質上僅在部分記憶槽（定址強度的分布銳利，即真正大的強度僅分布於部分記憶槽）。局部的特性讓記憶的存取更乾淨，干擾更小。

B.模型網路結構

網路總體結構在帶注意機制的序列到序列結構（即RNNsearch[3]）基礎上疊加簡化版神經圖靈機[1]外部記憶模塊。

編碼器（Encoder）採用標準雙向 GRU 結構（非 stack），不贅述。

解碼器（Decoder）採用和論文[2] 基本相同的結構。

|2. 演算法實現

演算法實現於以下幾個文件中：

external_memory.py: 主要實現簡化版的神經圖靈機於 ExternalMemory 類，對外提供初始化和讀寫函數。

model.py: 相關模型配置函數，包括雙向 GPU 編碼器（bidirectional_gru_encoder），帶外部記憶強化的解碼器（memory_enhanced_decoder），帶外部記憶強化的序列到序列模型（memory_enhanced_seq2seq）。

data_utils.py: 相關數據處理輔助函數。

train.py:模型訓練。

infer.py: 部分示例樣本的翻譯（模型推斷）。

ExternalMemory類

ExternalMemory類實現通用的簡化版神經圖靈機。相比完整版神經圖靈機，該類僅實現了基於內容的定址（Content Addressing, Interpolation），不包括基於位置的定址（ Convolutional Shift, Sharpening)。讀者可以自行將其補充成為一個完整的神經圖靈機。

該類結構如下：

classExternalMemory(object):

"""External neural memory class.

A simplified Neural Turing Machines (NTM) with only content-based

addressing (including content addressing and interpolation, but excluding

convolutional shift and sharpening). It serves as an external differential

memory bank, with differential write/read head controllers to store

and read information dynamically as needed. Simple feedforward networks are

used as the write/read head controllers.

For more details, please refer to

`Neural Turing Machines `_.

"""

def __init__(self,

name,

mem_slot_size,

boot_layer,

initial_weight,

readonly=False,

enable_interpolation=True):

""" Initialization.

:param name: Memory name.

:type name: basestring

:param mem_slot_size: Size of memory slot/vector.

:type mem_slot_size: int

:param boot_layer: Boot layer for initializing the external memory. The

sequence layer has sequence length indicating the number

of memory slots, and size as memory slot size.

:type boot_layer: LayerOutput

:param initial_weight: Initializer for addressing weights.

:type initial_weight: LayerOutput

:param readonly: If true, the memory is read-only, and write function cannot

be called. Default is false.

:type readonly: bool

:param enable_interpolation: If set true, the read/write addressing weights

will be interpolated with the weights in the

last step, with the affine coefficients being

a learnable gate function.

:type enable_interpolation: bool

"""

pass

def_content_addressing(self, key_vector):

"""Get write/read head"s addressing weights via content-based addressing.

"""

pass

def_interpolation(self, head_name, key_vector, addressing_weight):

"""Interpolate between previous and current addressing weights.

"""

pass

def_get_addressing_weight(self, head_name, key_vector):

"""Get final addressing weights for read/write heads, including content

addressing and interpolation.

"""

pass

defwrite(self, write_key):

"""Write onto the external memory.

It cannot be called if "readonly" set True.

:param write_key: Key vector for write heads to generate writing

content and addressing signals.

:type write_key: LayerOutput

"""

pass

defread(self, read_key):

"""Read from the external memory.

:param write_key: Key vector for read head to generate addressing

signals.

:type write_key: LayerOutput

:return: Content (vector) read from external memory.

:rtype: LayerOutput

"""

pass

其中，私有方法包含：

_content_addressing:通過基於內容的定址，計算得到讀寫操作的定址強度。

_interpolation: 通過插值定址（當前定址強度和上一時間步定址強度的線性加權），更新當前定址強度。

_get_addressing_weight: 調用上述兩個定址操作，獲得對存儲單元的讀寫操作的最終定址強度。

對外介面包含：

__init__：類實例初始化。

輸入參數name:外部記憶單元名，不同實例的相同命名將共享同一外部記憶單元。

輸入參數mem_slot_size: 單個記憶槽（向量）的維度。

輸入參數boot_layer:用於內存槽初始化的層。需為序列類型，序列長度表明記憶槽的數量。

輸入參數initial_weight: 用於初始化定址強度。

輸入參數readonly:是否打開只讀模式（例如打開只讀模式，該實例可用於注意力機制）。打開只讀模式，write方法不可被調用。

輸入參數enable_interpolation:是否允許插值定址（例如當用於注意力機制時，需要關閉插值定址）。

write:寫操作。

輸入參數write_key：某層的輸出，其包含的信息用於寫頭的定址和實際寫入信息的生成。

read:讀操作。

輸入參數read_key：某層的輸出，其包含的信息用於讀頭的定址。

返回：讀出的信息（可直接作為其他層的輸入）。

部分關鍵實現邏輯：

-ExternalMemory類的定址邏輯通過_content_addressing和_interpolation兩個私有方法實現。讀和寫操作通過read和write兩個函數實現，包括上述的定址操作。並且讀和寫的定址獨立進行，不同於 [2] 中的二者共享同一個定址強度，目的是為了使得該類更通用。 - 為了簡單起見，控制器（Controller）未被專門模塊化，而是分散在各個定址和讀寫函數中。控制器主要包括定址操作和寫操作時生成寫入/擦除向量等，其中定址操作通過上述的_content_addressing和_interpolation兩個私有方法實現，寫操作時的寫入/擦除向量的生成則在write方法中實現。上述均採用簡單的前饋網路模擬控制器。讀者可嘗試剝離控制器邏輯並模塊化，同時可嘗試循環神經網路做控制器。- ExternalMemory類具有隻讀模式，同時差值定址操作可關閉。主要目的是便於用該類等價實現傳統的注意力機制。

memory_enhanced_seq2seq及相關函數

涉及三個主要函數：

defbidirectional_gru_encoder(input, size, word_vec_dim):

"""Bidirectional GRU encoder.

:params size: Hidden cell number in decoder rnn.

:type size: int

:params word_vec_dim: Word embedding size.

:type word_vec_dim: int

:return: Tuple of 1. concatenated forward and backward hidden sequence.

2. last state of backward rnn.

:rtype: tuple of LayerOutput

"""

pass

def memory_enhanced_decoder(input, target, initial_state, source_context, size,

word_vec_dim, dict_size, is_generating, beam_size):

"""GRU sequence decoder enhanced with external memory.

The "external memory" refers to two types of memories.

- Unbounded memory: i.e. attention mechanism in Seq2Seq.

- Bounded memory: i.e. external memory in NTM.

Both types of external memories can be implemented with

ExternalMemory class, and are both exploited in this enhanced RNN decoder.

The vanilla RNN/LSTM/GRU also has a narrow memory mechanism, namely the

hidden state vector (or cell state in LSTM) carrying information through

a span of sequence time, which is a successful design enriching the model

with the capability to "remember" things in the long run. However, such a

vector state is somewhat limited to a very narrow memory bandwidth. External

memory introduced here could easily increase the memory capacity with linear

complexity cost (rather than quadratic for vector state).

This enhanced decoder expands its "memory passage" through two

ExternalMemory objects:

- Bounded memory for handling long-term information exchange within decoder

itself. A direct expansion of traditional "vector" state.

- Unbounded memory for handling source language"s token-wise information.

Exactly the attention mechanism over Seq2Seq.

Notice that we take the attention mechanism as a particular form of external

memory, with read-only memory bank initialized with encoder states, and a

read head with content-based addressing (attention). From this view point,

we arrive at a better understanding of attention mechanism itself and other

external memory, and a concise and unified implementation for them.

For more details about external memory, please refer to

`Neural Turing Machines `_.

For more details about this memory-enhanced decoder, please

refer to `Memory-enhanced Decoder for Neural Machine Translation

`_. This implementation is highly

correlated to this paper, but with minor differences (e.g. put "write"

before "read" to bypass a potential bug in V2 APIs. See

(`issue `_).

"""

pass

defmemory_enhanced_seq2seq(encoder_input, decoder_input, decoder_target,

hidden_size, word_vec_dim, dict_size, is_generating,

beam_size):

"""Seq2Seq Model enhanced with external memory.

The "external memory" refers to two types of memories.

- Unbounded memory: i.e. attention mechanism in Seq2Seq.

- Bounded memory: i.e. external memory in NTM.

Both types of external memories can be implemented with

ExternalMemory class, and are both exploited in this Seq2Seq model.

:params encoder_input: Encoder input.

:type encoder_input: LayerOutput

:params decoder_input: Decoder input.

:type decoder_input: LayerOutput

:params decoder_target: Decoder target.

:type decoder_target: LayerOutput

:params hidden_size: Hidden cell number, both in encoder and decoder rnn.

:type hidden_size: int

:params word_vec_dim: Word embedding size.

:type word_vec_dim: int

:param dict_size: Vocabulary size.

:type dict_size: int

:params is_generating: Whether for beam search inferencing (True) or

for training (False).

:type is_generating: bool

:params beam_size: Beam search width.

:type beam_size: int

:return: Cost layer if is_generating=False; Beam search layer if

is_generating = True.

:rtype: LayerOutput

"""

pass

bidirectional_gru_encoder函數實現雙向單層 GRU（Gated Recurrent Unit）編碼器。返回兩組結果：一組為字元級編碼向量序列（包含前後向），一組為整個源語句的句級編碼向量（僅後向）。前者用於解碼器的注意力機制中記憶矩陣的初始化，後者用於解碼器的狀態向量的初始化。

memory_enhanced_decoder函數實現通過外部記憶增強的 GRU 解碼器。它利用同一個ExternalMemory類實現兩種外部記憶模塊：

無界外部記憶：即傳統的注意力機制。利用ExternalMemory，打開只讀開關，關閉插值定址。並利用解碼器的第一組輸出作為ExternalMemory中存儲矩陣的初始化（boot_layer）。因此，該存儲的記憶槽數目是動態可變的，取決於編碼器的字元數。

unbounded_memory = ExternalMemory(

name="unbounded_memory",

mem_slot_size=size *2,

boot_layer=unbounded_memory_init,

initial_weight=unbounded_memory_weight_init,

readonly=True,

enable_interpolation=False)

- 有界外部記憶：利用ExternalMemory，關閉只讀開關，打開插值定址。並利用解碼器的第一組輸出，取均值池化（pooling）後並擴展為指定序列長度後，疊加隨機雜訊（訓練和推斷時保持一致），作為ExternalMemory中存儲矩陣的初始化（boot_layer）。因此，該存儲的記憶槽數目是固定的。即代碼中的：

bounded_memory = ExternalMemory(

name="bounded_memory",

mem_slot_size=size,

boot_layer=bounded_memory_init,

initial_weight=bounded_memory_weight_init,

readonly=False,

enable_interpolation=True)

memory_enhanced_seq2seq函數定義整個帶外部記憶機制的序列到序列模型，是模型定義的主調函數。它首先調用bidirectional_gru_encoder 對源語言進行編碼，然後通過 memory_enhanced_decoder 進行解碼。

此外，在該實現中，將ExternalMemory的write操作提前至read之前，以避開潛在的拓撲連接局限，詳見 Issue。我們可以看到，本質上他們是等價的。

|4. 快速開始

A.數據自定義

數據是通過無參的reader()迭代器函數，進入訓練過程。因此我們需要為訓練數據和測試數據分別構造兩個reader()迭代器。reader()函數使用yield來實現迭代器功能（即可通過for instance in reader()方式迭代運行），例如

defreader():

for instance in data_list:

yield instance

yield返回的每條樣本需為三元組，分別包含編碼器輸入字元列表（即源語言序列，需 ID 化），解碼器輸入字元列表（即目標語言序列，需 ID 化，且序列右移一位），解碼器輸出字元列表（即目標語言序列，需 ID 化）。

用戶需自行完成字元的切分 (Tokenize) ，並構建字典完成 ID 化。

這兩個函數被調用時即返回相應的reader()函數，供paddle.traner.SGD.train使用。當我們需要使用其他數據時，可參考 paddle.paddle.wmt14（https://github.com/PaddlePaddle/Paddle/blob/develop/python/paddle/v2/dataset/wmt14.py）構造相應的 data creator，並替換paddle.dataset.wmt14.train和paddle.dataset.wmt14.train成相應函數名。

B.訓練

命令行輸入：

python train.py

或自定義部分參數, 例如:

python train.py

--dict_size30000

--word_vec_dim512

--hidden_size1024

--memory_slot_num8

--use_gpu False

--trainer_count1

--num_passes100

--batch_size128

--memory_perturb_stddev.1

即可運行訓練腳本，訓練模型將被定期保存於本地./checkpoints。參數含義可運行

python train.py --help

C.解碼

命令行輸入：

python infer.py

或自定義部分參數, 例如:

即可運行解碼腳本，產生示例翻譯結果。參數含義可運行：

python infer.py --help

【參考文獻】

Alex Graves, Greg Wayne, Ivo Danihelka, Neural Turing Machines. arXiv preprint arXiv:1410.5401, 2014.

Mingxuan Wang, Zhengdong Lu, Hang Li, Qun Liu, Memory-enhanced Decoder Neural Machine Translation. In Proceedings of EMNLP, 2016, pages 278–286.

Dzmitry Bahdanau, Kyunghyun Cho, Yoshua Bengio, Neural Machine Translation by Jointly Learning to Align and Translate. arXiv preprint arXiv:1409.0473, 2014.