專欄｜自然語言處理第一番之文本分類器

新聞 06-29

機器之心專欄

作者：想飛的石頭

文本分類應該是自然語言處理中最普遍的一種應用，例如文章自動分類、郵件自動分類、垃圾郵件識別、用戶情感分類等等，在生活中有很多例子，本文主要從傳統和深度學習兩塊來向大家介紹如何構建一個文本分類器。

專欄｜自然語言處理第一番之文本分類器

文本分類方法

傳統的文本方法的主要流程是人工設計一些特徵，從原始文檔中提取特徵，然後指定分類器如 LR、SVM，訓練模型對文章進行分類，比較經典的特徵提取方法如頻次法、tf-idf、互信息方法、N-Gram。

深度學習興起之後，也有很多人開始使用一些經典的模型如 CNN、LSTM 這類方法來做特徵的提取，在本文中，作者會簡要描述一下各種方法，及其在文本分類任務中的一些實驗。

傳統文本分類方法

這裡主要描述兩種特徵提取方法：頻次法、tf-idf、互信息、N-Gram。

頻次法

頻次法，顧名思義，十分簡單。它記錄每篇文章的次數分布，然後將分布輸入機器學習模型，訓練一個合適的分類模型。對這類數據進行分類，需要指出的是：在統計次數分布時，可合理提出假設，頻次比較小的詞對文章分類的影響比較小。因此，我們可合理地假設閾值，濾除頻次小於閾值的詞，減少特徵空間維度。

TF-IDF

TF-IDF 相對於頻次法，有更進一步的考量。詞出現的次數能從一定程度反應文章的特點，即 TF，而 TF-IDF，增加了所謂的反文檔頻率，如果一個詞在某個類別上出現的次數多，而在全部文本上出現的次數相對比較少，我們就認為這個詞有更強大的文檔區分能力。TF-IDF 是綜合考慮了頻次和反文檔頻率兩個因素的方法。

互信息方法

互信息方法也是一種基於統計的方法，計算文檔中出現詞和文檔類別的相關程度，即互信息。

N-Gram

基於 N-Gram 的方法是把文章序列，通過大小為 N 的窗口，形成一個個 Group。然後對這些 Group 做統計，濾除出現頻次較低的 Group，再把這些 Group 組成特徵空間，傳入分類器，進行分類。

深度學習方法

基於 CNN 的文本分類方法

最普通的基於 CNN 的方法就是 Keras 上的 example 做情感分析，接 Conv1D，指定大小的 window size 來遍歷文章，加上一個 maxpool。如此多接入幾個，得到特徵表示，然後加上 FC，進行最終的分類輸出。
基於 CNN 的文本分類方法，最出名的應該是 2014 Emnlp 的 Convolutional Neural Networks for Sentence Classi?cation，使用不同 filter 的 cnn 網路，然後加入 maxpool，然後 concat 到一起。

論文鏈接：http://www.aclweb.org/anthology/D14-1181

專欄｜自然語言處理第一番之文本分類器

這類 CNN 的方法，通過設計不同的 window size 來建模不同尺度的關係，但是很明顯，丟失了大部分的上下文關係，論文《Recurrent Convolutional Neural Networks for Text Classification》對此進行了研究。將每一個詞形成向量化表示時，加上上文和下文的信息，每一個詞的表示如下：

專欄｜自然語言處理第一番之文本分類器

整個結構框架如下：

專欄｜自然語言處理第一番之文本分類器

如針對這句話「A sunset stroll along the South Bank affords an array of stunning vantage points」，stroll 的表示包括 c_l(stroll),pre_word2vec(stroll),c_r(stroll), c_l(stroll) 編碼 A sunset 的語義，而 c_r(stroll) 編碼 along the South Bank affords an array of stunning vantage points 的信息，每一個詞都如此處理，因此會避免普通 cnn 方法的上下文缺失的信息。

基於 LSTM 的方法

和基於 CNN 的方法中第一種類似，直接暴力地在 embedding 之後加入 LSTM，然後輸出到一個 FC 進行分類，基於 LSTM 的方法，我覺得這也是一種特徵提取方式，可能比較偏向建模時序的特徵；
在暴力的方法之上，如論文《A C-LSTM Neural Network for Text Classification》的研究，將 embedding 輸出不直接接入 LSTM，而是接入到 CNN，通過 CNN 得到一些序列，然後吧這些序列再接入到 LSTM，文章說這麼做會提高最後分類的准去率。

代碼實踐

語料及任務介紹

訓練的語料來自於大概 31 個新聞類別的新聞語料，但是其中有一些新聞數目比較少，所以取了數量比較多的前 20 個新聞類比的新聞語料，每篇新聞稿字數從幾百到幾千不等，任務就是訓練合適的分類器然後將新聞分為不同類別：

專欄｜自然語言處理第一番之文本分類器

Bow

Bow 對語料處理，得到 tokens set：

def __get_all_tokens(self): """ get all tokens of the corpus """ fwrite = open(self.data_path.replace("all.csv","all_token.csv"), "w") with open(self.data_path, "r") as fread: i = 0 # while True: for line in fread.readlines(): try: line_list = line.strip().split(" ") label = line_list[0] self.labels.append(label) text = line_list[1] text_tokens = self.cut_doc_obj.run(text) self.corpus.append(" ".join(text_tokens)) self.dictionary.add_documents([text_tokens]) fwrite.write(label+" "+"".join(text_tokens)+" ") i+=1 except BaseException as e: msg = traceback.format_exc() print msg print "=====>Read Done<======" break self.token_len = self.dictionary.__len__()

print "all token len "+ str(self.token_len)
self.num_data = i
fwrite.close()

然後，tokens set 以頻率閾值進行濾除，然後對每篇文章做處理來進行向量化：

def __filter_tokens(self, threshold_num=10): small_freq_ids = [tokenid for tokenid, docfreq in self.dictionary.dfs.items() if docfreq < threshold_num ] self.dictionary.filter_tokens(small_freq_ids) self.dictionary.compactify()def vec(self): """ vec: get a vec representation of bow """ self.__get_all_tokens()

print "before filter, the tokens len: {0}".format(self.dictionary.__len__()) self.__filter_tokens()

print "After filter, the tokens len: {0}".format(self.dictionary.__len__()) self.bow = []

for file_token in self.corpus:
file_bow = self.dictionary.doc2bow(file_token)
self.bow.append(file_bow) # write the bow vec into a file
bow_vec_file = open(self.data_path.replace("all.csv","bow_vec.pl"), "wb")
pickle.dump(self.bow,bow_vec_file)
bow_vec_file.close()
bow_label_file = open(self.data_path.replace("all.csv","bow_label.pl"), "wb")
pickle.dump(self.labels,bow_label_file)
bow_label_file.close()

最終得到了每篇文章的 bow 的向量，由於這塊的代碼是在我的筆記本電腦上運行的，直接跑佔用內存太大，因為每一篇文章在 token set 中的表示是極其稀疏的，因此我們可以選擇將其轉為 csr 表示，然後進行模型訓練，轉為 csr 並保存中間結果代碼如下：

def to_csr(self): self.bow = pickle.load(open(self.data_path.replace("all.csv","bow_vec.pl"), "rb")) self.labels = pickle.load(open(self.data_path.replace("all.csv","bow_label.pl"), "rb")) data = [] rows = [] cols = [] line_count = 0 for line in self.bow:

for elem in line: rows.append(line_count) cols.append(elem[0]) data.append(elem[1]) line_count += 1 print "dictionary shape ({0},{1})".format(line_count, self.dictionary.__len__()) bow_sparse_matrix = csr_matrix((data,(rows,cols)), shape=[line_count, self.dictionary.__len__()])

print "bow_sparse matrix shape: " print bow_sparse_matrix.shape # rarray=np.random.random(size=line_count) self.train_set, self.test_set, self.train_tag, self.test_tag = train_test_split(bow_sparse_matrix, self.labels, test_size=0.2)

print "train set shape: "
print self.train_set.shape
train_set_file = open(self.data_path.replace("all.csv","bow_train_set.pl"), "wb")
pickle.dump(self.train_set,train_set_file)
train_tag_file = open(self.data_path.replace("all.csv","bow_train_tag.pl"), "wb")
pickle.dump(self.train_tag,train_tag_file)
test_set_file = open(self.data_path.replace("all.csv","bow_test_set.pl"), "wb")
pickle.dump(self.test_set,test_set_file)
test_tag_file = open(self.data_path.replace("all.csv","bow_test_tag.pl"), "wb")
pickle.dump(self.test_tag,test_tag_file)

最後訓練模型代碼如下：

def train(self): print "Beigin to Train the model" lr_model = LogisticRegression() lr_model.fit(self.train_set, self.train_tag)

print "End Now, and evalution the model with test dataset" # print "mean accuracy: {0}".format(lr_model.score(self.test_set, self.test_tag)) y_pred = lr_model.predict(self.test_set)

print classification_report(self.test_tag, y_pred)

print confusion_matrix(self.test_tag, y_pred)

print "save the trained model to lr_model.pl"
joblib.dump(lr_model, self.data_path.replace("all.csv","bow_lr_model.pl"))

TF-IDF

TF-IDF 和 Bow 的操作十分類似，只是在向量化使使用 tf-idf 的方法：

def vec(self): """ vec: get a vec representation of bow """ self.__get_all_tokens()

print "before filter, the tokens len: {0}".format(self.dictionary.__len__()) vectorizer = CountVectorizer(min_df=1e-5) transformer = TfidfTransformer() # sparse matrix self.tfidf = transformer.fit_transform(vectorizer.fit_transform(self.corpus)) words = vectorizer.get_feature_names()

print "word len: {0}".format(len(words)) # print self.tfidf[0] print "tfidf shape ({0},{1})".format(self.tfidf.shape[0], self.tfidf.shape[1])

# write the tfidf vec into a file
tfidf_vec_file = open(self.data_path.replace("all.csv","tfidf_vec.pl"), "wb")
pickle.dump(self.tfidf,tfidf_vec_file)
tfidf_vec_file.close()
tfidf_label_file = open(self.data_path.replace("all.csv","tfidf_label.pl"), "wb")
pickle.dump(self.labels,tfidf_label_file)
tfidf_label_file.close()

這兩類方法效果都不錯，都能達到 98+% 的準確率。

CNN

語料處理的方法和傳統的差不多，分詞之後，使用 pretrain 的 word2vec。在這裡我遇到一個坑，我一開始對自己的分詞太自信了，最後模型一直不能收斂，後來向我們組博士請教，極有可能是由於分詞的詞序列中很多在 pretrained word2vec 裡面是不存在的，而我這部分直接丟棄了，所有可能存在問題，分詞添加了詞典。然後，我對於 pre-trained word2vec 不存在的詞做了一個隨機初始化，然後就能收斂了，學習了！

載入 word2vec 模型和構建 CNN 網路代碼如下（增加了一些 bn 和 dropout 的手段）：

def gen_embedding_matrix(self, load4file=True): """ gen_embedding_matrix: generate the embedding matrix """ if load4file: self.__get_all_tokens_v2()

else: self.__get_all_tokens()

print "before filter, the tokens len: {0}".format( self.dictionary.__len__()) self.__filter_tokens()

print "after filter, the tokens len: {0}".format( self.dictionary.__len__()) self.sequence = []

for file_token in self.corpus: temp_sequence = [x for x, y in self.dictionary.doc2bow(file_token)] print temp_sequence self.sequence.append(temp_sequence)


 self.corpus_size = len(self.dictionary.token2id)

 self.embedding_matrix = np.zeros((self.corpus_size, EMBEDDING_DIM)) print "corpus size: {0}".format(len(self.dictionary.token2id)) for key, v in self.dictionary.token2id.items():

 key_vec = self.w2vec.get(key)
if key_vec is not None:

 self.embedding_matrix[v] = key_vec
 
else:

 self.embedding_matrix[v] = np.random.rand(EMBEDDING_DIM) - 0.5

 print "embedding_matrix len {0}".format(len(self.embedding_matrix))def __build_network(self):

 embedding_layer = Embedding(

 self.corpus_size,

 EMBEDDING_DIM,

 weights=[self.embedding_matrix],

 input_length=MAX_SEQUENCE_LENGTH,

 trainable=False) # train a 1D convnet with global maxpooling

 sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype="int32")

 embedded_sequences = embedding_layer(sequence_input)

 x = Convolution1D(128, 5)(embedded_sequences)

 x = BatchNormalization()(x)

 x = Activation("relu")(x)

 x = MaxPooling1D(5)(x)

 x = Convolution1D(128, 5)(x)

 x = BatchNormalization()(x)

 x = Activation("relu")(x)

 x = MaxPooling1D(5)(x)
print "before 256", x.get_shape()

 x = Convolution1D(128, 5)(x)

 x = BatchNormalization()(x)

 x = Activation("relu")(x)

 x = MaxPooling1D(15)(x)

 x = Flatten()(x)
 x = Dense(128)(x)

 x = BatchNormalization()(x)

 x = Activation("relu")(x)

 x = Dropout(0.5)(x)
print x.get_shape()

 preds = Dense(self.class_num, activation="softmax")(x)

print preds.get_shape()
adam = Adam(lr=0.0001)
self.model = Model(sequence_input, preds)
self.model.compile(
loss="categorical_crossentropy", optimizer=adam, metrics=["acc"])

另外一種網路結構，韓國人那篇文章，網路構造如下：

def __build_network(self):

LSTM

由於我們的任務是對文章進行分類，序列太長，直接接 LSTM 後直接爆內存，所以我在文章序列直接，接了兩層 Conv1D+MaxPool1D 來提取維度較低的向量表示然後接入 LSTM。網路結構代碼如下：

def __build_network(self): embedding_layer = Embedding( self.corpus_size, EMBEDDING_DIM, weights=[self.embedding_matrix], input_length=MAX_SEQUENCE_LENGTH, trainable=False) # train a 1D convnet with global maxpooling sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH, ), dtype="int32") embedded_sequences = embedding_layer(sequence_input) x = Convolution1D( self.num_filters, 5, activation="relu")(embedded_sequences) x = MaxPooling1D(5)(x) x = Convolution1D(self.num_filters, 5, activation="relu")(x) x = MaxPooling1D(5)(x) x = LSTM(64, dropout_W=0.2, dropout_U=0.2)(x) preds = Dense(self.class_num, activation="softmax")(x)

print preds.get_shape()
rmsprop = RMSprop(lr=0.01)
self.model = Model(sequence_input, preds)
self.model.compile(
loss="categorical_crossentropy",
optimizer=rmsprop,
metrics=["acc"])

CNN結果：

專欄｜自然語言處理第一番之文本分類器

C-LSTM 結果：

專欄｜自然語言處理第一番之文本分類器

整個實驗的結果由於深度學習這部分都是在公司資源上運行的，沒有真正意義上地去做一些 trick 來調參來提高性能，這裡所有的代碼的網路配置包括參數都僅供參考，更深地工作需要耗費更多的時間來做參數的優化。

PS: 這裡發現了一個 keras 1.2.2 的 bug，在寫回調函數 TensorBoard，當 histogram_freq=1 時，顯卡佔用明顯增多，M40 的 24g 不夠用，個人感覺應該是一個 bug，但是考慮到 1.2.2 而非 2.0，可能後面 2.0 都優化了。

所有的代碼都在 github 上：tensorflow-101/nlp/text_classifier/scripts

總結和展望

在本文的實驗效果中，雖然基於深度學習的方法和傳統方法相比沒有什麼優勢，可能原因有幾個方面：

Pretrained Word2vec Model 並沒有覆蓋新聞中切分出來的詞，而且比例還挺高，如果能用網路新聞語料訓練出一個比較精準的 Pretrained Word2vec，效果應該會有很大的提升；
可以增加模型訓練收斂的 trick 以及優化器，看看是否有準確率的提升；
網路模型參數到現在為止，沒有做過深的優化。

UPDATE

長文分類

CNN 3 Split(3, 4, 5) model 0.97+ CNN+LSTM 0.94+ 能夠接近 Bow 和 TF-IDF 的效果（0.98+，0.99+），相信可以有更多的小技巧調參，很有信息在這個任務上面打敗它們。

短文分類，利用新聞標題判斷新聞類別：

CNN 3 Split(1,2,3) model 0.92+, LSTM 0.94+，而 Bow 和 TF-IDF 只能 0.80+，在短文本分類上基於深度學習的 DeepNLP 整體性能遙遙領先，另外 LSTM 在短文本上感覺比 CNN 有效，即使是比較複雜的 3 Split 的 CNN 也達不到和 LSTM 相同的效果。

調參心得

當使用 DL Embedding 層時，如 Word2vec 中若不存在該詞時，請不要隨意扔掉，可選擇隨機初始化，可以的話統計不存在詞數，如果數量較大，需分析相關原因；
切詞的好壞在一定程度上影響模型性能，但是如果使用不同的工具性能影響更大，因此，在使用 pretrain word2vec 和後面訓練數據時，請確保使用相同分詞工具，這在我的 task 上提升至少 0.07+；
大的語料上的生成的比較通用的 word2vec 模型，可能比較有效。但是當你想提升準確率時，如果數據量夠的話，可以考慮自己訓練 word2vec，很有效；
當上面都差不多沒問題的時候，如果想再提升下，可以打開 Embedding 的 trainable，有比較合理的解釋，word2vec 的 weight 是一個無監督學習任務，根據詞的共現算的，結合 task 來再更新往往會更有效；
有 GPU 真爽，尤其是這個 task，好快。

喜歡這篇文章嗎？立刻分享出去讓更多人知道吧！

本站內容充實豐富，博大精深，小編精選每日熱門資訊，隨時更新，點擊「搶先收到最新資訊」瀏覽吧！

請您繼續閱讀更多來自 機器之心 的精彩文章:

※使用認知心理學解釋深度神經網路：DeepMind新研究破解AI黑箱問題
※在機器人撫養孩子的浪潮中，迪士尼正研究如何搶佔先機
※用 AI 撬動美國神秘大型銀行，這位年輕創業者如何做到的？
※神經網路模型壓縮技術
※加入巨頭競爭之列，索尼開源神經網路庫NNabla

TAG:機器之心 |

您可能感興趣

※乾貨 | 自然語言處理（5）之英文文本挖掘預處理流程
※三言兩語推薦系統之二——數據預處理
※5分鐘學習自然語言處理
※什麼是自然語言處理
※自然語言處理：語言模型與評價方法
※講幾個高情商處理問題的趣事文/樂言
※自然語言處理領域公開數據集
※自然語言處理的相關應用
※一文助你解決90％的自然語言處理問題
※音頻處理器的基本內容之數字處理器輸出功能
※R處理文本之用1秒看完三國
※關於「熱處理」名詞術語的詮釋
※知名自然語言處理和搜索專家吳軍：區塊鏈不是炒概念
※教程｜一文簡述如何使用嵌套交叉驗證方法處理時序數據
※CMU神經網路自然語言處理課程
※一文簡述如何使用嵌套交叉驗證方法處理時序數據
※自然語言處理應用和前沿技術回顧
※教程 | 一文簡述如何使用嵌套交叉驗證方法處理時序數據
※AMD公布第二代「線程撕裂者」處理器
※電影影評的自然語言處理方法介紹