『论文复现系列』4.FastText
Fasttext源于2016年的论文《Bag of Tricks for Efficient Text Classification》,本篇将介绍Fasttext的相关概念、原理及用法。
★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>
fastText
论文 | Bag of Tricks for Effificient Text Classifification
链接 | https://arxiv.53yu.com/abs/1607.01759
作者 | Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov
发布时间 | 2016
一、概述
尽管深度学习神经网络在自然语言处理方面表现出色,但它们通常需要数十层甚至上亿个参数,因此速度较慢且需要大量计算资源的支持,这限制了它们的应用场景。相比之下,fastText是一种由Facebook开源的简单而高效的文本分类器,采用浅层神经网络实现了word2vec和文本分类功能。fastText与深度网络在效果上几乎相当,但更加节省资源,并且速度提高了百倍,因此可以被视为高效的工业级解决方案。
fastText是在负采样的Skip-Gram模型基础上,进一步的将每个中心词视作子词的集合,并学习该子词的词向量。
简单理解该方案:
以like这个词为例,设子词为2个字符,则子词包括"<l" , “li”,“ik”,“ke”, “e>” 和特殊子词"<like>",其中 “<” 和 “>” 是为了将作为前后缀的子词区分出来。而且,这里的子词 “her” 与整词 “<her>” 也可被区分。给定一个词
w
w
w, 通常我们的习惯于将字符长度在 3 到 6 之间的所有子词和特殊子词的并集
G
w
\mathcal{G}_w
Gw 取出。假设词典中任意子词
g
g
g 的子词向量为
z
g
\boldsymbol{z}_g
zg ,我们可以把使用负采样的 skip-gram 模型的损失函数
−
log
P
(
w
o
∣
w
c
)
=
−
log
1
1
+
exp
(
−
u
o
T
v
c
)
−
∑
k
=
1
,
w
k
∼
P
(
w
)
K
log
1
1
+
exp
(
u
i
k
T
v
c
)
-\log P\left(w_o \mid w_c\right)=-\log \frac{1}{1+\exp \left(-\boldsymbol{u}_o^T \boldsymbol{v}_c\right)}-\sum_{k=1, w_k \sim P(w)}^K \log \frac{1}{1+\exp \left(\boldsymbol{u}_{i_k}^T \boldsymbol{v}_c\right)}
−logP(wo∣wc)=−log1+exp(−uoTvc)1−k=1,wk∼P(w)∑Klog1+exp(uikTvc)1
直接替换成
−
log
P
(
w
o
∣
w
c
)
=
−
log
1
1
+
exp
(
−
u
o
T
∑
g
∈
G
w
c
z
g
)
−
∑
k
=
1
,
w
k
∼
P
(
w
)
K
log
1
1
+
exp
(
u
i
k
T
∑
g
∈
G
w
c
z
g
)
\begin{aligned} & -\log P\left(w_o \mid w_c\right)= \\ & -\log \frac{1}{1+\exp \left(-\boldsymbol{u}_o^T \sum_{g \in \mathcal{G}_{w_c}} \boldsymbol{z}_g\right)}-\sum_{k=1, w_k \sim P(w)}^K \log \frac{1}{1+\exp \left(\boldsymbol{u}_{i_k}^T \sum_{g \in \mathcal{G}_{w_c}} \boldsymbol{z}_g\right)} \end{aligned}
−logP(wo∣wc)=−log1+exp(−uoT∑g∈Gwczg)1−k=1,wk∼P(w)∑Klog1+exp(uikT∑g∈Gwczg)1
不难看出,原中心词向量被替换成了中心词的子词向量之和,这与整词学习(word2vec, Glove)不同,并且词典以外的新词也可以使用fastText中相应的子词向量之和来表达。
二、Why 提出 fastText
Word2Vec的局限
Word2Vec不能处理训练中未出现的词(Out of Vocabulary, OOV)
例如:tensor, flow已经在Word2Vec词典出现,但tensorflow未出现过 --> OOV Error
无法处理形态相同的词(morphology),即词根相同的词
对于具有相同词根(eat)的词, eaten,eating,eats,可能难以同时出现,实际训练就容易将其当成独一无二的词,也就是语义类似的词无法做到参数共享。
S
h
a
r
e
d
r
a
d
i
c
a
l
e
a
t
e
a
t
s
e
a
t
e
n
e
a
t
e
r
e
a
t
i
n
g
Shared \;\;radical\\ eat \;\; eat\textcolor{red}s \;\; eat\textcolor{red}{en} \;\; eat\textcolor{red}{er} \;\; eat\textcolor{red}{ing}
Sharedradicaleateatseateneatereating
为了解决以上问题,fastText提出
三、fastText提出
直觉(intuition):借鉴n-gram的思想,考虑字符级别的信息
使用单词内部语义用以改善Word2Vec的语义表示
sub-word的产生
Step 1:给定一个词,并且在词首尾添加上<>表示开始与结束
e a t i n g → < e a t i n g > eating\;\; → \;\; \textcolor{red}<eating\textcolor{red}> eating→<eating>
Step 2:设定n-gram的滑动窗口大小 n=3(可以设置为其他值),对词进行滑动
<
e
a
t
i
n
g
>
<eati\textcolor{red}{ng>}
<eating>
Step 3:当n=3, 4, 5, 6时所得到的n-grams列表
W
o
r
d
L
e
n
g
t
h
(
n
)
C
h
a
r
a
c
t
e
r
n
−
g
r
a
m
s
eating
3
<ea, east, ati, tin, ing, ng>
eating
4
<eat, eati, atin, ting, ing>
eating
5
<eati, eatin, ating, ting>
eating
6
<eatin, eating, ating>
\begin{aligned} &\begin{array}{lll} Word & Length(n) & Character n-grams\\ \text { eating } & 3 & \text { <ea, east, ati, tin, ing, ng> } \\ \hline \text { eating } & 4 & \text { <eat, eati, atin, ting, ing> } \\ \hline \text { eating } & 5 & \text { <eati, eatin, ating, ting> } \\ \hline \text { eating } & 6 & \text { <eatin, eating, ating> } \end{array} \end{aligned}
Word eating eating eating eating Length(n)3456Charactern−grams <ea, east, ati, tin, ing, ng> <eat, eati, atin, ting, ing> <eati, eatin, ating, ting> <eatin, eating, ating>
问题:这样会存在大量的唯一的n-grams
Hash
由于 n-gram 的量远比 word 大的多,完全存下所有的 n-gram 不太现实。fastText 采用的是 Hash 桶的方式,把所有的 n-gram 映射到 buckets 个桶中,而映射到相同桶的 n-gram 共享同一个 embedding vector,如下图所示
图中 Win 代表整个 Embedding 矩阵,其中前 V 行是 word Embedding,后 Buckets 行是 n-gram Embedding,每个 n-gram 通过 hash 函数之后映射到 0~Bucket-1 位置,得到对应的 embedding 向量。用哈希的方式既能保证查找时 O (1) 的效率,又可能把内存消耗控制在 O (buckets * dim) 范围内。不过这种方法潜在的问题是存在哈希冲突,不同的 n-gram 可能会共享同一个 embedding。如果桶大小取的足够大,这种影响会很小
具体实现过程为负采样的Skip-Gram模型,唯一的区别就是对中心词的子词化的处理
优劣势分析
fastText 对于形态丰富的语言较重要,例如阿拉伯语、德语和俄语。例如,德语中有很多复合词,例如乒乓球(英文 table tennis)在德语中叫 “Tischtennis”。fastText 可以通过子词表达两个词的相关性,例如 “Tischtennis” 和 “Tennis”,特别是当训练预料规模较小的时候,提升尤为出色。
word2vec-skipgram
word2vec-cbow
fasttext
Czech
52.8
55.0
77.8
German
44.5
45.0
56.4
English
70.1
69.9
74.9
Italian
51.5
51.8
62.7
\begin{array}{|l|l|l|l|} \hline & \text { word2vec-skipgram } & \text { word2vec-cbow } & \text { fasttext } \\ \hline \text { Czech } & 52.8 & 55.0 & \mathbf{7 7 . 8} \\ \hline \text { German } & 44.5 & 45.0 & \mathbf{5 6 . 4} \\ \hline \text { English } & 70.1 & 69.9 & \mathbf{7 4 . 9} \\ \hline \text { Italian } & 51.5 & 51.8 & \mathbf{6 2 . 7} \\ \hline \end{array}
Czech German English Italian word2vec-skipgram 52.844.570.151.5 word2vec-cbow 55.045.069.951.8 fasttext 77.856.474.962.7
但与Word2Vec相比,FastText降低了语义类比的任务性能,随着语料库规模的增加,两者的差异越来越小。
word2vec-skipgram
word2vec-cbow
fasttext
Czech
25.7
27.6
27.5
German
66.5
66.8
62.3
English
78.5
78.2
77.8
Italian
52.3
54.7
52.3
\begin{array}{|c|c|c|c|} \hline & \text { word2vec-skipgram } & \text { word2vec-cbow } & \text { fasttext } \\ \hline \text { Czech } & 25.7 & 27.6 & 27.5 \\ \hline \text { German } & 66.5 & 66.8 & 62.3 \\ \hline \text { English } & 78.5 & 78.2 & 77.8 \\ \hline \text { Italian } & 52.3 & 54.7 & 52.3 \\ \hline \end{array}
Czech German English Italian word2vec-skipgram 25.766.578.552.3 word2vec-cbow 27.666.878.254.7 fasttext 27.562.377.852.3
- fastText比常规的skipgram慢1.5倍,因为增加了n-grams的开销
- 在单词相似度任务中,带有n-grams的sub-word信息比CBOW和Skip-gram基线具有更好的性能。
另外,如果遇到一个新词,对于 fastText 来说,它可以从训练集中找出这个新词的所有子词向量,然后做个求和,就能算出这个新词的词向量了
四、实践
1.使用官方CLI工具
Word representations · fastText
2.使用gensim的fasttext
FastText Model — gensim (radimrehurek.com)
3.使用PaddlePaddle进行简单的尝试!
!unzip /home/aistudio/data/data198438/ag_news.zip
Archive: /home/aistudio/data/data198438/ag_news.zip
extracting: ag_news/classes.txt
inflating: ag_news/readme.txt
inflating: ag_news/test.csv
inflating: ag_news/train.csv
from collections import Counter
from sklearn.model_selection import train_test_split
import pandas as pd
data=pd.read_csv(r"./ag_news/train.csv",header=None,encoding="utf-8")
train,val=train_test_split(data,test_size=0.2,shuffle=True)
# 我们这里只取新闻描述,直接使用验证集充当测试集,验证集不参与训练
train_x=list(train.iloc[:,2])
train_y=list(train.iloc[:,0])
val_x=list(val.iloc[:,2])
val_y=list(val.iloc[:,0])
# 新闻描述
print(train_x[1])
# 类别
print(train_y[1])
Earlier this month, the US space program lost one of its most colorful pioneers - Leroy Gordon quot;Gordo quot; Cooper Jr., one of the Mercury 7 astronauts, whose exploits and foibles
4
# 数据清洗
# 目的是保留每个句子中的所有单词与空格
def clear(data):
for i, s in enumerate(data):
clears = ""
# 遍历当前句子每一个字符
for char in s:
# 若字符是字母或空格则保留
if(char.isalpha() or char == " "):
clears += char
data[i] = clears
clear(train_x)
clear(val_x)
#拼接所有句子,制作词典
document = ""
for s in train_x:
document += " " + str(s).lower()
# 按照空格拆分,并进行简单的数据清洗
clear_d = []
for word in document.split(" "):
if(word.isalpha()):
clear_d.append(word)
# Counter函数统计词频
freq = dict(Counter(clear_d))
idx2word = ["<pad>"] + list(freq.keys()) + ["<unk>"]
word2idx = { w:idx for idx, w in enumerate(idx2word)}
vocab_size = len(idx2word)
del freq
def corpus_encode(corpus):
# 分词操作
for i in range(len(corpus)):
# 每个句子直接分词
s1 = corpus[i]
sentences = corpus[i].lower().split(" ")
# 存储当前句子的编码信息
c = []
for word in sentences:
# 是单词才编码
if(word.isalpha()):
# 用词典中的索引代表该单词,若不在词典则用UNK的索引代表该单词
c.append(word2idx.get(word, word2idx["<unk>"]))
# if(len(c)==0):
# print("原来的句子", s1)
# print("分解以后的句子", sentences)
# 编码代替原先的句子
corpus[i] = c
print("原来的句子")
print(val_x[0])
print(train_x[0])
corpus_encode(val_x)
corpus_encode(train_x)
print("编码代替后的句子")
print(val_x[0])
print(train_x[0])
原来的句子
The United States accused the UN agriculture body on Friday of mismanaging the locust crisis afflicting vast swathes of West Africa
Crude oil futures rose to within a quarter of a barrel Monday after a gas leak shut down a North Sea oil production platform and as traders weighed concerns about heating oil supplies against the fact that temperatures have been mild in recent weeks
编码代替后的句子
[30, 159, 160, 2524, 30, 1066, 479, 1587, 77, 180, 9, 42802, 30, 14986, 3034, 34044, 815, 42804, 9, 812, 2751]
[1, 2, 3, 4, 5, 6, 7, 8, 9, 7, 10, 11, 12, 7, 13, 14, 15, 16, 7, 17, 18, 2, 19, 20, 21, 22, 23, 24, 25, 26, 27, 2, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
# set集合,大括号中每组词向量为二维list
Features = set()
def word_N_gram(s, N = 2, train = False):
features = []
for i in range(len(s) - N + 1):
f = str(s[i : i+N])
if train:
Features.add(f) # 训练阶段应该统计有哪些特征
# 将一句话的每个词进一步拆分成多个小的词向量,例如like --> li ik ke
features.append(f)
return features
# 生成每个句子的n-gram特征
train_n_gram = []
for s in train_x:
# 已有的词向量与又拆分的二元词向量累加作为训练的词向量
train_n_gram.append(word_N_gram(s, train=True))
val_n_gram = []
for s in val_x:
val_n_gram.append(word_N_gram(s))
len(train_x) == len(train_n_gram)
True
# 给n-gram特征建立索引表
idx2ngram = ["[<pad>]"] + list(Features) + ["[<unk>]"]
ngram2idx = {w:c for c, w in enumerate(idx2ngram)}
ngram_size = len(idx2ngram)
# 给特征进行编码
def encode_gram(n_gram):
for i, s in enumerate(n_gram):
feature = []
for word in s:
feature.append(ngram2idx.get(word, ngram2idx["[<unk>]"]))
n_gram[i] = feature
encode_gram(train_n_gram)
encode_gram(val_n_gram)
# 列表中的每一位为一个N_gram,例如like的li,为它新建立一个索引并返回
# train_n_gram[0]
# val_n_gram[0]
import paddle
# from torch.nn.utils.rnn import pad_sequence
from paddle.io import DataLoader
from paddle.io import Dataset
# import paddle.static.nn as F
# 统计每个句子的长度,为了后面取均值
# map(function, iterable, ...)
lensTrianWord = list(map(len, train_x))
lensVarWord = paddle.to_tensor(list(map(len, val_x)))
lensTrianNGram = list(map(len, train_n_gram))
lenValNGram = paddle.to_tensor(list(map(len, val_n_gram)))
# 句子进行pad,这种方式即全以最长的句子为规模进行组装
def padding(train_x, max_len):
for i in range(0, len(train_x)):
zero_list = [0 for i in range(0, max_len - len(train_x[i]))]
train_x[i].extend(zero_list)
# padding(train_x, 168)
# padding(train_n_gram, 167)
padding(train_x, 173)
padding(train_n_gram, 172)
padding(val_x, 173)
padding(val_n_gram, 172)
train_x = list(map(lambda x : paddle.to_tensor(x), train_x))
val_x = list(map(lambda x : paddle.to_tensor(x), val_x))
train_n_gram = list(map(lambda x : paddle.to_tensor(x), train_n_gram))
val_n_gram = list(map(lambda x : paddle.to_tensor(x), val_n_gram))
# 原本类别是1 2 3 4
# CrossEntropyLoss()的类别要求0 1 2 3 这种格式,因此-1
# train_y最终会放入数据生成器中,因此写列表元素为tensor形式
train_y = list(map(lambda x : paddle.to_tensor(x) - 1, train_y))
val_y = list(map(lambda x : x - 1, val_y))
val_y = paddle.to_tensor(val_y)
lensVarWord[0]
Tensor(shape=[1], dtype=int64, place=Place(gpu:0), stop_gradient=True,
[21])
class mydata(Dataset):
def __init__(self, train_x, train_Ngram, train_y, lensX, lensNGram):
super(mydata, self).__init__()
self.train_x = train_x
self.train_Ngram = train_Ngram
self.train_y = train_y
self.lensX = lensX
self.lensNGram = lensNGram
def __len__(self):
return len(self.train_x)
def __getitem__(self, idx):
return self.train_x[idx], self.train_Ngram[idx], self.lensX[idx], self.lensNGram[idx], self.train_y[idx]
batch_size = 20
data = mydata(train_x, train_n_gram, train_y, lensTrianWord, lensTrianNGram)
dataloader = DataLoader(data, batch_size=batch_size, shuffle=True, num_workers=0)
for x,ngram,lenx,lenNgram,y in dataloader:
print(x.shape,ngram.shape,lenx.shape,lenNgram.shape,y.shape)
break
[20, 173] [20, 172] [20] [20] [20, 1]
[20] [20, 1]
import paddle.nn as nn
from paddle.nn import Layer
# 模型深度
d_model = 50
# 类别数量
class_num = 4
ngram_size = len(ngram2idx)
class cbow(Layer):
def __init__(self, vocab_size, ngram_size, d_model, class_num) -> None:
super(cbow, self).__init__()
self.vocab_size = vocab_size
self.d_model = d_model
self.class_num = class_num
# torch.nn.Embedding(num_embeddings, embedding_dim, padding_idx=None)
# num_embeddings (int) – size of the dictionary of embeddings
# padding_idx 指定pad的索引,指定之后该索引不会随着训练的更新而更新
self.embed1 = nn.Embedding(vocab_size, d_model, padding_idx=0)
self.embed2 = nn.Embedding(ngram_size, d_model, padding_idx=0)
self.linear = nn.Linear(d_model, class_num)
def forward(self, x, ngram, lensX, lenNgram):
# x[batch, maxlen1], ngram[batch, maxlen2], lensX[batch], lenNgram[batch]
# Embedding的参数维度通常是放在bts之后,与Embedding扩展维度之前
x = self.embed1(x) # x[batch, maxlen1, d_model]
ngram = self.embed2(ngram) # ngram[batch, maxlen1, d_model]
# 词向量求和
x = paddle.sum(x, axis=1) # x[batch, d_model]
ngram = paddle.sum(ngram, axis=1) # ngram[batch, d_model]
x = x + ngram
lens = lensX + lenNgram
lens = lens.unsqueeze(1) # lens[batch, 1]
# 取均值
x /= lens
output = self.linear(x)
return output
model=cbow(vocab_size,ngram_size,d_model,class_num)
model(x,ngram,lenx,lenNgram).shape
[20, 4]
from paddle.optimizer import Adagrad
epochs = 2
lr = 0.01
optimize = Adagrad(parameters=model.parameters(), learning_rate=lr)
lossCul = nn.CrossEntropyLoss()
for epoch in range(epochs):
#总损失
allloss=0
for step,(x,ngram,lenx,lenNgram,y) in enumerate(dataloader):
output=model(x,ngram,lenx,lenNgram)
#计算极大似然函数损失
loss=lossCul(output,y)
optimize.clear_grad()
loss.backward()
optimize.step()
allloss+=loss
if((step+1)%500==0):
print("epochs:",epoch+1," iter:",step+1," loss:",allloss/(step+1))
五、总结
Fasttext模型优缺点
优点:
- 速度非常快,并且效果还可以。
- 有开源实现,可以快速上手使用。
缺点:
- 模型结构简单,所以目前来说,不是最优的模型。
- 因为使用词袋思想,所以语义信息获取有限。
论文总结
关键点:
- 基于深度学习的文本分类方法效果好,但是速度比较慢
- 基于线性分类器的机器学习方法效果还行,速度也比较快,但是需要做烦琐的特征工程
- Fasttext模型的提出
创新点:
- 提出了一种新的文本分类模型—Fasttext模型
- 利用了一些加快文本分类和使得文本分类效果更好的技巧——层次softmax和n-gram特征。
- 在文本分类和tag预测两个任务上得到了又快又好的结果。
参考资料:
此文章为搬运
原项目链接
更多推荐
所有评论(0)