我的PaddleNLP学习笔记

一、PaddleNLP加载预训练词向量

6.7日NLP直播打卡课开始啦

直播链接请戳这里,每晚20:00-21:30👈

课程地址请戳这里👈

欢迎来课程QQ群(群号:618354318)交流吧~~

词向量(Word embedding),即把词语表示成实数向量。“好”的词向量能体现词语直接的相近关系。词向量已经被证明可以提高NLP任务的性能,例如语法分析和情感分析。



图1:词向量示意图

PaddleNLP已预置多个公开的预训练Embedding,您可以通过使用paddlenlp.embeddings.TokenEmbedding接口加载预训练Embedding,从而提升训练效果。本篇教程将依次介绍paddlenlp.embeddings.TokenEmbedding的初始化和文本表示效果,并通过文本分类训练的例子展示其对训练提升的效果。

二、PaddleNLP安装

AI Studio平台后续会默认安装PaddleNLP,在此之前可使用如下命令安装。

!pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
from functools import partial

import paddle
import paddle.nn as nn
import paddlenlp
from paddlenlp.datasets import load_dataset
from paddlenlp.embeddings import TokenEmbedding
from paddlenlp.data import Pad, Stack, Tuple, Vocab

import data

三、PaddleNLP使用

1. TokenEmbedding初始化

  • embedding_name
    将模型名称以参数形式传入TokenEmbedding,加载对应的模型。默认为w2v.baidu_encyclopedia.target.word-word.dim300的词向量。
  • unknown_token
    未知token的表示,默认为[UNK]。
  • unknown_token_vector
    未知token的向量表示,默认生成和embedding维数一致,数值均值为0的正态分布向量。
  • extended_vocab_path
    扩展词汇列表文件路径,词表格式为一行一个词。如引入扩展词汇列表,trainable=True。
  • trainable
    Embedding层是否可被训练。True表示Embedding可以更新参数,False为不可更新。默认为True。
# 初始化TokenEmbedding, 预训练embedding未下载时会自动下载并加载数据
token_embedding = TokenEmbedding(embedding_name="w2v.baidu_encyclopedia.target.word-word.dim300")

# 查看token_embedding详情
print(token_embedding)

2. TokenEmbedding.search

获得指定词汇的词向量。

test_token_embedding = token_embedding.search("中国")
print(test_token_embedding)
print(test_token_embedding.shape)

3.TokenEmbedding.cosine_sim

计算词向量间余弦相似度,语义相近的词语余弦相似度更高,说明预训练好的词向量空间有很好的语义表示能力。

score1 = token_embedding.cosine_sim("女孩", "女人")
score2 = token_embedding.cosine_sim("女孩", "书籍")
score3 = token_embedding.cosine_sim("女孩", "男孩")
score4 = token_embedding.cosine_sim("中国", "美国")
score5 = token_embedding.dot("中国", "美国")
print('score1:',score1)
print('score2:',score2)
print('score3:',score3)
print('score4:',score4)
print('score5:',score5)

4.词向量映射到低维空间

使用深度学习可视化工具VisualDLHigh Dimensional组件可以对embedding结果进行可视化展示,便于对其直观分析,步骤如下:

  1. 由于AI Studio当前支持的是VisualDL 2.1版本,因此需要升级到2.2版本体验最新的数据降维功能

pip install --upgrade visualdl

  1. 创建LogWriter并将记录词向量
  2. 点击左侧面板中的可视化tab,选择‘hidi’作为文件并启动VisualDL可视化
!pip install --upgrade visualdl
# 获取词表中前1000个单词
labels = token_embedding.vocab.to_tokens(list(range(0,1000)))
test_token_embedding = token_embedding.search(labels)

# 引入VisualDL的LogWriter记录日志
from visualdl import LogWriter

with LogWriter(logdir='./hidi') as writer:
    #writer.add_embeddings(tag='test', mat=test_token_embedding, metadata=labels)
    writer.add_embeddings(tag='test', mat=[i for i in test_token_embedding], metadata=labels)

启动VisualDL查看词向量降维效果

启动步骤:

  • 1、切换到「可视化」指定可视化日志
  • 2、日志文件选择 ‘hidi’
  • 3、点击「启动VisualDL」后点击「打开VisualDL」,选择「高维数据映射」,即可查看词表中前1000词UMAP方法下映射到三维空间的可视化结果:

可以看出,语义相近的词在词向量空间中聚集(如数字、章节等),说明预训练好的词向量有很好的文本表示能力。

使用VisualDL除可视化embedding结果外,还可以对标量、图片、音频等进行可视化,有效提升训练调参效率。关于VisualDL更多功能和详细介绍,可参考VisualDL使用文档

四、文本分类任务

以下通过文本分类训练的例子展示paddlenlp.embeddings.TokenEmbedding对训练提升的效果。

1.数据准备

1.1下载词汇表文件dict.txt,用于构造词-id映射关系。

data.py

import numpy as np
import paddle
from paddlenlp.data import JiebaTokenizer, Pad, Stack, Tuple, Vocab
import jieba
tokenizer = jieba


def set_tokenizer(vocab):
    global tokenizer
    if vocab is not None:
        tokenizer = JiebaTokenizer(vocab=vocab)


def load_vocab(vocab_file):
    """Loads a vocabulary file into a dictionary."""
    vocab = {}
    with open(vocab_file, "r", encoding="utf-8") as reader:
        tokens = reader.readlines()
    for index, token in enumerate(tokens):
        token = token.rstrip("\n").split("\t")[0]
        vocab[token] = index
    return vocab


def convert_tokens_to_ids(tokens, vocab):
    """ Converts a token id (or a sequence of id) in a token string
        (or a sequence of tokens), using the vocabulary.
    """

    ids = []
    unk_id = vocab.get('[UNK]', None)
    for token in tokens:
        wid = vocab.get(token, unk_id)
        if wid:
            ids.append(wid)
    return ids


def convert_example(example, vocab, unk_token_id=1, is_test=False):
    """
    Builds model inputs from a sequence for sequence classification tasks. 
    It use `jieba.cut` to tokenize text.
    Args:
        example(obj:`list[str]`): List of input data, containing text and label if it have label.
        vocab(obj:`dict`): The vocabulary.
        unk_token_id(obj:`int`, defaults to 1): The unknown token id.
        is_test(obj:`False`, defaults to `False`): Whether the example contains label or not.
    Returns:
        input_ids(obj:`list[int]`): The list of token ids.s
        valid_length(obj:`int`): The input sequence valid length.
        label(obj:`numpy.array`, data type of int64, optional): The input label if not is_test.
    """

    input_ids = []
    for token in tokenizer.cut(example['text']):
        token_id = vocab.get(token, unk_token_id)
        input_ids.append(token_id)
    valid_length = len(input_ids)

    if not is_test:
        label = np.array(example['label'], dtype="int64")
        return input_ids, valid_length, label
    else:
        return input_ids, valid_length


def pad_texts_to_max_seq_len(texts, max_seq_len, pad_token_id=0):
    """
    Padded the texts to the max sequence length if the length of text is lower than it.
    Unless it truncates the text.
    Args:
        texts(obj:`list`): Texts which contrains a sequence of word ids.
        max_seq_len(obj:`int`): Max sequence length.
        pad_token_id(obj:`int`, optinal, defaults to 0) : The pad token index.
    """
    for index, text in enumerate(texts):
        seq_len = len(text)
        if seq_len < max_seq_len:
            padded_tokens = [pad_token_id for _ in range(max_seq_len - seq_len)]
            new_text = text + padded_tokens
            texts[index] = new_text
        elif seq_len > max_seq_len:
            new_text = text[:max_seq_len]
            texts[index] = new_text

def preprocess_prediction_data(data, vocab):
    """
    It process the prediction data as the format used as training.
    Args:
        data (obj:`List[str]`): The prediction data whose each element is  a tokenized text.
    Returns:
        examples (obj:`List(Example)`): The processed data whose each element is a Example (numedtuple) object.
            A Example object contains `text`(word_ids) and `se_len`(sequence length).
    """
    examples = []
    for text in data:
        tokens = " ".join(tokenizer.cut(text)).split(' ')
        ids = convert_tokens_to_ids(tokens, vocab)
        examples.append([ids, len(ids)])
    return examples

def create_dataloader(dataset,
                      trans_fn=None,
                      mode='train',
                      batch_size=1,
                      use_gpu=False,
                      pad_token_id=0):
    """
    Creats dataloader.
    Args:
        dataset(obj:`paddle.io.Dataset`): Dataset instance.
        mode(obj:`str`, optional, defaults to obj:`train`): If mode is 'train', it will shuffle the dataset randomly.
        batch_size(obj:`int`, optional, defaults to 1): The sample number of a mini-batch.
        use_gpu(obj:`bool`, optional, defaults to obj:`False`): Whether to use gpu to run.
        pad_token_id(obj:`int`, optional, defaults to 0): The pad token index.
    Returns:
        dataloader(obj:`paddle.io.DataLoader`): The dataloader which generates batches.
    """
    if trans_fn:
        dataset = dataset.map(trans_fn, lazy=True)

    shuffle = True if mode == 'train' else False
    sampler = paddle.io.BatchSampler(
        dataset=dataset, batch_size=batch_size, shuffle=shuffle)
    batchify_fn = lambda samples, fn=Tuple(
        Pad(axis=0, pad_val=pad_token_id),  # input_ids
        Stack(dtype="int64"),  # seq len
        Stack(dtype="int64")  # label
    ): [data for data in fn(samples)]

    dataloader = paddle.io.DataLoader(
        dataset,
        batch_sampler=sampler,
        return_list=True,
        collate_fn=batchify_fn)
    return dataloader

!wget https://paddlenlp.bj.bcebos.com/data/dict.txt

1.2 加载词表和数据集,数据集来自PaddleNLP内置的公开中文情感分析数据集ChnSenticorp,一键即可加载。

# Loads vocab.
vocab_path='./dict.txt'  
vocab = data.load_vocab(vocab_path)
if '[PAD]' not in vocab:
    vocab['[PAD]'] = len(vocab)
# Loads dataset.
train_ds, dev_ds, test_ds = load_dataset("chnsenticorp", splits=["train", "dev", "test"])

1.3 创建运行和预测时所需要的DataLoader对象。

# Reads data and generates mini-batches.
trans_fn = partial(
    data.convert_example,
    vocab=vocab,
    unk_token_id=vocab['[UNK]'],
    is_test=False)
train_loader = data.create_dataloader(
    train_ds,
    trans_fn=trans_fn,
    batch_size=64,
    mode='train',
    pad_token_id=vocab['[PAD]'])
dev_loader = data.create_dataloader(
    dev_ds,
    trans_fn=trans_fn,
    batch_size=64,
    mode='validation',
    pad_token_id=vocab['[PAD]'])
test_loader = data.create_dataloader(
    test_ds,
    trans_fn=trans_fn,
    batch_size=64,
    mode='test',
    pad_token_id=vocab['[PAD]'])

2.模型搭建

使用BOWencoder搭建一个BOW模型用于文本分类任务。BOWencoder输入序列的全部向量,输出一个序列向量的简单加和后的向量来表示序列语义。

class BoWModel(nn.Layer):
    def __init__(self,
                 vocab_size,
                 num_classes,
                 vocab_path,
                 emb_dim=300,
                 hidden_size=128,
                 fc_hidden_size=96,
                 use_token_embedding=True):
        super().__init__()
        if use_token_embedding:
            self.embedder = TokenEmbedding(
                "w2v.baidu_encyclopedia.target.word-word.dim300", extended_vocab_path=vocab_path)
            emb_dim = self.embedder.embedding_dim
        else:
            padding_idx = vocab_size - 1
            self.embedder = nn.Embedding(
                vocab_size, emb_dim, padding_idx=padding_idx)
        self.bow_encoder = paddlenlp.seq2vec.BoWEncoder(emb_dim)
        self.fc1 = nn.Linear(self.bow_encoder.get_output_dim(), hidden_size)
        self.fc2 = nn.Linear(hidden_size, fc_hidden_size)
        self.dropout = nn.Dropout(p=0.3, axis=1)
        self.output_layer = nn.Linear(fc_hidden_size, num_classes)

    def forward(self, text, seq_len=None):
        # Shape: (batch_size, num_tokens, embedding_dim)
        embedded_text = self.embedder(text)

        # Shape: (batch_size, embedding_dim)
        summed = self.bow_encoder(embedded_text)
        summed = self.dropout(summed)
        encoded_text = paddle.tanh(summed)

        # Shape: (batch_size, hidden_size)
        fc1_out = paddle.tanh(self.fc1(encoded_text))
        # Shape: (batch_size, fc_hidden_size)
        fc2_out = paddle.tanh(self.fc2(fc1_out))
        # Shape: (batch_size, num_classes)
        logits = self.output_layer(fc2_out)
        return logits

2.1获得样本数据类别标签数

num_classes = len(train_ds.label_list)

2.2 定义模型

定义一个使用预训练好的Tokenembedding的BOW模型,可通过设置超参数use_token_embedding=False来定义一个随机初始化embedding向量的BOW模型。

use_token_embedding=True
learning_rate=5e-4

model = BoWModel(
    vocab_size=len(vocab),
    num_classes=num_classes,
    vocab_path=vocab_path,
    use_token_embedding=use_token_embedding)
if use_token_embedding:
    vocab = model.embedder.vocab
    data.set_tokenizer(vocab)
    vocab = vocab.token_to_idx
else:
    v = Vocab.from_dict(vocab, unk_token="[UNK]", pad_token="[PAD]")
    data.set_tokenizer(v)

model = paddle.Model(model)

3模型配置

调用model.prepare()配置模型,如损失函数、优化器。

optimizer = paddle.optimizer.Adam(parameters=model.parameters(), learning_rate=learning_rate)

# Defines loss and metric.
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()

model.prepare(optimizer, criterion, metric)

4.模型训练

4.1 使用VisualDL观测模型训练过程和模型结果

Paddle2.0中利用model.prepare(), model.fit()等高层接口可以轻松实现训练模型的代码部分,同时在Paddle.Model()接口中提供了Callback类,这里通过VisualDL可视化来直观的对模型训练过程和模型结果进行评估。

# 设置VisualDL路径
log_dir = './use_normal_embedding'
if use_token_embedding:
    log_dir = './use_token_embedding'
callback = paddle.callbacks.VisualDL(log_dir=log_dir)

4.2 调用model.fit()一键训练模型

model.fit(train_loader,
    dev_loader,
    epochs=20,
    save_dir='./checkpoints',
    callbacks=callback)
Epoch 20/20
step  10/150 - loss: 0.2043 - acc: 0.9187 - 63ms/step
step  20/150 - loss: 0.0673 - acc: 0.9375 - 56ms/step
step  30/150 - loss: 0.0824 - acc: 0.9510 - 54ms/step
step  40/150 - loss: 0.1066 - acc: 0.9496 - 53ms/step
step  50/150 - loss: 0.1181 - acc: 0.9531 - 52ms/step
step  60/150 - loss: 0.0444 - acc: 0.9560 - 52ms/step
step  70/150 - loss: 0.0328 - acc: 0.9567 - 51ms/step
step  80/150 - loss: 0.2206 - acc: 0.9574 - 51ms/step
step  90/150 - loss: 0.1328 - acc: 0.9580 - 51ms/step
step 100/150 - loss: 0.0814 - acc: 0.9563 - 51ms/step
step 110/150 - loss: 0.0616 - acc: 0.9572 - 51ms/step
step 120/150 - loss: 0.1595 - acc: 0.9573 - 51ms/step
step 130/150 - loss: 0.0690 - acc: 0.9556 - 51ms/step
step 140/150 - loss: 0.1511 - acc: 0.9561 - 51ms/step
step 150/150 - loss: 0.0406 - acc: 0.9563 - 50ms/step
save checkpoint at /home/aistudio/checkpoints/19
Eval begin...
step 10/19 - loss: 0.5991 - acc: 0.8922 - 56ms/step
step 19/19 - loss: 0.5061 - acc: 0.8867 - 37ms/step
Eval samples: 1200
save checkpoint at /home/aistudio/checkpoints/final

5.训练结果

启动VisualDL步骤同上,日志文件选择 'use_token_embedding’和’use_normal_embedding’训练过程可视化结果如下:

图中绿色是使用paddlenlp.embeddings.TokenEmbedding进行的实验,蓝色是使用没有加载预训练模型的Embedding进行的实验。可以看到,使用paddlenlp.embeddings.TokenEmbedding的训练,其验证acc变化趋势上升,并收敛于0.90左右,loss变化趋势下降,收敛后相对平稳,不容易过拟合。而没有使用paddlenlp.embeddings.TokenEmbedding的训练,其验证acc变化趋势向下,并收敛于0.88左右。从示例实验可以观察到,使用paddlenlp.embedding.TokenEmbedding能提升训练效果。

五、切词

from paddlenlp.data import JiebaTokenizer
tokenizer = JiebaTokenizer(vocab=token_embedding.vocab)
words = tokenizer.cut("中国人民")
print(words) # ['中国人', '民']

tokens = tokenizer.encode("中国人民")
print(tokens) # [12530, 1334]

六、embedding constant.py源码分析

1. 62种embedding name

该文件中可见Embedding name list,一堆一堆的名称,好多啊啊啊。我数数先

from enum import Enum
import os.path as osp

URL_ROOT = "https://paddlenlp.bj.bcebos.com"
EMBEDDING_URL_ROOT = URL_ROOT + "/models/embeddings"

PAD_TOKEN = '[PAD]'
UNK_TOKEN = '[UNK]'

EMBEDDING_NAME_LIST = [
    # Word2Vec
    # baidu_encyclopedia
    "w2v.baidu_encyclopedia.target.word-word.dim300",
    "w2v.baidu_encyclopedia.target.word-character.char1-1.dim300",
    "w2v.baidu_encyclopedia.target.word-character.char1-2.dim300",
    "w2v.baidu_encyclopedia.target.word-character.char1-4.dim300",
    "w2v.baidu_encyclopedia.target.word-ngram.1-2.dim300",
    "w2v.baidu_encyclopedia.target.word-ngram.1-3.dim300",
    "w2v.baidu_encyclopedia.target.word-ngram.2-2.dim300",
    "w2v.baidu_encyclopedia.target.word-wordLR.dim300",
    "w2v.baidu_encyclopedia.target.word-wordPosition.dim300",
    "w2v.baidu_encyclopedia.target.bigram-char.dim300",
    "w2v.baidu_encyclopedia.context.word-word.dim300",
    "w2v.baidu_encyclopedia.context.word-character.char1-1.dim300",
    "w2v.baidu_encyclopedia.context.word-character.char1-2.dim300",
    "w2v.baidu_encyclopedia.context.word-character.char1-4.dim300",
    "w2v.baidu_encyclopedia.context.word-ngram.1-2.dim300",
    "w2v.baidu_encyclopedia.context.word-ngram.1-3.dim300",
    "w2v.baidu_encyclopedia.context.word-ngram.2-2.dim300",
    "w2v.baidu_encyclopedia.context.word-wordLR.dim300",
    "w2v.baidu_encyclopedia.context.word-wordPosition.dim300",
    # wikipedia
    "w2v.wiki.target.bigram-char.dim300",
    "w2v.wiki.target.word-char.dim300",
    "w2v.wiki.target.word-word.dim300",
    "w2v.wiki.target.word-bigram.dim300",
    # people_daily
    "w2v.people_daily.target.bigram-char.dim300",
    "w2v.people_daily.target.word-char.dim300",
    "w2v.people_daily.target.word-word.dim300",
    "w2v.people_daily.target.word-bigram.dim300",
    # weibo
    "w2v.weibo.target.bigram-char.dim300",
    "w2v.weibo.target.word-char.dim300",
    "w2v.weibo.target.word-word.dim300",
    "w2v.weibo.target.word-bigram.dim300",
    # sogou
    "w2v.sogou.target.bigram-char.dim300",
    "w2v.sogou.target.word-char.dim300",
    "w2v.sogou.target.word-word.dim300",
    "w2v.sogou.target.word-bigram.dim300",
    # zhihu
    "w2v.zhihu.target.bigram-char.dim300",
    "w2v.zhihu.target.word-char.dim300",
    "w2v.zhihu.target.word-word.dim300",
    "w2v.zhihu.target.word-bigram.dim300",
    # finacial
    "w2v.financial.target.bigram-char.dim300",
    "w2v.financial.target.word-char.dim300",
    "w2v.financial.target.word-word.dim300",
    "w2v.financial.target.word-bigram.dim300",
    # literature
    "w2v.literature.target.bigram-char.dim300",
    "w2v.literature.target.word-char.dim300",
    "w2v.literature.target.word-word.dim300",
    "w2v.literature.target.word-bigram.dim300",
    # siku
    "w2v.sikuquanshu.target.word-word.dim300",
    "w2v.sikuquanshu.target.word-bigram.dim300",
    # Mix-large
    "w2v.mixed-large.target.word-char.dim300",
    "w2v.mixed-large.target.word-word.dim300",
    # GOOGLE NEWS
    "w2v.google_news.target.word-word.dim300.en",
    # GloVe
    "glove.wiki2014-gigaword.target.word-word.dim50.en",
    "glove.wiki2014-gigaword.target.word-word.dim100.en",
    "glove.wiki2014-gigaword.target.word-word.dim200.en",
    "glove.wiki2014-gigaword.target.word-word.dim300.en",
    "glove.twitter.target.word-word.dim25.en",
    "glove.twitter.target.word-word.dim50.en",
    "glove.twitter.target.word-word.dim100.en",
    "glove.twitter.target.word-word.dim200.en",
    # FastText
    "fasttext.wiki-news.target.word-word.dim300.en",
    "fasttext.crawl.target.word-word.dim300.en"
]

from paddlenlp.embeddings import constant

print(len(constant.EMBEDDING_NAME_LIST))
for item in constant.EMBEDDING_NAME_LIST:
    print(item)

2.token_embedding

from enum import Enum
import os
import os.path as osp
import numpy as np
import logging

import paddle
import paddle.nn as nn
from paddle.utils.download import get_path_from_url
from paddlenlp.utils.env import _get_sub_home, MODEL_HOME
from paddlenlp.utils.log import logger
from paddlenlp.data import Vocab, get_idx_from_word
from .constant import EMBEDDING_URL_ROOT, PAD_TOKEN, UNK_TOKEN,\
                      EMBEDDING_NAME_LIST

EMBEDDING_HOME = _get_sub_home('embeddings', parent_home=MODEL_HOME)

__all__ = ['list_embedding_name', 'TokenEmbedding']


def list_embedding_name():
    """
    Lists all names of pretrained embedding models paddlenlp provides.
    """
    return list(EMBEDDING_NAME_LIST)


class TokenEmbedding(nn.Embedding):
    """
    A `TokenEmbedding` can load pre-trained embedding model which paddlenlp provides by
    specifying embedding name. Furthermore, a `TokenEmbedding` can load extended vocabulary
    by specifying extended_vocab_path.

    Args:
        embedding_name (`str`, optional):
            The pre-trained embedding model name. Use `paddlenlp.embeddings.list_embedding_name()` to
            list the names of all embedding models that we provide. 
            Defaults to `w2v.baidu_encyclopedia.target.word-word.dim300`.
        unknown_token (`str`, optional):
            Specifies unknown token.
            Defaults to `[UNK]`.
        unknown_token_vector (`list`, optional):
            To initialize the vector of unknown token. If it's none, use normal distribution to
            initialize the vector of unknown token.
            Defaults to `None`.
        extended_vocab_path (`str`, optional):
            The file path of extended vocabulary.
            Defaults to `None`.
        trainable (`bool`, optional):
            Whether the weight of embedding can be trained.
            Defaults to True.
        keep_extended_vocab_only (`bool`, optional):
            Whether to keep the extended vocabulary only, will be effective only if provides extended_vocab_path.
            Defaults to False.
    """

    def __init__(self,
                 embedding_name=EMBEDDING_NAME_LIST[0],
                 unknown_token=UNK_TOKEN,
                 unknown_token_vector=None,
                 extended_vocab_path=None,
                 trainable=True,
                 keep_extended_vocab_only=False):
        vector_path = osp.join(EMBEDDING_HOME, embedding_name + ".npz")
        if not osp.exists(vector_path):
            # download
            url = EMBEDDING_URL_ROOT + "/" + embedding_name + ".tar.gz"
            get_path_from_url(url, EMBEDDING_HOME)

        logger.info("Loading token embedding...")
        vector_np = np.load(vector_path)
        self.embedding_dim = vector_np['embedding'].shape[1]
        self.unknown_token = unknown_token
        if unknown_token_vector is not None:
            unk_vector = np.array(unknown_token_vector).astype(
                paddle.get_default_dtype())
        else:
            unk_vector = np.random.normal(
                scale=0.02,
                size=self.embedding_dim).astype(paddle.get_default_dtype())
        pad_vector = np.array(
            [0] * self.embedding_dim).astype(paddle.get_default_dtype())
        if extended_vocab_path is not None:
            embedding_table = self._extend_vocab(extended_vocab_path, vector_np,
                                                 pad_vector, unk_vector,
                                                 keep_extended_vocab_only)
            trainable = True
        else:
            embedding_table = self._init_without_extend_vocab(
                vector_np, pad_vector, unk_vector)

        self.vocab = Vocab.from_dict(
            self._word_to_idx, unk_token=unknown_token, pad_token=PAD_TOKEN)
        self.num_embeddings = embedding_table.shape[0]
        # import embedding
        super(TokenEmbedding, self).__init__(
            self.num_embeddings,
            self.embedding_dim,
            padding_idx=self._word_to_idx[PAD_TOKEN])
        self.weight.set_value(embedding_table)
        self.set_trainable(trainable)
        logger.info("Finish loading embedding vector.")
        s = "Token Embedding info:\
             \nUnknown index: {}\
             \nUnknown token: {}\
             \nPadding index: {}\
             \nPadding token: {}\
             \nShape :{}".format(
            self._word_to_idx[self.unknown_token], self.unknown_token,
            self._word_to_idx[PAD_TOKEN], PAD_TOKEN, self.weight.shape)
        logger.info(s)

    def _init_without_extend_vocab(self, vector_np, pad_vector, unk_vector):
        """
        Constructs index to word list, word to index dict and embedding weight.
        """
        self._idx_to_word = list(vector_np['vocab'])
        self._idx_to_word.append(self.unknown_token)
        self._idx_to_word.append(PAD_TOKEN)
        self._word_to_idx = self._construct_word_to_idx(self._idx_to_word)
        # insert unk, pad embedding
        embedding_table = np.append(
            vector_np['embedding'], [unk_vector, pad_vector], axis=0)

        return embedding_table

    def _read_vocab_list_from_file(self, extended_vocab_path):
        # load new vocab table from file
        vocab_list = []
        with open(extended_vocab_path, "r", encoding="utf-8") as f:
            for line in f.readlines():
                vocab = line.rstrip("\n").split("\t")[0]
                vocab_list.append(vocab)
        return vocab_list

    def _extend_vocab(self, extended_vocab_path, vector_np, pad_vector,
                      unk_vector, keep_extended_vocab_only):
        """
        Constructs index to word list, word to index dict and embedding weight using
        extended vocab.
        """
        logger.info("Start extending vocab.")
        extend_vocab_list = self._read_vocab_list_from_file(extended_vocab_path)
        extend_vocab_set = set(extend_vocab_list)
        # update idx_to_word
        self._idx_to_word = extend_vocab_list
        self._word_to_idx = self._construct_word_to_idx(self._idx_to_word)

        # use the Xavier init the embedding
        xavier_scale = np.sqrt(
            6.0 / float(len(self._idx_to_word) + self.embedding_dim))
        embedding_table = np.random.uniform(
            low=-1.0 * xavier_scale,
            high=xavier_scale,
            size=(len(self._idx_to_word),
                  self.embedding_dim)).astype(paddle.get_default_dtype())

        pretrained_idx_to_word = list(vector_np['vocab'])
        pretrained_word_to_idx = self._construct_word_to_idx(
            pretrained_idx_to_word)
        pretrained_embedding_table = np.array(vector_np['embedding'])

        pretrained_vocab_set = set(pretrained_idx_to_word)
        extend_vocab_set = set(self._idx_to_word)
        vocab_intersection = pretrained_vocab_set & extend_vocab_set
        vocab_subtraction = pretrained_vocab_set - extend_vocab_set

        # assignment from pretrained_vocab_embedding to extend_vocab_embedding
        pretrained_vocab_intersect_index = [
            pretrained_word_to_idx[word] for word in vocab_intersection
        ]
        pretrained_vocab_subtract_index = [
            pretrained_word_to_idx[word] for word in vocab_subtraction
        ]
        extend_vocab_intersect_index = [
            self._word_to_idx[word] for word in vocab_intersection
        ]
        embedding_table[
            extend_vocab_intersect_index] = pretrained_embedding_table[
                pretrained_vocab_intersect_index]
        if not keep_extended_vocab_only:
            for idx in pretrained_vocab_subtract_index:
                word = pretrained_idx_to_word[idx]
                self._idx_to_word.append(word)
                self._word_to_idx[word] = len(self._idx_to_word) - 1

            embedding_table = np.append(
                embedding_table,
                pretrained_embedding_table[pretrained_vocab_subtract_index],
                axis=0)

        if self.unknown_token not in extend_vocab_set:
            self._idx_to_word.append(self.unknown_token)
            self._word_to_idx[self.unknown_token] = len(self._idx_to_word) - 1
            embedding_table = np.append(embedding_table, [unk_vector], axis=0)
        else:
            unk_idx = self._word_to_idx[self.unknown_token]
            embedding_table[unk_idx] = unk_vector

        if PAD_TOKEN not in extend_vocab_set:
            self._idx_to_word.append(PAD_TOKEN)
            self._word_to_idx[PAD_TOKEN] = len(self._idx_to_word) - 1
            embedding_table = np.append(embedding_table, [pad_vector], axis=0)
        else:
            embedding_table[self._word_to_idx[PAD_TOKEN]] = pad_vector

        logger.info("Finish extending vocab.")
        return embedding_table

    def set_trainable(self, trainable):
        """
        Whether or not to set the weights of token embedding to be trainable.

        Args:
            trainable (`bool`):
                The weights can be trained if trainable is set to True, or the weights are fixed if trainable is False.

        """
        self.weight.stop_gradient = not trainable

    def search(self, words):
        """
        Gets the vectors of specifying words.

        Args:
            words (`list` or `str` or `int`): The words which need to be searched.

        Returns:
            `numpy.array`: The vectors of specifying words.

        """
        idx_list = self.get_idx_list_from_words(words)
        idx_tensor = paddle.to_tensor(idx_list)
        return self(idx_tensor).numpy()

    def get_idx_from_word(self, word):
        """
        Gets the index of specifying word by searching word_to_idx dict. 

        Args:
            word (`list` or `str` or `int`): The input token word which we want to get the token index converted from.

        Returns:
            `int`: The index of specifying word.

        """
        return get_idx_from_word(word, self.vocab.token_to_idx,
                                 self.unknown_token)

    def get_idx_list_from_words(self, words):
        """
        Gets the index list of specifying words by searching word_to_idx dict.

        Args:
            words (`list` or `str` or `int`): The input token words which we want to get the token indices converted from.

        Returns:
            `list`: The indexes list of specifying words.

        """
        if isinstance(words, str):
            idx_list = [self.get_idx_from_word(words)]
        elif isinstance(words, int):
            idx_list = [words]
        elif isinstance(words, list) or isinstance(words, tuple):
            idx_list = [
                self.get_idx_from_word(word) if isinstance(word, str) else word
                for word in words
            ]
        else:
            raise TypeError
        return idx_list

    def _dot_np(self, array_a, array_b):
        return np.sum(array_a * array_b)

    def _calc_word(self, word_a, word_b, calc_kernel):
        embeddings = self.search([word_a, word_b])
        embedding_a = embeddings[0]
        embedding_b = embeddings[1]
        return calc_kernel(embedding_a, embedding_b)

    def dot(self, word_a, word_b):
        """
        Calculates the dot product of 2 words. Dot product or scalar product is an
        algebraic operation that takes two equal-length sequences of numbers (usually
        coordinate vectors), and returns a single number.

        Args:
            word_a (`str`): The first word string.
            word_b (`str`): The second word string.

        Returns:
            `Float`: The dot product of 2 words.

        """
        dot = self._dot_np
        return self._calc_word(word_a, word_b, lambda x, y: dot(x, y))

    def cosine_sim(self, word_a, word_b):
        """
        Calculates the cosine similarity of 2 word vectors. Cosine similarity is the
        cosine of the angle between two n-dimensional vectors in an n-dimensional space.

        Args:
            word_a (`str`): The first word string.
            word_b (`str`): The second word string.

        Returns:
            `Float`: The cosine similarity of 2 words.

        """
        dot = self._dot_np
        return self._calc_word(
            word_a, word_b,
            lambda x, y: dot(x, y) / (np.sqrt(dot(x, x)) * np.sqrt(dot(y, y))))

    def _construct_word_to_idx(self, idx_to_word):
        """
        Constructs word to index dict.

        Args:
            idx_to_word ('list'):

        Returns:
            `Dict`: The word to index dict constructed by idx_to_word.

        """
        word_to_idx = {}
        for i, word in enumerate(idx_to_word):
            word_to_idx[word] = i
        return word_to_idx

    def __repr__(self):
        """
        Returns:
            `Str`: The token embedding infomation.

        """
        info = "Object   type: {}\
             \nUnknown index: {}\
             \nUnknown token: {}\
             \nPadding index: {}\
             \nPadding token: {}\
             \n{}".format(
            super(TokenEmbedding, self).__repr__(),
            self._word_to_idx[self.unknown_token], self.unknown_token,
            self._word_to_idx[PAD_TOKEN], PAD_TOKEN, self.weight)
        return info

3.TokenEmbedding方法

3.1 list_embedding_name

3.2 TokenEmbedding各类方法

3.2.1 set_trainable 是否将token的权重设置为可训练的
3.2.2 search 获取指定单词的向量
3.2.3 get_idx_from_word 获取通过搜索单词\u到\u idx dict来指定单词的索引
3.2.4 get_idx_list_from_words Gets the index list of specifying words by searching word_to_idx dict.
3.2.5 dot 计算两个单词的点积。点积或标量积是一种代数运算,它采用两个等长的数字序列(通常是坐标向量),并返回单个数字
3.2.6 cosine_sim 计算两个词向量的余弦相似度。余弦相似性是n维空间中两个n维向量之间夹角的余弦。
from paddlenlp.embeddings import token_embedding

# list_embedding_name()获取embedding名称
embedding_list=token_embedding.list_embedding_name()
for item in embedding_list:
mbedding_list:
    print(item)

七、PaddleNLP更多预训练词向量

PaddleNLP提供61种可直接加载的预训练词向量,训练自多领域中英文语料、覆盖多种经典词向量模型(word2vec、glove、fastText)、涵盖不同维度、不同语料库大小,详见PaddleNLP Embedding API

八、PaddleNLP 更多项目

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐