WSDM Cup 2023 挑战赛: 搜索预训练比赛基线

百度联合MSU在顶级学术会议WSDM上举办了2023的网页去偏排序和预训练的挑战赛。近来,基于预训练模型的自监督学习方法在NLP任务中取得了显著的成功。本挑战旨在探索利用大规模的搜索日志数据,训练搜索场景下的预训练模型,促进前沿技术的发展。

本赛道由百度飞桨(PaddlePaddle)赞助。飞桨以百度多年的深度学习技术研究和业务应用为基础,集深度学习核心训练和推理框架、基础模型库、端到端开发套件、丰富的工具组件于一体,是中国首个自主研发、功能丰富、开源开放的产业级深度学习平台。飞桨助力开发者快速实现AI想法,创新AI应用,作为基础平台支撑越来越多行业实现产业智能化升级。

任务描述

  • 通过大规模的搜索日志(the Large Scale Web Search Session Data)预训练一个大规模的预训练模型,然后基于人工样本微调得到搜索排序模型。

  • 因为是paddlepaddle赞助,奖项仅颁发给paddlepaddle的提交结果。

  • 需提供可以复现的代码,否则将取消获奖资格。

  • 本任务的评估指标是DCG@10.

更详细的数据分析和基线描述可以参考论文A Large Scale Search Dataset for Unbiased Learning to Rank.

示例代码

下面将提供一个快速的开始代码。

步骤1: 从github上下载示例代码。

# clear the dir
!rm -rf ./*
# clone demo code
!git clone https://github.com/ChuXiaokai/WSDMCUP_BaiduPLM_Paddle.git
Cloning into 'WSDMCUP_BaiduPLM_Paddle'...
remote: Enumerating objects: 72, done.[K
remote: Counting objects: 100% (72/72), done.[K
remote: Compressing objects: 100% (68/68), done.[K
remote: Total 72 (delta 32), reused 3 (delta 0), pack-reused 0[K
Unpacking objects: 100% (72/72), done.
Checking connectivity... done.

步骤 2: 下载和准备预训练、评估数据集

注意:示例代码采用的是采用一个小的子集训练,要实现更好的效果,建议下载更完整的大数据集做训练。

数据集的详细介绍可以参考百度搜索科学团队主页.

# Notice that the dataset are a small subset of the whole dataset for fast running the demo code. 
# Download train dataset
!wget https://searchscience.baidu.com/baidu_ultr/train_click/part-00000.gz -O part-00000.gz 
!wget https://searchscience.baidu.com/baidu_ultr/train_click/part-00001.gz -O part-00001.gz 

# preparing the training data
!mkdir ./WSDMCUP_BaiduPLM_Paddle/data/train_data/
!mv part-00001.gz ./WSDMCUP_BaiduPLM_Paddle/data/train_data/
# preparing the click test data, suggesting using a larger click dataset 
!zcat ./part-00000.gz|head -1000 > ./WSDMCUP_BaiduPLM_Paddle/data/click_data/test_click.data && gzip ./WSDMCUP_BaiduPLM_Paddle/data/click_data/test_click.data
!rm part-00000.gz

# preparing the validation dataset
!wget https://searchscience.baidu.com/baidu_ultr/labeled_dataset/test_data.txt -O test_data.txt 
!tail -10000 test_data.txt > valid_data.txt
!head -387572 test_data.txt > finetune.txt 
!mv valid_data.txt ./WSDMCUP_BaiduPLM_Paddle/data/annotate_data/
!mv finetune.txt ./WSDMCUP_BaiduPLM_Paddle/data/annotate_data/
--2022-10-20 11:40:28--  https://searchscience.baidu.com/baidu_ultr/train_click/part-00000.gz
Resolving searchscience.baidu.com (searchscience.baidu.com)... 180.101.49.82
Connecting to searchscience.baidu.com (searchscience.baidu.com)|180.101.49.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 622792845 (594M) [application/octet-stream]
Saving to: ‘part-00000.gz’

part-00000.gz        93%[=================>  ] 555.84M  12.0MB/s    in 60s     

2022-10-20 11:41:29 (9.21 MB/s) - Connection closed at byte 582836592. Retrying.

--2022-10-20 11:41:30--  (try: 2)  https://searchscience.baidu.com/baidu_ultr/train_click/part-00000.gz
Connecting to searchscience.baidu.com (searchscience.baidu.com)|180.101.49.82|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 622792845 (594M), 39956253 (38M) remaining [application/octet-stream]
Saving to: ‘part-00000.gz’

part-00000.gz       100%[++++++++++++++++++=>] 593.94M  7.18MB/s    in 5.9s    

2022-10-20 11:41:36 (6.50 MB/s) - ‘part-00000.gz’ saved [622792845/622792845]

--2022-10-20 11:41:36--  https://searchscience.baidu.com/baidu_ultr/train_click/part-00001.gz
Resolving searchscience.baidu.com (searchscience.baidu.com)... 180.101.49.82
Connecting to searchscience.baidu.com (searchscience.baidu.com)|180.101.49.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 622116985 (593M) [application/octet-stream]
Saving to: ‘part-00001.gz’

part-00001.gz         6%[>                   ]  36.16M   361KB/s    in 62s     

2022-10-20 11:42:39 (594 KB/s) - Connection closed at byte 37915712. Retrying.

--2022-10-20 11:42:40--  (try: 2)  https://searchscience.baidu.com/baidu_ultr/train_click/part-00001.gz
Connecting to searchscience.baidu.com (searchscience.baidu.com)|180.101.49.82|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 622116985 (593M), 584201273 (557M) remaining [application/octet-stream]
Saving to: ‘part-00001.gz’

part-00001.gz        93%[+================>  ] 552.07M   652KB/s    in 65s     

2022-10-20 11:43:45 (7.93 MB/s) - Connection closed at byte 578892336. Retrying.

--2022-10-20 11:43:47--  (try: 3)  https://searchscience.baidu.com/baidu_ultr/train_click/part-00001.gz
Connecting to searchscience.baidu.com (searchscience.baidu.com)|180.101.49.82|:443... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 622116985 (593M), 43224649 (41M) remaining [application/octet-stream]
Saving to: ‘part-00001.gz’

part-00001.gz       100%[++++++++++++++++++=>] 593.30M  4.83MB/s    in 17s     

2022-10-20 11:44:04 (2.49 MB/s) - ‘part-00001.gz’ saved [622116985/622116985]


gzip: stdout: Broken pipe
--2022-10-20 11:44:06--  https://searchscience.baidu.com/baidu_ultr/labeled_dataset/test_data.txt
Resolving searchscience.baidu.com (searchscience.baidu.com)... 180.101.49.82
Connecting to searchscience.baidu.com (searchscience.baidu.com)|180.101.49.82|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 273078456 (260M) [text/plain]
Saving to: ‘test_data.txt’

test_data.txt       100%[===================>] 260.43M  11.7MB/s    in 56s     

2022-10-20 11:45:01 (4.69 MB/s) - ‘test_data.txt’ saved [273078456/273078456]

步骤 3: 下载准备好的预训练模型(热起模型通过 mlm + ctr预估两个任务训练获得,如果有更好的方法可以使用其他热起,这里仅仅是提供一个快速开始的参考)

热起模型在下面的github链接下载:“https://github.com/ChuXiaokai/WSDMCUP_BaiduPLM_Paddle.git”

12层12头768维模型: https://drive.google.com/file/d/1dSSA3RotV4CNG5WMdAQ4ZulipjXSPwD1/view?usp=sharing

6层12头768维模型: https://drive.google.com/file/d/1tcQ-_VkKbV4ESE2Wu9FgAx9wVvO6Fc7w/view?usp=sharing

3层12头768维模型: https://drive.google.com/file/d/1tcQ-_VkKbV4ESE2Wu9FgAx9wVvO6Fc7w/view?usp=sharing

因为上传和模型下载的限制,建议自行下载了以后倒入,将 --init_parameters 指向对应模型的地址即可。

# # Download the three layer model as the initialize model
# !pip install gdown
# !gdown https://drive.google.com/uc?id=1tcQ-_VkKbV4ESE2Wu9FgAx9wVvO6Fc7w

步骤 4: 搜索大规模预训练

训练一个大规模的搜索预训练模型。

示例代码中采用mlm + ctr预估作为预训练。

注意将 --init_parameters 指向对应热起模型的地址

# finetune the model
# load the model from pretrain (in work/save_model/* )
!cd WSDMCUP_BaiduPLM_Paddle && python -u pretrain.py --init_parameters '' --emb_dim 768 --nlayer 3 --nhead 12 --dropout 0.1 --buffer_size 20 --eval_batch_size 20 --valid_click_path ./data/click_data/test_click.data.gz --save_step 50 --eval_step 50 --n_queries_for_each_gpu 8 --num_candidates 10 --valid_annotate_path ./data/annotate_data/valid_data.txt 
Namespace(_CLS_=0, _MASK_=3, _PAD_=2, _SEP_=1, buffer_size=20, dropout=0.1, emb_dim=768, eval_batch_size=20, eval_step=50, evaluate_model_path='', exp_settings={'method_name': 'NavieAlgorithm', 'n_gpus': 2, 'init_parameters': '', 'lr': 2e-07, 'max_candidate_num': 10, 'selection_bias_cutoff': 10, 'feature_size': 768, 'train_input_hparams': '', 'learning_algorithm_hparams': ''}, finetune_epoch=20, init_parameters='', log_interval=10, lr=2e-07, max_seq_len=128, max_steps=100000, method_name='NavieAlgorithm', n_gpus=2, n_queries_for_each_gpu=8, nhead=12, nlayers=3, ntokens=22000, num_candidates=10, result_path='result.csv', save_step=50, seed=0, test_annotate_path='./data/wsdm_round_1/wsdm_test_1.txt', train_batch_size=160, train_datadir='./data/train_data/', valid_annotate_path='./data/annotate_data/valid_data.txt', valid_click_path='./data/click_data/test_click.data.gz', warmup_steps=4000, weight_decay=0.01)
load annotated data from  ./data/annotate_data/valid_data.txt
load logged click data from  ./data/click_data/test_click.data.gz
Transformer is used for pretrain
W1020 11:53:41.626910  2014 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1020 11:53:41.631050  2014 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
load file part-00001.gz
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:278: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.int32, but right dtype is paddle.float32, the right dtype will convert to paddle.int32
  format(lhs_dtype, rhs_dtype, lhs_dtype))
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:278: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.bool, the right dtype will convert to paddle.float32
  format(lhs_dtype, rhs_dtype, lhs_dtype))
    0th step | lr 5.000e-11 | ms/batch 2466.13 | ctr 0.16288 | mlm 1.00683
0th step valid annotate | dcg@10: all 7.23544 | high 6.92377 | mid 7.50521 | low 7.96299 | pnr 1.01120
0th step valid click |dcg@3 0.22920 | dcg@5 0.26729 | dcg@10 0.43082 | pnr 0.92550
   10th step | lr 5.500e-10 | ms/batch 1179.41 | ctr 1.44906 | mlm 10.04882
   20th step | lr 1.050e-09 | ms/batch 545.47 | ctr 1.51707 | mlm 10.03446
   30th step | lr 1.550e-09 | ms/batch 537.22 | ctr 1.60664 | mlm 10.03044
   40th step | lr 2.050e-09 | ms/batch 559.94 | ctr 1.57694 | mlm 10.03254
   50th step | lr 2.550e-09 | ms/batch 560.08 | ctr 1.54710 | mlm 10.03014
50th step valid annotate | dcg@10: all 7.95067 | high 9.10882 | mid 8.12825 | low 7.64055 | pnr 0.95670
50th step valid click |dcg@3 0.18295 | dcg@5 0.25418 | dcg@10 0.42581 | pnr 0.87187
   60th step | lr 3.050e-09 | ms/batch 1210.35 | ctr 1.62457 | mlm 10.02694
   70th step | lr 3.550e-09 | ms/batch 541.79 | ctr 1.51931 | mlm 10.02788
   80th step | lr 4.050e-09 | ms/batch 557.38 | ctr 1.57417 | mlm 10.02641
   90th step | lr 4.550e-09 | ms/batch 549.59 | ctr 1.59447 | mlm 10.02970
  100th step | lr 5.050e-09 | ms/batch 541.10 | ctr 1.51983 | mlm 10.02984
100th step valid annotate | dcg@10: all 9.62659 | high 12.78715 | mid 7.13612 | low 9.80036 | pnr 1.19054
100th step valid click |dcg@3 0.18322 | dcg@5 0.27769 | dcg@10 0.42824 | pnr 0.95349
^C
Traceback (most recent call last):
  File "pretrain.py", line 101, in <module>
    total_ctr_loss += ctr_loss.item()
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/varbase_patch_methods.py", line 572, in item
    return self._getitem_from_offset(*args).item()
KeyboardInterrupt

步骤 5: 搜索微调

这个例子中, 我们将人工标注转化成了2档(0,1-> 0,2,3,4-> 1),做二分类的训练。

注意将 --init_parameters 指向上一个步骤训练好的模型地址

!cd WSDMCUP_BaiduPLM_Paddle && python -u finetune.py --emb_dim 768 --nlayer 3 --nhead 12 --dropout 0.1 --buffer_size 20 --eval_batch_size 20 --save_step 50 --eval_step 50 --init_parameters '' --n_queries_for_each_gpu 8 --num_candidates 10 --valid_annotate_path ./data/annotate_data/finetune.txt --test_annotate_path ./data/annotate_data/valid_data.txt
Namespace(_CLS_=0, _MASK_=3, _PAD_=2, _SEP_=1, buffer_size=20, dropout=0.1, emb_dim=768, eval_batch_size=20, eval_step=50, evaluate_model_path='', exp_settings={'method_name': 'NavieAlgorithm', 'n_gpus': 2, 'init_parameters': '', 'lr': 2e-07, 'max_candidate_num': 10, 'selection_bias_cutoff': 10, 'feature_size': 768, 'train_input_hparams': '', 'learning_algorithm_hparams': ''}, finetune_epoch=20, init_parameters='', log_interval=10, lr=2e-07, max_seq_len=128, max_steps=100000, method_name='NavieAlgorithm', n_gpus=2, n_queries_for_each_gpu=8, nhead=12, nlayers=3, ntokens=22000, num_candidates=10, result_path='result.csv', save_step=50, seed=0, test_annotate_path='./data/annotate_data/valid_data.txt', train_batch_size=160, train_datadir='./data/train_data/', valid_annotate_path='./data/annotate_data/finetune.txt', valid_click_path='./data/click_data/part-00000.gz', warmup_steps=4000, weight_decay=0.01)
Transformer is used for finetune
W1020 11:55:31.831223  6306 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1020 11:55:31.835103  6306 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
load annotated data from  ./data/annotate_data/finetune.txt
load annotated data from  ./data/annotate_data/valid_data.txt
    0th step | loss 0.715395
0th step valid annotate | dcg@10: all 7.994964 | high 9.773284 | mid 7.546137 | low 7.964992 | pnr 1.048048
   10th step | loss 0.804500
   20th step | loss 1.135197
   30th step | loss 1.005948
   40th step | loss 0.751979
   50th step | loss 1.025093
50th step valid annotate | dcg@10: all 8.628210 | high 6.103388 | mid 7.387629 | low 11.145684 | pnr 0.985732
   60th step | loss 0.691976
   70th step | loss 0.591397
   80th step | loss 0.956084
   90th step | loss 0.768199
  100th step | loss 0.968086
100th step valid annotate | dcg@10: all 7.370319 | high 10.159169 | mid 6.728385 | low 6.626826 | pnr 1.003588
  110th step | loss 1.075818
  120th step | loss 1.156515
  130th step | loss 0.963196
  140th step | loss 0.667635
  150th step | loss 0.894794
150th step valid annotate | dcg@10: all 6.344257 | high 6.183422 | mid 4.236632 | low 7.419970 | pnr 0.982316
  160th step | loss 0.838718
  170th step | loss 0.690243
  180th step | loss 0.946279
  190th step | loss 0.817485
  200th step | loss 0.946013
200th step valid annotate | dcg@10: all 7.376082 | high 7.133713 | mid 5.465949 | low 8.357487 | pnr 1.001154
  210th step | loss 0.826875
  220th step | loss 0.699750
  230th step | loss 0.928465
  240th step | loss 1.042144
  250th step | loss 0.789787
250th step valid annotate | dcg@10: all 8.553375 | high 11.054966 | mid 6.965511 | low 8.023679 | pnr 1.069588
  260th step | loss 0.968102
  270th step | loss 0.836006
  280th step | loss 0.747215
  290th step | loss 0.964489
  300th step | loss 0.894813
300th step valid annotate | dcg@10: all 6.733287 | high 9.517683 | mid 6.143183 | low 6.117587 | pnr 1.011633
  310th step | loss 0.812461
  320th step | loss 0.948829
  330th step | loss 0.894507
  340th step | loss 0.955558
  350th step | loss 0.840505
350th step valid annotate | dcg@10: all 7.062942 | high 7.528291 | mid 5.413143 | low 8.228528 | pnr 1.058824
  360th step | loss 1.129776
  370th step | loss 0.809914
  380th step | loss 0.950315
  390th step | loss 0.863141
  400th step | loss 0.788924
400th step valid annotate | dcg@10: all 8.575130 | high 11.893890 | mid 9.243525 | low 7.100964 | pnr 0.991805
  410th step | loss 0.997355
  420th step | loss 0.733651
  430th step | loss 0.613862
  440th step | loss 0.835121
  450th step | loss 0.665618
450th step valid annotate | dcg@10: all 5.841508 | high 9.478191 | mid 3.526082 | low 5.734523 | pnr 1.011556
  460th step | loss 0.767018
  470th step | loss 1.121727
  480th step | loss 0.843226
  490th step | loss 0.748975
  500th step | loss 0.898522
500th step valid annotate | dcg@10: all 6.229550 | high 6.570770 | mid 5.476731 | low 6.971382 | pnr 1.064337
  510th step | loss 1.024264
  520th step | loss 0.813276
  530th step | loss 0.872401
  540th step | loss 0.845863
  550th step | loss 0.690517
550th step valid annotate | dcg@10: all 7.085219 | high 11.351965 | mid 5.084504 | low 6.447138 | pnr 1.021756
  560th step | loss 0.785674
  570th step | loss 0.644466
  580th step | loss 0.825297
  590th step | loss 0.926044
  600th step | loss 0.927760
^C
Traceback (most recent call last):
  File "finetune.py", line 96, in <module>
    src_padding_mask=src_padding_mask, 
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/aistudio/WSDMCUP_BaiduPLM_Paddle/Transformer4Ranking/model.py", line 127, in forward
    output = self.transformer_encoder(x, padding_mask).transpose([1,2,0])
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 697, in forward
    output = mod(output, src_mask=src_mask)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 581, in forward
    src = self.self_attn(src, src, src, src_mask)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/nn/layer/transformer.py", line 423, in forward
    out = tensor.reshape(x=out, shape=[0, 0, out.shape[2] * out.shape[3]])
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/tensor/manipulation.py", line 2139, in reshape
    return paddle.fluid.layers.reshape(x=x, shape=shape, name=name)
  File "/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/layers/nn.py", line 6373, in reshape
    out, _ = _C_ops.reshape2(x, None, 'shape', shape)
KeyboardInterrupt

步骤 6: 准备提交结果

提交网址: https://aistudio.baidu.com/aistudio/competition/detail/537/0/submit-result?previewCode=aa051a5d-aa48-4be0-83af-4937933a80c2

(a) 下载评测数据集

(b) 加载预训练模型、产出模型打分。

© 在目录中找到result.csv,压缩成result.csv.zip格式,提交。

# Download the evaluation set from the AIStudio
!wget "https://bj.bcebos.com/v1/ai-studio-online/2cd7df56206d4e50be0f3a58e53a0236d7eab0cd9ad94f9d82afddba361aa479?responseContentDisposition=attachment%3B%20filename%3Dwsdm_test_1.txt&authorization=bce-auth-v1%2F5cfe9a5e1454405eb2a975c43eace6ec%2F2022-09-14T15%3A30%3A16Z%2F-1%2F%2Feae14185e8787d435e4720811fd707447593c1e6a7b24903daf99684a6391c95" -O wsdm_test_1.txt
!mv wsdm_test_1.txt ./WSDMCUP_BaiduPLM_Paddle/data/
!cd WSDMCUP_BaiduPLM_Paddle && python -u submit.py --emb_dim 768 --nlayer 3 --nhead 12 --init_parameters '' --test_annotate_path "data/wsdm_test_1.txt"  --eval_batch_size 200
# Find the result.csv and submit compressed result.csv.zip to the platform.
--2022-10-20 11:59:42--  https://bj.bcebos.com/v1/ai-studio-online/2cd7df56206d4e50be0f3a58e53a0236d7eab0cd9ad94f9d82afddba361aa479?responseContentDisposition=attachment%3B%20filename%3Dwsdm_test_1.txt&authorization=bce-auth-v1%2F5cfe9a5e1454405eb2a975c43eace6ec%2F2022-09-14T15%3A30%3A16Z%2F-1%2F%2Feae14185e8787d435e4720811fd707447593c1e6a7b24903daf99684a6391c95
Resolving bj.bcebos.com (bj.bcebos.com)... 220.181.33.44, 220.181.33.43, 2409:8c04:1001:1002:0:ff:b001:368a
Connecting to bj.bcebos.com (bj.bcebos.com)|220.181.33.44|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 43754463 (42M) [application/octet-stream]
Saving to: ‘wsdm_test_1.txt’

wsdm_test_1.txt     100%[===================>]  41.73M  6.76MB/s    in 6.2s    

2022-10-20 11:59:48 (6.71 MB/s) - ‘wsdm_test_1.txt’ saved [43754463/43754463]

Namespace(_CLS_=0, _MASK_=3, _PAD_=2, _SEP_=1, buffer_size=500000, dropout=0.1, emb_dim=768, eval_batch_size=200, eval_step=500, evaluate_model_path='', exp_settings={'method_name': 'NavieAlgorithm', 'n_gpus': 2, 'init_parameters': '', 'lr': 2e-07, 'max_candidate_num': 30, 'selection_bias_cutoff': 30, 'feature_size': 768, 'train_input_hparams': '', 'learning_algorithm_hparams': ''}, finetune_epoch=20, init_parameters='', log_interval=10, lr=2e-07, max_seq_len=128, max_steps=100000, method_name='NavieAlgorithm', n_gpus=2, n_queries_for_each_gpu=5, nhead=12, nlayers=3, ntokens=22000, num_candidates=30, result_path='result.csv', save_step=5000, seed=0, test_annotate_path='data/wsdm_test_1.txt', train_batch_size=300, train_datadir='./data/train_data/', valid_annotate_path='./data/annotate_data/val_data.txt', valid_click_path='./data/click_data/part-00000.gz', warmup_steps=4000, weight_decay=0.01)
Transformer is used for finetune
W1020 11:59:50.662153 20444 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W1020 11:59:50.666157 20444 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.
load annotated data from  data/wsdm_test_1.txt

总结与讨论

在提交以后,大概5分钟以内,会看到如下结果:

在这里插入图片描述

讨论:
在预训练阶段,自监督学习,few-shot学习的方法是可能探索方向。在微调阶段,pairwise、listwise的训练误差和基于setrank 的方案可以提高模型的效果。另外,unigram词表中的信息也可以帮助提升模型的效果。

请点击此处查看本环境基本用法.

Please click here for more detailed instructions.

此文章为搬运
原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐