PaddleNLP：AI 也能刷 LeetCode 算法题？

尝试使用 PaddleNLP 套件中的 GPT2 加载预训练模型参数实现 LeetCode Python 算法题解答代码生成。

AI Studio

287人浏览 · 2021-11-07 10:58:10

AI Studio · 2021-11-07 10:58:10 发布

引入

随着大规模预训练语言模型的诞生和不断发展，文本生成的水平也在不断地提高
生成对联，生成诗歌，生成作文等等，各个各样的文本生成任务也层出不穷
那代码也是一种特殊的文本，所以代码是不是也可以自动生成呢
结论当然是可以，而且也有人探索过了
今天就用一个小项目，尝试使用 GPT2 预训练模型实现算法题目的解答代码生成

项目说明

本次使用的数据集和预训练模型来自开源项目 hendrycks/apps
在 Paddle 框架上使用 PaddleNLP 的套件实现加载转换后的与预训练模型实现文本生成
本项目只是一个小尝试，目前问题多多

数据集介绍

一共包含 5000 个 Python 算法题目和其解答以及测试所用的输入和输出值等信息

同步代码

同步 PaddleNLP 的代码

!git clone https://github.com.cnpmjs.org/PaddlePaddle/PaddleNLP

安装依赖

运行之前首先还是要先安装下必要的依赖模块

!pip install regex

解压预训练模型和数据集

%cd ~
!mkdir dataset
!unzip -q data/data99247/1.5B.zip -d pretrained_models
!tar -xf data/data98973/APPS.tar.gz -C dataset

/home/aistudio

加载模型和 Tokenizer

加载文本生成模型和对应的编解码 Tokenizer

%cd ~/PaddleNLP/
from paddlenlp.transformers import GPTTokenizer, GPTLMHeadModel

# Tokenizer
tokenizer = GPTTokenizer.from_pretrained('gpt2-en')

# Set up model
model = GPTLMHeadModel.from_pretrained('../pretrained_models/1.5B')

/home/aistudio/PaddleNLP


[2021-07-14 23:01:41,824] [    INFO] - Found /home/aistudio/.paddlenlp/models/gpt2-en/gpt-en-vocab.json
[2021-07-14 23:01:41,828] [    INFO] - Found /home/aistudio/.paddlenlp/models/gpt2-en/gpt-en-merges.txt

准备输入

将输入的算法题目转换成模型接受的 Token ID 序列

def generate_prompt(test_case_path, prompt_path, tokenizer):
    _input = "\nQUESTION:\n"
    with open(prompt_path, "r", encoding='UTF-8') as f:
        data = f.readlines()
        data = "".join(data)
    _input += data

    with open(test_case_path, "r") as f:
        data = json.load(f)
    if not data.get("fn_name"):
        _input += "\nUse Standard Input format"#\n"
    else:
        _input += "\nUse Call-Based format"#\n"
    
    _input += "\nANSWER:\n"

    return _input

生成代码

接下来就是根据输入的题目生成代码了

%cd ~/PaddleNLP/
import os
import json
import paddle
prob_path = '../dataset/APPS/test/0002'
test_case_path = os.path.join(prob_path, "input_output.json")
prompt_path = os.path.join(prob_path, "question.txt")
input_texts = generate_prompt(test_case_path, prompt_path, tokenizer)
print("题目：\n", input_texts)

input_ids = paddle.to_tensor(tokenizer.encode(input_texts)['input_ids'], dtype='int64').unsqueeze(0)
output_ids = model.generate(
    
    input_ids=input_ids,
    max_length=1024-input_ids.shape[1],
    min_length=0,
    decode_strategy='beam_search',
    temperature=1.0,
    top_k=0,
    top_p=1.0,
    num_beams=5,
    length_penalty=1.0,
    early_stopping=False,
    bos_token_id=None,
    eos_token_id=50256,
    pad_token_id=None,
    num_return_sequences=1,
    use_cache=True
)

code = tokenizer.convert_ids_to_string(output_ids[0][0].numpy().tolist()).replace('<|endoftext|>', '')
print("代码：\n", code)

/home/aistudio/PaddleNLP
题目：
 
QUESTION:
Apart from having lots of holidays throughout the year, residents of Berland also have whole lucky years. Year is considered lucky if it has no more than 1 non-zero digit in its number. So years 100, 40000, 5 are lucky and 12, 3001 and 12345 are not.

You are given current year in Berland. Your task is to find how long will residents of Berland wait till the next lucky year.


-----Input-----

The first line contains integer number n (1 ≤ n ≤ 10^9) — current year in Berland.


-----Output-----

Output amount of years from the current year to the next lucky one.


-----Examples-----
Input
4

Output
1

Input
201

Output
99

Input
4000

Output
1000



-----Note-----

In the first example next lucky year is 5. In the second one — 300. In the third — 5000.
Use Standard Input format
ANSWER:

代码：
 def main():
	n = int(input())
	while True:
		if n % 10 == 0:
			print(n)
		else:
			n = n // 10 + 1
			if n % 10 != 0:
				print(n)
			else:
				break
	
def __starting_point():
	main()

__starting_point()

写入文件

可以看到生成的代码还是有点像模像样的
当然也可以将代码写进文本文件中

with open('../test.py', 'w') as f:
    f.write(code)

尝试运行代码

因为代码包含输入等操作，所以暂时还不能使用 Notebook 来运行
只能通过终端来运行生成的代码

$ python test.py

已知问题

官方项目其实给出了两个预训练模型，目前只有小模型转换过来了（大模型太大了，而且使用的是 GPT-Neo，目前 PaddleNLP 并不支持）
目前这个预训练模型只在某些数据上表现比较好
总体上也就是只能够生成些像模像样的代码，但是实际上大多数解答并不正确

To Do

接下来准备用这个数据集自己动手整一个代码生成模型

总结

今天的尝试到此结束，代码生成确实可行，但是目前来讲很难做到一个非常好的效果
毕竟写代码的难度还是很高的

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1437条内容