AI写代码】CodeGen模型生成代码

基于CodeGen代码生成模型生产代码或补充代码

一、项目介绍

近几年,人工智能迅速发展,面向开发者的AI开发工具也是层出不穷,特别是从GitHub Copilot工具中可以看到AI写代码的潜力。

PaddleNLP最近也开放了代码生成模型,并且可以使用Taskflow一键完成代码生成。

来让我们一探代码自动生成的乐趣吧!

二、开启AI写代码之旅

由于目前CodeGen还没有发布到PaddleNLP的pip包中,因此我们需要自己拉去dev代码并安装最新开发版PaddleNLP。

克隆最新的PaddleNLP的仓库,并进入仓库安装PaddleNLP和需要的包。这里准备好了其仓库的压缩包,直接解压使用即可。

准备阶段

!unzip PaddleNLP
!pip install --upgrade paddlenlp -i https://mirror.baidu.com/pypi/simple
!pip uninstall -y paddlenlp
%cd PaddleNLP
!python setup.py install
%cd ../
!pip install regex

重启内核

在这里插入图片描述

CodeGen参数介绍

  • batch_size:批处理大小,请结合机器情况进行调整,默认为1。
  • max_length:生成代码的最大长度,默认为128。
  • min_length:生成代码的最小长度,默认为0。
  • decode_strategy:解码策略,支持greedy_search,beam_search和sampling,默认为sampling。
  • temperature:解码参数temperature,默认为0.6。
  • top_k:解码参数top_k,默认为5。
  • top_p:解码参数top_p,默认为1.0。
  • num_beams:beam_search解码的beam size,默认为4。
  • length_penalty:解码长度控制值,默认为1.0。
  • repetition_penalty:解码重复惩罚值,默认为1.1。
  • output_scores:是否要输出解码得分,请默认为False。

CodeGen支持的预训练模型

“code_generation”: {

“models”: {
“Salesforce/codegen-350M-mono”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-350M-mono’,

“task_priority_path”: “Salesforce/codegen-350M-mono”,

},

“Salesforce/codegen-2B-mono”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-2B-mono’,

“task_priority_path”: “Salesforce/codegen-2B-mono”,

},

“Salesforce/codegen-6B-mono”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-6B-mono’,

“task_priority_path”: “Salesforce/codegen-6B-mono”,

},

“Salesforce/codegen-350M-nl”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-350M-nl’,

“task_priority_path”: “Salesforce/codegen-350M-nl”,

},

“Salesforce/codegen-2B-nl”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-2B-nl’,

“task_priority_path”: “Salesforce/codegen-2B-nl”,

},

“Salesforce/codegen-6B-nl”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-6B-nl’,

“task_priority_path”: “Salesforce/codegen-6B-nl”,

},

“Salesforce/codegen-350M-multi”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-350M-multi’,

“task_priority_path”: “Salesforce/codegen-350M-multi”,

},

“Salesforce/codegen-2B-multi”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-2B-multi’,

“task_priority_path”: “Salesforce/codegen-2B-multi”,

},

“Salesforce/codegen-6B-multi”: {

“task_class”: CodeGenerationTask,

“task_flag”: ‘code_generation-Salesforce/codegen-6B-multi’,

“task_priority_path”: “Salesforce/codegen-6B-multi”,

},

},

“default”: {

“model”: “Salesforce/codegen-350M-mono”,

}

}

CodeGen支持单条、批量预测

>>> from paddlenlp import Taskflow
>>> codegen = Taskflow("code_generation")
# 单条输入
>>> codegen("def hello_world():")
['\n    print("Hello World")']
# 多条输入
>>> codegen(["Get the length of array", "def hello_world():"])
['\n    n = len(a)\n\n    #', '\n    print("Hello World!")']

Taskflow如何使用不同预训练模型

看看Taskflow的参数说明

class Taskflow(object):
    """
    The Taskflow is the end2end inferface that could convert the raw text to model result, and decode the model result to task result. The main functions as follows:
        1) Convert the raw text to task result.
        2) Convert the model to the inference model.
        3) Offer the usage and help message.
    Args:
        task (str): The task name for the Taskflow, and get the task class from the name.
        model (str, optional): The model name in the task, if set None, will use the default model. 
        mode (str, optional): Select the mode of the task, only used in the tasks of word_segmentation and ner.
            If set None, will use the default mode.
        device_id (int, optional): The device id for the gpu, xpu and other devices, the defalut value is 0.
        kwargs (dict, optional): Additional keyword arguments passed along to the specific task. 
    """

    def __init__(self, task, model=None, mode=None, device_id=0, **kwargs):

因此我们实例化Taskflow只需要加上预训练模型名字即可

codegen = Taskflow("code_generation", "模型名". "其他参数")

例如:

codegen = Taskflow("code_generation", "Salesforce/codegen-6B-mono", min_length=1024)

正式开始AI写代码

from paddlenlp import Taskflow
codegen = Taskflow("code_generation", min_length=256)

先来试一个简单的,生成一个求和函数吧,看起来结果很棒!

c1 = "Calculate the sum of two numbers"
p1 = codegen([c1])
print(c1)
for code in p1:
    print(code)
Calculate the sum of two numbers
.
    """

    def add(self, num1: int, num2: int) -> int:
        return num1 + num2
import os
from pathlib import Path
import sys

接下来加大难度,来试一试代码续写

这里以leetcode第一题为例,我们写出了部分代码,让AI来补全剩下的部分。

两数之和

给定一个整数数组 nums 和一个整数目标值 target,请你在该数组中找出 和为目标值 target 的那 两个 整数,并返回它们的数组下标。

你可以假设每种输入只会对应一个答案。但是,数组中同一个元素在答案里不能重复出现。

你可以按任意顺序返回答案。

c2 = "def twoSum(nums, target):\n    hashmap={}\n    for ind,num in enumerate(nums):\n        hashmap[num] = ind\n    for i,num in enumerate(nums):"
p2 = codegen([c2])
print(c2)
for code in p2:
    print(code)
def twoSum(nums, target):
    hashmap={}
    for ind,num in enumerate(nums):
        hashmap[num] = ind
    for i,num in enumerate(nums):

    if hashmap.get(target-num)!= None: return [i,hashmap.get(target-num)]
    else: return []

将结果拿去测试可以发现普通测试样例都可以通过,这里错误是因为AI不知道不能重复使用同一元素;因此看上去整体续写逻辑也没有什么问题!

在这里插入图片描述

再试一试CodeGen自己来根据条件写代码

这里还是以一道leetcode上的题来测试,第一步是输入该题的描述和测试样例

from paddlenlp import Taskflow
codegen = Taskflow("code_generation")
c3 = """
        # In a deck of cards, each card has an integer written on it.
        # Return true if and only if you can choose X >= 2 such that it is possible to split the entire deck into 1 or more groups of cards, where:
        # Each group has exactly X cards.
        # All the cards in each group have the same integer.
        #  
        # Example 1:
        # Input: deck = [1,2,3,4,4,3,2,1]
        # Output: true
        # Explanation: Possible partition [1,1],[2,2],[3,3],[4,4].
        # Example 2:
        # Input: deck = [1,1,1,2,2,2,3,3]
        # Output: false
        # Explanation: No possible partition.
        #  
        # Constraints:
        # 1 <= deck.length <= 104
        # 0 <= deck[i] < 104

        def hasGroupsSizeX(self, deck: List[int]) -> bool:
    """


p3 = codegen(c3)
print(c3)
for code in p3:
    print(code)
        # In a deck of cards, each card has an integer written on it.
        # Return true if and only if you can choose X >= 2 such that it is possible to split the entire deck into 1 or more groups of cards, where:
        # Each group has exactly X cards.
        # All the cards in each group have the same integer.
        #  
        # Example 1:
        # Input: deck = [1,2,3,4,4,3,2,1]
        # Output: true
        # Explanation: Possible partition [1,1],[2,2],[3,3],[4,4].
        # Example 2:
        # Input: deck = [1,1,1,2,2,2,3,3]
        # Output: false
        # Explanation: No possible partition.
        #  
        # Constraints:
        # 1 <= deck.length <= 104
        # 0 <= deck[i] < 104

        def hasGroupsSizeX(self, deck: List[int]) -> bool:
    

        if len(deck) == 1: return True

        for i in range(len(deck)-1): 
            curr_sum = self.getSum(deck[i],deck[-1]-1) + deck[i+1] - 1 # curr_sum = sum(deck[0:i])+sum(deck[-1:i+1])
            #

随后将上面的输出结果补充到输入中,再次生成后半部分代码

c4 = """
        # In a deck of cards, each card has an integer written on it.
        # Return true if and only if you can choose X >= 2 such that it is possible to split the entire deck into 1 or more groups of cards, where:
        # Each group has exactly X cards.
        # All the cards in each group have the same integer.
        #  
        # Example 1:
        # Input: deck = [1,2,3,4,4,3,2,1]
        # Output: true
        # Explanation: Possible partition [1,1],[2,2],[3,3],[4,4].
        # Example 2:
        # Input: deck = [1,1,1,2,2,2,3,3]
        # Output: false
        # Explanation: No possible partition.
        #  
        # Constraints:
        # 1 <= deck.length <= 104
        # 0 <= deck[i] < 104

        def hasGroupsSizeX(self, deck: List[int]) -> bool:
            if len(deck) == 1: return True

            for i in range(len(deck)-1): 
                curr_sum = self.getSum(deck[i],deck[-1]-1) + deck[i+1] - 1
    """


p4 = codegen(c4)
print(c4)
for code in p4:
    print(code)
        # In a deck of cards, each card has an integer written on it.
        # Return true if and only if you can choose X >= 2 such that it is possible to split the entire deck into 1 or more groups of cards, where:
        # Each group has exactly X cards.
        # All the cards in each group have the same integer.
        #  
        # Example 1:
        # Input: deck = [1,2,3,4,4,3,2,1]
        # Output: true
        # Explanation: Possible partition [1,1],[2,2],[3,3],[4,4].
        # Example 2:
        # Input: deck = [1,1,1,2,2,2,3,3]
        # Output: false
        # Explanation: No possible partition.
        #  
        # Constraints:
        # 1 <= deck.length <= 104
        # 0 <= deck[i] < 104

        def hasGroupsSizeX(self, deck: List[int]) -> bool:
            if len(deck) == 1: return True

            for i in range(len(deck)-1): 
                curr_sum = self.getSum(deck[i],deck[-1]-1) + deck[i+1] - 1
    

        return False

    """
    def hasGroupsSizeX(self, deck: List[int]) -> bool:
        sums = []
        for d in deck: sums.append(d)
        for j in range(len(sums)):
            for k in range(j+1,len(sums)):
                if sums[j]+sums[k] > sums[0]:
                    continue
                elif sums[j]+sums[k]==sums[0]: return False
        return Trueimport os
from typing import

最后将结果拿去测试,同样也是可以通过部分样例的测试。

可见生成的代码还是可以参与使用的!

在这里插入图片描述

最后试一试自己写自己

CodeGen模型部分代码和注释传入,看看能写出什么来?!

c5 = """
    class CodeGenerationTask(Task):
    '''
    The text generation model to predict the code. 
    Args:
        task(string): The name of task.
        model(string): The model name in the task.
        kwargs (dict, optional): Additional keyword arguments passed along to the specific task. 
    '''
        def __init__(self, task, model, **kwargs):
    """

p5 = codegen([c5])
print(c5)
for code in p5:
    print(code)
    class CodeGenerationTask(Task):
        '''
        The text generation model to predict the code. 
        Args:
            task(string): The name of task.
            model(string): The model name in the task.
            kwargs (dict, optional): Additional keyword arguments passed along to the specific task. 
        '''

    def __init__(self, task, model, **kwargs):
    

        super().__init__('CodeGeneration', task, model, kwargs)

    @staticmethod
    def get_default_config():
        return {
            'task': 'code_generation',
           'model': 'bert-base-uncased',
            'kwargs': None,
        }

    #

三、总结

CodeGen单纯的代码生成的能力还是稍微有些弱,但是续写能力还是很不错。因此CodeGen也更适合作为与程序员配合一起写代码的工具,通过不断调整输入和微调,最终可以得到理想的代码。

我们也可以更多的将其作为一种代码补全或者代码联想的工具来使用和发展,尝试将其开发为VS Code或者JB的插件来作为代码补全工具。

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐