★★★ 本文源自AlStudio社区精品项目,【点击此处】查看更多精品内容 >>>

背景

随着互联网的发展,数字营销已成为企业获取潜在客户、用户获取目标商品和服务的主要渠道之一。 转化率预估技术可以在数字营销场景提高企业广告投放精准度、减少无效信息对用户的干扰,为用户带来更好的信息服务体验。 本任务提供了一个的 baseline模型,旨在帮助参赛者快速入手提高模型性能和效果。

PS : 本人也是第一次尝试完成推荐系统相关的数据处理以及demo实现,若有什么问题请评论区留言交流,或加vx沟通(同昵称)

本次比赛提供N天广告日志抽样数据(数据已脱敏):

  • 训练集:前N-1天数据,供参赛者训练/调试模型
  • 测试集:第N天数据,非公开

赛题介绍

本次任务提供百度广告投放场景的海量真实数据,希望参赛者通过采用机器学习方式,构建转化率预估模型, 根据给定用户及广告信息,预估转化概率,即 pcvr = p(cv=1 | user, ad)。

数据说明

数据格式
每一行数据为一次广告点击,共计5个域,各域间 \t 分隔:

字段说明
log_id样本id
t1转化类型1
t2转化类型2
t3转化类型3
特征包含用户及广告信息
样本示例:
(1)logid \t 1 \t - \t - \t 特征字段,属于第一个转化类型,是正样本
(2)logid \t - \t 1 \t - \t 特征字段,属于第二个转化类型,是正样本
(3)logid \t - \t - \t 1 \t 特征字段,属于第三个转化类型,是正样本
(4)logid \t 0 \t - \t - \t 特征字段,属于第一个转化类型,是负样本
(5)logid \t - \t 0 \t - \t 特征字段,属于第二个转化类型,是负样本
(6)logid \t - \t - \t 0 \t 特征字段,属于第三个转化类型,是负样本
example:
3601136181 - - 0 704:1 227:2 4630318:3 4630319:4 5:5 6:6 319:7 8:8 75:9 1715101:10 63328:10 412248:10 263656:10 4630320:10 150154:10 412250:10 63336:10 465:11 21:13 22:14 29442:15 24:16 630457:17 4630321:17 776254:17 4630322:17 325042:17 4630323:17 238029:17 325041:17 1690121:17 4630324:17 31148:18 3462866:18 174535:18 174536:18 675808 :18 675809:18 131969:18 36517:18 4259795:18 616316:18 2085787:19 30093:20 31406:20 31407:20 63351:20 46:21 4630325:22 4630326:22 4630327:22 4630328:22 4285695:22 4630329:22 4630330:22 4630331:22 4630332:22 4630333:22 2431996:22 4630334:22 1037304:22 4630335:22 1984706:22 4630336:22 2645081:22 816199:22 4630337:22 4630338:22 2085788:23 3161:24 3742:25 150:26 601:26 182:26 62:26 184:26 185:26 345:26 155:26 156:26 1258:26 158:26 70:26

特征类型

特征字段中包含26个不同的特征,各特征的信息类别如下:

字段号说明
标号1~13用户基础信息
标号14~16场景信息
标号17~24广告信息
标号25~26用户多值信息

注:本次比赛分为初赛和复赛,初赛的测试集随本项目进行发布,复赛数据集,将在复赛开始时间进行发布

所有样本已按时间先后排序,用户及广告信息已转为id进行脱敏。

使用组合特征及多目标建模有助于提高模型效果。

比赛最终根据所有转化类型样本整体auc排名。

评分标准

0.90 =< pcoc <= 1.10方可进入榜单排序。 主排序指标为auc,auc相同情况下按照 abs(pcoc - 1)升序。 指标说明:

auc(Area under curve):ROC曲线下的面积,越接近于1越好

pcoc(predict cv over cv) :预估转化率 / 真实转化率,越接近于1越好

解题思路

嫌弃啰嗦可以直接看第三步,一行完成项目训练!

  1. 数据处理: 数据解析(我这里就简单粗暴的将特征全部拼在一起,多特征的取均值处理)
  2. 网络搭建: 采用深度网络分类deepfm(自己写的非官方,若有问题欢迎评论探讨)
  3. 训练
  4. 验证

1 数据处理

# 45-50 s 只需运行一次
!unzip /home/aistudio/data/data213105/train_test.zip -d ./

将每条数据处理为简单粗暴的一维向量

具体逻辑详见:

queuedataset_reader.py
dataset.py

# queuedataset_reader.py

import paddle.distributed.fleet as fleet

class Reader(fleet.MultiSlotDataGenerator):
    def __init__(self):
        padding = 0
        # 无稠密特征
        sparse_slots = "log_key click 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26"
        self.sparse_slots = sparse_slots.strip().split(" ")
        self.slots = self.sparse_slots
        self.slot2index = {}
        self.visit = {}
        for i in range(len(self.slots)):
            self.slot2index[self.slots[i]] = i
            self.visit[self.slots[i]] = False
        self.padding = padding

    def line_process(self, line):
        items = line.strip("\n").split("\t")
        log_key = int(items[0])
        conv1 = items[1]
        conv2 = items[2]
        conv3 = items[3]
        conv = 0 
        if conv1 == "1" or conv2 == "1" or conv3 == "1":
            conv = 1
        else:
            conv = 0
        output = [(i, []) for i in self.slots]
        feasigns = items[4].split(" ")
        for i in feasigns:
            slot_feasign = i.split(":")
            slot = slot_feasign[1]
            if slot not in self.slots:
                continue
            if slot in self.sparse_slots:
                feasign = int(slot_feasign[0])
            else:
                feasign = float(slot_feasign[0])
            output[self.slot2index[slot]][1].append(feasign)
            self.visit[slot] = True
        output[0][1].append(log_key)
        self.visit['log_key'] = True
        output[1][1].append(conv)
        self.visit['click'] = True
        for i in self.visit:
            slot = i
            if not self.visit[slot]:
                output[self.slot2index[i]][1].extend([self.padding])
            else:
                self.visit[slot] = False
                
        new_output = []
        for i in range(len(output)):
            if len(output[i][1]) > 1:
                new_output.append(int(sum(output[i][1])/len(output[i][1])))
            else:
                new_output.append(output[i][1][0])
                
        return new_output
        
    def generate_sample(self, line):
        r"Dataset Generator"

        def reader():
            output_dict = self.line_process(line)
            yield output_dict

        return reader
    
if __name__ == '__main__':
    reader = Reader()
    with open('train_data_small/file_01.txt') as f:
        for line in f:
            print(reader.line_process(line))
            break
# dataset.py
import collections
import paddle
import queuedataset_reader as reader
import random
from tqdm import tqdm

class MyDataset(paddle.io.Dataset):
    def __init__(self, file_paths, train_ratio=0.8):
        super(MyDataset, self).__init__()
        self.file_paths = file_paths
        
        self.reader = reader.Reader()
        self.total_samples = 0
        self.train_data = []
        self.val_data = []
        self.train_ratio = train_ratio
        self._load_data()

    def _load_data(self):
        for file_path in tqdm(self.file_paths, desc='load file and data',ncols=100):
            samples = self._load_file(file_path)
            self.total_samples += len(samples)
            self.train_data.extend(samples)

        self._split_dataset()

    def _load_file(self, file_path):
        samples = []
        with open(file_path) as f:
            for line in f:
                sample = self.reader.line_process(line)
                feature = sample[2:]
                label = sample[1]
                samples.append((feature, label))
        return samples
    
    
    def _split_dataset(self):
        num_train = int(self.total_samples * self.train_ratio)
        random.shuffle(self.train_data)
        
        # 均分数据集类别,防止训练集和验证集中类别不均衡
        class_samples = collections.defaultdict(list)
        for feature, label in self.train_data:
            class_samples[label].append((feature, label))
        num_classes = len(class_samples)
        num_samples_per_class = num_train // num_classes
        train_samples = []
        val_samples = []
        for class_label, samples in class_samples.items():
            num_samples = len(samples)
            num_train_samples = min(num_samples_per_class, num_samples)
            num_val_samples = num_samples - num_train_samples
            train_samples.extend(random.sample(samples, num_train_samples))
            val_samples.extend(samples[num_train_samples:])
        random.shuffle(train_samples)
        random.shuffle(val_samples)
        self.train_data, self.val_data = train_samples, val_samples[:num_train - len(train_samples)]

    def __getitem__(self, idx):
        if idx < len(self.train_data):
            sample = self.train_data[idx]
        else:
            sample = self.val_data[idx - len(self.train_data)]
        return sample[0], sample[1]

    def __len__(self):
        return self.total_samples


if __name__ == '__main__':
    
    file_paths = ['train_data/file_01.txt', 'train_data/file_02.txt']
    dataset = MyDataset(file_paths, train_ratio=0.8)
    print(len(dataset.train_data))
    print(len(dataset.val_data))
    train_loader = paddle.io.DataLoader(dataset.train_data, batch_size=32, shuffle=True, num_workers=0)
    val_loader = paddle.io.DataLoader(dataset.val_data, batch_size=32, shuffle=False, num_workers=0)

    for epoch in range(1):
        for batch_id, data in enumerate(train_loader()):
            x = data[0]
            y = data[1]
            #  32 转换为 32*1 然后拼接为 32*26
            for i in range(len(x)):
                x[i] = paddle.unsqueeze(x[i], axis=1)
            x = paddle.concat(x, axis=1)
            
            print(x.shape)
            # 将 32 转换为 32*1
            y = paddle.unsqueeze(y, axis=1)
            print(y.shape)
            
            print(x)
            print(y)
            
            break

2 模型搭建

我这里只是一个可以跑通的demo展示,所以对模型不再深入研究,跑通即可。

当然,这里不做效果的评价,只是作为一个baseline的编写,带领大家能够迅速的拉通整个项目。

模型的优化还是交给大家!毕竟那么多奖金呢!

模型详见文件:

deepfm.py

# deepfm.py

import math
import paddle
import paddle.nn as nn


class FM(nn.Layer):
    def __init__(self, sparse_feature_number, sparse_feature_dim):
        super(FM, self).__init__()
        self.sparse_feature_number = sparse_feature_number
        self.sparse_feature_dim = sparse_feature_dim
        self.init_value_ = 0.1
        self.embedding = paddle.nn.Embedding(
            self.sparse_feature_number,
            self.sparse_feature_dim,
            sparse=True,
            weight_attr=paddle.ParamAttr(
                name="SparseFmFeatFactors",
                initializer=paddle.nn.initializer.TruncatedNormal(
                    mean=0.0,
                    std=self.init_value_ /
                    math.sqrt(float(self.sparse_feature_dim)))))
        
        self.bias = paddle.create_parameter(
            shape=[1],
            dtype='float32',
            default_initializer=paddle.nn.initializer.TruncatedNormal(
                mean=0.0,
                std=self.init_value_ /
                math.sqrt(float(self.sparse_feature_dim))))

    def forward(self, sparse_inputs):
        emb = self.embedding(sparse_inputs.astype('int64'))
        square_of_sum = paddle.square(paddle.sum(emb, axis=1))
        sum_of_square = paddle.sum(paddle.square(emb), axis=1)
        fm = 0.5 * paddle.sum(square_of_sum - sum_of_square, axis=1, keepdim=True)
        fm = fm + self.bias
        return fm

class DNN(nn.Layer):
    def __init__(self, num_field, layer_sizes, sparse_feature_number, sparse_feature_dim):
        super(DNN, self).__init__()
        self.num_field = num_field
        self.layer_sizes = layer_sizes
        self.sparse_feature_dim = sparse_feature_dim
        self.sparse_feature_number = sparse_feature_number
        self.init_value_ = 0.1
        
        self.embedding = paddle.nn.Embedding(
            self.sparse_feature_number,
            self.sparse_feature_dim,
            sparse=True,
            weight_attr=paddle.ParamAttr(
                name="SparseDnnFeatFactors",
                initializer=paddle.nn.initializer.TruncatedNormal(
                    mean=0.0,
                    std=self.init_value_ /
                    math.sqrt(float(self.sparse_feature_dim)))))

        sizes = [self.num_field * self.sparse_feature_dim] + self.layer_sizes
        acts = ["relu" for _ in range(len(self.layer_sizes))] + [None]
        self._mlp_layers = []
        for i in range(len(layer_sizes)):
            linear = paddle.nn.Linear(
                in_features=sizes[i],
                out_features=sizes[i + 1],
                weight_attr=paddle.ParamAttr(
                    initializer=paddle.nn.initializer.Normal(
                        std=1.0 / math.sqrt(sizes[i]))))
            self.add_sublayer('linear_%d' % i, linear)
            self._mlp_layers.append(linear)
            if acts[i] == 'relu':
                act = paddle.nn.ReLU()
                self.add_sublayer('act_%d' % i, act)
                self._mlp_layers.append(act)

    def forward(self, sparse_inputs):
        emb = self.embedding(sparse_inputs.astype('int64'))
        y_dnn = paddle.reshape(emb, shape=[-1, self.num_field * self.sparse_feature_dim])
        for n_layer in self._mlp_layers:
            y_dnn = n_layer(y_dnn)
        return y_dnn

class DeepFM(nn.Layer):
    def __init__(self, num_field, layer_sizes, sparse_feature_number=88000000, sparse_feature_dim=8):
        super(DeepFM, self).__init__()
        self.num_field = num_field
        self.layer_sizes = layer_sizes
        self.sparse_feature_number = sparse_feature_number
        self.sparse_feature_dim = sparse_feature_dim

        self.fm = FM(self.sparse_feature_number, self.sparse_feature_dim)
        self.dnn = DNN(self.num_field, self.layer_sizes, self.sparse_feature_number, self.sparse_feature_dim)

        self.fc = paddle.nn.Linear(
            in_features=1 + self.layer_sizes[-1],
            out_features=1,
            weight_attr=paddle.ParamAttr(
                initializer=paddle.nn.initializer.Normal(std=1.0 / math.sqrt(1 + self.layer_sizes[-1]))))
        self.add_sublayer('fc', self.fc)

    def forward(self, sparse_inputs):
        
        fm_out = self.fm(sparse_inputs)
        dnn_out = self.dnn(sparse_inputs)
        concat_out = paddle.concat([fm_out, dnn_out], axis=1)
        y = self.fc(concat_out)
        
        pred = paddle.nn.functional.sigmoid(y)
        
        return pred
    
    
    
if __name__ == '__main__':
    
    model = DeepFM(26, [512, 256, 128,32],sparse_feature_dim=9)
    
    paddle.summary(model, (100, 26))
    
    # data = [[159, 259, 4630199, 4630200, 163, 164, 165, 650, 167, 79129, 3135, 0, 652361, 169, 714906, 24, 0, 1520746, 1000438, 17249, 298, 0, 1000441, 3251, 1686, 181]]
    # data = paddle.to_tensor(data)
    
    # print(model(data))
   

3 训练

训练脚本,我这里采用的是MLP模型进行的训练,大家可以自己改进更换。

(推荐使用TeslaA100 32G 版本) 否则可能会有读取数据内存溢出,这一点是可以优化的,但是在平台上有32的选择,就无须纠结了。

train.py

# 训练一次需要很久,请根据自己的情况去调整epoch
!python train.py

项目总结

项目采用比较笨的数据处理方式完成数据处理,采用自写魔改的deepfm深度网络完成模型训练

项目还可以优化的地方有很多,提供参考意见如下:

  • 1、 数据读取优化 ⭐️
  • 2、 数据处理优化 ⭐️
  • 3、 特征处理优化 ⭐️⭐️
  • 4、 采用更为先进的模型进行训练 ⭐️⭐️⭐️
  • 5、 优化器以及超参的修改⭐️

此文章为搬运
原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐