【飞桨学习赛：个贷违约预测】9月第1名方案

在baseline基础上进行改动，AUC得分0.86+。

AI Studio

1853人浏览 · 2022-10-25 19:16:32

AI Studio · 2022-10-25 19:16:32 发布

飞桨学习赛：个贷违约预测

赛题介绍

利用已有的与目标客群稍有差异的另一批信贷数据，辅助目标业务风控模型的创建，两者数据集之间存在大量相同的字段和极少的共同用户。此处希望大家可以利用迁移学习捕捉不同业务中用户基本信息与违约行为之间的关联，帮助实现对新业务的用户违约预测。

注意：本学习赛要求选手使用飞桨 PaddlePaddle 2.0 及以上版本生成端到端深度学习模型，因此树模型之类的纯机器学习模型是不符合要求的。

数据介绍

训练集包括10000条数据，属性分别为

‘loan_id’, ‘user_id’, ‘total_loan’, ‘year_of_loan’, ‘interest’,
‘monthly_payment’, ‘class’, ‘employer_type’, ‘industry’, ‘work_year’,
‘house_exist’, ‘censor_status’, ‘issue_date’, ‘use’, ‘post_code’,
‘region’, ‘debt_loan_ratio’, ‘del_in_18month’, ‘scoring_low’,
‘scoring_high’, ‘known_outstanding_loan’, ‘known_dero’,
‘pub_dero_bankrup’, ‘recircle_b’, ‘recircle_u’, ‘initial_list_status’,
‘app_type’, ‘earlies_credit_mon’, ‘title’, ‘policy_code’, ‘f0’, ‘f1’,
‘f2’, ‘f3’, ‘f4’, ‘early_return’, ‘early_return_amount’,
‘early_return_amount_3mon’, ‘isDefault’

其中’isDefault’为目标分类值，近1/3为分类变量，近2/3为数值型变量，剩余变量包含时间信息等。

并且10000条数据中仅1683个样本为正样本，剩余为负样本，即样本不平衡。

方案介绍

本方案主要参考了官方推荐的精选项目：https://aistudio.baidu.com/aistudio/projectdetail/3555696

对于样本不平衡的问题，修改损失函数的权重，将负样本的权重设为0.2，正样本为1.2
对于连续数值型变量进行均值-方差归一化
对于分类变量，在网络中进行重编码（即增加一个全连接层，用于模拟embedding）
对一些变量进行了简单的变换与特征工程。

代码

# 查看样本信息
import pandas as pd
import re

train_df=pd.read_csv('data/data130186/train_public.csv')
test_df=pd.read_csv('data/data130187/test_public.csv')

# 固定随机种子
import random
import numpy as np
import paddle

seed = 1024
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)

<paddle.fluid.core_avx.Generator at 0x7fdc2798c5f0>

# 查看空值分布情况
train_df.isnull().any()

loan_id                     False
user_id                     False
total_loan                  False
year_of_loan                False
interest                    False
monthly_payment             False
class                       False
employer_type               False
industry                    False
work_year                    True
house_exist                 False
censor_status               False
issue_date                  False
use                         False
post_code                   False
region                      False
debt_loan_ratio             False
del_in_18month              False
scoring_low                 False
scoring_high                False
known_outstanding_loan      False
known_dero                  False
pub_dero_bankrup             True
recircle_b                  False
recircle_u                  False
initial_list_status         False
app_type                    False
earlies_credit_mon          False
title                       False
policy_code                 False
f0                           True
f1                           True
f2                           True
f3                           True
f4                           True
early_return                False
early_return_amount         False
early_return_amount_3mon    False
isDefault                   False
dtype: bool

# 新特征构造
train_df['risk'] = train_df['total_loan'] * train_df['interest']
test_df['risk'] = test_df['total_loan'] * test_df['interest']

# class数量化
class_dic = dict(zip([chr(y) for y in range(ord('A'), ord('G') + 1)], range(0, 8)))
train_df['class'] = train_df['class'].apply(lambda x: class_dic[x])
test_df['class'] = test_df['class'].apply(lambda x: class_dic[x])

# region重新构造，分成两派即可
region_des = train_df.groupby(['region'])['isDefault'].describe()
default_regions = region_des[region_des['mean'] > 0.2].index.tolist()
train_df['region'] = train_df['region'].apply(lambda x: 1 if x in default_regions else 0)
test_df['region'] = test_df['region'].apply(lambda x: 1 if x in default_regions else 0)

# work_year数量化
work_year_dict = {
    '< 1 year': 0,
    '1 year': 1,
    '2 years': 2,
    '3 years': 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9,
    '10+ years': 10,
}

train_df['work_year'] = train_df['work_year'].map(work_year_dict)
test_df['work_year'] = test_df['work_year'].map(work_year_dict)

# earlies_credit_mon处理

month_dict = {'Dec': 12,
         'Apr': 4,
         'Oct': 10,
         'Jun': 6,
         'May': 5,
         'Jan': 1,
         'Nov': 11,
         'Jul': 7,
         'Aug': 8,
         'Feb': 2,
         'Sep': 9,
         'Mar': 3}

train_df['early_mon'] = train_df['earlies_credit_mon'].apply(lambda x: month_dict[re.findall('[a-zA-Z]+', x)[0]])
test_df['early_mon'] = test_df['earlies_credit_mon'].apply(lambda x: month_dict[re.findall('[a-zA-Z]+', x)[0]])

train_df['early_day'] = train_df['earlies_credit_mon'].apply(lambda x: eval(re.findall('[0-9]+', x)[0]))
test_df['early_day'] = test_df['earlies_credit_mon'].apply(lambda x: eval(re.findall('[0-9]+', x)[0]))

# f系列新构造一个特征
train_df['f0xf3'] = train_df['f0'] * train_df['f3']
test_df['f0xf3'] = test_df['f0'] * test_df['f3']

# 缺失值均值填充

train_df['pub_dero_bankrup'] = train_df['pub_dero_bankrup'].fillna(train_df['pub_dero_bankrup'].mean())
test_df['pub_dero_bankrup'] = test_df['pub_dero_bankrup'].fillna(test_df['pub_dero_bankrup'].mean())

'''
以下是不成功的特征构造：

# f系列特征重新构造
f_ls = ['f0', 'f1', 'f2', 'f3', 'f4']

for i in f_ls:
    for j in f_ls[f_ls.index(i)+1:]:
        train_df['%sx%s'%(i,j)] = train_df[i] * train_df[j]
        test_df['%sx%s'%(i,j)] = test_df[i] * test_df[j]

# use重新构造，分成两派即可
use_des = train_df.groupby(['use'])['isDefault'].describe()
default_uses = use_des[use_des['mean'] > 0.2].index.tolist()
train_df['use'] = train_df['use'].apply(lambda x: 1 if x in default_uses else 0)
test_df['use'] = test_df['use'].apply(lambda x: 1 if x in default_uses else 0)

# post_code重新构造，分成两派即可
post_code_des = train_df.groupby(['post_code'])['isDefault'].describe()
default_post_codes = post_code_des[post_code_des['mean'] > 0.2].index.tolist()
train_df['post_code'] = train_df['post_code'].apply(lambda x: 1 if x in default_post_codes else 0)
test_df['post_code'] = test_df['post_code'].apply(lambda x: 1 if x in default_post_codes else 0)
'''

train_df.to_csv('train_df.csv', index=False)
test_df.to_csv('test_df.csv', index=False)

train_df['isDefault'].value_counts()

Counter({0: 8317, 1: 1683})

# 数值特征描述
pos = train_df[train_df['isDefault']==1].describe()
pos_des = pd.DataFrame(pos.values.T, index=pos.columns, columns=pos.index)  # 转置操作
pos_des = pos_des['mean']

neg = train_df[train_df['isDefault']==0].describe()
neg_des = pd.DataFrame(neg.values.T, index=neg.columns, columns=neg.index)
neg_des = neg_des['mean']

# 数值特征筛选
des = pd.DataFrame({'pos': pos_des, 'neg': neg_des})
des['diff'] = (des['pos'] - des['neg'])/des['neg']
des = des.sort_values(by=['diff'])
num_item = des[abs(des['diff'])>=0.01].index.tolist()[:-1]

un_num_item = ['early_mon', 'work_year', 'region', 'employer_type','industry', 'house_exist', 'censor_status',
        'use',
        'initial_list_status','app_type',
        'policy_code']

for i in un_num_item:
        try:
                num_item.remove(i)
        except:
                continue

len(num_item)

构造dataset

import paddle.vision.transforms as T
from PIL import Image

class MyDateset(paddle.io.Dataset):
    # csv_dir对应要读取的数据地址，standard_csv_dir用于生成均值和方差信息对数据进行归一化的文件地址
    def __init__(self,csv_dir,standard_csv_dir='train_df.csv',mode = 'train'):
        super(MyDateset, self).__init__()

        # 读取数据
        self.df = pd.read_csv(csv_dir)
        
        # 构造各个变量的均值和方差
        st_df = pd.read_csv(standard_csv_dir)
        self.mean_df = st_df.mean()
        self.std_df = st_df.std()

        # 分别指定数值型变量/分类变量/不使用的变量
        self.num_item = num_item
        self.un_num_item = un_num_item

        # 构造一个映射表，将分类变量/分类字符串映射到对应数值上
        un_num_item_list = {}
        for item in self.un_num_item:
            un_num_item_list[item]=list(set(st_df[item].values))
        self.un_num_item_list = un_num_item_list

        self.mode = mode

    def __getitem__(self, index):
        data=[]

        # 进行归一化，如果这个数值缺省了直接设置为0
        for item in self.num_item:
            if np.isnan(self.df[item][index]):
                data.append((0-self.mean_df[item])/self.std_df[item])
            else:
                data.append((self.df[item][index]-self.mean_df[item])/self.std_df[item])
        
        emb_data = []

        # 将分类变量映射到对应数值上
        for item in self.un_num_item:
            try:
                if self.df[item][index] not in self.un_num_item_list[item]:
                    emb_data.append(-1)
                else:
                    emb_data.append(self.un_num_item_list[item].index(self.df[item][index]))
            except:
                emb_data.append(-1)

        data = paddle.to_tensor(data).astype('float32')
        emb_data = paddle.to_tensor(emb_data).astype('float32')

        # 如果当前模式不为train，则返回对应的loan_id，用于锁定样本条目
        if self.mode == 'train':
            label = self.df['isDefault'][index]
        else:
            label = self.df['loan_id'][index]

        label = np.array(label).astype('int64')
        return data,emb_data,label

    def __len__(self):
        return len(self.df)

dataset=MyDateset('train_df.csv')
[data,emb_data,label] = dataset[0]
print(data.shape)
print(emb_data.shape)
print(label)

print('\n')
print(num_item)

[26]
[11]
0


['early_return_amount_3mon', 'early_return_amount', 'early_return', 'f1', 'early_day', 'title', 'recircle_b', 'scoring_low', 'scoring_high', 'f4', 'monthly_payment', 'post_code', 'total_loan', 'f3', 'recircle_u', 'year_of_loan', 'known_outstanding_loan', 'del_in_18month', 'f0', 'debt_loan_ratio', 'interest', 'f0xf3', 'risk', 'pub_dero_bankrup', 'known_dero', 'class']

# 查看变量分布情况
des

	pos	neg	diff
early_return_amount_3mon	2.803791e+01	3.973950e+02	-0.929446
early_return_amount	2.054415e+02	2.572250e+03	-0.920132
early_return	1.663696e-01	1.518095e+00	-0.890409
f1	1.303781e-03	1.445846e-03	-0.098258
early_day	4.226144e+01	4.620753e+01	-0.085399
title	1.687819e+03	1.832562e+03	-0.078984
use	1.645870e+00	1.786221e+00	-0.078574
recircle_b	1.550428e+04	1.675956e+04	-0.074900
work_year	5.919503e+00	6.091975e+00	-0.028311
scoring_low	6.548115e+02	6.659983e+02	-0.016797
scoring_high	7.638845e+02	7.765860e+02	-0.016356
user_id	2.251468e+05	2.252223e+05	-0.000335
loan_id	1.025147e+06	1.025222e+06	-0.000074
policy_code	1.000000e+00	1.000000e+00	0.000000
early_mon	6.733214e+00	6.732596e+00	0.000092
f2	8.537608e+00	8.453426e+00	0.009958
f4	8.179408e+00	8.081345e+00	0.012134
monthly_payment	4.421506e+02	4.359102e+02	0.014316
post_code	2.607600e+02	2.568633e+02	0.015170
total_loan	1.471829e+04	1.433815e+04	0.026513
initial_list_status	4.236482e-01	4.121678e-01	0.027854
f3	1.506535e+01	1.457538e+01	0.033616
recircle_u	5.563608e+01	5.321519e+01	0.045493
year_of_loan	3.721331e+00	3.430684e+00	0.084720
app_type	2.139037e-02	1.971865e-02	0.084779
known_outstanding_loan	1.248544e+01	1.147493e+01	0.088063
del_in_18month	3.428402e-01	3.052783e-01	0.123041
f0	6.338471e+00	5.557107e+00	0.140606
censor_status	1.142008e+00	9.888181e-01	0.154923
house_exist	6.922163e-01	5.960082e-01	0.161421
debt_loan_ratio	2.008696e+01	1.701519e+01	0.180531
interest	1.532975e+01	1.279642e+01	0.197971
f0xf3	1.163089e+02	9.522652e+01	0.221392
risk	2.344960e+05	1.886230e+05	0.243199
pub_dero_bankrup	1.759590e-01	1.315179e-01	0.337910
known_dero	2.953060e-01	2.124564e-01	0.389960
class	2.352941e+00	1.624985e+00	0.447977
region	1.556744e-01	1.055669e-01	0.474651
isDefault	1.000000e+00	0.000000e+00	inf

构造网络

对于分类变量使用两层全连接成生成embedding

class MyNet(paddle.nn.Layer):
    def __init__(self):
        super(MyNet,self).__init__()
        self.fc = paddle.nn.Linear(in_features=len(num_item), out_features=512)
        self.emb1 = paddle.nn.Linear(in_features=len(un_num_item),out_features=2048)
        self.emb2 = paddle.nn.Linear(in_features=2048,out_features=512)
        self.emb3 = paddle.nn.Linear(in_features=1024,out_features=512)
        self.dropout = paddle.nn.Dropout(0.1)
        self.relu = paddle.nn.ReLU()
        self.out = paddle.nn.Linear(in_features=512,out_features=2)

    def forward(self,data,emb_data):
        x = self.fc(data)
        x = self.dropout(x)

        emb = self.emb1(emb_data)
        emb = self.emb2(emb)

        x = paddle.concat([x,emb],axis=-1)
        x = self.relu(x)
        x = self.emb3(x)
        x = self.out(x)
        
        x = paddle.nn.functional.softmax(x)
        return x

训练

# 构造读取器，可以尝试继续调大batch size，当然batch size越大学习率也要相应增大
train_dataset=MyDateset('train_df.csv')
batch_size = 2048
train_dataloader = paddle.io.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)

# 构造模型
model = MyNet()

# model_dict = paddle.load('model.pdparams')
# model.set_dict(model_dict)

model.train()

max_epoch=17
opt = paddle.optimizer.SGD(learning_rate=0.1, parameters=model.parameters())

# 训练
now_step=0
for epoch in range(max_epoch):
    for step, data in enumerate(train_dataloader):
        now_step+=1

        data,emb_data, label = data
        pre = model(data,emb_data)
        loss = paddle.nn.functional.cross_entropy(pre,label,weight=paddle.to_tensor([0.2,1.2]),reduction='mean')
        # loss = paddle.nn.functional.square_error_cost(pre,label.reshape([-1,1]).astype('float32'))
        # loss = paddle.mean(loss)
        loss.backward()
        opt.step()
        opt.clear_gradients()
        if now_step%1==0:
            print("epoch: {}, batch: {}, loss is: {}".format(epoch, step, loss.mean().numpy()))

# 保存模型到model.pdparams
paddle.save(model.state_dict(), 'model.pdparams')

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:253: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.bool, the right dtype will convert to paddle.float32
  format(lhs_dtype, rhs_dtype, lhs_dtype))


epoch: 0, batch: 0, loss is: [0.72932494]
epoch: 0, batch: 1, loss is: [0.72809494]
epoch: 0, batch: 2, loss is: [0.67359775]
epoch: 0, batch: 3, loss is: [0.68661183]
epoch: 0, batch: 4, loss is: [0.75169533]
epoch: 1, batch: 0, loss is: [0.7372106]
epoch: 1, batch: 1, loss is: [0.6646998]
epoch: 1, batch: 2, loss is: [0.652986]
epoch: 1, batch: 3, loss is: [0.6788878]
epoch: 1, batch: 4, loss is: [0.6282849]
epoch: 2, batch: 0, loss is: [0.6557647]
epoch: 2, batch: 1, loss is: [0.6336642]
epoch: 2, batch: 2, loss is: [0.63801914]
epoch: 2, batch: 3, loss is: [0.6108661]
epoch: 2, batch: 4, loss is: [0.60893005]
epoch: 3, batch: 0, loss is: [0.5925074]
epoch: 3, batch: 1, loss is: [0.5883472]
epoch: 3, batch: 2, loss is: [0.58365494]
epoch: 3, batch: 3, loss is: [0.59151757]
epoch: 3, batch: 4, loss is: [0.57692784]
epoch: 4, batch: 0, loss is: [0.58431715]
epoch: 4, batch: 1, loss is: [0.56982553]
epoch: 4, batch: 2, loss is: [0.5645007]
epoch: 4, batch: 3, loss is: [0.57651067]
epoch: 4, batch: 4, loss is: [0.56192523]
epoch: 5, batch: 0, loss is: [0.5654892]
epoch: 5, batch: 1, loss is: [0.56473136]
epoch: 5, batch: 2, loss is: [0.5650625]
epoch: 5, batch: 3, loss is: [0.5576684]
epoch: 5, batch: 4, loss is: [0.5584635]
epoch: 6, batch: 0, loss is: [0.5546932]
epoch: 6, batch: 1, loss is: [0.55603355]
epoch: 6, batch: 2, loss is: [0.5493836]
epoch: 6, batch: 3, loss is: [0.5532869]
epoch: 6, batch: 4, loss is: [0.55402416]
epoch: 7, batch: 0, loss is: [0.5511932]
epoch: 7, batch: 1, loss is: [0.5557276]
epoch: 7, batch: 2, loss is: [0.55160713]
epoch: 7, batch: 3, loss is: [0.53901863]
epoch: 7, batch: 4, loss is: [0.5432972]
epoch: 8, batch: 0, loss is: [0.53537405]
epoch: 8, batch: 1, loss is: [0.52480084]
epoch: 8, batch: 2, loss is: [0.54991305]
epoch: 8, batch: 3, loss is: [0.5552637]
epoch: 8, batch: 4, loss is: [0.5439552]
epoch: 9, batch: 0, loss is: [0.53486353]
epoch: 9, batch: 1, loss is: [0.54527813]
epoch: 9, batch: 2, loss is: [0.5333212]
epoch: 9, batch: 3, loss is: [0.5366559]
epoch: 9, batch: 4, loss is: [0.53380215]
epoch: 10, batch: 0, loss is: [0.5307571]
epoch: 10, batch: 1, loss is: [0.5307473]
epoch: 10, batch: 2, loss is: [0.5341663]
epoch: 10, batch: 3, loss is: [0.53598076]
epoch: 10, batch: 4, loss is: [0.533747]
epoch: 11, batch: 0, loss is: [0.5351857]
epoch: 11, batch: 1, loss is: [0.5374662]
epoch: 11, batch: 2, loss is: [0.53001803]
epoch: 11, batch: 3, loss is: [0.5292998]
epoch: 11, batch: 4, loss is: [0.520213]
epoch: 12, batch: 0, loss is: [0.5229189]
epoch: 12, batch: 1, loss is: [0.53077817]
epoch: 12, batch: 2, loss is: [0.5234481]
epoch: 12, batch: 3, loss is: [0.52863]
epoch: 12, batch: 4, loss is: [0.5292127]
epoch: 13, batch: 0, loss is: [0.53497964]
epoch: 13, batch: 1, loss is: [0.5193575]
epoch: 13, batch: 2, loss is: [0.52314794]
epoch: 13, batch: 3, loss is: [0.5276453]
epoch: 13, batch: 4, loss is: [0.51771814]
epoch: 14, batch: 0, loss is: [0.50922066]
epoch: 14, batch: 1, loss is: [0.5228251]
epoch: 14, batch: 2, loss is: [0.5262174]
epoch: 14, batch: 3, loss is: [0.52728355]
epoch: 14, batch: 4, loss is: [0.5314889]
epoch: 15, batch: 0, loss is: [0.5179457]
epoch: 15, batch: 1, loss is: [0.53258806]
epoch: 15, batch: 2, loss is: [0.51814485]
epoch: 15, batch: 3, loss is: [0.5144637]
epoch: 15, batch: 4, loss is: [0.5230428]
epoch: 16, batch: 0, loss is: [0.528043]
epoch: 16, batch: 1, loss is: [0.5009919]
epoch: 16, batch: 2, loss is: [0.5170122]
epoch: 16, batch: 3, loss is: [0.5321321]
epoch: 16, batch: 4, loss is: [0.5218459]

预测

这里直接读取保存好的得分为0.86893的模型，如需测试自己的模型请替换对应的模型读取路径

最后直接提交生成result.csv即可

# 读取模型和构造读取器
model = MyNet()

model_dict = paddle.load('model_best.pdparams')
# model_dict = paddle.load('model.pdparams')

model.set_dict(model_dict)

model.eval()

test_dataset=MyDateset('test_df.csv',mode = 'test')

test_dataloader = paddle.io.DataLoader(
    test_dataset,
    batch_size=1,
    shuffle=False,
    drop_last=False)

# 将结果保存在result.csv中
result = []
for step, data in enumerate(test_dataloader):
    data ,emb_data, loan_id = data
    pre = model(data,emb_data)
    result.append([loan_id.numpy()[0], pre[:,1].numpy()[0]])
    # result.append([loan_id.numpy()[0], np.argmax(pre.numpy())])

pd.DataFrame(result,columns=['id','isDefault']).to_csv('result.csv',index=None)