飞桨学习赛:个贷违约预测

赛题介绍

利用已有的与目标客群稍有差异的另一批信贷数据,辅助目标业务风控模型的创建,两者数据集之间存在大量相同的字段和极少的共同用户。此处希望大家可以利用迁移学习捕捉不同业务中用户基本信息与违约行为之间的关联,帮助实现对新业务的用户违约预测。

注意:本学习赛要求选手使用飞桨 PaddlePaddle 2.0 及以上版本生成端到端深度学习模型,因此树模型之类的纯机器学习模型是不符合要求的。

数据介绍

训练集包括10000条数据,属性分别为

‘loan_id’, ‘user_id’, ‘total_loan’, ‘year_of_loan’, ‘interest’,
‘monthly_payment’, ‘class’, ‘employer_type’, ‘industry’, ‘work_year’,
‘house_exist’, ‘censor_status’, ‘issue_date’, ‘use’, ‘post_code’,
‘region’, ‘debt_loan_ratio’, ‘del_in_18month’, ‘scoring_low’,
‘scoring_high’, ‘known_outstanding_loan’, ‘known_dero’,
‘pub_dero_bankrup’, ‘recircle_b’, ‘recircle_u’, ‘initial_list_status’,
‘app_type’, ‘earlies_credit_mon’, ‘title’, ‘policy_code’, ‘f0’, ‘f1’,
‘f2’, ‘f3’, ‘f4’, ‘early_return’, ‘early_return_amount’,
‘early_return_amount_3mon’, ‘isDefault’

其中’isDefault’为目标分类值,近1/3为分类变量,近2/3为数值型变量,剩余变量包含时间信息等。

并且10000条数据中仅1683个样本为正样本,剩余为负样本,即样本不平衡。

方案介绍

本方案主要参考了官方推荐的精选项目:https://aistudio.baidu.com/aistudio/projectdetail/3555696

  1. 对于样本不平衡的问题,修改损失函数的权重,将负样本的权重设为0.2,正样本为1.2
  2. 对于连续数值型变量进行均值-方差归一化
  3. 对于分类变量,在网络中进行重编码(即增加一个全连接层,用于模拟embedding)
  4. 对一些变量进行了简单的变换与特征工程。

代码

# 查看样本信息
import pandas as pd
import re

train_df=pd.read_csv('data/data130186/train_public.csv')
test_df=pd.read_csv('data/data130187/test_public.csv')

# 固定随机种子
import random
import numpy as np
import paddle

seed = 1024
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
<paddle.fluid.core_avx.Generator at 0x7fdc2798c5f0>
# 查看空值分布情况
train_df.isnull().any()
loan_id                     False
user_id                     False
total_loan                  False
year_of_loan                False
interest                    False
monthly_payment             False
class                       False
employer_type               False
industry                    False
work_year                    True
house_exist                 False
censor_status               False
issue_date                  False
use                         False
post_code                   False
region                      False
debt_loan_ratio             False
del_in_18month              False
scoring_low                 False
scoring_high                False
known_outstanding_loan      False
known_dero                  False
pub_dero_bankrup             True
recircle_b                  False
recircle_u                  False
initial_list_status         False
app_type                    False
earlies_credit_mon          False
title                       False
policy_code                 False
f0                           True
f1                           True
f2                           True
f3                           True
f4                           True
early_return                False
early_return_amount         False
early_return_amount_3mon    False
isDefault                   False
dtype: bool
# 新特征构造
train_df['risk'] = train_df['total_loan'] * train_df['interest']
test_df['risk'] = test_df['total_loan'] * test_df['interest']

# class数量化
class_dic = dict(zip([chr(y) for y in range(ord('A'), ord('G') + 1)], range(0, 8)))
train_df['class'] = train_df['class'].apply(lambda x: class_dic[x])
test_df['class'] = test_df['class'].apply(lambda x: class_dic[x])

# region重新构造,分成两派即可
region_des = train_df.groupby(['region'])['isDefault'].describe()
default_regions = region_des[region_des['mean'] > 0.2].index.tolist()
train_df['region'] = train_df['region'].apply(lambda x: 1 if x in default_regions else 0)
test_df['region'] = test_df['region'].apply(lambda x: 1 if x in default_regions else 0)

# work_year数量化
work_year_dict = {
    '< 1 year': 0,
    '1 year': 1,
    '2 years': 2,
    '3 years': 3,
    '4 years': 4,
    '5 years': 5,
    '6 years': 6,
    '7 years': 7,
    '8 years': 8,
    '9 years': 9,
    '10+ years': 10,
}

train_df['work_year'] = train_df['work_year'].map(work_year_dict)
test_df['work_year'] = test_df['work_year'].map(work_year_dict)

# earlies_credit_mon处理

month_dict = {'Dec': 12,
         'Apr': 4,
         'Oct': 10,
         'Jun': 6,
         'May': 5,
         'Jan': 1,
         'Nov': 11,
         'Jul': 7,
         'Aug': 8,
         'Feb': 2,
         'Sep': 9,
         'Mar': 3}

train_df['early_mon'] = train_df['earlies_credit_mon'].apply(lambda x: month_dict[re.findall('[a-zA-Z]+', x)[0]])
test_df['early_mon'] = test_df['earlies_credit_mon'].apply(lambda x: month_dict[re.findall('[a-zA-Z]+', x)[0]])

train_df['early_day'] = train_df['earlies_credit_mon'].apply(lambda x: eval(re.findall('[0-9]+', x)[0]))
test_df['early_day'] = test_df['earlies_credit_mon'].apply(lambda x: eval(re.findall('[0-9]+', x)[0]))

# f系列新构造一个特征
train_df['f0xf3'] = train_df['f0'] * train_df['f3']
test_df['f0xf3'] = test_df['f0'] * test_df['f3']

# 缺失值均值填充

train_df['pub_dero_bankrup'] = train_df['pub_dero_bankrup'].fillna(train_df['pub_dero_bankrup'].mean())
test_df['pub_dero_bankrup'] = test_df['pub_dero_bankrup'].fillna(test_df['pub_dero_bankrup'].mean())

'''
以下是不成功的特征构造:

# f系列特征重新构造
f_ls = ['f0', 'f1', 'f2', 'f3', 'f4']

for i in f_ls:
    for j in f_ls[f_ls.index(i)+1:]:
        train_df['%sx%s'%(i,j)] = train_df[i] * train_df[j]
        test_df['%sx%s'%(i,j)] = test_df[i] * test_df[j]

# use重新构造,分成两派即可
use_des = train_df.groupby(['use'])['isDefault'].describe()
default_uses = use_des[use_des['mean'] > 0.2].index.tolist()
train_df['use'] = train_df['use'].apply(lambda x: 1 if x in default_uses else 0)
test_df['use'] = test_df['use'].apply(lambda x: 1 if x in default_uses else 0)

# post_code重新构造,分成两派即可
post_code_des = train_df.groupby(['post_code'])['isDefault'].describe()
default_post_codes = post_code_des[post_code_des['mean'] > 0.2].index.tolist()
train_df['post_code'] = train_df['post_code'].apply(lambda x: 1 if x in default_post_codes else 0)
test_df['post_code'] = test_df['post_code'].apply(lambda x: 1 if x in default_post_codes else 0)
'''

train_df.to_csv('train_df.csv', index=False)
test_df.to_csv('test_df.csv', index=False)
train_df['isDefault'].value_counts()
Counter({0: 8317, 1: 1683})
# 数值特征描述
pos = train_df[train_df['isDefault']==1].describe()
pos_des = pd.DataFrame(pos.values.T, index=pos.columns, columns=pos.index)  # 转置操作
pos_des = pos_des['mean']

neg = train_df[train_df['isDefault']==0].describe()
neg_des = pd.DataFrame(neg.values.T, index=neg.columns, columns=neg.index)
neg_des = neg_des['mean']

# 数值特征筛选
des = pd.DataFrame({'pos': pos_des, 'neg': neg_des})
des['diff'] = (des['pos'] - des['neg'])/des['neg']
des = des.sort_values(by=['diff'])
num_item = des[abs(des['diff'])>=0.01].index.tolist()[:-1]

un_num_item = ['early_mon', 'work_year', 'region', 'employer_type','industry', 'house_exist', 'censor_status',
        'use',
        'initial_list_status','app_type',
        'policy_code']

for i in un_num_item:
        try:
                num_item.remove(i)
        except:
                continue
len(num_item)
26

构造dataset

import paddle.vision.transforms as T
from PIL import Image

class MyDateset(paddle.io.Dataset):
    # csv_dir对应要读取的数据地址,standard_csv_dir用于生成均值和方差信息对数据进行归一化的文件地址
    def __init__(self,csv_dir,standard_csv_dir='train_df.csv',mode = 'train'):
        super(MyDateset, self).__init__()

        # 读取数据
        self.df = pd.read_csv(csv_dir)
        
        # 构造各个变量的均值和方差
        st_df = pd.read_csv(standard_csv_dir)
        self.mean_df = st_df.mean()
        self.std_df = st_df.std()

        # 分别指定数值型变量/分类变量/不使用的变量
        self.num_item = num_item
        self.un_num_item = un_num_item

        # 构造一个映射表,将分类变量/分类字符串映射到对应数值上
        un_num_item_list = {}
        for item in self.un_num_item:
            un_num_item_list[item]=list(set(st_df[item].values))
        self.un_num_item_list = un_num_item_list

        self.mode = mode

    def __getitem__(self, index):
        data=[]

        # 进行归一化,如果这个数值缺省了直接设置为0
        for item in self.num_item:
            if np.isnan(self.df[item][index]):
                data.append((0-self.mean_df[item])/self.std_df[item])
            else:
                data.append((self.df[item][index]-self.mean_df[item])/self.std_df[item])
        
        emb_data = []

        # 将分类变量映射到对应数值上
        for item in self.un_num_item:
            try:
                if self.df[item][index] not in self.un_num_item_list[item]:
                    emb_data.append(-1)
                else:
                    emb_data.append(self.un_num_item_list[item].index(self.df[item][index]))
            except:
                emb_data.append(-1)

        data = paddle.to_tensor(data).astype('float32')
        emb_data = paddle.to_tensor(emb_data).astype('float32')

        # 如果当前模式不为train,则返回对应的loan_id,用于锁定样本条目
        if self.mode == 'train':
            label = self.df['isDefault'][index]
        else:
            label = self.df['loan_id'][index]

        label = np.array(label).astype('int64')
        return data,emb_data,label

    def __len__(self):
        return len(self.df)
dataset=MyDateset('train_df.csv')
[data,emb_data,label] = dataset[0]
print(data.shape)
print(emb_data.shape)
print(label)

print('\n')
print(num_item)
[26]
[11]
0


['early_return_amount_3mon', 'early_return_amount', 'early_return', 'f1', 'early_day', 'title', 'recircle_b', 'scoring_low', 'scoring_high', 'f4', 'monthly_payment', 'post_code', 'total_loan', 'f3', 'recircle_u', 'year_of_loan', 'known_outstanding_loan', 'del_in_18month', 'f0', 'debt_loan_ratio', 'interest', 'f0xf3', 'risk', 'pub_dero_bankrup', 'known_dero', 'class']
# 查看变量分布情况
des
posnegdiff
early_return_amount_3mon2.803791e+013.973950e+02-0.929446
early_return_amount2.054415e+022.572250e+03-0.920132
early_return1.663696e-011.518095e+00-0.890409
f11.303781e-031.445846e-03-0.098258
early_day4.226144e+014.620753e+01-0.085399
title1.687819e+031.832562e+03-0.078984
use1.645870e+001.786221e+00-0.078574
recircle_b1.550428e+041.675956e+04-0.074900
work_year5.919503e+006.091975e+00-0.028311
scoring_low6.548115e+026.659983e+02-0.016797
scoring_high7.638845e+027.765860e+02-0.016356
user_id2.251468e+052.252223e+05-0.000335
loan_id1.025147e+061.025222e+06-0.000074
policy_code1.000000e+001.000000e+000.000000
early_mon6.733214e+006.732596e+000.000092
f28.537608e+008.453426e+000.009958
f48.179408e+008.081345e+000.012134
monthly_payment4.421506e+024.359102e+020.014316
post_code2.607600e+022.568633e+020.015170
total_loan1.471829e+041.433815e+040.026513
initial_list_status4.236482e-014.121678e-010.027854
f31.506535e+011.457538e+010.033616
recircle_u5.563608e+015.321519e+010.045493
year_of_loan3.721331e+003.430684e+000.084720
app_type2.139037e-021.971865e-020.084779
known_outstanding_loan1.248544e+011.147493e+010.088063
del_in_18month3.428402e-013.052783e-010.123041
f06.338471e+005.557107e+000.140606
censor_status1.142008e+009.888181e-010.154923
house_exist6.922163e-015.960082e-010.161421
debt_loan_ratio2.008696e+011.701519e+010.180531
interest1.532975e+011.279642e+010.197971
f0xf31.163089e+029.522652e+010.221392
risk2.344960e+051.886230e+050.243199
pub_dero_bankrup1.759590e-011.315179e-010.337910
known_dero2.953060e-012.124564e-010.389960
class2.352941e+001.624985e+000.447977
region1.556744e-011.055669e-010.474651
isDefault1.000000e+000.000000e+00inf

构造网络

对于分类变量使用两层全连接成生成embedding

class MyNet(paddle.nn.Layer):
    def __init__(self):
        super(MyNet,self).__init__()
        self.fc = paddle.nn.Linear(in_features=len(num_item), out_features=512)
        self.emb1 = paddle.nn.Linear(in_features=len(un_num_item),out_features=2048)
        self.emb2 = paddle.nn.Linear(in_features=2048,out_features=512)
        self.emb3 = paddle.nn.Linear(in_features=1024,out_features=512)
        self.dropout = paddle.nn.Dropout(0.1)
        self.relu = paddle.nn.ReLU()
        self.out = paddle.nn.Linear(in_features=512,out_features=2)

    def forward(self,data,emb_data):
        x = self.fc(data)
        x = self.dropout(x)

        emb = self.emb1(emb_data)
        emb = self.emb2(emb)

        x = paddle.concat([x,emb],axis=-1)
        x = self.relu(x)
        x = self.emb3(x)
        x = self.out(x)
        
        x = paddle.nn.functional.softmax(x)
        return x

训练

# 构造读取器,可以尝试继续调大batch size,当然batch size越大学习率也要相应增大
train_dataset=MyDateset('train_df.csv')
batch_size = 2048
train_dataloader = paddle.io.DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True,
    drop_last=False)
# 构造模型
model = MyNet()

# model_dict = paddle.load('model.pdparams')
# model.set_dict(model_dict)

model.train()

max_epoch=17
opt = paddle.optimizer.SGD(learning_rate=0.1, parameters=model.parameters())

# 训练
now_step=0
for epoch in range(max_epoch):
    for step, data in enumerate(train_dataloader):
        now_step+=1

        data,emb_data, label = data
        pre = model(data,emb_data)
        loss = paddle.nn.functional.cross_entropy(pre,label,weight=paddle.to_tensor([0.2,1.2]),reduction='mean')
        # loss = paddle.nn.functional.square_error_cost(pre,label.reshape([-1,1]).astype('float32'))
        # loss = paddle.mean(loss)
        loss.backward()
        opt.step()
        opt.clear_gradients()
        if now_step%1==0:
            print("epoch: {}, batch: {}, loss is: {}".format(epoch, step, loss.mean().numpy()))

# 保存模型到model.pdparams
paddle.save(model.state_dict(), 'model.pdparams')
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:253: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.bool, the right dtype will convert to paddle.float32
  format(lhs_dtype, rhs_dtype, lhs_dtype))


epoch: 0, batch: 0, loss is: [0.72932494]
epoch: 0, batch: 1, loss is: [0.72809494]
epoch: 0, batch: 2, loss is: [0.67359775]
epoch: 0, batch: 3, loss is: [0.68661183]
epoch: 0, batch: 4, loss is: [0.75169533]
epoch: 1, batch: 0, loss is: [0.7372106]
epoch: 1, batch: 1, loss is: [0.6646998]
epoch: 1, batch: 2, loss is: [0.652986]
epoch: 1, batch: 3, loss is: [0.6788878]
epoch: 1, batch: 4, loss is: [0.6282849]
epoch: 2, batch: 0, loss is: [0.6557647]
epoch: 2, batch: 1, loss is: [0.6336642]
epoch: 2, batch: 2, loss is: [0.63801914]
epoch: 2, batch: 3, loss is: [0.6108661]
epoch: 2, batch: 4, loss is: [0.60893005]
epoch: 3, batch: 0, loss is: [0.5925074]
epoch: 3, batch: 1, loss is: [0.5883472]
epoch: 3, batch: 2, loss is: [0.58365494]
epoch: 3, batch: 3, loss is: [0.59151757]
epoch: 3, batch: 4, loss is: [0.57692784]
epoch: 4, batch: 0, loss is: [0.58431715]
epoch: 4, batch: 1, loss is: [0.56982553]
epoch: 4, batch: 2, loss is: [0.5645007]
epoch: 4, batch: 3, loss is: [0.57651067]
epoch: 4, batch: 4, loss is: [0.56192523]
epoch: 5, batch: 0, loss is: [0.5654892]
epoch: 5, batch: 1, loss is: [0.56473136]
epoch: 5, batch: 2, loss is: [0.5650625]
epoch: 5, batch: 3, loss is: [0.5576684]
epoch: 5, batch: 4, loss is: [0.5584635]
epoch: 6, batch: 0, loss is: [0.5546932]
epoch: 6, batch: 1, loss is: [0.55603355]
epoch: 6, batch: 2, loss is: [0.5493836]
epoch: 6, batch: 3, loss is: [0.5532869]
epoch: 6, batch: 4, loss is: [0.55402416]
epoch: 7, batch: 0, loss is: [0.5511932]
epoch: 7, batch: 1, loss is: [0.5557276]
epoch: 7, batch: 2, loss is: [0.55160713]
epoch: 7, batch: 3, loss is: [0.53901863]
epoch: 7, batch: 4, loss is: [0.5432972]
epoch: 8, batch: 0, loss is: [0.53537405]
epoch: 8, batch: 1, loss is: [0.52480084]
epoch: 8, batch: 2, loss is: [0.54991305]
epoch: 8, batch: 3, loss is: [0.5552637]
epoch: 8, batch: 4, loss is: [0.5439552]
epoch: 9, batch: 0, loss is: [0.53486353]
epoch: 9, batch: 1, loss is: [0.54527813]
epoch: 9, batch: 2, loss is: [0.5333212]
epoch: 9, batch: 3, loss is: [0.5366559]
epoch: 9, batch: 4, loss is: [0.53380215]
epoch: 10, batch: 0, loss is: [0.5307571]
epoch: 10, batch: 1, loss is: [0.5307473]
epoch: 10, batch: 2, loss is: [0.5341663]
epoch: 10, batch: 3, loss is: [0.53598076]
epoch: 10, batch: 4, loss is: [0.533747]
epoch: 11, batch: 0, loss is: [0.5351857]
epoch: 11, batch: 1, loss is: [0.5374662]
epoch: 11, batch: 2, loss is: [0.53001803]
epoch: 11, batch: 3, loss is: [0.5292998]
epoch: 11, batch: 4, loss is: [0.520213]
epoch: 12, batch: 0, loss is: [0.5229189]
epoch: 12, batch: 1, loss is: [0.53077817]
epoch: 12, batch: 2, loss is: [0.5234481]
epoch: 12, batch: 3, loss is: [0.52863]
epoch: 12, batch: 4, loss is: [0.5292127]
epoch: 13, batch: 0, loss is: [0.53497964]
epoch: 13, batch: 1, loss is: [0.5193575]
epoch: 13, batch: 2, loss is: [0.52314794]
epoch: 13, batch: 3, loss is: [0.5276453]
epoch: 13, batch: 4, loss is: [0.51771814]
epoch: 14, batch: 0, loss is: [0.50922066]
epoch: 14, batch: 1, loss is: [0.5228251]
epoch: 14, batch: 2, loss is: [0.5262174]
epoch: 14, batch: 3, loss is: [0.52728355]
epoch: 14, batch: 4, loss is: [0.5314889]
epoch: 15, batch: 0, loss is: [0.5179457]
epoch: 15, batch: 1, loss is: [0.53258806]
epoch: 15, batch: 2, loss is: [0.51814485]
epoch: 15, batch: 3, loss is: [0.5144637]
epoch: 15, batch: 4, loss is: [0.5230428]
epoch: 16, batch: 0, loss is: [0.528043]
epoch: 16, batch: 1, loss is: [0.5009919]
epoch: 16, batch: 2, loss is: [0.5170122]
epoch: 16, batch: 3, loss is: [0.5321321]
epoch: 16, batch: 4, loss is: [0.5218459]

预测

这里直接读取保存好的得分为0.86893的模型,如需测试自己的模型请替换对应的模型读取路径

最后直接提交生成result.csv即可

# 读取模型和构造读取器
model = MyNet()

model_dict = paddle.load('model_best.pdparams')
# model_dict = paddle.load('model.pdparams')

model.set_dict(model_dict)

model.eval()

test_dataset=MyDateset('test_df.csv',mode = 'test')

test_dataloader = paddle.io.DataLoader(
    test_dataset,
    batch_size=1,
    shuffle=False,
    drop_last=False)
# 将结果保存在result.csv中
result = []
for step, data in enumerate(test_dataloader):
    data ,emb_data, loan_id = data
    pre = model(data,emb_data)
    result.append([loan_id.numpy()[0], pre[:,1].numpy()[0]])
    # result.append([loan_id.numpy()[0], np.argmax(pre.numpy())])

pd.DataFrame(result,columns=['id','isDefault']).to_csv('result.csv',index=None)

后续改进方向

由于数据量较少的情况下神经网络效果尚不足以发挥其威力,因此传统机器学习模型反而更能取得优势。在此引用吴恩达老师在Coursera公开课中的一张图:

在本赛题限制只能使用深度学习模型的情况下,期待能有优秀的深度模型出现,可以考虑交叉验证、模型融合等方法继续上分。

请点击此处查看本环境基本用法.

Please click here for more detailed instructions.

此文章为搬运
原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐