【飞桨学习赛:个贷违约预测】9月第1名方案
在baseline基础上进行改动,AUC得分0.86+。
飞桨学习赛:个贷违约预测
赛题介绍
利用已有的与目标客群稍有差异的另一批信贷数据,辅助目标业务风控模型的创建,两者数据集之间存在大量相同的字段和极少的共同用户。此处希望大家可以利用迁移学习捕捉不同业务中用户基本信息与违约行为之间的关联,帮助实现对新业务的用户违约预测。
注意:本学习赛要求选手使用飞桨 PaddlePaddle 2.0 及以上版本生成端到端深度学习模型,因此树模型之类的纯机器学习模型是不符合要求的。
数据介绍
训练集包括10000条数据,属性分别为
‘loan_id’, ‘user_id’, ‘total_loan’, ‘year_of_loan’, ‘interest’,
‘monthly_payment’, ‘class’, ‘employer_type’, ‘industry’, ‘work_year’,
‘house_exist’, ‘censor_status’, ‘issue_date’, ‘use’, ‘post_code’,
‘region’, ‘debt_loan_ratio’, ‘del_in_18month’, ‘scoring_low’,
‘scoring_high’, ‘known_outstanding_loan’, ‘known_dero’,
‘pub_dero_bankrup’, ‘recircle_b’, ‘recircle_u’, ‘initial_list_status’,
‘app_type’, ‘earlies_credit_mon’, ‘title’, ‘policy_code’, ‘f0’, ‘f1’,
‘f2’, ‘f3’, ‘f4’, ‘early_return’, ‘early_return_amount’,
‘early_return_amount_3mon’, ‘isDefault’
其中’isDefault’为目标分类值,近1/3为分类变量,近2/3为数值型变量,剩余变量包含时间信息等。
并且10000条数据中仅1683个样本为正样本,剩余为负样本,即样本不平衡。
方案介绍
本方案主要参考了官方推荐的精选项目:https://aistudio.baidu.com/aistudio/projectdetail/3555696
- 对于样本不平衡的问题,修改损失函数的权重,将负样本的权重设为0.2,正样本为1.2
- 对于连续数值型变量进行均值-方差归一化
- 对于分类变量,在网络中进行重编码(即增加一个全连接层,用于模拟embedding)
- 对一些变量进行了简单的变换与特征工程。
代码
# 查看样本信息
import pandas as pd
import re
train_df=pd.read_csv('data/data130186/train_public.csv')
test_df=pd.read_csv('data/data130187/test_public.csv')
# 固定随机种子
import random
import numpy as np
import paddle
seed = 1024
random.seed(seed)
np.random.seed(seed)
paddle.seed(seed)
<paddle.fluid.core_avx.Generator at 0x7fdc2798c5f0>
# 查看空值分布情况
train_df.isnull().any()
loan_id False
user_id False
total_loan False
year_of_loan False
interest False
monthly_payment False
class False
employer_type False
industry False
work_year True
house_exist False
censor_status False
issue_date False
use False
post_code False
region False
debt_loan_ratio False
del_in_18month False
scoring_low False
scoring_high False
known_outstanding_loan False
known_dero False
pub_dero_bankrup True
recircle_b False
recircle_u False
initial_list_status False
app_type False
earlies_credit_mon False
title False
policy_code False
f0 True
f1 True
f2 True
f3 True
f4 True
early_return False
early_return_amount False
early_return_amount_3mon False
isDefault False
dtype: bool
# 新特征构造
train_df['risk'] = train_df['total_loan'] * train_df['interest']
test_df['risk'] = test_df['total_loan'] * test_df['interest']
# class数量化
class_dic = dict(zip([chr(y) for y in range(ord('A'), ord('G') + 1)], range(0, 8)))
train_df['class'] = train_df['class'].apply(lambda x: class_dic[x])
test_df['class'] = test_df['class'].apply(lambda x: class_dic[x])
# region重新构造,分成两派即可
region_des = train_df.groupby(['region'])['isDefault'].describe()
default_regions = region_des[region_des['mean'] > 0.2].index.tolist()
train_df['region'] = train_df['region'].apply(lambda x: 1 if x in default_regions else 0)
test_df['region'] = test_df['region'].apply(lambda x: 1 if x in default_regions else 0)
# work_year数量化
work_year_dict = {
'< 1 year': 0,
'1 year': 1,
'2 years': 2,
'3 years': 3,
'4 years': 4,
'5 years': 5,
'6 years': 6,
'7 years': 7,
'8 years': 8,
'9 years': 9,
'10+ years': 10,
}
train_df['work_year'] = train_df['work_year'].map(work_year_dict)
test_df['work_year'] = test_df['work_year'].map(work_year_dict)
# earlies_credit_mon处理
month_dict = {'Dec': 12,
'Apr': 4,
'Oct': 10,
'Jun': 6,
'May': 5,
'Jan': 1,
'Nov': 11,
'Jul': 7,
'Aug': 8,
'Feb': 2,
'Sep': 9,
'Mar': 3}
train_df['early_mon'] = train_df['earlies_credit_mon'].apply(lambda x: month_dict[re.findall('[a-zA-Z]+', x)[0]])
test_df['early_mon'] = test_df['earlies_credit_mon'].apply(lambda x: month_dict[re.findall('[a-zA-Z]+', x)[0]])
train_df['early_day'] = train_df['earlies_credit_mon'].apply(lambda x: eval(re.findall('[0-9]+', x)[0]))
test_df['early_day'] = test_df['earlies_credit_mon'].apply(lambda x: eval(re.findall('[0-9]+', x)[0]))
# f系列新构造一个特征
train_df['f0xf3'] = train_df['f0'] * train_df['f3']
test_df['f0xf3'] = test_df['f0'] * test_df['f3']
# 缺失值均值填充
train_df['pub_dero_bankrup'] = train_df['pub_dero_bankrup'].fillna(train_df['pub_dero_bankrup'].mean())
test_df['pub_dero_bankrup'] = test_df['pub_dero_bankrup'].fillna(test_df['pub_dero_bankrup'].mean())
'''
以下是不成功的特征构造:
# f系列特征重新构造
f_ls = ['f0', 'f1', 'f2', 'f3', 'f4']
for i in f_ls:
for j in f_ls[f_ls.index(i)+1:]:
train_df['%sx%s'%(i,j)] = train_df[i] * train_df[j]
test_df['%sx%s'%(i,j)] = test_df[i] * test_df[j]
# use重新构造,分成两派即可
use_des = train_df.groupby(['use'])['isDefault'].describe()
default_uses = use_des[use_des['mean'] > 0.2].index.tolist()
train_df['use'] = train_df['use'].apply(lambda x: 1 if x in default_uses else 0)
test_df['use'] = test_df['use'].apply(lambda x: 1 if x in default_uses else 0)
# post_code重新构造,分成两派即可
post_code_des = train_df.groupby(['post_code'])['isDefault'].describe()
default_post_codes = post_code_des[post_code_des['mean'] > 0.2].index.tolist()
train_df['post_code'] = train_df['post_code'].apply(lambda x: 1 if x in default_post_codes else 0)
test_df['post_code'] = test_df['post_code'].apply(lambda x: 1 if x in default_post_codes else 0)
'''
train_df.to_csv('train_df.csv', index=False)
test_df.to_csv('test_df.csv', index=False)
train_df['isDefault'].value_counts()
Counter({0: 8317, 1: 1683})
# 数值特征描述
pos = train_df[train_df['isDefault']==1].describe()
pos_des = pd.DataFrame(pos.values.T, index=pos.columns, columns=pos.index) # 转置操作
pos_des = pos_des['mean']
neg = train_df[train_df['isDefault']==0].describe()
neg_des = pd.DataFrame(neg.values.T, index=neg.columns, columns=neg.index)
neg_des = neg_des['mean']
# 数值特征筛选
des = pd.DataFrame({'pos': pos_des, 'neg': neg_des})
des['diff'] = (des['pos'] - des['neg'])/des['neg']
des = des.sort_values(by=['diff'])
num_item = des[abs(des['diff'])>=0.01].index.tolist()[:-1]
un_num_item = ['early_mon', 'work_year', 'region', 'employer_type','industry', 'house_exist', 'censor_status',
'use',
'initial_list_status','app_type',
'policy_code']
for i in un_num_item:
try:
num_item.remove(i)
except:
continue
len(num_item)
26
构造dataset
import paddle.vision.transforms as T
from PIL import Image
class MyDateset(paddle.io.Dataset):
# csv_dir对应要读取的数据地址,standard_csv_dir用于生成均值和方差信息对数据进行归一化的文件地址
def __init__(self,csv_dir,standard_csv_dir='train_df.csv',mode = 'train'):
super(MyDateset, self).__init__()
# 读取数据
self.df = pd.read_csv(csv_dir)
# 构造各个变量的均值和方差
st_df = pd.read_csv(standard_csv_dir)
self.mean_df = st_df.mean()
self.std_df = st_df.std()
# 分别指定数值型变量/分类变量/不使用的变量
self.num_item = num_item
self.un_num_item = un_num_item
# 构造一个映射表,将分类变量/分类字符串映射到对应数值上
un_num_item_list = {}
for item in self.un_num_item:
un_num_item_list[item]=list(set(st_df[item].values))
self.un_num_item_list = un_num_item_list
self.mode = mode
def __getitem__(self, index):
data=[]
# 进行归一化,如果这个数值缺省了直接设置为0
for item in self.num_item:
if np.isnan(self.df[item][index]):
data.append((0-self.mean_df[item])/self.std_df[item])
else:
data.append((self.df[item][index]-self.mean_df[item])/self.std_df[item])
emb_data = []
# 将分类变量映射到对应数值上
for item in self.un_num_item:
try:
if self.df[item][index] not in self.un_num_item_list[item]:
emb_data.append(-1)
else:
emb_data.append(self.un_num_item_list[item].index(self.df[item][index]))
except:
emb_data.append(-1)
data = paddle.to_tensor(data).astype('float32')
emb_data = paddle.to_tensor(emb_data).astype('float32')
# 如果当前模式不为train,则返回对应的loan_id,用于锁定样本条目
if self.mode == 'train':
label = self.df['isDefault'][index]
else:
label = self.df['loan_id'][index]
label = np.array(label).astype('int64')
return data,emb_data,label
def __len__(self):
return len(self.df)
dataset=MyDateset('train_df.csv')
[data,emb_data,label] = dataset[0]
print(data.shape)
print(emb_data.shape)
print(label)
print('\n')
print(num_item)
[26]
[11]
0
['early_return_amount_3mon', 'early_return_amount', 'early_return', 'f1', 'early_day', 'title', 'recircle_b', 'scoring_low', 'scoring_high', 'f4', 'monthly_payment', 'post_code', 'total_loan', 'f3', 'recircle_u', 'year_of_loan', 'known_outstanding_loan', 'del_in_18month', 'f0', 'debt_loan_ratio', 'interest', 'f0xf3', 'risk', 'pub_dero_bankrup', 'known_dero', 'class']
# 查看变量分布情况
des
pos | neg | diff | |
---|---|---|---|
early_return_amount_3mon | 2.803791e+01 | 3.973950e+02 | -0.929446 |
early_return_amount | 2.054415e+02 | 2.572250e+03 | -0.920132 |
early_return | 1.663696e-01 | 1.518095e+00 | -0.890409 |
f1 | 1.303781e-03 | 1.445846e-03 | -0.098258 |
early_day | 4.226144e+01 | 4.620753e+01 | -0.085399 |
title | 1.687819e+03 | 1.832562e+03 | -0.078984 |
use | 1.645870e+00 | 1.786221e+00 | -0.078574 |
recircle_b | 1.550428e+04 | 1.675956e+04 | -0.074900 |
work_year | 5.919503e+00 | 6.091975e+00 | -0.028311 |
scoring_low | 6.548115e+02 | 6.659983e+02 | -0.016797 |
scoring_high | 7.638845e+02 | 7.765860e+02 | -0.016356 |
user_id | 2.251468e+05 | 2.252223e+05 | -0.000335 |
loan_id | 1.025147e+06 | 1.025222e+06 | -0.000074 |
policy_code | 1.000000e+00 | 1.000000e+00 | 0.000000 |
early_mon | 6.733214e+00 | 6.732596e+00 | 0.000092 |
f2 | 8.537608e+00 | 8.453426e+00 | 0.009958 |
f4 | 8.179408e+00 | 8.081345e+00 | 0.012134 |
monthly_payment | 4.421506e+02 | 4.359102e+02 | 0.014316 |
post_code | 2.607600e+02 | 2.568633e+02 | 0.015170 |
total_loan | 1.471829e+04 | 1.433815e+04 | 0.026513 |
initial_list_status | 4.236482e-01 | 4.121678e-01 | 0.027854 |
f3 | 1.506535e+01 | 1.457538e+01 | 0.033616 |
recircle_u | 5.563608e+01 | 5.321519e+01 | 0.045493 |
year_of_loan | 3.721331e+00 | 3.430684e+00 | 0.084720 |
app_type | 2.139037e-02 | 1.971865e-02 | 0.084779 |
known_outstanding_loan | 1.248544e+01 | 1.147493e+01 | 0.088063 |
del_in_18month | 3.428402e-01 | 3.052783e-01 | 0.123041 |
f0 | 6.338471e+00 | 5.557107e+00 | 0.140606 |
censor_status | 1.142008e+00 | 9.888181e-01 | 0.154923 |
house_exist | 6.922163e-01 | 5.960082e-01 | 0.161421 |
debt_loan_ratio | 2.008696e+01 | 1.701519e+01 | 0.180531 |
interest | 1.532975e+01 | 1.279642e+01 | 0.197971 |
f0xf3 | 1.163089e+02 | 9.522652e+01 | 0.221392 |
risk | 2.344960e+05 | 1.886230e+05 | 0.243199 |
pub_dero_bankrup | 1.759590e-01 | 1.315179e-01 | 0.337910 |
known_dero | 2.953060e-01 | 2.124564e-01 | 0.389960 |
class | 2.352941e+00 | 1.624985e+00 | 0.447977 |
region | 1.556744e-01 | 1.055669e-01 | 0.474651 |
isDefault | 1.000000e+00 | 0.000000e+00 | inf |
构造网络
对于分类变量使用两层全连接成生成embedding
class MyNet(paddle.nn.Layer):
def __init__(self):
super(MyNet,self).__init__()
self.fc = paddle.nn.Linear(in_features=len(num_item), out_features=512)
self.emb1 = paddle.nn.Linear(in_features=len(un_num_item),out_features=2048)
self.emb2 = paddle.nn.Linear(in_features=2048,out_features=512)
self.emb3 = paddle.nn.Linear(in_features=1024,out_features=512)
self.dropout = paddle.nn.Dropout(0.1)
self.relu = paddle.nn.ReLU()
self.out = paddle.nn.Linear(in_features=512,out_features=2)
def forward(self,data,emb_data):
x = self.fc(data)
x = self.dropout(x)
emb = self.emb1(emb_data)
emb = self.emb2(emb)
x = paddle.concat([x,emb],axis=-1)
x = self.relu(x)
x = self.emb3(x)
x = self.out(x)
x = paddle.nn.functional.softmax(x)
return x
训练
# 构造读取器,可以尝试继续调大batch size,当然batch size越大学习率也要相应增大
train_dataset=MyDateset('train_df.csv')
batch_size = 2048
train_dataloader = paddle.io.DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=True,
drop_last=False)
# 构造模型
model = MyNet()
# model_dict = paddle.load('model.pdparams')
# model.set_dict(model_dict)
model.train()
max_epoch=17
opt = paddle.optimizer.SGD(learning_rate=0.1, parameters=model.parameters())
# 训练
now_step=0
for epoch in range(max_epoch):
for step, data in enumerate(train_dataloader):
now_step+=1
data,emb_data, label = data
pre = model(data,emb_data)
loss = paddle.nn.functional.cross_entropy(pre,label,weight=paddle.to_tensor([0.2,1.2]),reduction='mean')
# loss = paddle.nn.functional.square_error_cost(pre,label.reshape([-1,1]).astype('float32'))
# loss = paddle.mean(loss)
loss.backward()
opt.step()
opt.clear_gradients()
if now_step%1==0:
print("epoch: {}, batch: {}, loss is: {}".format(epoch, step, loss.mean().numpy()))
# 保存模型到model.pdparams
paddle.save(model.state_dict(), 'model.pdparams')
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:253: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.bool, the right dtype will convert to paddle.float32
format(lhs_dtype, rhs_dtype, lhs_dtype))
epoch: 0, batch: 0, loss is: [0.72932494]
epoch: 0, batch: 1, loss is: [0.72809494]
epoch: 0, batch: 2, loss is: [0.67359775]
epoch: 0, batch: 3, loss is: [0.68661183]
epoch: 0, batch: 4, loss is: [0.75169533]
epoch: 1, batch: 0, loss is: [0.7372106]
epoch: 1, batch: 1, loss is: [0.6646998]
epoch: 1, batch: 2, loss is: [0.652986]
epoch: 1, batch: 3, loss is: [0.6788878]
epoch: 1, batch: 4, loss is: [0.6282849]
epoch: 2, batch: 0, loss is: [0.6557647]
epoch: 2, batch: 1, loss is: [0.6336642]
epoch: 2, batch: 2, loss is: [0.63801914]
epoch: 2, batch: 3, loss is: [0.6108661]
epoch: 2, batch: 4, loss is: [0.60893005]
epoch: 3, batch: 0, loss is: [0.5925074]
epoch: 3, batch: 1, loss is: [0.5883472]
epoch: 3, batch: 2, loss is: [0.58365494]
epoch: 3, batch: 3, loss is: [0.59151757]
epoch: 3, batch: 4, loss is: [0.57692784]
epoch: 4, batch: 0, loss is: [0.58431715]
epoch: 4, batch: 1, loss is: [0.56982553]
epoch: 4, batch: 2, loss is: [0.5645007]
epoch: 4, batch: 3, loss is: [0.57651067]
epoch: 4, batch: 4, loss is: [0.56192523]
epoch: 5, batch: 0, loss is: [0.5654892]
epoch: 5, batch: 1, loss is: [0.56473136]
epoch: 5, batch: 2, loss is: [0.5650625]
epoch: 5, batch: 3, loss is: [0.5576684]
epoch: 5, batch: 4, loss is: [0.5584635]
epoch: 6, batch: 0, loss is: [0.5546932]
epoch: 6, batch: 1, loss is: [0.55603355]
epoch: 6, batch: 2, loss is: [0.5493836]
epoch: 6, batch: 3, loss is: [0.5532869]
epoch: 6, batch: 4, loss is: [0.55402416]
epoch: 7, batch: 0, loss is: [0.5511932]
epoch: 7, batch: 1, loss is: [0.5557276]
epoch: 7, batch: 2, loss is: [0.55160713]
epoch: 7, batch: 3, loss is: [0.53901863]
epoch: 7, batch: 4, loss is: [0.5432972]
epoch: 8, batch: 0, loss is: [0.53537405]
epoch: 8, batch: 1, loss is: [0.52480084]
epoch: 8, batch: 2, loss is: [0.54991305]
epoch: 8, batch: 3, loss is: [0.5552637]
epoch: 8, batch: 4, loss is: [0.5439552]
epoch: 9, batch: 0, loss is: [0.53486353]
epoch: 9, batch: 1, loss is: [0.54527813]
epoch: 9, batch: 2, loss is: [0.5333212]
epoch: 9, batch: 3, loss is: [0.5366559]
epoch: 9, batch: 4, loss is: [0.53380215]
epoch: 10, batch: 0, loss is: [0.5307571]
epoch: 10, batch: 1, loss is: [0.5307473]
epoch: 10, batch: 2, loss is: [0.5341663]
epoch: 10, batch: 3, loss is: [0.53598076]
epoch: 10, batch: 4, loss is: [0.533747]
epoch: 11, batch: 0, loss is: [0.5351857]
epoch: 11, batch: 1, loss is: [0.5374662]
epoch: 11, batch: 2, loss is: [0.53001803]
epoch: 11, batch: 3, loss is: [0.5292998]
epoch: 11, batch: 4, loss is: [0.520213]
epoch: 12, batch: 0, loss is: [0.5229189]
epoch: 12, batch: 1, loss is: [0.53077817]
epoch: 12, batch: 2, loss is: [0.5234481]
epoch: 12, batch: 3, loss is: [0.52863]
epoch: 12, batch: 4, loss is: [0.5292127]
epoch: 13, batch: 0, loss is: [0.53497964]
epoch: 13, batch: 1, loss is: [0.5193575]
epoch: 13, batch: 2, loss is: [0.52314794]
epoch: 13, batch: 3, loss is: [0.5276453]
epoch: 13, batch: 4, loss is: [0.51771814]
epoch: 14, batch: 0, loss is: [0.50922066]
epoch: 14, batch: 1, loss is: [0.5228251]
epoch: 14, batch: 2, loss is: [0.5262174]
epoch: 14, batch: 3, loss is: [0.52728355]
epoch: 14, batch: 4, loss is: [0.5314889]
epoch: 15, batch: 0, loss is: [0.5179457]
epoch: 15, batch: 1, loss is: [0.53258806]
epoch: 15, batch: 2, loss is: [0.51814485]
epoch: 15, batch: 3, loss is: [0.5144637]
epoch: 15, batch: 4, loss is: [0.5230428]
epoch: 16, batch: 0, loss is: [0.528043]
epoch: 16, batch: 1, loss is: [0.5009919]
epoch: 16, batch: 2, loss is: [0.5170122]
epoch: 16, batch: 3, loss is: [0.5321321]
epoch: 16, batch: 4, loss is: [0.5218459]
预测
这里直接读取保存好的得分为0.86893的模型,如需测试自己的模型请替换对应的模型读取路径
最后直接提交生成result.csv即可
# 读取模型和构造读取器
model = MyNet()
model_dict = paddle.load('model_best.pdparams')
# model_dict = paddle.load('model.pdparams')
model.set_dict(model_dict)
model.eval()
test_dataset=MyDateset('test_df.csv',mode = 'test')
test_dataloader = paddle.io.DataLoader(
test_dataset,
batch_size=1,
shuffle=False,
drop_last=False)
# 将结果保存在result.csv中
result = []
for step, data in enumerate(test_dataloader):
data ,emb_data, loan_id = data
pre = model(data,emb_data)
result.append([loan_id.numpy()[0], pre[:,1].numpy()[0]])
# result.append([loan_id.numpy()[0], np.argmax(pre.numpy())])
pd.DataFrame(result,columns=['id','isDefault']).to_csv('result.csv',index=None)
后续改进方向
由于数据量较少的情况下神经网络效果尚不足以发挥其威力,因此传统机器学习模型反而更能取得优势。在此引用吴恩达老师在Coursera公开课中的一张图:
在本赛题限制只能使用深度学习模型的情况下,期待能有优秀的深度模型出现,可以考虑交叉验证、模型融合等方法继续上分。
请点击此处查看本环境基本用法.
Please click here for more detailed instructions.
此文章为搬运
原项目链接
更多推荐
所有评论(0)