资源

⭐ ⭐ ⭐ 欢迎点个小小的Star支持!⭐ ⭐ ⭐

开源不易,希望大家多多支持~

一、竞赛介绍

CCF大数据与计算智能大赛(CCF Big Data & Computing Intelligence Contest,简称CCF BDCI)由中国计算机学会于2013年创办。大赛由国家自然科学基金委员会指导,是大数据与人工智能领域的算法、应用和系统大型挑战赛事。大赛面向重点行业和应用领域征集需求,以前沿技术与行业应用问题为导向,以促进行业发展及产业升级为目标,以众智、众包的方式,汇聚海内外产学研用多方智慧,为社会发现和培养了大量高质量数据人才。
  大赛迄今已成功举办八届,累计吸引全球1500余所高校、1800家企事业单位及80余所科研机构的12万余人参与,已成为中国大数据与人工智能领域最具影响力的活动之一,是中国大数据综合赛事第一品牌。
  2021年第九届大赛以“数引创新,竞促汇智”为主题,立足余杭、面向全球,于9月至12月举办。大赛将致力于解决来自政府、企业真实场景中的痛点、难点问题,邀请全球优秀团队参与数据资源开发利用,广泛征集信息技术应用解决方案。

1.1 赛题任务

比赛的地址为https://www.datafountain.cn/competitions/530

本赛题要求利用已有的与目标客群稍有差异的另一批信贷数据,辅助目标业务风控模型的创建,两者数据集之间存在大量相同的字段和极少的共同用户。此处希望大家可以利用迁移学习捕捉不同业务中用户基本信息与违约行为之间的关联,帮助实现对新业务的用户违约预测。

二、数据预处理

import warnings
import pandas as pd
# warnings.filterwarnings('ignore')
import numpy as np
import lightgbm
from sklearn import metrics
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset
!unzip -o data/data111908/个贷违约预测.zip -d data/
Archive:  data/data111908/个贷违约预测.zip
  inflating: data/submit_example.csv  
  inflating: data/__MACOSX/._submit_example.csv  
  inflating: data/test_public.csv    
  inflating: data/__MACOSX/._test_public.csv  
  inflating: data/__MACOSX/._train_dataset  
  inflating: data/train_dataset/train_public.csv  
  inflating: data/__MACOSX/train_dataset/._train_public.csv  
  inflating: data/train_dataset/train_internet.csv  
  inflating: data/__MACOSX/train_dataset/._train_internet.csv  
train_bank = pd.read_csv('./data/train_dataset/train_public.csv')
train_internet = pd.read_csv('./data/train_dataset/train_internet.csv')
train_bank.head()
loan_iduser_idtotal_loanyear_of_loaninterestmonthly_paymentclassemployer_typeindustrywork_year...policy_codef0f1f2f3f4early_returnearly_return_amountearly_return_amount_3monisDefault
0104041824041831818.18182311.4661174.91C政府机构金融业3 years...11.00.04.05.04.0399270.00
1102519722519728000.00000516.841670.69C政府机构金融业10+ years...17.00.04.045.022.0000.00
2100936020936017272.7272738.900603.32A政府机构公共服务、社会组织10+ years...16.00.06.028.019.0000.00
3103970823970820000.0000034.788602.30A世界五百强文化和体育业6 years...15.00.010.015.09.0000.00
4102748322748315272.72727312.790470.31C政府机构信息传输、软件和信息技术服务业< 1 year...110.00.06.015.04.0000.00

5 rows × 39 columns

test = pd.read_csv('./data/test_public.csv')
# train_internet.rename(columns={'is_default':'isDefault'},inplace=True)
train_bank.rename(columns={'isDefault':'is_default'},inplace=True)
common_cols = []
for col in train_bank.columns:
    if col in train_internet.columns:
        common_cols.append(col)
    else: continue

print(train_bank.columns)
print(train_internet.columns)
print(len(common_cols))
Index(['loan_id', 'user_id', 'total_loan', 'year_of_loan', 'interest',
       'monthly_payment', 'class', 'employer_type', 'industry', 'work_year',
       'house_exist', 'censor_status', 'issue_date', 'use', 'post_code',
       'region', 'debt_loan_ratio', 'del_in_18month', 'scoring_low',
       'scoring_high', 'known_outstanding_loan', 'known_dero',
       'pub_dero_bankrup', 'recircle_b', 'recircle_u', 'initial_list_status',
       'app_type', 'earlies_credit_mon', 'title', 'policy_code', 'f0', 'f1',
       'f2', 'f3', 'f4', 'early_return', 'early_return_amount',
       'early_return_amount_3mon', 'is_default'],
      dtype='object')
Index(['loan_id', 'user_id', 'total_loan', 'year_of_loan', 'interest',
       'monthly_payment', 'class', 'sub_class', 'work_type', 'employer_type',
       'industry', 'work_year', 'house_exist', 'house_loan_status',
       'censor_status', 'marriage', 'offsprings', 'issue_date', 'use',
       'post_code', 'region', 'debt_loan_ratio', 'del_in_18month',
       'scoring_low', 'scoring_high', 'pub_dero_bankrup', 'early_return',
       'early_return_amount', 'early_return_amount_3mon', 'recircle_b',
       'recircle_u', 'initial_list_status', 'earlies_credit_mon', 'title',
       'policy_code', 'f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'is_default'],
      dtype='object')
36
print(len(train_bank.columns))
print(len(train_internet.columns))
39
42

train_bank_left = list(set(list(train_bank.columns)) - set(common_cols))
train_internet_left = list(set(list(train_internet.columns)) - set(common_cols))

train_bank_left
['known_dero', 'known_outstanding_loan', 'app_type']
train_internet_left
['marriage', 'work_type', 'sub_class', 'f5', 'house_loan_status', 'offsprings']
train1_data = train_internet[common_cols]
train2_data = train_bank[common_cols]
test_data = test[common_cols[:-1]]
import datetime

# 日期类型:issueDate,earliesCreditLine
# 转换为pandas中的日期类型
train1_data['issue_date'] = pd.to_datetime(train1_data['issue_date'])
# 提取多尺度特征
train1_data['issue_date_y'] = train1_data['issue_date'].dt.year
train1_data['issue_date_m'] = train1_data['issue_date'].dt.month
# 提取时间diff
# 设置初始的时间
base_time = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
# 转换为天为单位
train1_data['issue_date_diff'] = train1_data['issue_date'].apply(lambda x: x-base_time).dt.days
train1_data[['issue_date', 'issue_date_y', 'issue_date_m', 'issue_date_diff']]
train1_data.drop('issue_date', axis = 1, inplace = True)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
# 日期类型:issueDate,earliesCreditLine
# 转换为pandas中的日期类型
train2_data['issue_date'] = pd.to_datetime(train2_data['issue_date'])
# 提取多尺度特征
train2_data['issue_date_y'] = train2_data['issue_date'].dt.year
train2_data['issue_date_m'] = train2_data['issue_date'].dt.month
# 提取时间diff
# 设置初始的时间
base_time = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
# 转换为天为单位
train2_data['issue_date_diff'] = train2_data['issue_date'].apply(lambda x: x-base_time).dt.days
train2_data[['issue_date', 'issue_date_y', 'issue_date_m', 'issue_date_diff']]
train2_data.drop('issue_date', axis = 1, inplace = True)
train2_data.head()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
loan_iduser_idtotal_loanyear_of_loaninterestmonthly_paymentclassemployer_typeindustrywork_year...f2f3f4early_returnearly_return_amountearly_return_amount_3monis_defaultissue_date_yissue_date_missue_date_diff
0104041824041831818.18182311.4661174.91C政府机构金融业3 years...4.05.04.0399270.002016103410
1102519722519728000.00000516.841670.69C政府机构金融业10+ years...4.045.022.0000.00201362192
2100936020936017272.7272738.900603.32A政府机构公共服务、社会组织10+ years...6.028.019.0000.00201412406
3103970823970820000.0000034.788602.30A世界五百强文化和体育业6 years...10.015.09.0000.00201572952
4102748322748315272.72727312.790470.31C政府机构信息传输、软件和信息技术服务业< 1 year...6.015.04.0000.00201673318

5 rows × 38 columns

employer_type = train1_data['employer_type'].value_counts().index
industry = train1_data['industry'].value_counts().index
emp_type_dict = dict(zip(employer_type, [0,1,2,3,4,5]))
industry_dict = dict(zip(industry, [i for i in range(15)]))
train1_data['work_year'].fillna('10+ years', inplace=True)
train2_data['work_year'].fillna('10+ years', inplace=True)

work_year_map = {'10+ years': 10, '2 years': 2, '< 1 year': 0, '3 years': 3, '1 year': 1,
     '5 years': 5, '4 years': 4, '6 years': 6, '8 years': 8, '7 years': 7, '9 years': 9}
train1_data['work_year']  = train1_data['work_year'].map(work_year_map)
train2_data['work_year']  = train2_data['work_year'].map(work_year_map)

train1_data['class'] = train1_data['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})
train2_data['class'] = train2_data['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})

train1_data['employer_type'] = train1_data['employer_type'].map(emp_type_dict)
train2_data['employer_type'] = train2_data['employer_type'].map(emp_type_dict)

train1_data['industry'] = train1_data['industry'].map(industry_dict)
train2_data['industry'] = train2_data['industry'].map(industry_dict)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pandas/core/series.py:4536: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if sys.path[0] == '':
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  from ipykernel import kernelapp as app
# 日期类型:issueDate,earliesCreditLine
#train[cat_features]
# 转换为pandas中的日期类型
test_data['issue_date'] = pd.to_datetime(test_data['issue_date'])
# 提取多尺度特征
test_data['issue_date_y'] = test_data['issue_date'].dt.year
test_data['issue_date_m'] = test_data['issue_date'].dt.month
# 提取时间diff
# 设置初始的时间
base_time = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
# 转换为天为单位
test_data['issue_date_diff'] = test_data['issue_date'].apply(lambda x: x-base_time).dt.days
test_data[['issue_date', 'issue_date_y', 'issue_date_m', 'issue_date_diff']]
test_data.drop('issue_date', axis = 1, inplace = True)
test_data['work_year'].fillna('10+ years', inplace=True)

work_year_map = {'10+ years': 10, '2 years': 2, '< 1 year': 0, '3 years': 3, '1 year': 1,
     '5 years': 5, '4 years': 4, '6 years': 6, '8 years': 8, '7 years': 7, '9 years': 9}
test_data['work_year']  = test_data['work_year'].map(work_year_map)
test_data['class'] = test_data['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})
test_data['employer_type'] = test_data['employer_type'].map(emp_type_dict)
test_data['industry'] = test_data['industry'].map(industry_dict)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
 train1_data['is_default']
0         1
1         0
2         0
3         0
4         0
         ..
749995    0
749996    0
749997    0
749998    0
749999    0
Name: is_default, Length: 750000, dtype: int64

三、传统机器学习算法:lightgbm模型

X_train1 = train1_data.drop(['is_default','earlies_credit_mon','loan_id','user_id'], axis = 1, inplace = False)
y_train1 = train1_data['is_default']

X_train2 = train2_data.drop(['is_default','earlies_credit_mon','loan_id','user_id'], axis = 1, inplace = False)
y_train2 = train2_data['is_default']

X_train = pd.concat([X_train1, X_train2])
y_train = pd.concat([y_train1, y_train2])

X_test = test_data.drop(['earlies_credit_mon','loan_id','user_id'], axis = 1, inplace = False)

# 利用Internet数据预训练模型1
clf_ex=lightgbm.LGBMRegressor(n_estimators = 200)
clf_ex.fit(X = X_train, y = y_train)
clf_ex.booster_.save_model('LGBMmode.txt')
pred = clf_ex.predict(X_test)
print(pred[:5])
pred_label=[0 if(item<0.5) else 1 for item in pred]
[-0.01169305  0.02755907 -0.00791106  0.06182464 -0.01506804]
# submission
submission = pd.DataFrame({'id':test['loan_id'], 'isDefault':pred_label})
submission.to_csv('submission.csv', index = None)

四、深度神经网络

# 数据标准化
mean_px = X_train.mean().astype(np.float32)
std_px = X_train.std().astype(np.float32)
def standardize(x): 
    return (x-mean_px)/std_px

# 缺失值填补
X_train.fillna(0, inplace = True)
X_test.fillna(0, inplace = True)
X_train_NN =(X_train - mean_px) / std_px
X_test_NN  = (X_test - mean_px) / std_px

X_train_NN = (X_train.values).astype('float32') # all pixel values
y_train_NN = (y_train.values).astype('int64')

X_test_NN = (X_test.values).astype('float32') # all pixel value
print(X_train_NN[0].shape)
print(y_train_NN[0].shape)
# print(y_train_NN)
print(type(y_train_NN))
(34,)
()
<class 'numpy.ndarray'>
# define a random dataset
class CreditDataset(Dataset):
    def __init__(self, X,Y):
        self.inputs = X
        self.labels=Y

    def __getitem__(self, idx):
        image =self.inputs[idx]
        label = self.labels[idx]
        return image, label

    def __len__(self):
        return self.inputs.shape[0]

trainDataset=CreditDataset(X_train_NN,y_train_NN)
for i in range(len(trainDataset)):
    print(trainDataset[i])
    break
# print(trainDataset[759570])

(array([1.200e+04, 5.000e+00, 1.153e+01, 2.641e+02, 1.000e+00, 0.000e+00,
       1.000e+01, 1.000e+01, 0.000e+00, 2.000e+00, 0.000e+00, 8.140e+02,
       4.000e+00, 5.070e+00, 1.000e+00, 6.700e+02, 6.740e+02, 1.000e+00,
       3.855e+03, 2.310e+01, 0.000e+00, 0.000e+00, 1.000e+00, 1.000e+00,
       0.000e+00, 8.000e+00, 1.700e+01, 8.000e+00, 0.000e+00, 0.000e+00,
       0.000e+00, 2.015e+03, 6.000e+00, 2.922e+03], dtype=float32), 1)
(array([7.0909092e+03, 3.0000000e+00, 1.8990000e+01, 2.1991000e+02,
       3.0000000e+00, 2.0000000e+00, 0.0000000e+00, 2.0000000e+00,
       1.0000000e+00, 2.0000000e+00, 0.0000000e+00, 4.4600000e+02,
       3.6000000e+01, 1.9420000e+01, 0.0000000e+00, 5.6454547e+02,
       6.5863635e+02, 0.0000000e+00, 5.0757690e+03, 3.8230770e+01,
       1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 5.0000000e+00,
       0.0000000e+00, 1.0000000e+00, 7.0000000e+00, 6.0000000e+00,
       2.0000000e+00, 9.8000000e+01, 9.8000002e+00, 2.0170000e+03,
       3.0000000e+00, 3.5610000e+03], dtype=float32), 0)
print(X_train_NN.shape)
print(y_train_NN.shape)
print(type(X_train_NN))
(760000, 34)
(760000,)
<class 'numpy.ndarray'>
IMAGE_SIZE=34
CLASS_NUM=2
class SimpleNet(nn.Layer):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.nn1=nn.Linear(IMAGE_SIZE,256)
        self.norm1=nn.LayerNorm(256)
        self.dropout=nn.Dropout(0.3)
        self.nn2=nn.Linear(256,64)
        self.norm2=nn.LayerNorm(64)
        self.dropout1=nn.Dropout(0.3)
        self.fc = nn.Linear(64, CLASS_NUM)

    def forward(self, inputs):
        x=self.nn1(inputs)
        x=self.norm1(x)
        x=self.dropout(x)

        x=self.nn2(x)
        x=self.norm2(x)
        x=self.dropout1(x)
        return self.fc(x)
model=paddle.Model(SimpleNet())
lr=0.001
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
optimizer = paddle.optimizer.Adam(
        parameters=model.parameters(), learning_rate=lr)
model.prepare(optimizer, criterion, metric)
# Starts training and evaluating.
epochs=3
save_dir='checkpoint'
model.fit(trainDataset,
        epochs=epochs,
        batch_size=1024,
        save_dir=save_dir,
        log_freq=100)
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/3
step 100/743 - loss: 0.4971 - acc: 0.7757 - 13ms/step
step 200/743 - loss: 0.4908 - acc: 0.7881 - 12ms/step
step 300/743 - loss: 0.5055 - acc: 0.7918 - 11ms/step
step 400/743 - loss: 0.4924 - acc: 0.7944 - 11ms/step
step 500/743 - loss: 0.4959 - acc: 0.7959 - 11ms/step
step 600/743 - loss: 0.4893 - acc: 0.7968 - 11ms/step
step 700/743 - loss: 0.4762 - acc: 0.7972 - 11ms/step
step 743/743 - loss: 0.4367 - acc: 0.7975 - 11ms/step
save checkpoint at /home/aistudio/checkpoint/0
Epoch 2/3
step 100/743 - loss: 0.5235 - acc: 0.7993 - 12ms/step
step 200/743 - loss: 0.4688 - acc: 0.8005 - 11ms/step
step 300/743 - loss: 0.4640 - acc: 0.8006 - 11ms/step
step 400/743 - loss: 0.4847 - acc: 0.8004 - 11ms/step
step 500/743 - loss: 0.4715 - acc: 0.8002 - 12ms/step
step 600/743 - loss: 0.4957 - acc: 0.8003 - 12ms/step
step 700/743 - loss: 0.4752 - acc: 0.8004 - 11ms/step
step 743/743 - loss: 0.4061 - acc: 0.8005 - 11ms/step
save checkpoint at /home/aistudio/checkpoint/1
Epoch 3/3
step 100/743 - loss: 0.4701 - acc: 0.7978 - 13ms/step
step 200/743 - loss: 0.4428 - acc: 0.8001 - 11ms/step
step 300/743 - loss: 0.4475 - acc: 0.8008 - 12ms/step
step 400/743 - loss: 0.4562 - acc: 0.8010 - 12ms/step
step 500/743 - loss: 0.4465 - acc: 0.8012 - 11ms/step
step 600/743 - loss: 0.4536 - acc: 0.8011 - 11ms/step
step 700/743 - loss: 0.4485 - acc: 0.8010 - 11ms/step
step 743/743 - loss: 0.4768 - acc: 0.8011 - 11ms/step
save checkpoint at /home/aistudio/checkpoint/2
save checkpoint at /home/aistudio/checkpoint/final
y_test_NN=np.ones(X_test_NN.shape[0])
testDataset=CreditDataset(X_test_NN,y_test_NN)
result = model.predict(testDataset, batch_size=len(testDataset))
Predict begin...
step 1/1 [==============================] - 18ms/step
Predict samples: 5000
# print(result[:5])
print(result[0][0])
idx = np.argmax(np.array(result[0][0]),axis=1)
pred_NN = idx.tolist()
print(pred_NN[:10])
[[ 3.533115  -3.2123508]
 [ 3.7103195 -2.7369869]
 [ 3.7547517 -2.5327196]
 ...
 [ 3.585775  -3.1076064]
 [-0.7036233 -3.4341366]
 [ 3.670831  -2.8589182]]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
submission = pd.DataFrame({'id':test['loan_id'], 'isDefault':pred_NN})
submission.to_csv('submission_nn.csv', index = None)

五、优化思路

1.这个比赛偏数据挖掘,所以一个好的特征能够起到比较好的作用,所以要多分析数据本身。

2.调节模型的超参数,batch_size,学习率等。

3.利用常见的trick来进行优化:集成学习,kfold交叉验证等。

六、更多PaddleEdu信息内容

1. PaddleEdu一站式深度学习在线百科awesome-DeepLearning中还有其他的能力,大家可以敬请期待:

  • 深度学习入门课
  • 深度学习百问
  • 特色课
  • 产业实践

PaddleEdu使用过程中有任何问题欢迎在awesome-DeepLearning提issue,同时更多深度学习资料请参阅飞桨深度学习平台

记得点个Star⭐收藏噢~~

2. 飞桨PaddleEdu技术交流群(QQ)

目前QQ群已有2000+同学一起学习,欢迎扫码加入

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐