【Paddle打比赛】个贷违约预测Baseline+ 0.607
资源⭐ ⭐ ⭐ 欢迎点个小小的Star支持!⭐ ⭐ ⭐开源不易,希望大家多多支持~更多CV和NLP中的transformer模型(BERT、ERNIE、ViT、DeiT、Swin Transformer等)、深度学习资料,请参考:awesome-DeepLearning飞桨框架相关资料,请参考:飞桨深度学习平台一、竞赛介绍CCF大数据与计算智能大赛(CCF Big Data & Compu
资源
⭐ ⭐ ⭐ 欢迎点个小小的Star支持!⭐ ⭐ ⭐
开源不易,希望大家多多支持~

-
更多CV和NLP中的transformer模型(BERT、ERNIE、ViT、DeiT、Swin Transformer等)、深度学习资料,请参考:awesome-DeepLearning
-
飞桨框架相关资料,请参考:飞桨深度学习平台
一、竞赛介绍
CCF大数据与计算智能大赛(CCF Big Data & Computing Intelligence Contest,简称CCF BDCI)由中国计算机学会于2013年创办。大赛由国家自然科学基金委员会指导,是大数据与人工智能领域的算法、应用和系统大型挑战赛事。大赛面向重点行业和应用领域征集需求,以前沿技术与行业应用问题为导向,以促进行业发展及产业升级为目标,以众智、众包的方式,汇聚海内外产学研用多方智慧,为社会发现和培养了大量高质量数据人才。
大赛迄今已成功举办八届,累计吸引全球1500余所高校、1800家企事业单位及80余所科研机构的12万余人参与,已成为中国大数据与人工智能领域最具影响力的活动之一,是中国大数据综合赛事第一品牌。
2021年第九届大赛以“数引创新,竞促汇智”为主题,立足余杭、面向全球,于9月至12月举办。大赛将致力于解决来自政府、企业真实场景中的痛点、难点问题,邀请全球优秀团队参与数据资源开发利用,广泛征集信息技术应用解决方案。
1.1 赛题任务
比赛的地址为https://www.datafountain.cn/competitions/530
本赛题要求利用已有的与目标客群稍有差异的另一批信贷数据,辅助目标业务风控模型的创建,两者数据集之间存在大量相同的字段和极少的共同用户。此处希望大家可以利用迁移学习捕捉不同业务中用户基本信息与违约行为之间的关联,帮助实现对新业务的用户违约预测。
二、数据预处理
import warnings
import pandas as pd
# warnings.filterwarnings('ignore')
import numpy as np
import lightgbm
from sklearn import metrics
import paddle
import paddle.nn as nn
import paddle.nn.functional as F
from paddle.io import Dataset
!unzip -o data/data111908/个贷违约预测.zip -d data/
Archive: data/data111908/个贷违约预测.zip
inflating: data/submit_example.csv
inflating: data/__MACOSX/._submit_example.csv
inflating: data/test_public.csv
inflating: data/__MACOSX/._test_public.csv
inflating: data/__MACOSX/._train_dataset
inflating: data/train_dataset/train_public.csv
inflating: data/__MACOSX/train_dataset/._train_public.csv
inflating: data/train_dataset/train_internet.csv
inflating: data/__MACOSX/train_dataset/._train_internet.csv
train_bank = pd.read_csv('./data/train_dataset/train_public.csv')
train_internet = pd.read_csv('./data/train_dataset/train_internet.csv')
train_bank.head()
loan_id | user_id | total_loan | year_of_loan | interest | monthly_payment | class | employer_type | industry | work_year | ... | policy_code | f0 | f1 | f2 | f3 | f4 | early_return | early_return_amount | early_return_amount_3mon | isDefault | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1040418 | 240418 | 31818.18182 | 3 | 11.466 | 1174.91 | C | 政府机构 | 金融业 | 3 years | ... | 1 | 1.0 | 0.0 | 4.0 | 5.0 | 4.0 | 3 | 9927 | 0.0 | 0 |
1 | 1025197 | 225197 | 28000.00000 | 5 | 16.841 | 670.69 | C | 政府机构 | 金融业 | 10+ years | ... | 1 | 7.0 | 0.0 | 4.0 | 45.0 | 22.0 | 0 | 0 | 0.0 | 0 |
2 | 1009360 | 209360 | 17272.72727 | 3 | 8.900 | 603.32 | A | 政府机构 | 公共服务、社会组织 | 10+ years | ... | 1 | 6.0 | 0.0 | 6.0 | 28.0 | 19.0 | 0 | 0 | 0.0 | 0 |
3 | 1039708 | 239708 | 20000.00000 | 3 | 4.788 | 602.30 | A | 世界五百强 | 文化和体育业 | 6 years | ... | 1 | 5.0 | 0.0 | 10.0 | 15.0 | 9.0 | 0 | 0 | 0.0 | 0 |
4 | 1027483 | 227483 | 15272.72727 | 3 | 12.790 | 470.31 | C | 政府机构 | 信息传输、软件和信息技术服务业 | < 1 year | ... | 1 | 10.0 | 0.0 | 6.0 | 15.0 | 4.0 | 0 | 0 | 0.0 | 0 |
5 rows × 39 columns
test = pd.read_csv('./data/test_public.csv')
# train_internet.rename(columns={'is_default':'isDefault'},inplace=True)
train_bank.rename(columns={'isDefault':'is_default'},inplace=True)
common_cols = []
for col in train_bank.columns:
if col in train_internet.columns:
common_cols.append(col)
else: continue
print(train_bank.columns)
print(train_internet.columns)
print(len(common_cols))
Index(['loan_id', 'user_id', 'total_loan', 'year_of_loan', 'interest',
'monthly_payment', 'class', 'employer_type', 'industry', 'work_year',
'house_exist', 'censor_status', 'issue_date', 'use', 'post_code',
'region', 'debt_loan_ratio', 'del_in_18month', 'scoring_low',
'scoring_high', 'known_outstanding_loan', 'known_dero',
'pub_dero_bankrup', 'recircle_b', 'recircle_u', 'initial_list_status',
'app_type', 'earlies_credit_mon', 'title', 'policy_code', 'f0', 'f1',
'f2', 'f3', 'f4', 'early_return', 'early_return_amount',
'early_return_amount_3mon', 'is_default'],
dtype='object')
Index(['loan_id', 'user_id', 'total_loan', 'year_of_loan', 'interest',
'monthly_payment', 'class', 'sub_class', 'work_type', 'employer_type',
'industry', 'work_year', 'house_exist', 'house_loan_status',
'censor_status', 'marriage', 'offsprings', 'issue_date', 'use',
'post_code', 'region', 'debt_loan_ratio', 'del_in_18month',
'scoring_low', 'scoring_high', 'pub_dero_bankrup', 'early_return',
'early_return_amount', 'early_return_amount_3mon', 'recircle_b',
'recircle_u', 'initial_list_status', 'earlies_credit_mon', 'title',
'policy_code', 'f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'is_default'],
dtype='object')
36
print(len(train_bank.columns))
print(len(train_internet.columns))
39
42
train_bank_left = list(set(list(train_bank.columns)) - set(common_cols))
train_internet_left = list(set(list(train_internet.columns)) - set(common_cols))
train_bank_left
['known_dero', 'known_outstanding_loan', 'app_type']
train_internet_left
['marriage', 'work_type', 'sub_class', 'f5', 'house_loan_status', 'offsprings']
train1_data = train_internet[common_cols]
train2_data = train_bank[common_cols]
test_data = test[common_cols[:-1]]
import datetime
# 日期类型:issueDate,earliesCreditLine
# 转换为pandas中的日期类型
train1_data['issue_date'] = pd.to_datetime(train1_data['issue_date'])
# 提取多尺度特征
train1_data['issue_date_y'] = train1_data['issue_date'].dt.year
train1_data['issue_date_m'] = train1_data['issue_date'].dt.month
# 提取时间diff
# 设置初始的时间
base_time = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
# 转换为天为单位
train1_data['issue_date_diff'] = train1_data['issue_date'].apply(lambda x: x-base_time).dt.days
train1_data[['issue_date', 'issue_date_y', 'issue_date_m', 'issue_date_diff']]
train1_data.drop('issue_date', axis = 1, inplace = True)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
import sys
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:8: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:13: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
del sys.path[0]
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pandas/core/frame.py:4174: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
errors=errors,
# 日期类型:issueDate,earliesCreditLine
# 转换为pandas中的日期类型
train2_data['issue_date'] = pd.to_datetime(train2_data['issue_date'])
# 提取多尺度特征
train2_data['issue_date_y'] = train2_data['issue_date'].dt.year
train2_data['issue_date_m'] = train2_data['issue_date'].dt.month
# 提取时间diff
# 设置初始的时间
base_time = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
# 转换为天为单位
train2_data['issue_date_diff'] = train2_data['issue_date'].apply(lambda x: x-base_time).dt.days
train2_data[['issue_date', 'issue_date_y', 'issue_date_m', 'issue_date_diff']]
train2_data.drop('issue_date', axis = 1, inplace = True)
train2_data.head()
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
This is separate from the ipykernel package so we can avoid doing imports until
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
"""
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
loan_id | user_id | total_loan | year_of_loan | interest | monthly_payment | class | employer_type | industry | work_year | ... | f2 | f3 | f4 | early_return | early_return_amount | early_return_amount_3mon | is_default | issue_date_y | issue_date_m | issue_date_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1040418 | 240418 | 31818.18182 | 3 | 11.466 | 1174.91 | C | 政府机构 | 金融业 | 3 years | ... | 4.0 | 5.0 | 4.0 | 3 | 9927 | 0.0 | 0 | 2016 | 10 | 3410 |
1 | 1025197 | 225197 | 28000.00000 | 5 | 16.841 | 670.69 | C | 政府机构 | 金融业 | 10+ years | ... | 4.0 | 45.0 | 22.0 | 0 | 0 | 0.0 | 0 | 2013 | 6 | 2192 |
2 | 1009360 | 209360 | 17272.72727 | 3 | 8.900 | 603.32 | A | 政府机构 | 公共服务、社会组织 | 10+ years | ... | 6.0 | 28.0 | 19.0 | 0 | 0 | 0.0 | 0 | 2014 | 1 | 2406 |
3 | 1039708 | 239708 | 20000.00000 | 3 | 4.788 | 602.30 | A | 世界五百强 | 文化和体育业 | 6 years | ... | 10.0 | 15.0 | 9.0 | 0 | 0 | 0.0 | 0 | 2015 | 7 | 2952 |
4 | 1027483 | 227483 | 15272.72727 | 3 | 12.790 | 470.31 | C | 政府机构 | 信息传输、软件和信息技术服务业 | < 1 year | ... | 6.0 | 15.0 | 4.0 | 0 | 0 | 0.0 | 0 | 2016 | 7 | 3318 |
5 rows × 38 columns
employer_type = train1_data['employer_type'].value_counts().index
industry = train1_data['industry'].value_counts().index
emp_type_dict = dict(zip(employer_type, [0,1,2,3,4,5]))
industry_dict = dict(zip(industry, [i for i in range(15)]))
train1_data['work_year'].fillna('10+ years', inplace=True)
train2_data['work_year'].fillna('10+ years', inplace=True)
work_year_map = {'10+ years': 10, '2 years': 2, '< 1 year': 0, '3 years': 3, '1 year': 1,
'5 years': 5, '4 years': 4, '6 years': 6, '8 years': 8, '7 years': 7, '9 years': 9}
train1_data['work_year'] = train1_data['work_year'].map(work_year_map)
train2_data['work_year'] = train2_data['work_year'].map(work_year_map)
train1_data['class'] = train1_data['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})
train2_data['class'] = train2_data['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})
train1_data['employer_type'] = train1_data['employer_type'].map(emp_type_dict)
train2_data['employer_type'] = train2_data['employer_type'].map(emp_type_dict)
train1_data['industry'] = train1_data['industry'].map(industry_dict)
train2_data['industry'] = train2_data['industry'].map(industry_dict)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/pandas/core/series.py:4536: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
downcast=downcast,
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:9: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
if __name__ == '__main__':
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:12: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
if sys.path[0] == '':
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:15: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
from ipykernel import kernelapp as app
# 日期类型:issueDate,earliesCreditLine
#train[cat_features]
# 转换为pandas中的日期类型
test_data['issue_date'] = pd.to_datetime(test_data['issue_date'])
# 提取多尺度特征
test_data['issue_date_y'] = test_data['issue_date'].dt.year
test_data['issue_date_m'] = test_data['issue_date'].dt.month
# 提取时间diff
# 设置初始的时间
base_time = datetime.datetime.strptime('2007-06-01', '%Y-%m-%d')
# 转换为天为单位
test_data['issue_date_diff'] = test_data['issue_date'].apply(lambda x: x-base_time).dt.days
test_data[['issue_date', 'issue_date_y', 'issue_date_m', 'issue_date_diff']]
test_data.drop('issue_date', axis = 1, inplace = True)
test_data['work_year'].fillna('10+ years', inplace=True)
work_year_map = {'10+ years': 10, '2 years': 2, '< 1 year': 0, '3 years': 3, '1 year': 1,
'5 years': 5, '4 years': 4, '6 years': 6, '8 years': 8, '7 years': 7, '9 years': 9}
test_data['work_year'] = test_data['work_year'].map(work_year_map)
test_data['class'] = test_data['class'].map({'A': 0, 'B': 1, 'C': 2, 'D': 3, 'E': 4, 'F': 5, 'G': 6})
test_data['employer_type'] = test_data['employer_type'].map(emp_type_dict)
test_data['industry'] = test_data['industry'].map(industry_dict)
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
after removing the cwd from sys.path.
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/ipykernel_launcher.py:7: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
import sys
train1_data['is_default']
0 1
1 0
2 0
3 0
4 0
..
749995 0
749996 0
749997 0
749998 0
749999 0
Name: is_default, Length: 750000, dtype: int64
三、传统机器学习算法:lightgbm模型
X_train1 = train1_data.drop(['is_default','earlies_credit_mon','loan_id','user_id'], axis = 1, inplace = False)
y_train1 = train1_data['is_default']
X_train2 = train2_data.drop(['is_default','earlies_credit_mon','loan_id','user_id'], axis = 1, inplace = False)
y_train2 = train2_data['is_default']
X_train = pd.concat([X_train1, X_train2])
y_train = pd.concat([y_train1, y_train2])
X_test = test_data.drop(['earlies_credit_mon','loan_id','user_id'], axis = 1, inplace = False)
# 利用Internet数据预训练模型1
clf_ex=lightgbm.LGBMRegressor(n_estimators = 200)
clf_ex.fit(X = X_train, y = y_train)
clf_ex.booster_.save_model('LGBMmode.txt')
pred = clf_ex.predict(X_test)
print(pred[:5])
pred_label=[0 if(item<0.5) else 1 for item in pred]
[-0.01169305 0.02755907 -0.00791106 0.06182464 -0.01506804]
# submission
submission = pd.DataFrame({'id':test['loan_id'], 'isDefault':pred_label})
submission.to_csv('submission.csv', index = None)
四、深度神经网络
# 数据标准化
mean_px = X_train.mean().astype(np.float32)
std_px = X_train.std().astype(np.float32)
def standardize(x):
return (x-mean_px)/std_px
# 缺失值填补
X_train.fillna(0, inplace = True)
X_test.fillna(0, inplace = True)
X_train_NN =(X_train - mean_px) / std_px
X_test_NN = (X_test - mean_px) / std_px
X_train_NN = (X_train.values).astype('float32') # all pixel values
y_train_NN = (y_train.values).astype('int64')
X_test_NN = (X_test.values).astype('float32') # all pixel value
print(X_train_NN[0].shape)
print(y_train_NN[0].shape)
# print(y_train_NN)
print(type(y_train_NN))
(34,)
()
<class 'numpy.ndarray'>
# define a random dataset
class CreditDataset(Dataset):
def __init__(self, X,Y):
self.inputs = X
self.labels=Y
def __getitem__(self, idx):
image =self.inputs[idx]
label = self.labels[idx]
return image, label
def __len__(self):
return self.inputs.shape[0]
trainDataset=CreditDataset(X_train_NN,y_train_NN)
for i in range(len(trainDataset)):
print(trainDataset[i])
break
# print(trainDataset[759570])
(array([1.200e+04, 5.000e+00, 1.153e+01, 2.641e+02, 1.000e+00, 0.000e+00,
1.000e+01, 1.000e+01, 0.000e+00, 2.000e+00, 0.000e+00, 8.140e+02,
4.000e+00, 5.070e+00, 1.000e+00, 6.700e+02, 6.740e+02, 1.000e+00,
3.855e+03, 2.310e+01, 0.000e+00, 0.000e+00, 1.000e+00, 1.000e+00,
0.000e+00, 8.000e+00, 1.700e+01, 8.000e+00, 0.000e+00, 0.000e+00,
0.000e+00, 2.015e+03, 6.000e+00, 2.922e+03], dtype=float32), 1)
(array([7.0909092e+03, 3.0000000e+00, 1.8990000e+01, 2.1991000e+02,
3.0000000e+00, 2.0000000e+00, 0.0000000e+00, 2.0000000e+00,
1.0000000e+00, 2.0000000e+00, 0.0000000e+00, 4.4600000e+02,
3.6000000e+01, 1.9420000e+01, 0.0000000e+00, 5.6454547e+02,
6.5863635e+02, 0.0000000e+00, 5.0757690e+03, 3.8230770e+01,
1.0000000e+00, 0.0000000e+00, 1.0000000e+00, 5.0000000e+00,
0.0000000e+00, 1.0000000e+00, 7.0000000e+00, 6.0000000e+00,
2.0000000e+00, 9.8000000e+01, 9.8000002e+00, 2.0170000e+03,
3.0000000e+00, 3.5610000e+03], dtype=float32), 0)
print(X_train_NN.shape)
print(y_train_NN.shape)
print(type(X_train_NN))
(760000, 34)
(760000,)
<class 'numpy.ndarray'>
IMAGE_SIZE=34
CLASS_NUM=2
class SimpleNet(nn.Layer):
def __init__(self):
super(SimpleNet, self).__init__()
self.nn1=nn.Linear(IMAGE_SIZE,256)
self.norm1=nn.LayerNorm(256)
self.dropout=nn.Dropout(0.3)
self.nn2=nn.Linear(256,64)
self.norm2=nn.LayerNorm(64)
self.dropout1=nn.Dropout(0.3)
self.fc = nn.Linear(64, CLASS_NUM)
def forward(self, inputs):
x=self.nn1(inputs)
x=self.norm1(x)
x=self.dropout(x)
x=self.nn2(x)
x=self.norm2(x)
x=self.dropout1(x)
return self.fc(x)
model=paddle.Model(SimpleNet())
lr=0.001
criterion = paddle.nn.CrossEntropyLoss()
metric = paddle.metric.Accuracy()
optimizer = paddle.optimizer.Adam(
parameters=model.parameters(), learning_rate=lr)
model.prepare(optimizer, criterion, metric)
# Starts training and evaluating.
epochs=3
save_dir='checkpoint'
model.fit(trainDataset,
epochs=epochs,
batch_size=1024,
save_dir=save_dir,
log_freq=100)
The loss value printed in the log is the current step, and the metric is the average value of previous steps.
Epoch 1/3
step 100/743 - loss: 0.4971 - acc: 0.7757 - 13ms/step
step 200/743 - loss: 0.4908 - acc: 0.7881 - 12ms/step
step 300/743 - loss: 0.5055 - acc: 0.7918 - 11ms/step
step 400/743 - loss: 0.4924 - acc: 0.7944 - 11ms/step
step 500/743 - loss: 0.4959 - acc: 0.7959 - 11ms/step
step 600/743 - loss: 0.4893 - acc: 0.7968 - 11ms/step
step 700/743 - loss: 0.4762 - acc: 0.7972 - 11ms/step
step 743/743 - loss: 0.4367 - acc: 0.7975 - 11ms/step
save checkpoint at /home/aistudio/checkpoint/0
Epoch 2/3
step 100/743 - loss: 0.5235 - acc: 0.7993 - 12ms/step
step 200/743 - loss: 0.4688 - acc: 0.8005 - 11ms/step
step 300/743 - loss: 0.4640 - acc: 0.8006 - 11ms/step
step 400/743 - loss: 0.4847 - acc: 0.8004 - 11ms/step
step 500/743 - loss: 0.4715 - acc: 0.8002 - 12ms/step
step 600/743 - loss: 0.4957 - acc: 0.8003 - 12ms/step
step 700/743 - loss: 0.4752 - acc: 0.8004 - 11ms/step
step 743/743 - loss: 0.4061 - acc: 0.8005 - 11ms/step
save checkpoint at /home/aistudio/checkpoint/1
Epoch 3/3
step 100/743 - loss: 0.4701 - acc: 0.7978 - 13ms/step
step 200/743 - loss: 0.4428 - acc: 0.8001 - 11ms/step
step 300/743 - loss: 0.4475 - acc: 0.8008 - 12ms/step
step 400/743 - loss: 0.4562 - acc: 0.8010 - 12ms/step
step 500/743 - loss: 0.4465 - acc: 0.8012 - 11ms/step
step 600/743 - loss: 0.4536 - acc: 0.8011 - 11ms/step
step 700/743 - loss: 0.4485 - acc: 0.8010 - 11ms/step
step 743/743 - loss: 0.4768 - acc: 0.8011 - 11ms/step
save checkpoint at /home/aistudio/checkpoint/2
save checkpoint at /home/aistudio/checkpoint/final
y_test_NN=np.ones(X_test_NN.shape[0])
testDataset=CreditDataset(X_test_NN,y_test_NN)
result = model.predict(testDataset, batch_size=len(testDataset))
Predict begin...
step 1/1 [==============================] - 18ms/step
Predict samples: 5000
# print(result[:5])
print(result[0][0])
idx = np.argmax(np.array(result[0][0]),axis=1)
pred_NN = idx.tolist()
print(pred_NN[:10])
[[ 3.533115 -3.2123508]
[ 3.7103195 -2.7369869]
[ 3.7547517 -2.5327196]
...
[ 3.585775 -3.1076064]
[-0.7036233 -3.4341366]
[ 3.670831 -2.8589182]]
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
submission = pd.DataFrame({'id':test['loan_id'], 'isDefault':pred_NN})
submission.to_csv('submission_nn.csv', index = None)
五、优化思路
1.这个比赛偏数据挖掘,所以一个好的特征能够起到比较好的作用,所以要多分析数据本身。
2.调节模型的超参数,batch_size,学习率等。
3.利用常见的trick来进行优化:集成学习,kfold交叉验证等。
六、更多PaddleEdu信息内容
1. PaddleEdu一站式深度学习在线百科awesome-DeepLearning中还有其他的能力,大家可以敬请期待:
- 深度学习入门课

- 深度学习百问

- 特色课

- 产业实践

PaddleEdu使用过程中有任何问题欢迎在awesome-DeepLearning提issue,同时更多深度学习资料请参阅飞桨深度学习平台。
记得点个Star⭐收藏噢~~
2. 飞桨PaddleEdu技术交流群(QQ)
目前QQ群已有2000+同学一起学习,欢迎扫码加入
更多推荐
所有评论(0)