银行客户流失预测
对银行的用户数据进行分析,探索对用户流失有深刻影响的相关指标,有助于帮助制定有针对性的用户挽留计划,提高用户留存率。得出和留存有关的最重要特征情况,并和BML的AutoML模式进行了比较。
·
一、银行客户流失预测
第十三期3天AI进阶实战营就是银行客户流失预测,使用BML一键训练、预测、发布,速度挺好,看来我等要被BML打败了,我也来试试。
1.数据集简介
背景介绍
我们知道,注册新客户要比保留现有客户难得多。
对于银行而言,了解导致客户决定离开的决定是非常有帮忙的。
防止流失可以使银行制定忠诚度计划和保留活动,以保持尽可能多的客户。
数据描述
- RowNumber-对应于记录(行)号,对输出没有影响。
- CustomerId-包含随机值,对客户离开银行没有影响。
- 姓-客户的姓氏对他们离开银行的决定没有影响。
- CreditScore-可能会影响客户流失,因为具有较高信用评分的客户不太可能离开银行。
- 地理位置-客户的位置可能会影响他们离开银行的决定。
- 性别-探索性别在离开银行的客户中是否起作用很有趣。
- 年龄-这当然是相关的,因为年龄较大的客户比年龄较小的客户离开银行的可能性较小。
- 任期-指客户成为银行客户的年限。通常,年长的客户更忠诚,离开银行的可能性也较小。
- 余额-也是客户流失的很好指标,因为帐户余额较高的人比余额较低的人离开银行的可能性小。
- NumOfProducts-指客户通过银行购买的产品数量。
- HasCrCard-表示客户是否有信用卡。此列也很重要,因为拿着信用卡的人离开银行的可能性较小。
- IsActiveMember-活跃的客户离开银行的可能性较小。
- 薪金估计—与薪金较高的人相比,薪金较低的人更有可能离开银行。
- 已退出-客户是否离开银行。
2.数据集读取
import numpy as np
import warnings
warnings.simplefilter('ignore')
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('whitegrid')
data = pd.read_csv('data/data107968/churn.csv')
print(data.shape)
data.head()
(10000, 14)
RowNumber | CustomerId | Surname | CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 15634602 | Hargrave | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 2 | 15647311 | Hill | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 3 | 15619304 | Onio | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 4 | 15701354 | Boni | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 5 | 15737888 | Mitchell | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
3.删除无意义的列
比如列号,客户id,姓名,这些字段对于客户是否流失没有任何意义。
# 中括号表示要删除的列明,axis=1表示按列删除
data=data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
data.head()
CreditScore | Geography | Gender | Age | Tenure | Balance | NumOfProducts | HasCrCard | IsActiveMember | EstimatedSalary | Exited | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 619 | France | Female | 42 | 2 | 0.00 | 1 | 1 | 1 | 101348.88 | 1 |
1 | 608 | Spain | Female | 41 | 1 | 83807.86 | 1 | 0 | 1 | 112542.58 | 0 |
2 | 502 | France | Female | 42 | 8 | 159660.80 | 3 | 1 | 0 | 113931.57 | 1 |
3 | 699 | France | Female | 39 | 1 | 0.00 | 2 | 0 | 0 | 93826.63 | 0 |
4 | 850 | Spain | Female | 43 | 2 | 125510.82 | 1 | 1 | 1 | 79084.10 | 0 |
4.数据分析
借鉴自 孤飞 同学:https://aistudio.baidu.com/aistudio/projectdetail/2801890?channelType=0&channel=0,自己太烂了不想写了。从图表可见,流失和不流失差距相当的大啊。
# 地域统计
data.Geography.unique()
array(['France', 'Spain', 'Germany'], dtype=object)
# 性别统计
data.Gender.unique()
array(['Female', 'Male'], dtype=object)
# 有几个变量包含空值?
for column in data.columns:
print(column,data[column].isnull().any())
CreditScore False
Geography False
Gender False
Age False
Tenure False
Balance False
NumOfProducts False
HasCrCard False
IsActiveMember False
EstimatedSalary False
Exited False
plt.figure(figsize = (16, 6))
sns.countplot(x = data.Exited, palette = 'Blues_r')#流失率柱状图
<matplotlib.axes._subplots.AxesSubplot at 0x7fed58c05790>
# 数据集划分
train_dataset, eval_dataset = train_test_split(data, test_size=0.2, random_state=1024)
5.导入库
!pip install catboost
import pandas as pd
import lightgbm as lgb
import jieba
import gc
import numpy as np
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
6.数据加载
print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CreditScore 10000 non-null int64
1 Geography 10000 non-null object
2 Gender 10000 non-null object
3 Age 10000 non-null int64
4 Tenure 10000 non-null int64
5 Balance 10000 non-null float64
6 NumOfProducts 10000 non-null int64
7 HasCrCard 10000 non-null int64
8 IsActiveMember 10000 non-null int64
9 EstimatedSalary 10000 non-null float64
10 Exited 10000 non-null int64
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB
None
# train_dataset, eval_dataset
data = pd.concat([train_dataset, eval_dataset], axis=0)
cate_cols = ['Geography', 'Gender']
for col in cate_cols:
lb = LabelEncoder()
data[col] = lb.fit(data[col])
train_dataset[col] = lb.transform(train_dataset[col])
eval_dataset[col] = lb.transform(eval_dataset[col])
no_feas = ['Exited']
features = [col for col in train_dataset.columns if col not in no_feas]
X_train = train_dataset[features]
X_test = eval_dataset[features]
y_train = train_dataset['Exited'].astype(int)
def train_model_classification(X, X_test, y, params, num_classes=2,
folds=None, model_type='lgb',
eval_metric='logloss', columns=None,
plot_feature_importance=False,
model=None, verbose=10000,
early_stopping_rounds=200,
splits=None, n_folds=3):
"""
分类模型函数
返回字典,包括: oof predictions, test predictions, scores and, if necessary, feature importances.
:params: X - 训练数据, pd.DataFrame
:params: X_test - 测试数据,pd.DataFrame
:params: y - 目标
:params: folds - folds to split data
:params: model_type - 模型
:params: eval_metric - 评价指标
:params: columns - 特征列
:params: plot_feature_importance - 是否展示特征重要性
:params: model - sklearn model, works only for "sklearn" model type
"""
start_time = time.time()
global y_pred_valid, y_pred
columns = X.columns if columns is None else columns
X_test = X_test[columns]
splits = folds.split(X, y) if splits is None else splits
n_splits = folds.n_splits if splits is None else n_folds
# to set up scoring parameters
metrics_dict = {
'logloss': {
'lgb_metric_name': 'logloss',
'xgb_metric_name': 'logloss',
'catboost_metric_name': 'Logloss',
'sklearn_scoring_function': metrics.log_loss
},
'lb_score_method': {
'sklearn_scoring_f1': metrics.f1_score, # 线上评价指标
'sklearn_scoring_accuracy': metrics.accuracy_score, # 线上评价指标
'sklearn_scoring_auc': metrics.roc_auc_score
},
}
result_dict = {}
# out-of-fold predictions on train data
oof = np.zeros(shape=(len(X), num_classes))
# averaged predictions on train data
prediction = np.zeros(shape=(len(X_test), num_classes))
# list of scores on folds
acc_scores=[]
scores = []
# feature importance
feature_importance = pd.DataFrame()
# split and train on folds
for fold_n, (train_index, valid_index) in enumerate(splits):
if verbose:
print(f'Fold {fold_n + 1} started at {time.ctime()}')
if type(X) == np.ndarray:
X_train, X_valid = X[train_index], X[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
else:
X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]
if model_type == 'lgb':
model = lgb.LGBMClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)],
eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
verbose=verbose,
early_stopping_rounds=early_stopping_rounds)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)
if model_type == 'xgb':
model = xgb.XGBClassifier(**params)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)],
eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],
verbose=bool(verbose), # xgb verbose bool
early_stopping_rounds=early_stopping_rounds)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)
if model_type == 'sklearn':
model = model
model.fit(X_train, y_train)
y_pred_valid = model.predict_proba(X_valid)
score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')
y_pred = model.predict_proba(X_test)
if model_type == 'cat':
model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],
**params,
loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,
verbose=False)
y_pred_valid = model.predict_proba(X_valid)
y_pred = model.predict_proba(X_test)
oof[valid_index] = y_pred_valid
# 评价指标
acc_scores.append(
metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))
scores.append(
metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))
print(acc_scores)
print(scores)
prediction += y_pred
if model_type == 'lgb' and plot_feature_importance:
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
if model_type == 'xgb' and plot_feature_importance:
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
prediction /= n_splits
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
result_dict['oof'] = oof
result_dict['prediction'] = prediction
result_dict['acc_scores'] = acc_scores
result_dict['scores'] = scores
if model_type == 'lgb' or model_type == 'xgb':
if plot_feature_importance:
feature_importance["importance"] /= n_splits
cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
by="importance", ascending=False)[:50].index
best_features = feature_importance.loc[feature_importance.feature.isin(cols)]
plt.figure(figsize=(16, 12))
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
plt.title('LGB Features (avg over folds)')
plt.show()
result_dict['feature_importance'] = feature_importance
end_time = time.time()
print("train_model_classification cost time:{}".format(end_time - start_time))
return result_dict
7.lgb模型
- GitHub主页:https://github.com/microsoft/LightGBM
- 文档网址:https://lightgbm.readthedocs.io/en/latest/
- 核心参数:https://lightgbm.readthedocs.io/en/latest/
lgb_params = {
'boosting_type': 'gbdt',
'objective': 'binary',
'n_estimators': 100000,
'learning_rate': 0.1,
'random_state': 2948,
'bagging_freq': 8,
'bagging_fraction': 0.80718,
'feature_fraction': 0.38691, # 0.3
'feature_fraction_seed': 11,
'max_depth': 9,
'min_data_in_leaf': 40,
'min_child_weight': 0.18654,
"min_split_gain": 0.35079,
'min_sum_hessian_in_leaf': 1.11347,
'num_leaves': 29,
'num_threads': 6,
"lambda_l1": 0.55831,
'lambda_l2': 1.67906,
'cat_smooth': 10.4,
'subsample': 0.7,
'colsample_bytree': 0.7,
# 'n_jobs': -1,
'metric': 'auc'
}
n_fold = 5
num_classes = 2
print("分类个数num_classes:{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)
X = train_dataset[features]
print(y_train.value_counts())
X_test = eval_dataset[features]
result_dict_lgb = train_model_classification(X=X,
X_test=X_test,
y=y_train,
params=lgb_params,
num_classes=num_classes,
folds=folds,
model_type='lgb',
eval_metric='logloss',
plot_feature_importance=True,
verbose=200,
early_stopping_rounds=200,
n_folds=n_fold
)
acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
early_stopping_rounds=200,
n_folds=n_fold
)
acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
print(score)
分类个数num_classes:2
0 6347
1 1653
Name: Exited, dtype: int64
Fold 1 started at Sun Jan 23 23:34:02 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.230063 training's auc: 0.956131 valid_1's binary_logloss: 0.377107 valid_1's auc: 0.826328
Early stopping, best iteration is:
[87] training's binary_logloss: 0.276623 training's auc: 0.932371 valid_1's binary_logloss: 0.371376 valid_1's auc: 0.830094
[0.850625]
[0.8300943483819361]
Fold 2 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.233911 training's auc: 0.954853 valid_1's binary_logloss: 0.366604 valid_1's auc: 0.844229
Early stopping, best iteration is:
[100] training's binary_logloss: 0.271444 training's auc: 0.934158 valid_1's binary_logloss: 0.359972 valid_1's auc: 0.850745
[0.850625, 0.846875]
[0.8300943483819361, 0.8507448117912859]
Fold 3 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.238286 training's auc: 0.95085 valid_1's binary_logloss: 0.355037 valid_1's auc: 0.849645
Early stopping, best iteration is:
[98] training's binary_logloss: 0.276559 training's auc: 0.93025 valid_1's binary_logloss: 0.353108 valid_1's auc: 0.856094
[0.850625, 0.846875, 0.85375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975]
Fold 4 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.239597 training's auc: 0.95257 valid_1's binary_logloss: 0.318056 valid_1's auc: 0.882052
Early stopping, best iteration is:
[106] training's binary_logloss: 0.275256 training's auc: 0.931964 valid_1's binary_logloss: 0.31983 valid_1's auc: 0.883624
[0.850625, 0.846875, 0.85375, 0.873125]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726]
Fold 5 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.236783 training's auc: 0.952846 valid_1's binary_logloss: 0.350703 valid_1's auc: 0.851403
Early stopping, best iteration is:
[119] training's binary_logloss: 0.265393 training's auc: 0.937635 valid_1's binary_logloss: 0.347245 valid_1's auc: 0.853543
[0.850625, 0.846875, 0.85375, 0.873125, 0.859375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726, 0.8535433070866143]
CV mean score: 0.8548, std: 0.0171.
train_model_classification cost time:2.9498422145843506
0.8548202431242012
8.与BML对比
BML结果
自建模型结果
Training until validation scores don't improve for 200 rounds
[200] training's binary_logloss: 0.236783 training's auc: 0.952846 valid_1's binary_logloss: 0.350703 valid_1's auc: 0.851403
Early stopping, best iteration is:
[119] training's binary_logloss: 0.265393 training's auc: 0.937635 valid_1's binary_logloss: 0.347245 valid_1's auc: 0.853543
[0.850625, 0.846875, 0.85375, 0.873125, 0.859375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726, 0.8535433070866143]
CV mean score: 0.8548, std: 0.0171.
对比结果
可见BML得分略高,原因可能有自建模型没有做数据均衡。
更多推荐
已为社区贡献1438条内容
所有评论(0)