一、银行客户流失预测

第十三期3天AI进阶实战营就是银行客户流失预测,使用BML一键训练、预测、发布,速度挺好,看来我等要被BML打败了,我也来试试。

1.数据集简介

背景介绍

我们知道,注册新客户要比保留现有客户难得多。

对于银行而言,了解导致客户决定离开的决定是非常有帮忙的。

防止流失可以使银行制定忠诚度计划和保留活动,以保持尽可能多的客户。

数据描述

  • RowNumber-对应于记录(行)号,对输出没有影响。
  • CustomerId-包含随机值,对客户离开银行没有影响。
  • 姓-客户的姓氏对他们离开银行的决定没有影响。
  • CreditScore-可能会影响客户流失,因为具有较高信用评分的客户不太可能离开银行。
  • 地理位置-客户的位置可能会影响他们离开银行的决定。
  • 性别-探索性别在离开银行的客户中是否起作用很有趣。
  • 年龄-这当然是相关的,因为年龄较大的客户比年龄较小的客户离开银行的可能性较小。
  • 任期-指客户成为银行客户的年限。通常,年长的客户更忠诚,离开银行的可能性也较小。
  • 余额-也是客户流失的很好指标,因为帐户余额较高的人比余额较低的人离开银行的可能性小。
  • NumOfProducts-指客户通过银行购买的产品数量。
  • HasCrCard-表示客户是否有信用卡。此列也很重要,因为拿着信用卡的人离开银行的可能性较小。
  • IsActiveMember-活跃的客户离开银行的可能性较小。
  • 薪金估计—与薪金较高的人相比,薪金较低的人更有可能离开银行。
  • 已退出-客户是否离开银行。

2.数据集读取

import numpy as np
import warnings
warnings.simplefilter('ignore')
import pandas as pd

pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

sns.set_style('whitegrid')
data = pd.read_csv('data/data107968/churn.csv')
print(data.shape)
data.head()
(10000, 14)
RowNumberCustomerIdSurnameCreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0115634602Hargrave619FranceFemale4220.00111101348.881
1215647311Hill608SpainFemale41183807.86101112542.580
2315619304Onio502FranceFemale428159660.80310113931.571
3415701354Boni699FranceFemale3910.0020093826.630
4515737888Mitchell850SpainFemale432125510.8211179084.100

3.删除无意义的列

比如列号,客户id,姓名,这些字段对于客户是否流失没有任何意义。

# 中括号表示要删除的列明,axis=1表示按列删除
data=data.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)
data.head()
CreditScoreGeographyGenderAgeTenureBalanceNumOfProductsHasCrCardIsActiveMemberEstimatedSalaryExited
0619FranceFemale4220.00111101348.881
1608SpainFemale41183807.86101112542.580
2502FranceFemale428159660.80310113931.571
3699FranceFemale3910.0020093826.630
4850SpainFemale432125510.8211179084.100

4.数据分析

借鉴自 孤飞 同学:https://aistudio.baidu.com/aistudio/projectdetail/2801890?channelType=0&channel=0,自己太烂了不想写了。从图表可见,流失和不流失差距相当的大啊。

# 地域统计
data.Geography.unique()
array(['France', 'Spain', 'Germany'], dtype=object)
# 性别统计
data.Gender.unique()
array(['Female', 'Male'], dtype=object)
# 有几个变量包含空值?
for column in data.columns:
     print(column,data[column].isnull().any())
CreditScore False
Geography False
Gender False
Age False
Tenure False
Balance False
NumOfProducts False
HasCrCard False
IsActiveMember False
EstimatedSalary False
Exited False
plt.figure(figsize = (16, 6))
sns.countplot(x = data.Exited, palette = 'Blues_r')#流失率柱状图
<matplotlib.axes._subplots.AxesSubplot at 0x7fed58c05790>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-X6PneIGP-1644589426329)(output_11_1.png)]

#  数据集划分
train_dataset, eval_dataset = train_test_split(data, test_size=0.2, random_state=1024)

5.导入库

!pip install catboost
import pandas as pd
import lightgbm as lgb
import jieba
import gc
import numpy as np
import time
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import xgboost as xgb
import seaborn as sns
from sklearn.model_selection import StratifiedKFold
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder
from catboost import CatBoostClassifier
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold

6.数据加载

print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB
None
# train_dataset, eval_dataset
data = pd.concat([train_dataset, eval_dataset], axis=0)
cate_cols = ['Geography', 'Gender']
for col in cate_cols:
    lb = LabelEncoder()
    data[col] = lb.fit(data[col])
    train_dataset[col] = lb.transform(train_dataset[col])
    eval_dataset[col] = lb.transform(eval_dataset[col])

no_feas = ['Exited']
features = [col for col in train_dataset.columns if col not in no_feas]
X_train = train_dataset[features]
X_test = eval_dataset[features]

y_train = train_dataset['Exited'].astype(int)
def train_model_classification(X, X_test, y, params, num_classes=2,
                               folds=None, model_type='lgb',
                               eval_metric='logloss', columns=None,
                               plot_feature_importance=False,
                               model=None, verbose=10000,
                               early_stopping_rounds=200,
                               splits=None, n_folds=3):
    """
    分类模型函数
    返回字典,包括: oof predictions, test predictions, scores and, if necessary, feature importances.
    :params: X - 训练数据, pd.DataFrame
    :params: X_test - 测试数据,pd.DataFrame
    :params: y - 目标
    :params: folds - folds to split data
    :params: model_type - 模型
    :params: eval_metric - 评价指标
    :params: columns - 特征列
    :params: plot_feature_importance - 是否展示特征重要性
    :params: model - sklearn model, works only for "sklearn" model type
    """
    start_time = time.time()
    global y_pred_valid, y_pred

    columns = X.columns if columns is None else columns
    X_test = X_test[columns]
    splits = folds.split(X, y) if splits is None else splits
    n_splits = folds.n_splits if splits is None else n_folds

    # to set up scoring parameters
    metrics_dict = {
        'logloss': {
            'lgb_metric_name': 'logloss',
            'xgb_metric_name': 'logloss',
            'catboost_metric_name': 'Logloss',
            'sklearn_scoring_function': metrics.log_loss
        },
        'lb_score_method': {
            'sklearn_scoring_f1': metrics.f1_score,  # 线上评价指标
            'sklearn_scoring_accuracy': metrics.accuracy_score,  # 线上评价指标
            'sklearn_scoring_auc': metrics.roc_auc_score
        },
    }
    result_dict = {}

    # out-of-fold predictions on train data
    oof = np.zeros(shape=(len(X), num_classes))
    # averaged predictions on train data
    prediction = np.zeros(shape=(len(X_test), num_classes))
    # list of scores on folds
    acc_scores=[]
    scores = []
    # feature importance
    feature_importance = pd.DataFrame()

    # split and train on folds
    for fold_n, (train_index, valid_index) in enumerate(splits):
        if verbose:
            print(f'Fold {fold_n + 1} started at {time.ctime()}')
        if type(X) == np.ndarray:
            X_train, X_valid = X[train_index], X[valid_index]
            y_train, y_valid = y[train_index], y[valid_index]
        else:
            X_train, X_valid = X[columns].iloc[train_index], X[columns].iloc[valid_index]
            y_train, y_valid = y.iloc[train_index], y.iloc[valid_index]

        if model_type == 'lgb':
            model = lgb.LGBMClassifier(**params)
            model.fit(X_train, y_train,
                      eval_set=[(X_train, y_train), (X_valid, y_valid)],
                      eval_metric=metrics_dict[eval_metric]['lgb_metric_name'],
                      verbose=verbose,
                      early_stopping_rounds=early_stopping_rounds)

            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test, num_iteration=model.best_iteration_)

        if model_type == 'xgb':
            model = xgb.XGBClassifier(**params)
            model.fit(X_train, y_train,
                      eval_set=[(X_train, y_train), (X_valid, y_valid)],
                      eval_metric=metrics_dict[eval_metric]['xgb_metric_name'],
                      verbose=bool(verbose),  # xgb verbose bool
                      early_stopping_rounds=early_stopping_rounds)
            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test, ntree_limit=model.best_ntree_limit)
        if model_type == 'sklearn':
            model = model
            model.fit(X_train, y_train)
            y_pred_valid = model.predict_proba(X_valid)
            score = metrics_dict[eval_metric]['sklearn_scoring_function'](y_valid, y_pred_valid)
            print(f'Fold {fold_n}. {eval_metric}: {score:.4f}.')
            y_pred = model.predict_proba(X_test)

        if model_type == 'cat':
            model = CatBoostClassifier(iterations=20000, eval_metric=metrics_dict[eval_metric]['catboost_metric_name'],
                                       **params,
                                       loss_function=metrics_dict[eval_metric]['catboost_metric_name'])
            model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True,
                      verbose=False)

            y_pred_valid = model.predict_proba(X_valid)
            y_pred = model.predict_proba(X_test)

        oof[valid_index] = y_pred_valid
        # 评价指标
        acc_scores.append(
            metrics_dict['lb_score_method']['sklearn_scoring_accuracy'](y_valid, np.argmax(y_pred_valid, axis=1)))
        scores.append(
            metrics_dict['lb_score_method']['sklearn_scoring_auc'](y_valid, y_pred_valid[:,1]))
        print(acc_scores)
        print(scores)
        prediction += y_pred

        if model_type == 'lgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)

        if model_type == 'xgb' and plot_feature_importance:
            # feature importance
            fold_importance = pd.DataFrame()
            fold_importance["feature"] = columns
            fold_importance["importance"] = model.feature_importances_
            fold_importance["fold"] = fold_n + 1
            feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
    prediction /= n_splits
    print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))

    result_dict['oof'] = oof
    result_dict['prediction'] = prediction
    result_dict['acc_scores'] = acc_scores
    result_dict['scores'] = scores


    if model_type == 'lgb' or model_type == 'xgb':
        if plot_feature_importance:
            feature_importance["importance"] /= n_splits
            cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
                by="importance", ascending=False)[:50].index

            best_features = feature_importance.loc[feature_importance.feature.isin(cols)]

            plt.figure(figsize=(16, 12))
            sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False))
            plt.title('LGB Features (avg over folds)')
            plt.show()
            result_dict['feature_importance'] = feature_importance
    end_time = time.time()

    print("train_model_classification cost time:{}".format(end_time - start_time))
    return result_dict

7.lgb模型

  • GitHub主页:https://github.com/microsoft/LightGBM
  • 文档网址:https://lightgbm.readthedocs.io/en/latest/
  • 核心参数:https://lightgbm.readthedocs.io/en/latest/
lgb_params = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'n_estimators': 100000,
    'learning_rate': 0.1,
    'random_state': 2948,
    'bagging_freq': 8,
    'bagging_fraction': 0.80718,
    'feature_fraction': 0.38691,  # 0.3
    'feature_fraction_seed': 11,
    'max_depth': 9,
    'min_data_in_leaf': 40,
    'min_child_weight': 0.18654,
    "min_split_gain": 0.35079,
    'min_sum_hessian_in_leaf': 1.11347,
    'num_leaves': 29,
    'num_threads': 6,
    "lambda_l1": 0.55831,
    'lambda_l2': 1.67906,
    'cat_smooth': 10.4,
    'subsample': 0.7,
    'colsample_bytree': 0.7,
    # 'n_jobs': -1,
    'metric': 'auc'
}

n_fold = 5
num_classes = 2
print("分类个数num_classes:{}".format(num_classes))
folds = StratifiedKFold(n_splits=n_fold, random_state=1314,shuffle=True)
X = train_dataset[features]
print(y_train.value_counts())
X_test = eval_dataset[features]

result_dict_lgb = train_model_classification(X=X,
                                             X_test=X_test,
                                             y=y_train,
                                             params=lgb_params,
                                             num_classes=num_classes,
                                             folds=folds,
                                             model_type='lgb',
                                             eval_metric='logloss',
                                             plot_feature_importance=True,
                                             verbose=200,
                                             early_stopping_rounds=200,
                                             n_folds=n_fold
                                             )

acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
                   early_stopping_rounds=200,
                                             n_folds=n_fold
                                             )

acc_score = np.mean(result_dict_lgb['acc_scores'])
score = np.mean(result_dict_lgb['scores'])
print(score)
分类个数num_classes:2
0    6347
1    1653
Name: Exited, dtype: int64
Fold 1 started at Sun Jan 23 23:34:02 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.230063	training's auc: 0.956131	valid_1's binary_logloss: 0.377107	valid_1's auc: 0.826328
Early stopping, best iteration is:
[87]	training's binary_logloss: 0.276623	training's auc: 0.932371	valid_1's binary_logloss: 0.371376	valid_1's auc: 0.830094
[0.850625]
[0.8300943483819361]
Fold 2 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.233911	training's auc: 0.954853	valid_1's binary_logloss: 0.366604	valid_1's auc: 0.844229
Early stopping, best iteration is:
[100]	training's binary_logloss: 0.271444	training's auc: 0.934158	valid_1's binary_logloss: 0.359972	valid_1's auc: 0.850745
[0.850625, 0.846875]
[0.8300943483819361, 0.8507448117912859]
Fold 3 started at Sun Jan 23 23:34:03 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.238286	training's auc: 0.95085	valid_1's binary_logloss: 0.355037	valid_1's auc: 0.849645
Early stopping, best iteration is:
[98]	training's binary_logloss: 0.276559	training's auc: 0.93025	valid_1's binary_logloss: 0.353108	valid_1's auc: 0.856094
[0.850625, 0.846875, 0.85375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975]
Fold 4 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.239597	training's auc: 0.95257	valid_1's binary_logloss: 0.318056	valid_1's auc: 0.882052
Early stopping, best iteration is:
[106]	training's binary_logloss: 0.275256	training's auc: 0.931964	valid_1's binary_logloss: 0.31983	valid_1's auc: 0.883624
[0.850625, 0.846875, 0.85375, 0.873125]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726]
Fold 5 started at Sun Jan 23 23:34:04 2022
[LightGBM] [Warning] feature_fraction is set=0.38691, colsample_bytree=0.7 will be ignored. Current value: feature_fraction=0.38691
[LightGBM] [Warning] lambda_l1 is set=0.55831, reg_alpha=0.0 will be ignored. Current value: lambda_l1=0.55831
[LightGBM] [Warning] num_threads is set=6, n_jobs=-1 will be ignored. Current value: num_threads=6
[LightGBM] [Warning] bagging_fraction is set=0.80718, subsample=0.7 will be ignored. Current value: bagging_fraction=0.80718
[LightGBM] [Warning] min_sum_hessian_in_leaf is set=1.11347, min_child_weight=0.18654 will be ignored. Current value: min_sum_hessian_in_leaf=1.11347
[LightGBM] [Warning] min_data_in_leaf is set=40, min_child_samples=20 will be ignored. Current value: min_data_in_leaf=40
[LightGBM] [Warning] bagging_freq is set=8, subsample_freq=0 will be ignored. Current value: bagging_freq=8
[LightGBM] [Warning] lambda_l2 is set=1.67906, reg_lambda=0.0 will be ignored. Current value: lambda_l2=1.67906
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.236783	training's auc: 0.952846	valid_1's binary_logloss: 0.350703	valid_1's auc: 0.851403
Early stopping, best iteration is:
[119]	training's binary_logloss: 0.265393	training's auc: 0.937635	valid_1's binary_logloss: 0.347245	valid_1's auc: 0.853543
[0.850625, 0.846875, 0.85375, 0.873125, 0.859375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726, 0.8535433070866143]
CV mean score: 0.8548, std: 0.0171.

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4Ao3QD2R-1644589426330)(output_21_1.png)]

train_model_classification cost time:2.9498422145843506
0.8548202431242012

8.与BML对比

BML结果

自建模型结果
Training until validation scores don't improve for 200 rounds
[200]	training's binary_logloss: 0.236783	training's auc: 0.952846	valid_1's binary_logloss: 0.350703	valid_1's auc: 0.851403
Early stopping, best iteration is:
[119]	training's binary_logloss: 0.265393	training's auc: 0.937635	valid_1's binary_logloss: 0.347245	valid_1's auc: 0.853543
[0.850625, 0.846875, 0.85375, 0.873125, 0.859375]
[0.8300943483819361, 0.8507448117912859, 0.8560943150516975, 0.8836244333094726, 0.8535433070866143]
CV mean score: 0.8548, std: 0.0171.

对比结果

可见BML得分略高,原因可能有自建模型没有做数据均衡。

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐