数字中国创新大赛(DCIC)-消费者信用评分冠军选手开源学习方案

带你一起回顾深刻的业务理解、巧妙的特征工程以及深度学习模型设计，取得了A榜Top1，B榜首次提交Top1的成绩，最终从2522支国内外队伍中获得了冠军佳绩心路历程。喜欢的话可以fork一下！

AI Studio

954人浏览 · 2021-12-04 08:15:13

AI Studio · 2021-12-04 08:15:13 发布

项目简介

技术简介：数据挖掘、特征工程、机器学习、深度学习

赛题名称：数字中国创新大赛-消费者人群画像信用智能评分信用评分

赛题连接

文章背景：与好友刘鑫共同撰写-58同城AI算法大赛季军方案共同作者

方案链接

赛题背景

随着社会信用体系建设的深入推进, 社会信用标准建设飞速发展，相关的标准相继发布，包括信用服务标准、信用数据釆集和服务标准、信用修复标准、城市信用标准、行业信用标准等在内的多层次标准体系亟待出台，社会信用标准体系有望快速推进。社会各行业信用服务机构深度参与广告、政务、涉金融、共享单车、旅游、重大投资项目、教育、环保以及社会信用体系建设，社会信用体系建设是个系统工程，通讯运营商作为社会企业中不可缺少的部分同样需要打造企业信用评分体系，助推整个社会的信用体系升级。同时国家也鼓励推进第三方信用服务机构与政府数据交换，以增强政府公共信用信息中心的核心竞争力。

传统的信用评分主要以客户消费能力等少数的维度来衡量，难以全面、客观、及时的反映客户的信用。中国移动作为通信运营商拥有海量、广泛、高质量、高时效的数据，如何基于丰富的大数据对客户进行智能评分是中国移动和新大陆科技集团目前攻关的难题。运营商信用智能评分体系的建立不仅能完善社会信用体系，同时中国移动内部也提供了丰富的应用价值，包括全球通客户服务品质的提升、客户欠费额度的信用控制、根据信用等级享受各类业务优惠等，希望通过本次建模比赛，征集优秀的模型体系，准确评估用户信用分值。

数据说明

本次提供数据主要包含用户几个方面信息：身份特征、消费能力、人脉关系、位置轨迹、应用行为偏好。字段说明如下：

用户编码数值唯一性
用户实名制是否通过核实 1为是0为否
用户年龄数值
是否大学生客户 1为是0为否
是否黑名单客户 1为是0为否
是否4G不健康客户 1为是0为否
用户网龄（月）数值
用户最近一次缴费距今时长（月）数值
缴费用户最近一次缴费金额（元）数值
用户近6个月平均消费话费（元）数值
用户账单当月总费用（元）数值
用户当月账户余额（元）数值
缴费用户当前是否欠费缴费 1为是0为否
用户话费敏感度用户话费敏感度一级表示敏感等级最大。根据极值计算法、叶指标权重后得出的结果，根据规则，生成敏感度用户的敏感级别：先将敏感度用户按中间分值按降序进行排序，前5%的用户对应的敏感级别为一级：接下来的15%的用户对应的敏感级别为二级；接下来的15%的用户对应的敏感级别为三级；接下来的25%的用户对应的敏感级别为四级；最后40%的用户对应的敏感度级别为五级。
当月通话交往圈人数数值
是否经常逛商场的人 1为是0为否
近三个月月均商场出现次数数值
当月是否逛过福州仓山万达 1为是0为否
当月是否到过福州山姆会员店 1为是0为否
当月是否看电影 1为是0为否
当月是否景点游览 1为是0为否
当月是否体育场馆消费 1为是0为否
当月网购类应用使用次数数值
当月物流快递类应用使用次数数值
当月金融理财类应用使用总次数数值
当月视频播放类应用使用次数数值
当月飞机类应用使用次数数值
当月火车类应用使用次数数值
当月旅游资讯类应用使用次数数值

评价方式

竞赛评价指标采用MAE系数。

平均绝对差值是用来衡量模型预测结果对标准结果的接近程度的一种衡量方法。计算方法如下：

$MAE=\frac{1}{n}\sum_{i=1}^{n}{\left| pred_{i} - y_{i} \right|}$

其中 $pred_{i}$ 为预测样本， $y_{i}$ 为真实样本。 $M A E$ 的值越小，说明预测数据与真实数据越接近。最终结果为：

$Score=\frac{1}{1+MAE}$

最终的结果越接近1分数越高。

全面探索

万变不离其宗，首先我们作为一名数据竞赛选手，拿到数据应该进行分析观察，让自己对竞赛题型、数据大致了解，下面开始数据整体探索。

""" 导入基本库 """
import tqdm
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
plt.style.use("bmh")
plt.rc('font', family='SimHei', size=13)
pd.set_option('display.max_columns',1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',1000)

在数据清单我们知道本次竞赛有一个训练集压缩包和一个预测集压缩包，在文件夹中解压后直接进行合并，以便后续数据内容变换的统一处理。

train_data = pd.read_csv('data/data113102/train_dataset.csv')
test_data = pd.read_csv('data/data113102/test_dataset.csv')
columns = ['user_id', 'real_name', 'age', 'whether_college_students',
           'whether_blacklist_customer', 'whether_4G_unhealthy_customers',
           'user_network_age', 'last_payment_long', 'last_payment_amount',
           'average_consumption_value', 'all_fee', 'balance', 'whether_payment_owed',
           'call_sensitivity', 'number_people_circle', 'whether_often_shopping',
           'average_number_appearance', 'whether_visited_Wanda',
           'whether_visited_member_store', 'whether_watch_movie',
           'whether_attraction', 'whether_stadium_consumption',
           'shopping_app_usage', 'express_app_usage', 'financial_app_usage',
           'video_app_usage', 'aircraft_app_usage', 'train_app_usage',
           'tourism_app_usage', 'label']
train_data.columns = columns
test_data.columns = columns[:-1]
df_data = pd.concat([train_data, test_data], ignore_index=True)

df_data.head()

	user_id	real_name	age	whether_4G_unhealthy_customers	user_network_age	last_payment_long	last_payment_amount	average_consumption_value	all_fee	balance	call_sensitivity	number_people_circle	whether_often_shopping	average_number_appearance	whether_attraction	whether_stadium_consumption	shopping_app_usage	financial_app_usage	video_app_usage	tourism_app_usage	label
0	a4651f98c82948b186bdcdc8108381b4	1	44	0	186	1	99.80	163.86	159.20	180	3	83	1	75	1	1	713	2740	7145	30	664.0
1	aeb10247db4e4d67b2550bbc42ff9827	1	18	1	5	1	29.94	153.28	145.10	110	3	21	1	16	0	0	414	2731	44862	0	530.0
2	5af23a1e0e77410abb25e9a7eee510aa	1	47	0	145	1	49.90	109.64	120.20	70	1	59	0	1	0	0	3391	0	4804	1	643.0
3	43c64379d3c24a15b8478851b22049e4	1	55	0	234	1	99.80	92.97	167.42	90	3	78	1	26	1	1	500	1931	3141	5	649.0
4	f1687f3b8a6f4910bd0b13eb634056e2	1	40	0	76	1	49.90	95.47	101.00	80	3	70	1	44	1	0	522	64	59	0	648.0

""" 数据属性 """

df_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 30 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   user_id                         100000 non-null  object 
 1   real_name                       100000 non-null  int64  
 2   age                             100000 non-null  int64  
 3   whether_college_students        100000 non-null  int64  
 4   whether_blacklist_customer      100000 non-null  int64  
 5   whether_4G_unhealthy_customers  100000 non-null  int64  
 6   user_network_age                100000 non-null  int64  
 7   last_payment_long               100000 non-null  int64  
 8   last_payment_amount             100000 non-null  float64
 9   average_consumption_value       100000 non-null  float64
 10  all_fee                         100000 non-null  float64
 11  balance                         100000 non-null  int64  
 12  whether_payment_owed            100000 non-null  int64  
 13  call_sensitivity                100000 non-null  int64  
 14  number_people_circle            100000 non-null  int64  
 15  whether_often_shopping          100000 non-null  int64  
 16  average_number_appearance       100000 non-null  int64  
 17  whether_visited_Wanda           100000 non-null  int64  
 18  whether_visited_member_store    100000 non-null  int64  
 19  whether_watch_movie             100000 non-null  int64  
 20  whether_attraction              100000 non-null  int64  
 21  whether_stadium_consumption     100000 non-null  int64  
 22  shopping_app_usage              100000 non-null  int64  
 23  express_app_usage               100000 non-null  int64  
 24  financial_app_usage             100000 non-null  int64  
 25  video_app_usage                 100000 non-null  int64  
 26  aircraft_app_usage              100000 non-null  int64  
 27  train_app_usage                 100000 non-null  int64  
 28  tourism_app_usage               100000 non-null  int64  
 29  label                           50000 non-null   float64
dtypes: float64(4), int64(25), object(1)
memory usage: 22.9+ MB

print("共有数据集：", df_data.shape[0])
print("共有测试集：", test_data.shape[0])
print("共有训练集：", train_data.shape[0])

共有数据集： 100000
共有测试集： 50000
共有训练集： 50000

# """ 数据类别 """

for i,name in enumerate(df_data.columns):
    name_sum = df_data[name].value_counts().shape[0] 
    print("{}、{}      The number of types of features is：{}".format(i + 1, name, name_sum))

1、user_id      The number of types of features is：100000
2、real_name      The number of types of features is：2
3、age      The number of types of features is：88
4、whether_college_students      The number of types of features is：2
5、whether_blacklist_customer      The number of types of features is：2
6、whether_4G_unhealthy_customers      The number of types of features is：2
7、user_network_age      The number of types of features is：283
8、last_payment_long      The number of types of features is：2
9、last_payment_amount      The number of types of features is：532
10、average_consumption_value      The number of types of features is：22520
11、all_fee      The number of types of features is：16597
12、balance      The number of types of features is：316
13、whether_payment_owed      The number of types of features is：2
14、call_sensitivity      The number of types of features is：6
15、number_people_circle      The number of types of features is：554
16、whether_often_shopping      The number of types of features is：2
17、average_number_appearance      The number of types of features is：93
18、whether_visited_Wanda      The number of types of features is：2
19、whether_visited_member_store      The number of types of features is：2
20、whether_watch_movie      The number of types of features is：2
21、whether_attraction      The number of types of features is：2
22、whether_stadium_consumption      The number of types of features is：2
23、shopping_app_usage      The number of types of features is：8382
24、express_app_usage      The number of types of features is：239
25、financial_app_usage      The number of types of features is：7232
26、video_app_usage      The number of types of features is：16067
27、aircraft_app_usage      The number of types of features is：209
28、train_app_usage      The number of types of features is：180
29、tourism_app_usage      The number of types of features is：934
30、label      The number of types of features is：278

# """ 数据统计 """

df_data.describe()

	real_name	age	whether_college_students	whether_blacklist_customer	whether_4G_unhealthy_customers	user_network_age	last_payment_long	last_payment_amount	average_consumption_value	all_fee	balance	whether_payment_owed	call_sensitivity	number_people_circle	whether_often_shopping	average_number_appearance	whether_visited_Wanda	whether_visited_member_store	whether_watch_movie	whether_attraction	whether_stadium_consumption	shopping_app_usage	express_app_usage	financial_app_usage	video_app_usage	aircraft_app_usage	train_app_usage	tourism_app_usage	label
count	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.000000	100000.00000	100000.000000	100000.00000	100000.000000	100000.000000	100000.000000	100000.00000	1.000000e+05	100000.000000	100000.000000	100000.000000	50000.000000
mean	0.991240	37.907910	0.003620	0.048500	0.088690	96.271580	0.701420	53.721932	98.983241	99.709021	115.684600	0.051830	3.352980	48.211110	0.330720	26.507030	0.039230	0.02705	0.243800	0.47546	0.374730	1161.142610	1.025860	975.36609	3.386321e+03	0.649760	0.564590	19.394650	618.053060
std	0.093184	11.625008	0.060058	0.214821	0.284297	59.112782	0.457637	62.214807	61.002422	65.314169	448.684984	0.221685	1.241129	54.867465	0.470475	32.739661	0.194143	0.16223	0.429376	0.49940	0.484056	4300.092242	37.482212	2965.36056	1.074417e+04	22.299903	7.973381	312.587384	42.443022
min	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	10.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.00000	0.000000	0.000000	0.000000	0.00000	0.000000e+00	0.000000	0.000000	0.000000	422.000000
25%	1.000000	30.000000	0.000000	0.000000	0.000000	48.000000	0.000000	0.000000	54.320000	53.000000	40.000000	0.000000	2.000000	16.000000	0.000000	1.000000	0.000000	0.00000	0.000000	0.00000	0.000000	18.000000	0.000000	6.00000	1.000000e+01	0.000000	0.000000	0.000000	594.000000
50%	1.000000	36.000000	0.000000	0.000000	0.000000	94.000000	1.000000	49.900000	89.670000	90.000000	70.000000	0.000000	4.000000	32.000000	0.000000	8.000000	0.000000	0.00000	0.000000	0.00000	0.000000	248.000000	0.000000	265.00000	3.340000e+02	0.000000	0.000000	0.000000	627.000000
75%	1.000000	45.000000	0.000000	0.000000	0.000000	139.000000	1.000000	99.800000	131.560000	134.627500	130.000000	0.000000	4.000000	62.000000	1.000000	49.000000	0.000000	0.00000	0.000000	1.00000	1.000000	934.000000	0.000000	1145.00000	2.440000e+03	0.000000	0.000000	4.000000	649.000000
max	1.000000	111.000000	1.000000	1.000000	1.000000	288.000000	1.000000	1000.000000	1792.740000	2117.010000	109090.000000	1.000000	5.000000	1906.000000	1.000000	92.000000	1.000000	1.00000	1.000000	1.00000	1.000000	417536.000000	8235.000000	496238.00000	1.382227e+06	5856.000000	775.000000	87681.000000	719.000000

# """ 观察训练/测试集数据同分布状况 """

df_data[df_data['label'].isnull()].describe()

	real_name	age	whether_college_students	whether_blacklist_customer	whether_4G_unhealthy_customers	user_network_age	last_payment_long	last_payment_amount	average_consumption_value	all_fee	balance	whether_payment_owed	call_sensitivity	number_people_circle	whether_often_shopping	average_number_appearance	whether_visited_Wanda	whether_visited_member_store	whether_watch_movie	whether_attraction	whether_stadium_consumption	shopping_app_usage	express_app_usage	financial_app_usage	video_app_usage	aircraft_app_usage	train_app_usage	tourism_app_usage	label
count	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.00000	50000.000000	50000.000000	50000.00000	50000.000000	50000.00000	50000.000000	0.0
mean	0.992260	37.932380	0.003520	0.048200	0.088800	96.094480	0.702740	54.027936	99.234402	99.842912	117.196800	0.051120	3.354040	48.367720	0.331940	26.441540	0.039760	0.026980	0.245320	0.476500	0.375340	1173.46996	0.853880	979.229100	3406.12244	0.595100	0.57584	19.672180	NaN
std	0.087637	11.636829	0.059226	0.214191	0.284458	59.048962	0.457057	62.614124	61.245686	65.301379	556.938946	0.220245	1.241067	55.518686	0.470914	32.690192	0.195397	0.162027	0.430281	0.499452	0.484215	4586.71334	28.848873	2924.008879	9919.40536	13.025441	8.20404	408.041808	NaN
min	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	10.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.000000	0.000000	0.00000	0.000000	0.00000	0.000000	NaN
25%	1.000000	30.000000	0.000000	0.000000	0.000000	48.000000	0.000000	0.000000	54.450000	53.200000	40.000000	0.000000	2.000000	16.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	17.00000	0.000000	5.000000	10.00000	0.000000	0.00000	0.000000	NaN
50%	1.000000	36.000000	0.000000	0.000000	0.000000	94.000000	1.000000	49.900000	90.000000	90.000000	70.000000	0.000000	4.000000	32.000000	0.000000	8.000000	0.000000	0.000000	0.000000	0.000000	0.000000	246.00000	0.000000	263.000000	333.00000	0.000000	0.00000	0.000000	NaN
75%	1.000000	45.000000	0.000000	0.000000	0.000000	139.000000	1.000000	99.800000	132.000000	135.235000	130.000000	0.000000	4.000000	62.000000	1.000000	49.000000	0.000000	0.000000	0.000000	1.000000	1.000000	935.00000	0.000000	1144.000000	2455.00000	0.000000	0.00000	4.000000	NaN
max	1.000000	108.000000	1.000000	1.000000	1.000000	288.000000	1.000000	1000.000000	1792.740000	2117.010000	109090.000000	1.000000	5.000000	1633.000000	1.000000	92.000000	1.000000	1.000000	1.000000	1.000000	1.000000	417536.00000	5462.000000	329767.000000	295210.00000	1645.000000	775.00000	87681.000000	NaN

df_data[df_data['label'].notnull()].describe()

	real_name	age	whether_college_students	whether_blacklist_customer	whether_4G_unhealthy_customers	user_network_age	last_payment_long	last_payment_amount	average_consumption_value	all_fee	balance	whether_payment_owed	call_sensitivity	number_people_circle	whether_often_shopping	average_number_appearance	whether_visited_Wanda	whether_visited_member_store	whether_watch_movie	whether_attraction	whether_stadium_consumption	shopping_app_usage	express_app_usage	financial_app_usage	video_app_usage	aircraft_app_usage	train_app_usage	tourism_app_usage	label
count	50000.00000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.000000	50000.00000	50000.00000	50000.000000	50000.000000	50000.00000	5.000000e+04	50000.000000	50000.000000	50000.000000	50000.000000
mean	0.99022	37.883440	0.003720	0.048800	0.088580	96.448680	0.700100	53.415929	98.732081	99.575130	114.172400	0.052540	3.351920	48.054500	0.329500	26.572520	0.038700	0.027120	0.242280	0.47442	0.37412	1148.815260	1.197840	971.50308	3.366519e+03	0.704420	0.553340	19.117120	618.053060
std	0.09841	11.613239	0.060879	0.215452	0.284139	59.176593	0.458218	61.812022	60.757758	65.327335	304.063961	0.223116	1.241202	54.208524	0.470036	32.789251	0.192881	0.162435	0.428467	0.49935	0.48390	3992.957952	44.469584	3006.16776	1.151006e+04	28.721302	7.735913	170.074772	42.443022
min	0.00000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	10.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.00000	0.00000	0.000000	0.000000	0.00000	0.000000e+00	0.000000	0.000000	0.000000	422.000000
25%	1.00000	30.000000	0.000000	0.000000	0.000000	48.000000	0.000000	0.000000	54.180000	52.675000	40.000000	0.000000	2.000000	16.000000	0.000000	1.000000	0.000000	0.000000	0.000000	0.00000	0.00000	18.000000	0.000000	6.00000	1.000000e+01	0.000000	0.000000	0.000000	594.000000
50%	1.00000	36.000000	0.000000	0.000000	0.000000	94.000000	1.000000	49.900000	89.320000	89.620000	70.000000	0.000000	4.000000	32.000000	0.000000	8.000000	0.000000	0.000000	0.000000	0.00000	0.00000	250.000000	0.000000	267.00000	3.350000e+02	0.000000	0.000000	0.000000	627.000000
75%	1.00000	45.000000	0.000000	0.000000	0.000000	139.000000	1.000000	99.800000	131.160000	133.945000	130.000000	0.000000	4.000000	61.000000	1.000000	50.000000	0.000000	0.000000	0.000000	1.00000	1.00000	932.000000	0.000000	1147.25000	2.423250e+03	0.000000	0.000000	4.000000	649.000000
max	1.00000	111.000000	1.000000	1.000000	1.000000	288.000000	1.000000	998.000000	840.570000	1164.290000	49040.000000	1.000000	5.000000	1906.000000	1.000000	92.000000	1.000000	1.000000	1.000000	1.00000	1.00000	234336.000000	8235.000000	496238.00000	1.382227e+06	5856.000000	474.000000	13965.000000	719.000000

**结论一：**数据集情况较好，特征全为数值型特征，我们直接建立模型进行提交验证线下分数。但是大部分的特征存在类拖尾情况，如当月旅游资讯类应用使用次数特征中max为87681次，明显偏离mean约为19的次数，后续我们将会对这些特征单独分析。

**结论二：**训练集与测试集中都存在拖尾数据，所以拖尾数据不一定是异常数据，需要经过线下验证才能确定。

特征探索

接下来开始分析特征与信用分的关联性、开展相关的特征探索。

# """ 拖尾/顺序特征分析 """

f, ax = plt.subplots(figsize=(20, 6))
sns.scatterplot(data=df_data, x='number_people_circle', y='label', color='k', ax=ax)

plt.show()

在这里插入图片描述

name_list = ['shopping_app_usage', 'express_app_usage', 'financial_app_usage','video_app_usage', 'aircraft_app_usage', 'train_app_usage',
           'tourism_app_usage', 'last_payment_amount', 'average_consumption_value', 'all_fee']

f, ax = plt.subplots(3, 4, figsize=(20, 20))

for i,name in enumerate(name_list):     
    sns.scatterplot(data=df_data, x=name, y='label', color='b', ax=ax[i // 4][i % 4])

plt.show()

在这里插入图片描述

f, ax = plt.subplots(1, 3, figsize=(20, 6))

sns.kdeplot(data=df_data['aircraft_app_usage'], color='r', shade=True, ax=ax[0])
sns.kdeplot(data=df_data['train_app_usage'], color='c', shade=True, ax=ax[1])
sns.kdeplot(data=df_data['tourism_app_usage'], color='b', shade=True, ax=ax[2])

plt.show()

在这里插入图片描述

**结论：**观察前面提到的特征在上述散点图中的确存在相关的长尾分布现象，但是长尾分布不一定就是无效数据，就像缺失值自己也可能代表某种意义一样，后期在处理长尾分布数据时应结合模型进行线下验证。

# """ 离散特征分析 """

f, ax = plt.subplots(1, 2, figsize=(20, 6))
sns.boxplot(data=df_data, x='last_payment_long', y='label', ax=ax[0])
sns.boxplot(data=df_data, x='whether_payment_owed', y='label', ax=ax[1])

plt.show()

在这里插入图片描述

name_list = ['whether_college_students','whether_blacklist_customer', 'whether_4G_unhealthy_customers','whether_often_shopping',
           'whether_visited_Wanda','whether_visited_member_store', 'whether_watch_movie',
           'whether_attraction', 'whether_stadium_consumption','whether_payment_owed']
             
f, ax = plt.subplots(2, 5, figsize=(20, 12))

for i,name in enumerate(name_list):
    sns.boxplot(data=df_data, x=name, y='label', ax=ax[i // 5][i % 5])

plt.show()

在这里插入图片描述

f, ax = plt.subplots(figsize=(10, 6))

sns.boxplot(data=df_data, x='call_sensitivity', y='label', ax=ax)

plt.show()

在这里插入图片描述

数据预处理涉及的内容很多，也包括特征工程，是任务量最大的一部分。为了让大家更清晰的阅读，以下先列出处理部分大致要用到的一些方法。

数据清洗：缺失值，异常值，一致性；
特征编码：onehot 和 labelcoding;
特征分箱：等频、等距，聚类等
衍生变量：可解释性强，适合模型输入；
特征选择：方差选择，卡方选择，正则化等；

# df_data = df_data[df_data['当月通话交往圈人数'] <= 1750].reset_index(drop=True)

df_data['fee_ratio'] = df_data['all_fee'] / (df_data['balance'] + 1)
df_data['fee_diff'] = df_data['all_fee'] - df_data['average_consumption_value']
df_data['5month_all_fee'] = df_data['average_consumption_value'] * 6 - df_data['all_fee']
df_data['fee_tend'] = df_data['last_payment_amount'] / (df_data['average_consumption_value'] + 1)
df_data['is_bazaar'] = (df_data['whether_visited_Wanda'] + df_data['whether_visited_member_store']).map(lambda x: 1 if x > 0 else 0)
df_data['count_sum'] = df_data['shopping_app_usage'] + df_data['express_app_usage'] + df_data['financial_app_usage'] + df_data['video_app_usage'] + df_data['aircraft_app_usage'] + df_data['train_app_usage'] + df_data['tourism_app_usage']
df_data['user_network_age_month'] = df_data['user_network_age'].apply(lambda x: x % 12)

def get_count(df, column, feature):
    df['idx'] = range(len(df))
    temp = df.groupby(column)['user_id'].agg([(feature, 'count')]).reset_index()
    df = df.merge(temp)
    df = df.sort_values('idx').drop('idx', axis=1).reset_index(drop=True)
    return df
for i in ['last_payment_amount', 'all_fee', 'average_consumption_value', 'fee_diff', ['all_fee', 'average_consumption_value']]:
    df_data = get_count(df_data, i, 'cnt_{}'.format(i))

from sklearn.model_selection import train_test_split
feature_col = [tmp_col for tmp_col in df_data.columns if tmp_col not in ['user_id', 'label']]
print(len(feature_col))

st_model = StandardScaler()
for i in feature_col:
    df_data[i] = st_model.fit_transform(df_data[[i]].values)
df_data['label'] = st_model.fit_transform(df_data[['label']].values)

train_data = df_data[:train_data.shape[0]]
test_data = df_data[train_data.shape[0]:]
X_train, X_vaild, y_train, y_vaild = train_test_split(train_data[feature_col],train_data['label'],test_size = 0.2,random_state = 0)
print(X_train.shape, X_vaild.shape)

40
(40000, 40) (10000, 40)



# for i in feature_col:
#     X_train[i] = st_model.fit_transform(X_train[[i]].values)
#     X_vaild[i] = st_model.fit_transform(X_vaild[[i]].values)
#     test_data[i] = st_model.fit_transform(test_data[[i]].values)

算法模型

import paddle
import paddle.fluid as fluid
class WB_Dataset(paddle.io.Dataset):
    def __init__(self, data, feature_cols, lab=None, dev=False):
        super().__init__()
        self.feature_cols = feature_cols
        self.lens = data.shape[0]

        self.features = data[feature_cols].values.astype(np.float32)
        self.lab = lab
        self.dev = dev
        if dev == False:
            self.labels = lab.values.astype(np.int64)

    def __getitem__(self, idx):
        features_input = list(self.features[[idx]])
        if self.dev == False:
            label_input = self.labels[[idx]]
            return features_input, label_input
        else:
            return features_input

    def __len__(self):
        return self.lens

train_dataset = WB_Dataset(X_train, feature_col, y_train, False)
valid_dataset = WB_Dataset(X_vaild, feature_col, y_vaild, False)
test_dataset = WB_Dataset(test_data[feature_col], feature_col, None, True)

import paddle.nn as nn
class Mlp(nn.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.ReLU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        # self.bn = nn.functional.batch_norm(in_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        # x = self.bn(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x


#定义LSTM网络
import paddle.fluid as fluid
class MyLSTMModel(fluid.dygraph.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.ReLU, drop=0.):
        super(MyLSTMModel,self).__init__()
        self.rnn = paddle.nn.LSTM(in_features, hidden_features, 2, dropout=drop)
        self.flatten = paddle.nn.Flatten()
        self.fc1=fluid.dygraph.Linear(120*14,120)
        self.fc2=fluid.dygraph.Linear(120,out_features)

        
    def forward(self,input):        # forward 定义执行实际运行时网络的执行逻辑
        '''前向计算'''
        # print('input',input.shape)
        out, (h, c)=self.rnn(input)
        out =self.flatten(out)
        out=self.fc1(out)
        out=self.fc2(out)
        return out

def train_model(train_dataset, valid_dataset, model, optimizer, verbose=100, epochs=5, batch_size=64, shuffle=True):

    with paddle.fluid.dygraph.guard(paddle.CUDAPlace(0)):
        train_loader = paddle.io.DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))
        valid_loader = paddle.io.DataLoader(valid_dataset, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))
        test_loader = paddle.io.DataLoader(test_dataset, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))

        print('start training ... ')

        loss_list = []

        for epoch in range(epochs):
            for t_batch_id, t_data in enumerate(train_loader()):

                label_data = paddle.to_tensor(t_data[1],dtype='float32')
                feature=paddle.to_tensor(t_data[0][0], dtype='float32')
                predicts = model(feature)

                loss = nn.functional.mse_loss(predicts, label_data, reduction='mean')
                loss.backward()
                if t_batch_id % (verbose) == 0:
                    print("epoch: {}, batch_id: {}, loss : {}".format(epoch, t_batch_id, math.sqrt(loss.numpy())))  
                    loss_list.append(loss.numpy())
                optim.step()          
                # 梯度清零
                optim.clear_grad()

        predict_list = []
        #每训练一轮 进行一次测试
        for v_batch_id, v_data in enumerate(test_loader()):

            feature=paddle.to_tensor(t_data[0][0], dtype='float32')
            predicts = model(feature)
            predict_list.extend(predicts)

        paddle.save(model.state_dict(), "./model/model.pdparams")
        paddle.save(optim.state_dict(), "./model/adam.pdopt")

        return predict_list, loss_list

#模型参数
feature_number = 40
hidden_features = 20
out_features = 1

#训练参数
epochs = 2
batchsize=512
learning_rate=1e-5
verbose=10
model=Mlp(feature_number, hidden_features, out_features)
optim = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())# Adam优化器

result, loss_list= train_model(train_dataset, valid_dataset,model,optimizer=optim,verbose=verbose,epochs=epochs,batch_size=batchsize,shuffle=True)

start training ... 
epoch: 0, batch_id: 0, loss : 1.0221488935549963
epoch: 0, batch_id: 10, loss : 1.0319473914036887
epoch: 0, batch_id: 20, loss : 1.1616030832364013
epoch: 0, batch_id: 30, loss : 0.9790661875256286
epoch: 0, batch_id: 40, loss : 1.0165814517740945
epoch: 0, batch_id: 50, loss : 0.9623618385930143
epoch: 0, batch_id: 60, loss : 1.050502584157359
epoch: 0, batch_id: 70, loss : 1.0234238718304822
epoch: 1, batch_id: 0, loss : 1.0975484557523207
epoch: 1, batch_id: 10, loss : 1.086625683856127
epoch: 1, batch_id: 20, loss : 0.9919022578127885
epoch: 1, batch_id: 30, loss : 1.0404108806648333
epoch: 1, batch_id: 40, loss : 1.093106216609532
epoch: 1, batch_id: 50, loss : 1.018006920849071
epoch: 1, batch_id: 60, loss : 1.0005267663184374
epoch: 1, batch_id: 70, loss : 1.1276594622696936

赛后总结

之后又组到一群优秀队友，最终我们根据业务设计一款用于融合的模型，取得了A榜Top1，B榜首次提交Top1的成绩，从2522支国内外队伍中取得冠军的佳绩。

实际在竞赛中，你花下的时间应该通常是：特征工程 > 模型融合 > 算法模型 > 参数调整或者是：模型融合 > 特征工程 > 算法模型 > 参数调整

本篇文章介绍了我在中国移动消费者人群画像赛中的经验和心历路程，完成入门到冠军的基本复现。大家可以在本文章的竞赛中进行实践和学习，很多知识只有实践过才能真正理解。比赛成绩很有用，但更重要的是通过竞赛学到东西！在竞赛中，想要取得较好成绩，投入大量的时间是必不可少的，有很多时候你的时间投入下去没有回报，不要气馁，相信自己并付诸努力和实践。