项目简介

技术简介:数据挖掘、特征工程、机器学习、深度学习

赛题名称:数字中国创新大赛-消费者人群画像信用智能评分信用评分

赛题连接

文章背景:与好友刘鑫共同撰写-58同城AI算法大赛季军方案共同作者

方案链接

赛题背景

随着社会信用体系建设的深入推进, 社会信用标准建设飞速发展,相关的标准相继发布,包括信用服务标准、信用数据釆集和服务标准、信用修复标准、城市信用标准、行业信用标准等在内的多层次标准体系亟待出台,社会信用标准体系有望快速推进。社会各行业信用服务机构深度参与广告、政务、涉金融、共享单车、旅游、重大投资项目、教育、环保以及社会信用体系建设,社会信用体系建设是个系统工程,通讯运营商作为社会企业中不可缺少的部分同样需要打造企业信用评分体系,助推整个社会的信用体系升级。同时国家也鼓励推进第三方信用服务机构与政府数据交换,以增强政府公共信用信息中心的核心竞争力。

传统的信用评分主要以客户消费能力等少数的维度来衡量,难以全面、客观、及时的反映客户的信用。中国移动作为通信运营商拥有海量、广泛、高质量、高时效的数据,如何基于丰富的大数据对客户进行智能评分是中国移动和新大陆科技集团目前攻关的难题。运营商信用智能评分体系的建立不仅能完善社会信用体系,同时中国移动内部也提供了丰富的应用价值,包括全球通客户服务品质的提升、客户欠费额度的信用控制、根据信用等级享受各类业务优惠等,希望通过本次建模比赛,征集优秀的模型体系,准确评估用户信用分值。

数据说明

本次提供数据主要包含用户几个方面信息:身份特征、消费能力、人脉关系、位置轨迹、应用行为偏好。字段说明如下:

  1. 用户编码 数值 唯一性
  2. 用户实名制是否通过核实 1为是0为否
  3. 用户年龄 数值
  4. 是否大学生客户 1为是0为否
  5. 是否黑名单客户 1为是0为否
  6. 是否4G不健康客户 1为是0为否
  7. 用户网龄(月) 数值
  8. 用户最近一次缴费距今时长(月) 数值
  9. 缴费用户最近一次缴费金额(元) 数值
  10. 用户近6个月平均消费话费(元) 数值
  11. 用户账单当月总费用(元) 数值
  12. 用户当月账户余额(元) 数值
  13. 缴费用户当前是否欠费缴费 1为是0为否
  14. 用户话费敏感度 用户话费敏感度一级表示敏感等级最大。根据极值计算法、叶指标权重后得出的结果,根据规则,生成敏感度用户的敏感级别:先将敏感度用户按中间分值按降序进行排序,前5%的用户对应的敏感级别为一级:接下来的15%的用户对应的敏感级别为二级;接下来的15%的用户对应的敏感级别为三级;接下来的25%的用户对应的敏感级别为四级;最后40%的用户对应的敏感度级别为五级。
  15. 当月通话交往圈人数 数值
  16. 是否经常逛商场的人 1为是0为否
  17. 近三个月月均商场出现次数 数值
  18. 当月是否逛过福州仓山万达 1为是0为否
  19. 当月是否到过福州山姆会员店 1为是0为否
  20. 当月是否看电影 1为是0为否
  21. 当月是否景点游览 1为是0为否
  22. 当月是否体育场馆消费 1为是0为否
  23. 当月网购类应用使用次数 数值
  24. 当月物流快递类应用使用次数 数值
  25. 当月金融理财类应用使用总次数 数值
  26. 当月视频播放类应用使用次数 数值
  27. 当月飞机类应用使用次数 数值
  28. 当月火车类应用使用次数 数值
  29. 当月旅游资讯类应用使用次数 数值

评价方式

竞赛评价指标采用MAE系数。

平均绝对差值是用来衡量模型预测结果对标准结果的接近程度的一种衡量方法。计算方法如下:

M A E = 1 n ∑ i = 1 n ∣ p r e d i − y i ∣ MAE=\frac{1}{n}\sum_{i=1}^{n}{\left| pred_{i} - y_{i} \right|} MAE=n1i=1nprediyi

其中 p r e d i pred_{i} predi为预测样本, y i y_{i} yi为真实样本。 M A E MAE MAE的值越小,说明预测数据与真实数据越接近。最终结果为:

S c o r e = 1 1 + M A E Score=\frac{1}{1+MAE} Score=1+MAE1

最终的结果越接近1分数越高。

全面探索

万变不离其宗,首先我们作为一名数据竞赛选手,拿到数据应该进行分析观察,让自己对竞赛题型、数据大致了解,下面开始数据整体探索。

""" 导入基本库 """
import tqdm
import math
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
plt.style.use("bmh")
plt.rc('font', family='SimHei', size=13)
pd.set_option('display.max_columns',1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',1000)

在数据清单我们知道本次竞赛有一个训练集压缩包和一个预测集压缩包,在文件夹中解压后直接进行合并,以便后续数据内容变换的统一处理。

train_data = pd.read_csv('data/data113102/train_dataset.csv')
test_data = pd.read_csv('data/data113102/test_dataset.csv')
columns = ['user_id', 'real_name', 'age', 'whether_college_students',
           'whether_blacklist_customer', 'whether_4G_unhealthy_customers',
           'user_network_age', 'last_payment_long', 'last_payment_amount',
           'average_consumption_value', 'all_fee', 'balance', 'whether_payment_owed',
           'call_sensitivity', 'number_people_circle', 'whether_often_shopping',
           'average_number_appearance', 'whether_visited_Wanda',
           'whether_visited_member_store', 'whether_watch_movie',
           'whether_attraction', 'whether_stadium_consumption',
           'shopping_app_usage', 'express_app_usage', 'financial_app_usage',
           'video_app_usage', 'aircraft_app_usage', 'train_app_usage',
           'tourism_app_usage', 'label']
train_data.columns = columns
test_data.columns = columns[:-1]
df_data = pd.concat([train_data, test_data], ignore_index=True)

df_data.head()

user_idreal_nameagewhether_college_studentswhether_blacklist_customerwhether_4G_unhealthy_customersuser_network_agelast_payment_longlast_payment_amountaverage_consumption_valueall_feebalancewhether_payment_owedcall_sensitivitynumber_people_circlewhether_often_shoppingaverage_number_appearancewhether_visited_Wandawhether_visited_member_storewhether_watch_moviewhether_attractionwhether_stadium_consumptionshopping_app_usageexpress_app_usagefinancial_app_usagevideo_app_usageaircraft_app_usagetrain_app_usagetourism_app_usagelabel
0a4651f98c82948b186bdcdc8108381b4144000186199.80163.86159.201800383175000117130274071450030664.0
1aeb10247db4e4d67b2550bbc42ff98271180015129.94153.28145.101100321116000004140273144862000530.0
25af23a1e0e77410abb25e9a7eee510aa147000145149.90109.64120.2070015901000003391004804001643.0
343c64379d3c24a15b8478851b22049e4155000234199.8092.97167.4290037812600011500019313141005649.0
4f1687f3b8a6f4910bd0b13eb634056e214000076149.9095.47101.008003701440001052206459000648.0
""" 数据属性 """

df_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 30 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   user_id                         100000 non-null  object 
 1   real_name                       100000 non-null  int64  
 2   age                             100000 non-null  int64  
 3   whether_college_students        100000 non-null  int64  
 4   whether_blacklist_customer      100000 non-null  int64  
 5   whether_4G_unhealthy_customers  100000 non-null  int64  
 6   user_network_age                100000 non-null  int64  
 7   last_payment_long               100000 non-null  int64  
 8   last_payment_amount             100000 non-null  float64
 9   average_consumption_value       100000 non-null  float64
 10  all_fee                         100000 non-null  float64
 11  balance                         100000 non-null  int64  
 12  whether_payment_owed            100000 non-null  int64  
 13  call_sensitivity                100000 non-null  int64  
 14  number_people_circle            100000 non-null  int64  
 15  whether_often_shopping          100000 non-null  int64  
 16  average_number_appearance       100000 non-null  int64  
 17  whether_visited_Wanda           100000 non-null  int64  
 18  whether_visited_member_store    100000 non-null  int64  
 19  whether_watch_movie             100000 non-null  int64  
 20  whether_attraction              100000 non-null  int64  
 21  whether_stadium_consumption     100000 non-null  int64  
 22  shopping_app_usage              100000 non-null  int64  
 23  express_app_usage               100000 non-null  int64  
 24  financial_app_usage             100000 non-null  int64  
 25  video_app_usage                 100000 non-null  int64  
 26  aircraft_app_usage              100000 non-null  int64  
 27  train_app_usage                 100000 non-null  int64  
 28  tourism_app_usage               100000 non-null  int64  
 29  label                           50000 non-null   float64
dtypes: float64(4), int64(25), object(1)
memory usage: 22.9+ MB
print("共有数据集:", df_data.shape[0])
print("共有测试集:", test_data.shape[0])
print("共有训练集:", train_data.shape[0])
共有数据集: 100000
共有测试集: 50000
共有训练集: 50000
# ​""" 数据类别 """

for i,name in enumerate(df_data.columns):
    name_sum = df_data[name].value_counts().shape[0] 
    print("{}、{}      The number of types of features is:{}".format(i + 1, name, name_sum))
1、user_id      The number of types of features is:100000
2、real_name      The number of types of features is:2
3、age      The number of types of features is:88
4、whether_college_students      The number of types of features is:2
5、whether_blacklist_customer      The number of types of features is:2
6、whether_4G_unhealthy_customers      The number of types of features is:2
7、user_network_age      The number of types of features is:283
8、last_payment_long      The number of types of features is:2
9、last_payment_amount      The number of types of features is:532
10、average_consumption_value      The number of types of features is:22520
11、all_fee      The number of types of features is:16597
12、balance      The number of types of features is:316
13、whether_payment_owed      The number of types of features is:2
14、call_sensitivity      The number of types of features is:6
15、number_people_circle      The number of types of features is:554
16、whether_often_shopping      The number of types of features is:2
17、average_number_appearance      The number of types of features is:93
18、whether_visited_Wanda      The number of types of features is:2
19、whether_visited_member_store      The number of types of features is:2
20、whether_watch_movie      The number of types of features is:2
21、whether_attraction      The number of types of features is:2
22、whether_stadium_consumption      The number of types of features is:2
23、shopping_app_usage      The number of types of features is:8382
24、express_app_usage      The number of types of features is:239
25、financial_app_usage      The number of types of features is:7232
26、video_app_usage      The number of types of features is:16067
27、aircraft_app_usage      The number of types of features is:209
28、train_app_usage      The number of types of features is:180
29、tourism_app_usage      The number of types of features is:934
30、label      The number of types of features is:278
# """ 数据统计 """

df_data.describe()

real_nameagewhether_college_studentswhether_blacklist_customerwhether_4G_unhealthy_customersuser_network_agelast_payment_longlast_payment_amountaverage_consumption_valueall_feebalancewhether_payment_owedcall_sensitivitynumber_people_circlewhether_often_shoppingaverage_number_appearancewhether_visited_Wandawhether_visited_member_storewhether_watch_moviewhether_attractionwhether_stadium_consumptionshopping_app_usageexpress_app_usagefinancial_app_usagevideo_app_usageaircraft_app_usagetrain_app_usagetourism_app_usagelabel
count100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.000000100000.00000100000.000000100000.00000100000.000000100000.000000100000.000000100000.000001.000000e+05100000.000000100000.000000100000.00000050000.000000
mean0.99124037.9079100.0036200.0485000.08869096.2715800.70142053.72193298.98324199.709021115.6846000.0518303.35298048.2111100.33072026.5070300.0392300.027050.2438000.475460.3747301161.1426101.025860975.366093.386321e+030.6497600.56459019.394650618.053060
std0.09318411.6250080.0600580.2148210.28429759.1127820.45763762.21480761.00242265.314169448.6849840.2216851.24112954.8674650.47047532.7396610.1941430.162230.4293760.499400.4840564300.09224237.4822122965.360561.074417e+0422.2999037.973381312.58738442.443022
min0.0000000.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.00000010.0000000.0000000.0000001.0000000.0000000.0000000.0000000.000000.0000000.000000.0000000.0000000.0000000.000000.000000e+000.0000000.0000000.000000422.000000
25%1.00000030.0000000.0000000.0000000.00000048.0000000.0000000.00000054.32000053.00000040.0000000.0000002.00000016.0000000.0000001.0000000.0000000.000000.0000000.000000.00000018.0000000.0000006.000001.000000e+010.0000000.0000000.000000594.000000
50%1.00000036.0000000.0000000.0000000.00000094.0000001.00000049.90000089.67000090.00000070.0000000.0000004.00000032.0000000.0000008.0000000.0000000.000000.0000000.000000.000000248.0000000.000000265.000003.340000e+020.0000000.0000000.000000627.000000
75%1.00000045.0000000.0000000.0000000.000000139.0000001.00000099.800000131.560000134.627500130.0000000.0000004.00000062.0000001.00000049.0000000.0000000.000000.0000001.000001.000000934.0000000.0000001145.000002.440000e+030.0000000.0000004.000000649.000000
max1.000000111.0000001.0000001.0000001.000000288.0000001.0000001000.0000001792.7400002117.010000109090.0000001.0000005.0000001906.0000001.00000092.0000001.0000001.000001.0000001.000001.000000417536.0000008235.000000496238.000001.382227e+065856.000000775.00000087681.000000719.000000
# """ 观察训练/测试集数据同分布状况 """

df_data[df_data['label'].isnull()].describe()
real_nameagewhether_college_studentswhether_blacklist_customerwhether_4G_unhealthy_customersuser_network_agelast_payment_longlast_payment_amountaverage_consumption_valueall_feebalancewhether_payment_owedcall_sensitivitynumber_people_circlewhether_often_shoppingaverage_number_appearancewhether_visited_Wandawhether_visited_member_storewhether_watch_moviewhether_attractionwhether_stadium_consumptionshopping_app_usageexpress_app_usagefinancial_app_usagevideo_app_usageaircraft_app_usagetrain_app_usagetourism_app_usagelabel
count50000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.0000050000.00000050000.00000050000.0000050000.00000050000.0000050000.0000000.0
mean0.99226037.9323800.0035200.0482000.08880096.0944800.70274054.02793699.23440299.842912117.1968000.0511203.35404048.3677200.33194026.4415400.0397600.0269800.2453200.4765000.3753401173.469960.853880979.2291003406.122440.5951000.5758419.672180NaN
std0.08763711.6368290.0592260.2141910.28445859.0489620.45705762.61412461.24568665.301379556.9389460.2202451.24106755.5186860.47091432.6901920.1953970.1620270.4302810.4994520.4842154586.7133428.8488732924.0088799919.4053613.0254418.20404408.041808NaN
min0.0000000.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.00000010.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000.0000000.0000000.000000.0000000.000000.000000NaN
25%1.00000030.0000000.0000000.0000000.00000048.0000000.0000000.00000054.45000053.20000040.0000000.0000002.00000016.0000000.0000001.0000000.0000000.0000000.0000000.0000000.00000017.000000.0000005.00000010.000000.0000000.000000.000000NaN
50%1.00000036.0000000.0000000.0000000.00000094.0000001.00000049.90000090.00000090.00000070.0000000.0000004.00000032.0000000.0000008.0000000.0000000.0000000.0000000.0000000.000000246.000000.000000263.000000333.000000.0000000.000000.000000NaN
75%1.00000045.0000000.0000000.0000000.000000139.0000001.00000099.800000132.000000135.235000130.0000000.0000004.00000062.0000001.00000049.0000000.0000000.0000000.0000001.0000001.000000935.000000.0000001144.0000002455.000000.0000000.000004.000000NaN
max1.000000108.0000001.0000001.0000001.000000288.0000001.0000001000.0000001792.7400002117.010000109090.0000001.0000005.0000001633.0000001.00000092.0000001.0000001.0000001.0000001.0000001.000000417536.000005462.000000329767.000000295210.000001645.000000775.0000087681.000000NaN
df_data[df_data['label'].notnull()].describe()
real_nameagewhether_college_studentswhether_blacklist_customerwhether_4G_unhealthy_customersuser_network_agelast_payment_longlast_payment_amountaverage_consumption_valueall_feebalancewhether_payment_owedcall_sensitivitynumber_people_circlewhether_often_shoppingaverage_number_appearancewhether_visited_Wandawhether_visited_member_storewhether_watch_moviewhether_attractionwhether_stadium_consumptionshopping_app_usageexpress_app_usagefinancial_app_usagevideo_app_usageaircraft_app_usagetrain_app_usagetourism_app_usagelabel
count50000.0000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.00000050000.0000050000.0000050000.00000050000.00000050000.000005.000000e+0450000.00000050000.00000050000.00000050000.000000
mean0.9902237.8834400.0037200.0488000.08858096.4486800.70010053.41592998.73208199.575130114.1724000.0525403.35192048.0545000.32950026.5725200.0387000.0271200.2422800.474420.374121148.8152601.197840971.503083.366519e+030.7044200.55334019.117120618.053060
std0.0984111.6132390.0608790.2154520.28413959.1765930.45821861.81202260.75775865.327335304.0639610.2231161.24120254.2085240.47003632.7892510.1928810.1624350.4284670.499350.483903992.95795244.4695843006.167761.151006e+0428.7213027.735913170.07477242.443022
min0.000000.0000000.0000000.0000000.0000001.0000000.0000000.0000000.0000000.00000010.0000000.0000000.0000001.0000000.0000000.0000000.0000000.0000000.0000000.000000.000000.0000000.0000000.000000.000000e+000.0000000.0000000.000000422.000000
25%1.0000030.0000000.0000000.0000000.00000048.0000000.0000000.00000054.18000052.67500040.0000000.0000002.00000016.0000000.0000001.0000000.0000000.0000000.0000000.000000.0000018.0000000.0000006.000001.000000e+010.0000000.0000000.000000594.000000
50%1.0000036.0000000.0000000.0000000.00000094.0000001.00000049.90000089.32000089.62000070.0000000.0000004.00000032.0000000.0000008.0000000.0000000.0000000.0000000.000000.00000250.0000000.000000267.000003.350000e+020.0000000.0000000.000000627.000000
75%1.0000045.0000000.0000000.0000000.000000139.0000001.00000099.800000131.160000133.945000130.0000000.0000004.00000061.0000001.00000050.0000000.0000000.0000000.0000001.000001.00000932.0000000.0000001147.250002.423250e+030.0000000.0000004.000000649.000000
max1.00000111.0000001.0000001.0000001.000000288.0000001.000000998.000000840.5700001164.29000049040.0000001.0000005.0000001906.0000001.00000092.0000001.0000001.0000001.0000001.000001.00000234336.0000008235.000000496238.000001.382227e+065856.000000474.00000013965.000000719.000000

**结论一:**数据集情况较好,特征全为数值型特征,我们直接建立模型进行提交验证线下分数。但是大部分的特征存在类拖尾情况,如当月旅游资讯类应用使用次数特征中max为87681次,明显偏离mean约为19的次数,后续我们将会对这些特征单独分析。

**结论二:**训练集与测试集中都存在拖尾数据,所以拖尾数据不一定是异常数据,需要经过线下验证才能确定。

特征探索

接下来开始分析特征与信用分的关联性、开展相关的特征探索。

# """ 拖尾/顺序特征分析 """

f, ax = plt.subplots(figsize=(20, 6))
sns.scatterplot(data=df_data, x='number_people_circle', y='label', color='k', ax=ax)

plt.show()

在这里插入图片描述

name_list = ['shopping_app_usage', 'express_app_usage', 'financial_app_usage','video_app_usage', 'aircraft_app_usage', 'train_app_usage',
           'tourism_app_usage', 'last_payment_amount', 'average_consumption_value', 'all_fee']

f, ax = plt.subplots(3, 4, figsize=(20, 20))

for i,name in enumerate(name_list):     
    sns.scatterplot(data=df_data, x=name, y='label', color='b', ax=ax[i // 4][i % 4])

plt.show()

在这里插入图片描述

f, ax = plt.subplots(1, 3, figsize=(20, 6))

sns.kdeplot(data=df_data['aircraft_app_usage'], color='r', shade=True, ax=ax[0])
sns.kdeplot(data=df_data['train_app_usage'], color='c', shade=True, ax=ax[1])
sns.kdeplot(data=df_data['tourism_app_usage'], color='b', shade=True, ax=ax[2])

plt.show()

在这里插入图片描述

**结论:**观察前面提到的特征在上述散点图中的确存在相关的长尾分布现象,但是长尾分布不一定就是无效数据,就像缺失值自己也可能代表某种意义一样,后期在处理长尾分布数据时应结合模型进行线下验证。

# """ 离散特征分析 """

f, ax = plt.subplots(1, 2, figsize=(20, 6))
sns.boxplot(data=df_data, x='last_payment_long', y='label', ax=ax[0])
sns.boxplot(data=df_data, x='whether_payment_owed', y='label', ax=ax[1])

plt.show()

在这里插入图片描述

name_list = ['whether_college_students','whether_blacklist_customer', 'whether_4G_unhealthy_customers','whether_often_shopping',
           'whether_visited_Wanda','whether_visited_member_store', 'whether_watch_movie',
           'whether_attraction', 'whether_stadium_consumption','whether_payment_owed']
             
f, ax = plt.subplots(2, 5, figsize=(20, 12))

for i,name in enumerate(name_list):
    sns.boxplot(data=df_data, x=name, y='label', ax=ax[i // 5][i % 5])

plt.show()

在这里插入图片描述

f, ax = plt.subplots(figsize=(10, 6))

sns.boxplot(data=df_data, x='call_sensitivity', y='label', ax=ax)

plt.show()

在这里插入图片描述

数据预处理涉及的内容很多,也包括特征工程,是任务量最大的一部分。为了让大家更清晰的阅读,以下先列出处理部分大致要用到的一些方法。

  1. 数据清洗:缺失值,异常值,一致性;

  2. 特征编码:onehot 和 labelcoding;

  3. 特征分箱:等频、等距,聚类等

  4. 衍生变量:可解释性强,适合模型输入;

  5. 特征选择:方差选择,卡方选择,正则化等;

# df_data = df_data[df_data['当月通话交往圈人数'] <= 1750].reset_index(drop=True)
df_data['fee_ratio'] = df_data['all_fee'] / (df_data['balance'] + 1)
df_data['fee_diff'] = df_data['all_fee'] - df_data['average_consumption_value']
df_data['5month_all_fee'] = df_data['average_consumption_value'] * 6 - df_data['all_fee']
df_data['fee_tend'] = df_data['last_payment_amount'] / (df_data['average_consumption_value'] + 1)
df_data['is_bazaar'] = (df_data['whether_visited_Wanda'] + df_data['whether_visited_member_store']).map(lambda x: 1 if x > 0 else 0)
df_data['count_sum'] = df_data['shopping_app_usage'] + df_data['express_app_usage'] + df_data['financial_app_usage'] + df_data['video_app_usage'] + df_data['aircraft_app_usage'] + df_data['train_app_usage'] + df_data['tourism_app_usage']
df_data['user_network_age_month'] = df_data['user_network_age'].apply(lambda x: x % 12)
def get_count(df, column, feature):
    df['idx'] = range(len(df))
    temp = df.groupby(column)['user_id'].agg([(feature, 'count')]).reset_index()
    df = df.merge(temp)
    df = df.sort_values('idx').drop('idx', axis=1).reset_index(drop=True)
    return df
for i in ['last_payment_amount', 'all_fee', 'average_consumption_value', 'fee_diff', ['all_fee', 'average_consumption_value']]:
    df_data = get_count(df_data, i, 'cnt_{}'.format(i))
from sklearn.model_selection import train_test_split
feature_col = [tmp_col for tmp_col in df_data.columns if tmp_col not in ['user_id', 'label']]
print(len(feature_col))

st_model = StandardScaler()
for i in feature_col:
    df_data[i] = st_model.fit_transform(df_data[[i]].values)
df_data['label'] = st_model.fit_transform(df_data[['label']].values)

train_data = df_data[:train_data.shape[0]]
test_data = df_data[train_data.shape[0]:]
X_train, X_vaild, y_train, y_vaild = train_test_split(train_data[feature_col],train_data['label'],test_size = 0.2,random_state = 0)
print(X_train.shape, X_vaild.shape)
40
(40000, 40) (10000, 40)


# for i in feature_col:
#     X_train[i] = st_model.fit_transform(X_train[[i]].values)
#     X_vaild[i] = st_model.fit_transform(X_vaild[[i]].values)
#     test_data[i] = st_model.fit_transform(test_data[[i]].values)

算法模型

import paddle
import paddle.fluid as fluid
class WB_Dataset(paddle.io.Dataset):
    def __init__(self, data, feature_cols, lab=None, dev=False):
        super().__init__()
        self.feature_cols = feature_cols
        self.lens = data.shape[0]

        self.features = data[feature_cols].values.astype(np.float32)
        self.lab = lab
        self.dev = dev
        if dev == False:
            self.labels = lab.values.astype(np.int64)

    def __getitem__(self, idx):
        features_input = list(self.features[[idx]])
        if self.dev == False:
            label_input = self.labels[[idx]]
            return features_input, label_input
        else:
            return features_input

    def __len__(self):
        return self.lens
train_dataset = WB_Dataset(X_train, feature_col, y_train, False)
valid_dataset = WB_Dataset(X_vaild, feature_col, y_vaild, False)
test_dataset = WB_Dataset(test_data[feature_col], feature_col, None, True)
import paddle.nn as nn
class Mlp(nn.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.ReLU, drop=0.):
        super().__init__()
        out_features = out_features or in_features
        hidden_features = hidden_features or in_features
        self.fc1 = nn.Linear(in_features, hidden_features)
        # self.bn = nn.functional.batch_norm(in_features)
        self.act = act_layer()
        self.fc2 = nn.Linear(hidden_features, out_features)
        self.drop = nn.Dropout(drop)

    def forward(self, x):
        x = self.fc1(x)
        # x = self.bn(x)
        x = self.act(x)
        x = self.drop(x)
        x = self.fc2(x)
        return x

#定义LSTM网络
import paddle.fluid as fluid
class MyLSTMModel(fluid.dygraph.Layer):
    def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.ReLU, drop=0.):
        super(MyLSTMModel,self).__init__()
        self.rnn = paddle.nn.LSTM(in_features, hidden_features, 2, dropout=drop)
        self.flatten = paddle.nn.Flatten()
        self.fc1=fluid.dygraph.Linear(120*14,120)
        self.fc2=fluid.dygraph.Linear(120,out_features)

        
    def forward(self,input):        # forward 定义执行实际运行时网络的执行逻辑
        '''前向计算'''
        # print('input',input.shape)
        out, (h, c)=self.rnn(input)
        out =self.flatten(out)
        out=self.fc1(out)
        out=self.fc2(out)
        return out
def train_model(train_dataset, valid_dataset, model, optimizer, verbose=100, epochs=5, batch_size=64, shuffle=True):

    with paddle.fluid.dygraph.guard(paddle.CUDAPlace(0)):
        train_loader = paddle.io.DataLoader(train_dataset, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))
        valid_loader = paddle.io.DataLoader(valid_dataset, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))
        test_loader = paddle.io.DataLoader(test_dataset, batch_size=batch_size, shuffle=shuffle, places=paddle.CUDAPlace(0))

        print('start training ... ')

        loss_list = []

        for epoch in range(epochs):
            for t_batch_id, t_data in enumerate(train_loader()):

                label_data = paddle.to_tensor(t_data[1],dtype='float32')
                feature=paddle.to_tensor(t_data[0][0], dtype='float32')
                predicts = model(feature)

                loss = nn.functional.mse_loss(predicts, label_data, reduction='mean')
                loss.backward()
                if t_batch_id % (verbose) == 0:
                    print("epoch: {}, batch_id: {}, loss : {}".format(epoch, t_batch_id, math.sqrt(loss.numpy())))  
                    loss_list.append(loss.numpy())
                optim.step()          
                # 梯度清零
                optim.clear_grad()

        predict_list = []
        #每训练一轮 进行一次测试
        for v_batch_id, v_data in enumerate(test_loader()):

            feature=paddle.to_tensor(t_data[0][0], dtype='float32')
            predicts = model(feature)
            predict_list.extend(predicts)

        paddle.save(model.state_dict(), "./model/model.pdparams")
        paddle.save(optim.state_dict(), "./model/adam.pdopt")

        return predict_list, loss_list

#模型参数
feature_number = 40
hidden_features = 20
out_features = 1

#训练参数
epochs = 2
batchsize=512
learning_rate=1e-5
verbose=10
model=Mlp(feature_number, hidden_features, out_features)
optim = paddle.optimizer.Adam(learning_rate=learning_rate, parameters=model.parameters())# Adam优化器

result, loss_list= train_model(train_dataset, valid_dataset,model,optimizer=optim,verbose=verbose,epochs=epochs,batch_size=batchsize,shuffle=True)
start training ... 
epoch: 0, batch_id: 0, loss : 1.0221488935549963
epoch: 0, batch_id: 10, loss : 1.0319473914036887
epoch: 0, batch_id: 20, loss : 1.1616030832364013
epoch: 0, batch_id: 30, loss : 0.9790661875256286
epoch: 0, batch_id: 40, loss : 1.0165814517740945
epoch: 0, batch_id: 50, loss : 0.9623618385930143
epoch: 0, batch_id: 60, loss : 1.050502584157359
epoch: 0, batch_id: 70, loss : 1.0234238718304822
epoch: 1, batch_id: 0, loss : 1.0975484557523207
epoch: 1, batch_id: 10, loss : 1.086625683856127
epoch: 1, batch_id: 20, loss : 0.9919022578127885
epoch: 1, batch_id: 30, loss : 1.0404108806648333
epoch: 1, batch_id: 40, loss : 1.093106216609532
epoch: 1, batch_id: 50, loss : 1.018006920849071
epoch: 1, batch_id: 60, loss : 1.0005267663184374
epoch: 1, batch_id: 70, loss : 1.1276594622696936

赛后总结

之后又组到一群优秀队友,最终我们根据业务设计一款用于融合的模型,取得了A榜Top1,B榜首次提交Top1的成绩,从2522支国内外队伍中取得冠军的佳绩。

实际在竞赛中,你花下的时间应该通常是:特征工程 > 模型融合 > 算法模型 > 参数调整或者是:模型融合 > 特征工程 > 算法模型 > 参数调整

本篇文章介绍了我在中国移动消费者人群画像赛中的经验和心历路程,完成入门到冠军的基本复现。大家可以在本文章的竞赛中进行实践和学习,很多知识只有实践过才能真正理解。比赛成绩很有用,但更重要的是通过竞赛学到东西!在竞赛中,想要取得较好成绩,投入大量的时间是必不可少的,有很多时候你的时间投入下去没有回报,不要气馁,相信自己并付诸努力和实践。

作者简介

郑雨轩,华东师范大学计算机硕士,研究方向为多模态推荐系统和自然语言处理,发表CCF-A类论文一篇,多次参与国内外顶级算法竞赛并获奖。
曾获2019年阿里巴巴天池全球数据智能大赛冠军、2019年数字中国创新大赛冠军、2019全国高校大数据应用创新大赛冠军、2019年上海市大学生计算机联赛冠军等佳绩。
荣获2021年Deecamp人工智能训练营赛道冠军及总冠军,团队项目为GeneBERT,是国内首个公开的基于大规模基于数据的多模态基因预训练模型,在汇报中得到张亚勤、陈维英等老师的高度评价。
Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐