英雄联盟大师预测赛方案分享-模型融合

实现sklearn的交叉验证以及3个弱分类器与paddle网络模型的融合。

AI Studio

2248人浏览 · 2023-02-12 23:09:25

AI Studio · 2023-02-12 23:09:25 发布

★★★ 本文源自AI Studio社区精品项目，【点击此处】查看更多精品内容 >>>

赛事介绍

实时对战游戏是人工智能研究领域的一个热点。由于游戏复杂性、部分可观察和动态实时变化战局等游戏特点使得研究变得比较困难。我们可以在选择英雄阶段预测胜负概率，也可以在比赛期间根据比赛实时数据进行建模。那么我们英雄联盟对局进行期间，能知道自己的胜率吗？

赛事任务

比赛数据使用了英雄联盟玩家的实时游戏数据，记录下用户在游戏中对局数据（如击杀数、住物理伤害）。希望参赛选手能从数据集中挖掘出数据的规律，并预测玩家在本局游戏中的输赢情况。

赛题训练集案例如下：

训练集18万数据；
测试集2万条数据；

import pandas as pd
import numpy as np

train = pd.read_csv('train.csv.zip')

对于数据集中每一行为一个玩家的游戏数据，数据字段如下所示：

id：玩家记录id
win：是否胜利，标签变量
kills：击杀次数
deaths：死亡次数
assists：助攻次数
largestkillingspree：最大 killing spree（游戏术语，意味大杀特杀。当你连续杀死三个对方英雄而中途没有死亡时）
largestmultikill：最大mult ikill（游戏术语，短时间内多重击杀）
longesttimespentliving：最长存活时间
doublekills：doublekills次数
triplekills：doublekills次数
quadrakills：quadrakills次数
pentakills：pentakills次数
totdmgdealt：总伤害
magicdmgdealt：魔法伤害
physicaldmgdealt：物理伤害
truedmgdealt：真实伤害
largestcrit：最大暴击伤害
totdmgtochamp：对对方玩家的伤害
magicdmgtochamp：对对方玩家的魔法伤害
physdmgtochamp：对对方玩家的物理伤害
truedmgtochamp：对对方玩家的真实伤害
totheal：治疗量
totunitshealed：痊愈的总单位
dmgtoturrets：对炮塔的伤害
timecc：法控时间
totdmgtaken：承受的伤害
magicdmgtaken：承受的魔法伤害
physdmgtaken：承受的物理伤害
truedmgtaken：承受的真实伤害
wardsplaced：侦查守卫放置次数
wardskilled：侦查守卫摧毁次数
firstblood：是否为firstblood
测试集中label字段win为空，需要选手预测。

评审规则

数据说明

选手需要提交测试集队伍排名预测，具体的提交格式如下：

win
0
1
1
0

评估指标

本次竞赛的使用准确率进行评分，数值越高精度越高，评估代码参考：

from sklearn.metrics import accuracy_score
y_pred = [0, 2, 1, 3]
y_true = [0, 1, 2, 3]
accuracy_score(y_true, y_pred)

1)加载数据

#!pip install numpy==1.19
#!pip install -U scikit-learn numpy

import sklearn

import pandas as pd
import paddle
import numpy as np
%pylab inline
import seaborn as sns

train_df_raw = pd.read_csv('data/data137276/train.csv.zip')
test_df_raw = pd.read_csv('data/data137276/test.csv.zip')

train_df = train_df_raw.drop(['id', 'timecc'], axis=1)
test_df = test_df_raw.drop(['id', 'timecc'], axis=1)

train_df_raw

train_df

	win	kills	deaths	assists	largestkillingspree	largestmultikill	longesttimespentliving	doublekills	triplekills	quadrakills	...	totheal	totunitshealed	dmgtoturrets	totdmgtaken	magicdmgtaken	physdmgtaken	truedmgtaken	wardsplaced	wardskilled	firstblood
0	0	1	5	2	0	1	569	0	0	0	...	849	2	0	7819	2178	5239	401	4	1	0
1	0	5	8	7	3	1	880	0	0	0	...	642	4	303	24637	5607	17635	1394	10	0	0
2	1	1	6	16	0	1	593	0	0	0	...	2326	3	329	18749	3651	14834	263	7	1	0
3	0	1	2	0	0	1	381	0	0	0	...	1555	1	0	12134	1739	10318	76	8	1	0
4	0	4	11	25	0	1	455	0	0	0	...	6630	8	0	27891	14068	12749	1073	34	2	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
179995	1	1	6	12	0	1	362	0	0	0	...	3559	3	5751	14786	2374	12309	102	12	1	0
179996	1	7	3	4	5	1	574	0	0	0	...	2529	2	8907	11019	3933	6533	552	7	2	0
179997	1	9	0	9	9	1	0	0	0	0	...	11494	4	6627	14279	3661	10617	0	7	2	1
179998	1	14	1	5	10	2	980	3	0	0	...	6555	1	1943	19165	4818	14110	236	6	0	0
179999	1	4	4	2	2	1	559	0	0	0	...	608	1	1590	10992	7681	3065	246	7	1	0

180000 rows × 30 columns

#查看标签
train_df['win']

#查看数据内容
train_df.columns

train_df.info()

2)EDA数据分析

2.1异常值处理

#缺失值
print(type(train_df.isnull()))
train_df.isnull()

#查看缺失值个数
train_df.isnull().sum()

#查看缺失值比例
train_df.isnull().mean(axis=0)

train_df['win'].value_counts().plot(kind='bar')

sns.distplot(train_df['kills'])

sns.distplot(train_df['deaths'])

sns.boxplot(y='kills', x='win', data=train_df)

plt.scatter(train_df['kills'], train_df['deaths'])
plt.xlabel('kills')
plt.ylabel('deaths')

for col in train_df.columns[1:]:
    train_df[col] /= train_df[col].max()
    test_df[col] /= test_df[col].max()

3)数据集

from sklearn.model_selection import train_test_split 
from sklearn.model_selection import KFold,cross_validate

#取出标签
x=train_df.drop(['win'], axis=1)
y=train_df.win

	kills	deaths	assists	largestkillingspree	largestmultikill	longesttimespentliving	doublekills	triplekills	quadrakills	pentakills	...	totheal	totunitshealed	dmgtoturrets	totdmgtaken	magicdmgtaken	physdmgtaken	truedmgtaken	wardsplaced	wardskilled	firstblood
0	1	5	2	0	1	569	0	0	0	0	...	849	2	0	7819	2178	5239	401	4	1	0
1	5	8	7	3	1	880	0	0	0	0	...	642	4	303	24637	5607	17635	1394	10	0	0
2	1	6	16	0	1	593	0	0	0	0	...	2326	3	329	18749	3651	14834	263	7	1	0
3	1	2	0	0	1	381	0	0	0	0	...	1555	1	0	12134	1739	10318	76	8	1	0
4	4	11	25	0	1	455	0	0	0	0	...	6630	8	0	27891	14068	12749	1073	34	2	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
179995	1	6	12	0	1	362	0	0	0	0	...	3559	3	5751	14786	2374	12309	102	12	1	0
179996	7	3	4	5	1	574	0	0	0	0	...	2529	2	8907	11019	3933	6533	552	7	2	0
179997	9	0	9	9	1	0	0	0	0	0	...	11494	4	6627	14279	3661	10617	0	7	2	1
179998	14	1	5	10	2	980	3	0	0	0	...	6555	1	1943	19165	4818	14110	236	6	0	0
179999	4	4	2	2	1	559	0	0	0	0	...	608	1	1590	10992	7681	3065	246	7	1	0

180000 rows × 29 columns

0         0
1         0
2         1
3         0
4         0
         ..
179995    1
179996    1
179997    1
179998    1
179999    1
Name: win, Length: 180000, dtype: int64

print('特征向量形状{}'.format(x.shape))
print('标签形状{}'.format(y.shape))
print('标签类别{}'.format(np.unique(y)))
print('测试集特征形状{}'.format(test_df.shape))

特征向量形状(180000, 29)
标签形状(180000,)
标签类别[0 1]
测试集特征形状(20000, 29)

#数据集划分 /这里分出的test部分用于二次验证
Xtrain,Xtest,Ytrain,Ytest=train_test_split(x,y,test_size=0.2,random_state=1412)

#验证指验证集，而非测试集的特征向量。
print('用于训练的特征向量形状{}'.format(Xtrain.shape))
print('用于训练的标签形状{}'.format(Ytrain.shape))
print('用于验证的特征向量形状{}'.format(Xtest.shape))
print('用于验证的标签形状{}'.format(Ytest.shape))

用于训练的特征向量形状(144000, 29)
用于训练的标签形状(144000,)
用于验证的特征向量形状(36000, 29)
用于验证的标签形状(36000,)

def individual_estimators(estimators):
    train_score=[]
    cv_mean=[]
    test_score=[]

    for estimator in estimators:
        cv=KFold(n_splits=5,shuffle=True,random_state=1412)
        results=cross_validate(estimator[1],Xtrain,Ytrain
                                ,cv=cv
                                ,scoring="accuracy"
                                ,n_jobs=8
                                ,return_train_score=True
                                ,verbose=False)
        test=estimator[1].fit(Xtrain,Ytrain).score(Xtest,Ytest)
        train_score.append(results["train_score"].mean())
        cv_mean.append(results["test_score"].mean())
        test_score.append(test)
    for i in range(len(estimators)):
        print("-------------------------------------------")
        print(
            estimators[i]
            ,"\n train_score_mean:{}".format(train_score[i])
            ,"\n cv_mean:{}".format(cv_mean[i])
            ,"\n test_score:{}".format(test_score[i])
            ,"\n")

 def fusion_estimators(estimators):
   
    cv=KFold(n_splits=5,shuffle=True,random_state=1412)
    results=cross_validate(clf,Xtrain,Ytrain
                            ,cv=cv
                            ,scoring="accuracy"
                            ,n_jobs=-1
                            ,return_train_score=True
                            ,verbose=False)
    test=clf.fit(Xtrain,Ytrain).score(Xtest,Ytest)
    print("++++++++++++++++++++++++++++++++++++++++++++++")
    print(
        "\n train_score_mean:{}".format(results["train_score"].mean())
        ,"\n cv_mean:{}".format(results["test_score"].mean())
        ,"\n test_score:{}".format(test)
        )

4)模型

from sklearn.neighbors import KNeighborsClassifier as KNNC
from sklearn.tree import DecisionTreeClassifier as DTR

from sklearn.ensemble import RandomForestClassifier as RFC
from sklearn.ensemble import GradientBoostingClassifier as GBC
from sklearn.linear_model import LogisticRegression as LogiR
from sklearn.ensemble import VotingClassifier

4.a为什么模型融合比集成算法更好？

虽然每一个弱分类器并不强，但都能代表一组其对应的假设空间。真实世界的数据分布是多远随机的复杂系统，往往其中一种并不能有一个好的近似结果。模型融合是一种简单粗暴的办法，考虑多重分布的组合。当然，模型融合的结果并不一定好，只是大部分时间是好的。

4.1弱分类器与集成

clf1=LogiR(max_iter=3000,random_state=1412,n_jobs=8)
clf2=RFC(n_estimators=100,random_state=1412,n_jobs=8)
clf3=GBC(n_estimators=100,random_state=1412)

estimators=[("Logistic Regression",clf1),("RandomForest",clf2),("GBDT",clf3)]
clf=VotingClassifier(estimators,voting="soft")

4.1.1对弱分类器分别进行评估

individual_estimators(estimators)

4.1.2对融合算法评估

logi=LogiR(max_iter=3000,n_jobs=8)
fusion_estimators(logi)

test_predict_sklearn=clf.predict(test_df)
test_predict_sklearn=clf.predict_proba(test_df)

print(test_predict_sklearn.shape)
print(test_predict_sklearn)

(20000, 2)
[[0.87535621 0.12464379]
 [0.77675525 0.22324475]
 [0.16242339 0.83757661]
 ...
 [0.94152587 0.05847413]
 [0.90214731 0.09785269]
 [0.10380786 0.89619214]]

4.2网络模型

import paddle.fluid

class MyModel(paddle.nn.Layer):
    # self代表类的实例自身
    def __init__(self):
        # 初始化父类中的一些参数
        super(MyModel, self).__init__()
        self.fc1 = paddle.nn.Linear(in_features=29, out_features=30)
        self.hidden1=paddle.fluid.BatchNorm(30)
        self.relu1=paddle.nn.ReLU()
        self.fc2 = paddle.nn.Linear(in_features=30, out_features=8)
        self.relu2=paddle.nn.LeakyReLU()
        self.fc3 = paddle.nn.Linear(in_features=8, out_features=6)
        self.relu3=paddle.nn.Sigmoid()
        self.fc4 = paddle.nn.Linear(in_features=6, out_features=4)
        self.fc5=paddle.nn.Linear(in_features=4, out_features=2)
        self.softmax = paddle.nn.Softmax()
    # 网络的前向计算
    def forward(self, inputs):
        x = self.fc1(inputs)
        #x = self.relu1(x)
        x = self.hidden1(x)
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        x = self.relu3(x)
        
        x=self.fc4(x)
        x=self.fc5(x)
        #x=self.fc6(x)
        x = self.softmax(x)
        return x

model = MyModel()
model.train()
opt = paddle.optimizer.SGD(learning_rate=0.01, parameters=model.parameters())

EPOCH_NUM = 10   # 设置外层循环次数
BATCH_SIZE = 100  # 设置batch大小
training_data = train_df.iloc[:-1000,].values.astype(np.float32)
val_data = train_df.iloc[-1000:, ].values.astype(np.float32)

# 定义外层循环
for epoch_id in range(EPOCH_NUM):
    # 在每轮迭代开始之前，将训练数据的顺序随机的打乱
    
    np.random.shuffle(training_data)
    
    # 将训练数据进行拆分，每个batch包含10条数据
    mini_batches = [training_data[k:k+BATCH_SIZE] for k in range(0, len(training_data), BATCH_SIZE)]
    
    # 定义内层循环
    for iter_id, mini_batch in enumerate(mini_batches):
        x_data = np.array(mini_batch[:, 1:]) # 获得当前批次训练数据
        y_label = np.array(mini_batch[:, :1]) # 获得当前批次训练标签
       
        # 将numpy数据转为飞桨动态图tensor的格式
        features = paddle.to_tensor(x_data)
        y_label = paddle.to_tensor(y_label)
        label=np.zeros([len(y_label),2])

        for i in range(len(y_label)):
            if y_label[i]==0:
                label[i,0]=1
            elif y_label[i]==1:
                label[i,1]=1
        label=paddle.to_tensor(label,dtype=float32)
        # 前向计算
        predicts = model(features)
        # 计算损失
        loss = paddle.nn.functional.softmax_with_cross_entropy(predicts, label,soft_label=True)
        avg_loss = paddle.mean(loss)
        
        # 反向传播，计算每层参数的梯度值
        avg_loss.backward()
        
        # 更新参数，根据设置好的学习率迭代一步
        opt.step()
        # 清空梯度变量，以备下一轮计算
        opt.clear_grad()

/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/paddle/fluid/data_feeder.py:51: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  np.bool, np.float16, np.uint16, np.float32, np.float64, np.int8,

model.eval()
test_data = paddle.to_tensor(test_df.values.astype(np.float32))
test_predict_dl = model(test_data)

test_predict_dl

Tensor(shape=[20000, 2], dtype=float32, place=CPUPlace, stop_gradient=False,
       [[0.31092143, 0.68907863],
        [0.89762008, 0.10237990],
        [0.00382155, 0.99617851],
        ...,
        [0.97896796, 0.02103199],
        [0.98377025, 0.01622973],
        [0.00828540, 0.99171454]])

test_predict_sklearn

array([[0.87535621, 0.12464379],
       [0.77675525, 0.22324475],
       [0.16242339, 0.83757661],
       ...,
       [0.94152587, 0.05847413],
       [0.90214731, 0.09785269],
       [0.10380786, 0.89619214]])

#控制融合比例
test_predict_=(1/4*(np.array(test_predict_dl)))+(3/4*(test_predict_sklearn))

test_predict=np.zeros([len(test_predict_)])
for i in range(len(test_predict_)):
    if test_predict_[i,0]>test_predict_[i,1]:
        test_predict[i]=0
    elif test_predict_[i,0]<test_predict_[i,1]:
        test_predict[i]=1

test_predict

array([0., 0., 1., ..., 0., 0., 1.])

pd.DataFrame({'win':
              test_predict
             }).to_csv('submission.csv', index=None)

!zip submission.zip submission.csv

  adding: submission.csv (deflated 94%)

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1437条内容