一、 基于lightgbm的Web攻击检测与分类识别

比赛地址:https://www.datafountain.cn/competitions/596

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-fdDSV8lv-1664258631957)(https://wid.s3.cn-north-1.amazonaws.com.cn/uploads/images/2022-08-19/4%20%E6%8B%B7%E8%B4%9D-710862.jpg)]

1.赛题背景

某业务平台平均每月捕获到Web攻击数量超过2亿,涉及常见注入攻击,代码执行等类型。传统威胁检测手段通过分析已知攻击特征进行规则匹配,无法检测未知漏洞或攻击手法。如何快速准确地识别未知威胁攻击并且将不同攻击正确分类,对提升Web攻击检测能力至关重要。利用机器学习和深度学习技术对攻击报文进行识别和分类已经成为解决该问题的创新思路,有利于推动AI技术在威胁检测分析场景的研究与应用。

2.赛题任务

参赛团队需要对前期提供的训练集进行分析,通过特征工程、机器学习和深度学习等方法构建AI模型,实现对每一条样本正确且快速分类,不断提高模型精确率和召回率。待模型优化稳定后,通过无标签测试集评估各参赛团队模型分类效果,以正确率评估各参赛团队模型质量。

3.数据简介

赛题训练集分为6种不同标签,共计约3.5万条数据。训练数据集字段内容主要包括:
● ID:样本编号
● label:攻击类型编号
● 其他:HTTP协议内容

按照当前测试集格式,增加一个字段predict,predict字段表示模型预测结果,predict值与训练集中的label字段值的含义一致。提交的文件名为predict.csv。

测试集格式:包含id,method,user_agent,url,refer和body

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-8kJhfeUJ-1664258631958)(https://s3.cn-north-1.amazonaws.com.cn/files.datafountain.cn/uploads/admin/editor/2022-08-16/%E5%9B%BE%E7%89%871-333163.jpg)]

如图:增加了predict字段,对id=117的样本预测类别为0。注意提交的id编号不允许改变。
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-DtOJ3sxf-1664258631958)(https://s3.cn-north-1.amazonaws.com.cn/files.datafountain.cn/uploads/admin/editor/2022-08-16/%E5%9B%BE%E7%89%872-487566.jpg)]

4.引用

本项目修改自 致Great 大佬 https://discussion.datafountain.cn/articles/detail/2381 ,属于再学习,非常感谢。

二、数据处理

1.解压缩

特别注意的是指定中文编码,否则会乱码

  • -c:将解压缩的结果显示到屏幕上,并对字符做适当的转换;
  • -f:更新现有的文件;
  • -l:显示压缩文件内所包含的文件;
  • -p:与-c参数类似,会将解压缩的结果显示到屏幕上,但不会执行任何的转换;
  • -t:检查压缩文件是否正确;
  • -u:与-f参数类似,但是除了更新现有的文件外,也会将压缩文件中的其他文件解压缩到目录中;
  • -v:执行时显示详细的信息;
  • -z:仅显示压缩文件的备注文字;
  • -a:对文本文件进行必要的字符转换;
  • -b:不要对文本文件进行字符转换;
  • -C:压缩文件中的文件名称区分大小写;
  • -j:不处理压缩文件中原有的目录路径;
  • -L:将压缩文件中的全部文件名改为小写;
  • -M:将输出结果送到more程序处理;
  • -n:解压缩时不要覆盖原有的文件;
  • -o:不必先询问用户,unzip执行后覆盖原有的文件;
  • -P<密码>:使用zip的密码选项;
  • -q:执行时不显示任何信息;
  • -s:将文件名中的空白字符转换为底线字符;
  • -V:保留VMS的文件版本信息;
  • -X:解压缩时同时回存文件原来的UID/GID;
  • -d<目录>:指定文件解压缩后所要存储的目录;
  • -x<文件>:指定不要处理.zip压缩文件中的哪些文件;
  • -Z:unzip-Z等于执行zipinfo指令。
  • -O 为在DOS、Windows和OS/2上压缩的存档指定字符编码
  • -I 为UNIX和其他系统上压缩的存档指定字符编码
!unzip -qoa -O GBK data/data168450/train.zip
!unzip -qoa -O GBK data/data168450/test-726699.zip
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

2.数据加载

读取并横向合并数据集

train=pd.concat([
    pd.read_csv('SQL注入.csv'),
    pd.read_csv('XSS跨站脚本.csv'),
    pd.read_csv('命令执行.csv'),
    pd.read_csv('白.csv'),
    pd.read_csv('目录遍历.csv'),
    pd.read_csv('远程代码执行.csv'),
],axis=0).reset_index(drop=True)
test=pd.read_csv('test.csv')
train.head()
idmethoduser_agenturlreferbodylable
017902GETDalvik/2.1.0 (Linux; U; Android 11; SM-G9860 B.../livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=ed...NaNGET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...1
1190GETDalvik/2.1.0 (Linux; U; Android 11; Mi 10 Buil.../livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d6...NaNGET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...1
28799GETMozilla/5.0 (Windows NT 10.0; Win64; x64) Appl.../(select%20extractvalue(xmltype('%3c%3fxml%20v...NaNGET /(select%20extractvalue(xmltype('%3c%3fxml...1
38788GETMozilla/5.0 (Windows NT 10.0; Win64; x64) Appl.../ftp/quarantine/?(select%20load_file('%5c%5c%5...NaNGET /ftp/quarantine/?(select%20load_file('%5c%...1
416030GETDalvik/2.1.0 (Linux; U; Android 9; MI 9 SE MIU.../livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...NaNGET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...1
test.head()
idmethoduser_agenturlreferbody
00GETMozilla/5.0 (Windows NT 10.0; Win64; x64) Appl.../demo/aisec/upload.php?act='%7C%7C(select+1+fr...http://demo.aisec.cn/demo/aisec/upload.php?t=0...GET /demo/aisec/upload.php?act='%7C%7C(select+...
11GETDalvik/2.1.0 (Linux; U; Android 11; M2102J2SC .../livemsg?ad_type=WL_WK&ty=web&pu=1&openudid=5f...NaNGET /livemsg?ad_type=WL_WK&ty=web&pu=1&openudi...
22GETMozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2.../create_user/?username=%3Cscript%3Ealert(docum...NaNNaN
33GETNaN/mmsns/WeDwicXmkOl4kjKsBycicI0H3q41r6syFFvu46h...NaNNaN
44PUTMozilla/5.0 (Windows NT 10.0; rv:78.0) Gecko/2.../naizau.jsp/NaNGET /login HTTP/1.1 Host: 111.160.211.18:8088 ...
train.to_csv("train_all.csv")
test.to_csv("test_all.csv")

3.数据分布

print(train['lable'].value_counts())
train['lable'].value_counts().plot(kind='bar')
1    14038
2     9939
0     6489
3     1397
4      697
5      659
Name: lable, dtype: int64





<matplotlib.axes._subplots.AxesSubplot at 0x7fb3645bf890>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-Ma7RtR08-1664258631959)(output_10_2.png)]

3.数据合并

reset_index用来重置索引,因为有时候对dataframe做处理后索引可能是乱的。这样如果数据量比较大不注意的话取出一行会取出多个索引相同的行。

  • drop=True就是把原来的索引index列去掉,重置index。
  • drop=False就是保留原来的索引,添加重置的index。

两者的区别就是有没有把原来的index去掉。

# 合并训练集、测试集,并重置index
data=pd.concat([train,test],axis=0).reset_index(drop=True)

4.空值处理

建议对所有空置进行替换,而不是某列处理。

data=data.fillna('NAN')

4.特征抽取

4.1 lightgbm 简介

注意lightgbm 需要更新为3.3.2,老的3.1版本没有log_evaluation。
文档: https://lightgbm.readthedocs.io/en/latest/

4.2 user_agents 简介

user_agents 是一个Python库,通过解析(浏览器/HTTP)用户代理字符串,提供了一种简单的方法来识别/检测设备。user_agents依赖于优秀的ua-parser对原始用户代理字符串进行实际解析。

!pip install -U -q lightgbm
!pip install -q user_agents
!pip list |grep lightgbm
lightgbm                       3.3.2
import lightgbm as lgb
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from lightgbm import early_stopping
from lightgbm import log_evaluation
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
from tqdm import tqdm
from user_agents import parse

4.3 agent解析

使用例子如下:


from user_agents import parse
 
user_string = "Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36"
user_agent = parse(user_string)
 
 
# 访问用户代理的浏览器属性
print(user_agent.browser)
print(user_agent.browser.family)
print(user_agent.browser.version)
print(user_agent.browser.version_string)
 
 
# 访问用户代理的操作系统属性
print(user_agent.os)
print(user_agent.os.family)
print(user_agent.os.version)
print(user_agent.os.version_string)
 
 
# 访问用户代理的设备属性
print(user_agent.device)
print(user_agent.device.family)
print(user_agent.device.brand)
print(user_agent.device.model)
 
 
# 查看一个漂亮的字符串版本

# 查看一个漂亮的字符串版本
print(str(user_agent))
ua_cols=['browser_family', 'os_family', 'device_family','device_brand','device_model']
# iPhone的UserAgent

def get_ua(row):
    user_agent = parse(row['user_agent'])

    browser_family=str(user_agent.browser.family)
    os_family=str(user_agent.os.family)
    device_family=str(user_agent.device.family)
    device_brand=str(user_agent.device.brand)
    device_model=str(user_agent.device.model)
    return browser_family,os_family,device_family,device_brand,device_model


data[ua_cols] = data.apply(get_ua, axis=1, result_type="expand")

4.4 TfidfVectorizer 文本向量化

TfidfVectorizer可以把原始文本转化为tf-idf的特征矩阵,从而为后续的文本相似度计算,主题模型(如LSI),文本搜索排序等一系列应用奠定基础。

常见用法如下:

from sklearn.feature_extraction.text import TfidfVectorizer
document = ["I have a pen.",
            "I have an apple."]
tfidf_model = TfidfVectorizer().fit(document)
# 得到tf-idf矩阵,稀疏矩阵表示法
sparse_result = tfidf_model.transform(document)
print(sparse_result)
# 第0个字符串,对应词典序号为3的词的TFIDF为0.8148
# (0, 3)	0.814802474667
# (0, 2)	0.579738671538
# (1, 2)	0.449436416524
# (1, 1)	0.631667201738
# (1, 0)	0.631667201738
print(sparse_result.todense())
# 转化为更直观的一般矩阵
# [[ 0.          0.          0.57973867  0.81480247]
#  [ 0.6316672   0.6316672   0.44943642  0.        ]]
print(tfidf_model.vocabulary_)
# 词语与列的对应关系
# {'have': 2, 'pen': 3, 'an': 0, 'apple': 1}
texts=data['user_agent'].values.tolist()
n_components = 16
tf = TfidfVectorizer(min_df= 3, max_df=0.5,analyzer = 'char_wb', ngram_range = (2,5))
X = tf.fit_transform(texts)
svd = TruncatedSVD(n_components=n_components,
                   random_state=42)
X_svd = svd.fit_transform(X)
df_tfidf = pd.DataFrame(X_svd)
df_tfidf.columns = [f'user_agent_name_tfidf_{i}' for i in range(n_components)]
data=pd.concat([data,df_tfidf],axis=1)
cate_cols=['method','user_agent','url','refer','body']+ua_cols
data.head()
idmethoduser_agenturlreferbodylablebrowser_familyos_familydevice_family...user_agent_name_tfidf_6user_agent_name_tfidf_7user_agent_name_tfidf_8user_agent_name_tfidf_9user_agent_name_tfidf_10user_agent_name_tfidf_11user_agent_name_tfidf_12user_agent_name_tfidf_13user_agent_name_tfidf_14user_agent_name_tfidf_15
017902GETDalvik/2.1.0 (Linux; U; Android 11; SM-G9860 B.../livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=ed...NANGET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...1AndroidAndroidSamsung SM-G9860...-0.0017570.006566-0.0167280.0714960.105247-0.3739830.515993-0.172456-0.0150120.014517
1190GETDalvik/2.1.0 (Linux; U; Android 11; Mi 10 Buil.../livemsg?ad_type=WL_WK&ty=web&pu=0&openudid=d6...NANGET /livemsg?ad_type=WL_WK&ty=web&pu=0&openudi...1AndroidAndroidXiaoMi Mi 10...0.002254-0.028675-0.014554-0.330868-0.045324-0.079707-0.1978020.0350360.018985-0.014604
28799GETMozilla/5.0 (Windows NT 10.0; Win64; x64) Appl.../(select%20extractvalue(xmltype('%3c%3fxml%20v...NANGET /(select%20extractvalue(xmltype('%3c%3fxml...1ChromeWindowsOther...-0.329514-0.249199-0.0243910.006098-0.0494020.008570-0.005371-0.0424930.0000340.002854
38788GETMozilla/5.0 (Windows NT 10.0; Win64; x64) Appl.../ftp/quarantine/?(select%20load_file('%5c%5c%5...NANGET /ftp/quarantine/?(select%20load_file('%5c%...1ChromeWindowsOther...-0.329514-0.249199-0.0243910.006098-0.0494020.008570-0.005371-0.0424930.0000340.002854
416030GETDalvik/2.1.0 (Linux; U; Android 9; MI 9 SE MIU.../livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&adap...NANGET /livemsg?ad_type=WL_WK&oadid=&ty=web&pu=0&...1AndroidAndroidXiaoMi MI 9 SE...-0.005911-0.0026890.0152730.085995-0.011761-0.006246-0.009553-0.012806-0.0016210.002927

5 rows × 28 columns

三、模型训练和预测

1.LabelEncoder 处理

将离散型的数据转换成 0 到 n − 1 之间的数,这里 n 是一个列表的不同取值的个数,可以认为是某个特征的所有不同取值的个数。

  • 提示:LabelEncoder().fit_transform函数 argument must be a string or number:
  • LabelEncoder()运行前要对缺失值进行处理

主要是两个步骤

  • 先构造encoder,通过fit函数传入需要编码的数据,在内部生成对应的key-value
  • 然后encoder 用于需要转化的数据,用transform函数
for col in tqdm(cate_cols):
    lbl = LabelEncoder()
    lbl.fit(data[col])
    data[col] = lbl.transform(data[col])
100%|██████████| 10/10 [00:00<00:00, 29.83it/s]
train, test = data[:len(train)], data[len(train):]

2.模型训练

def lgb_model(train, target, test, k):
    feats = [f for f in train.columns if f not in ['lable',  'url', 'url_count']]

    print('Current num of features:', len(feats))

    oof_probs = np.zeros((train.shape[0],6))
    output_preds = 0
    offline_score = []
    feature_importance_df = pd.DataFrame()
    parameters = {
        'learning_rate': 0.03,
        'boosting_type': 'gbdt',
        'objective': 'multiclass',
        'metric': 'multi_error',
        'num_class': 6,
        'num_leaves': 31,
        'feature_fraction': 0.6,
        'bagging_fraction': 0.8,
        'min_data_in_leaf': 15,
        'verbose': -1,
        'nthread': 4,
        'max_depth': 7
    }

    seeds = [2020]
    for seed in seeds:
        folds = StratifiedKFold(n_splits=k, shuffle=True, random_state=seed)
        for i, (train_index, test_index) in enumerate(folds.split(train, target)):
            train_y, test_y = target.iloc[train_index], target.iloc[test_index]
            train_X, test_X = train[feats].iloc[train_index, :], train[feats].iloc[test_index, :]

            dtrain = lgb.Dataset(train_X,
                                 label=train_y)
            dval = lgb.Dataset(test_X,
                               label=test_y)
            lgb_model = lgb.train(
                parameters,
                dtrain,
                num_boost_round=20000,
                valid_sets=[dval],
                callbacks=[early_stopping(100), log_evaluation(100)],
            )
            oof_probs[test_index] = lgb_model.predict(test_X[feats], num_iteration=lgb_model.best_iteration) / len(
                seeds)
            offline_score.append(lgb_model.best_score['valid_0']['multi_error'])
            output_preds += lgb_model.predict(test[feats],
                                              num_iteration=lgb_model.best_iteration) / folds.n_splits / len(seeds)
            print(offline_score)
            # feature importance
            fold_importance_df = pd.DataFrame()
            fold_importance_df["feature"] = feats
            fold_importance_df["importance"] = lgb_model.feature_importance(importance_type='gain')
            fold_importance_df["fold"] = i + 1
            feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
    print('OOF-MEAN-AUC:%.6f, OOF-STD-AUC:%.6f' % (np.mean(offline_score), np.std(offline_score)))
    print('feature importance:')
    print(feature_importance_df.groupby(['feature'])['importance'].mean().sort_values(ascending=False).head(50))

    return output_preds, oof_probs, np.mean(offline_score), feature_importance_df

# 排除特征
# id等肯定是要排除的
feature_names = list(
    filter(
        lambda x: x not in ['id','lable','url', 'url_count'],
        train.columns))
feature_names
['method',
 'user_agent',
 'refer',
 'body',
 'browser_family',
 'os_family',
 'device_family',
 'device_brand',
 'device_model',
 'user_agent_name_tfidf_0',
 'user_agent_name_tfidf_1',
 'user_agent_name_tfidf_2',
 'user_agent_name_tfidf_3',
 'user_agent_name_tfidf_4',
 'user_agent_name_tfidf_5',
 'user_agent_name_tfidf_6',
 'user_agent_name_tfidf_7',
 'user_agent_name_tfidf_8',
 'user_agent_name_tfidf_9',
 'user_agent_name_tfidf_10',
 'user_agent_name_tfidf_11',
 'user_agent_name_tfidf_12',
 'user_agent_name_tfidf_13',
 'user_agent_name_tfidf_14',
 'user_agent_name_tfidf_15']
train['lable'].unique()
array([1, 5, 4, 0, 2, 3])
# label转为int类型
train['lable']=train['lable'].apply(lambda i:int(i))
train['lable'].unique()
array([1, 5, 4, 0, 2, 3])
print('开始模型训练train')
lgb_preds, lgb_oof, lgb_score, feature_importance_df = lgb_model(train=train[feature_names],
                                                                 target=train['lable'],
                                                                 test=test[feature_names], k=5)
开始模型训练train
Current num of features: 25
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.106713
[200]	valid_0's multi_error: 0.102197
[300]	valid_0's multi_error: 0.0996388
[400]	valid_0's multi_error: 0.0979831
[500]	valid_0's multi_error: 0.0963275
[600]	valid_0's multi_error: 0.0945214
[700]	valid_0's multi_error: 0.0943709
Early stopping, best iteration is:
[665]	valid_0's multi_error: 0.0942203
[0.09422034918723661]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.107164
[200]	valid_0's multi_error: 0.104756
[300]	valid_0's multi_error: 0.102498
[400]	valid_0's multi_error: 0.0990367
[500]	valid_0's multi_error: 0.10009
Early stopping, best iteration is:
[410]	valid_0's multi_error: 0.0987357
[0.09422034918723661, 0.09873570138470801]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.105057
[200]	valid_0's multi_error: 0.100993
[300]	valid_0's multi_error: 0.0975316
[400]	valid_0's multi_error: 0.0957255
Early stopping, best iteration is:
[388]	valid_0's multi_error: 0.0954244
[0.09422034918723661, 0.09873570138470801, 0.09542444310656231]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.110777
[200]	valid_0's multi_error: 0.104907
[300]	valid_0's multi_error: 0.1028
[400]	valid_0's multi_error: 0.101445
[500]	valid_0's multi_error: 0.100542
[600]	valid_0's multi_error: 0.101746
Early stopping, best iteration is:
[518]	valid_0's multi_error: 0.0997893
[0.09422034918723661, 0.09873570138470801, 0.09542444310656231, 0.099789283564118]
Training until validation scores don't improve for 100 rounds
[100]	valid_0's multi_error: 0.0999548
[200]	valid_0's multi_error: 0.0948367
[300]	valid_0's multi_error: 0.0928797
[400]	valid_0's multi_error: 0.0924281
[500]	valid_0's multi_error: 0.0913744
Early stopping, best iteration is:
[496]	valid_0's multi_error: 0.0912238
[0.09422034918723661, 0.09873570138470801, 0.09542444310656231, 0.099789283564118, 0.09122384464850218]
OOF-MEAN-AUC:0.095879, OOF-STD-AUC:0.003101
feature importance:
feature
body                        241348.874669
browser_family              136926.749047
refer                        91150.440689
user_agent_name_tfidf_2      57131.292926
user_agent_name_tfidf_6      54907.824831
user_agent_name_tfidf_4      54479.337116
user_agent                   41806.222855
user_agent_name_tfidf_1      37807.530563
user_agent_name_tfidf_5      37042.289503
user_agent_name_tfidf_0      35681.990519
user_agent_name_tfidf_7      31312.516073
method                       22488.125961
user_agent_name_tfidf_12     20017.676556
user_agent_name_tfidf_15     19832.515956
user_agent_name_tfidf_14     17805.419354
user_agent_name_tfidf_10     17644.252264
user_agent_name_tfidf_3      17013.031725
user_agent_name_tfidf_9      16673.376316
user_agent_name_tfidf_13     16357.774028
user_agent_name_tfidf_8      15505.476155
user_agent_name_tfidf_11     11311.683459
os_family                     3066.650211
device_model                  2139.934527
device_family                 1887.317088
device_brand                  1406.674794
Name: importance, dtype: float64

四、预测并提交

1.预测

# 读取提交格式
sub=pd.read_csv('data/data168450/submit_example.csv')
# 获取最大概率标签
sub['predict']=np.argmax(lgb_preds,axis=1)
# 统计
sub['predict'].value_counts()
1    1281
2     923
0     840
3     588
5     298
4      70
Name: predict, dtype: int64
# 保存
81
    2     923
    0     840
    3     588
    5     298
    4      70
    Name: predict, dtype: int64




```python
# 保存
sub.to_csv('sub.csv',index=None)
accuracy_score(train['lable'],np.argmax(lgb_oof,axis=1))
0.9041211354947469

2.提交

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-4X1WaTyJ-1664258631960)(https://ai-studio-static-online.cdn.bcebos.com/d960517e49154f5d81d2ffcf67e031ae9ec928b97e0f4736b2fb717432004505)]

3.体会

  • 对于字符串离散化处理,以前一直手动处理,很慢,这次学到了 LabelEncoder 办法;
  • 对于文本内容向量化处理,这次学到了 TfidfVectorizer 办法。

此文章为搬运
原项目链接

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐