基于UEBA的用户上网异常行为分析

AI Studio

2064人浏览 · 2022-12-15 12:49:13

AI Studio · 2022-12-15 12:49:13 发布

1. 引言

1.1 项目介绍 ✅

随着企业信息化水平的不断提升，数据作为一种资产成为越来越多企业的共识，企业在产业与服务、营销支持、业务运营、风险管控、信息纰漏等生产、经营、管理活动中涉及到大量的商业秘密、工作秘密以及员工和客户的隐私信息。
目前绝大多数企业围绕敏感数据保护都出台了相关管理办法和操作行为准则，但是仍然存在导致敏感数据泄露的异常操作行为，《Securonix 2020 内部威胁报告》指出，涉及60%的内部网络安全和数据泄露事件都与企业用户的异常操作行为相关。
为了有效保护企业敏感数据，践行企业安全操作行为准则，杜绝由异常操作行为导致的企业敏感数据泄露安全事件发生，用户异常行为分析与识别成为重难点技术之一。

本项目将使用Paddle搭建神经网络，通过机器学习的手段预测用户上网异常评分，本项目包括数据处理与分析、数据预处理、模型搭建、模型训练、模型预测、模型评估等完整流程，尝试搭建三种不同的网络解决该问题。

1.2 数据集介绍 ✅

训练数据集保存在csv文件中，数据样本量较大，包括528690条数据，其中ret字段为需要预测的标签值。

具体表头字段及字段含义如下表所示：

字段名称	字段说明
id	日志数据记录编号
account	用户账号，已脱敏
group	用户归属部门
IP	终端IP
url	终端上网网址
port	终端上网应用端口
vlan	终端所在虚拟网域编号
switchIP	终端连接交换机IP
time	终端上网行为发生时间
ret	异常行为评价得分

2. 数据处理

2.1 导入数据集及依赖库

! unzip -o /home/aistudio/data/data147183/train_data.zip

Archive:  /home/aistudio/data/data147183/train_data.zip
  inflating: train_data.csv

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

2.1.1 读取数据

直接读取数据会出现中文乱码或者报错问题，需要指定：encoding='gbk'

train_data = pd.read_csv("train_data.csv", encoding='gbk')
train_data.head()

	id	account	group	IP	url	port	vlan	switchIP	time	ret
0	1	xiaojiawei@qq.com	人事行政中心	192.168.1.50	http://123.6.4.41	15788	700	129.30.06.37	2021/6/16 7:56	0.1149
1	2	xiaojiawei@qq.com	人事行政中心	192.168.31.46	http://104.192.108.154	12665	700	152.91.89.45	2021/6/28 7:58	0.1801
2	3	xiaojiawei@qq.com	人事行政中心	192.168.2.3	http://42.236.37.80	25551	700	129.30.06.37	2021/6/1 6:37	0.3690
3	4	xiaojiawei@qq.com	人事行政中心	192.168.145.30	http://uf-api6-client.ksord.com	18274	700	162.6.8.29	2021/5/5 8:18	0.1532
4	5	xiaojiawei@qq.com	人事行政中心	192.168.178.92	http://101.199.128.170	34572	700	162.0.1.84	2021/6/20 6:52	0.1449

# 查看列索引
train_data.columns

Index(['id', 'account', 'group', 'IP', 'url', 'port', 'vlan', 'switchIP',
       'time', 'ret'],
      dtype='object')

2.1.2 构造新字段

训练集中的time字段为datetime64类型，可以将其分别拆分成年、月、日、周、小时等字段，考虑将其作为训练特征（features）。

train_data['time'] = pd.to_datetime(train_data['time'])
train_data['hour'] = train_data['time'].dt.hour
train_data['weekday'] = train_data['time'].dt.weekday
train_data['year'] = train_data['time'].dt.year
train_data['month'] = train_data['time'].dt.month
train_data['day'] = train_data['time'].dt.day

train_data.head()

	id	account	group	IP	url	port	vlan	switchIP	time	ret	hour	weekday	year	month	day
0	1	xiaojiawei@qq.com	人事行政中心	192.168.1.50	http://123.6.4.41	15788	700	129.30.06.37	2021-06-16 07:56:00	0.1149	7	2	2021	6	16
1	2	xiaojiawei@qq.com	人事行政中心	192.168.31.46	http://104.192.108.154	12665	700	152.91.89.45	2021-06-28 07:58:00	0.1801	7	0	2021	6	28
2	3	xiaojiawei@qq.com	人事行政中心	192.168.2.3	http://42.236.37.80	25551	700	129.30.06.37	2021-06-01 06:37:00	0.3690	6	1	2021	6	1
3	4	xiaojiawei@qq.com	人事行政中心	192.168.145.30	http://uf-api6-client.ksord.com	18274	700	162.6.8.29	2021-05-05 08:18:00	0.1532	8	2	2021	5	5
4	5	xiaojiawei@qq.com	人事行政中心	192.168.178.92	http://101.199.128.170	34572	700	162.0.1.84	2021-06-20 06:52:00	0.1449	6	6	2021	6	20

2.1.3 查看数据类型

查看数据集中各个字段的数据类型，对于object字段，我们需要对其进行特征编码。

# 查看训练集中各字段的数据类型
train_data.dtypes

id                   int64
account             object
group               object
IP                  object
url                 object
port                 int64
vlan                 int64
switchIP            object
time        datetime64[ns]
ret                float64
hour                 int64
weekday              int64
year                 int64
month                int64
day                  int64
dtype: object

2.1.4 查看数据统计学描述

# 查看各字段的基本统计学数据描述
train_data.describe()

	id	port	vlan	ret	hour	weekday	year	month	day
count	528690.000000	528690.000000	528690.000000	528690.000000	528690.000000	528690.000000	528690.0	528690.000000	528690.000000
mean	264345.500000	25241.669929	1152.486902	0.257757	6.334313	2.993121	2021.0	5.500458	14.999383
std	152619.801246	8543.017235	438.932786	0.219326	4.258042	1.997915	0.0	0.500828	8.369525
min	1.000000	11122.000000	700.000000	0.000100	0.000000	0.000000	2021.0	4.000000	1.000000
25%	132173.250000	16865.000000	900.000000	0.078500	3.000000	1.000000	2021.0	5.000000	8.000000
50%	264345.500000	25278.000000	1000.000000	0.208000	6.000000	3.000000	2021.0	6.000000	15.000000
75%	396517.750000	33474.000000	1200.000000	0.377400	9.000000	5.000000	2021.0	6.000000	22.000000
max	528690.000000	38867.000000	2000.000000	1.000000	23.000000	6.000000	2021.0	6.000000	31.000000

2.2 重复值检测

使用如下方法判断数据集中是否存在重复值，返回False说明不存在重复值。

train_data.duplicated().any()

False

2.3 缺失值检测

使用如下方法查看数据集是否存在缺失值（NaN空值），返回True说明非空，即不存在缺失值。

pd.notnull(train_data).all()

corr = train_data.corr()
# 调用热力图绘制相关性关系
plt.figure(figsize=(15,15),dpi=100)
sns.heatmap(corr, square=True, linewidths=0.1, annot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x7f53861e2250>

在这里插入图片描述

2.4 数据预处理

2.4.1 数据编码

在数据处理过程中，我们有时需要对不连续的数字或者文本进行数字化处理。
其中，我们选取了'account', 'group', 'IP', 'url', 'switchIP'这五个object类型的特征字段进行了编码。

from sklearn.preprocessing import LabelEncoder

for feat in ['account', 'group', 'IP', 'url', 'switchIP']:
    labelencoder = LabelEncoder()
    train_data[feat] = labelencoder.fit_transform(train_data[feat])

# 查看编码后的数据集
train_data.head()

	id	account	group	IP	url	port	vlan	switchIP	time	ret	hour	weekday	year	month	day
0	1	113	1	18	216	15788	700	44	2021-06-16 07:56:00	0.1149	7	2	2021	6	16
1	2	113	1	101	157	12665	700	91	2021-06-28 07:58:00	0.1801	7	0	2021	6	28
2	3	113	1	81	373	25551	700	44	2021-06-01 06:37:00	0.3690	6	1	2021	6	1
3	4	113	1	39	1135	18274	700	102	2021-05-05 08:18:00	0.1532	8	2	2021	5	5
4	5	113	1	77	57	34572	700	92	2021-06-20 06:52:00	0.1449	6	6	2021	6	20

2.4.2 划分训练集与测试集

选取了'account', 'group', 'IP', 'url', 'port', 'vlan', 'switchIP', 'hour', 'weekday', 'year', 'month', 'day'字段作为训练的特征（features）。
'ret'字段为想要预测的目标值/标签值（label）。
我们将数据集划分为：训练集：测试集 = 0.75：0.25

data_X = train_data.loc[train_data.index[:], ['account', 'group', 'IP', 'url', 'port', 'vlan', 'switchIP', 'hour', 'weekday', 'year', 'month', 'day']]
data_Y = train_data['ret']

from sklearn.model_selection import train_test_split
# 数据集划分
x_train, x_test, y_train, y_test = train_test_split(data_X, data_Y, test_size=0.25, random_state=6)
print("训练集的特征值：\n", x_train, x_train.shape)
print("测试集的标签值：\n", y_test, y_test.shape)

print("The length of original data X is:", data_X.shape[0])
print("The length of train Data is:", x_train.shape[0])
print("The length of test Data is:", x_test.shape[0])

2.4.3 标签值数组维度转换

训练集的特征数组（features）为二维数组，而训练集的标签数据（label）是一维数组，需要将其转变为二维数组再传入神经网络训练，保持维度一致，否则会报错

x_train = np.array(x_train, dtype='float32')
y_train = np.array(y_train, dtype='float32')

y_train

array([0.1724, 0.0298, 0.0739, ..., 0.3177, 0.0929, 0.1663], dtype=float32)

y_train = y_train.reshape(-1,1)
y_train

x_test = np.array(x_test, dtype='float32')
y_test = np.array(y_test, dtype='float32')

y_test = y_test.reshape(-1,1)
y_test

2.4.4 归一化（标准化）

我们对训练集和测试集的特征值进行了标准化

# 标准化
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler


# 1. 实例化一个转换器类
transfer = StandardScaler()
# 2. 标准化
x_train = transfer.fit_transform(x_train)
x_test = transfer.fit_transform(x_test)

# y_train = transfer.fit_transform(y_train)
# y_test = transfer.fit_transform(y_test)

2.5 设置随机数种子

import random
import paddle

seed = 666
# 设置随机种子 固定结果
def set_seed(seed):
    np.random.seed(seed)
    random.seed(seed)
    paddle.seed(seed)

set_seed(seed)

3. 模型组网

使用飞桨PaddlePaddle进行组网，激活函数可以选择：

$t anh (x) = 2 σ (2 x) - 1$
$R e LU (x) = M a x (0, x)$

隐藏层层数可以尝试使用1层，2层，3层。

Paddle相关模型组网API详情可以查阅：PaddlePaddle API文档，主要组网API都在Paddle.nn下。

经过尝试，我们搭建了如下三种神经网络，可选择其中一种进行训练，以下三个神经网络运行其一即可。

3.1 搭建神经网络1

尝试使用 $12 \times 64 \times 32 \times 16 \times 8 \times 1$

'account', 'group', 'IP', 'url', 'port', 'vlan', 'switchIP', 'hour', 'weekday', 'year', 'month', 'day'：共12个特征（features）进行输入
'ret'：作为回归输出的标签（label）
激活函数使用的是Tanh
共4层隐藏层

import paddle
import paddle.nn as nn

# 定义动态图
class Classification(paddle.nn.Layer):
    def __init__(self):
        super(Classification, self).__init__()
        nn.Tanh
        self.drop = paddle.nn.Dropout(p=0.1)
        self.fc1 = paddle.nn.Linear(12, 64)
        self.fc2 = paddle.nn.Linear(64, 32)
        self.fc3 = paddle.nn.Linear(32, 16)
        self.fc4 = paddle.nn.Linear(16, 8)
        self.fc5 = paddle.nn.Linear(8, 1)
        self.Tanh = nn.Tanh()
    
    # 网络的前向计算函数
    def forward(self, inputs):
        x = self.Tanh(self.fc1(inputs))
        x = self.drop(x)
        x = self.Tanh(self.fc2(x))
        x = self.drop(x)
        x = self.Tanh(self.fc3(x))
        x = self.drop(x)
        x = self.Tanh(self.fc4(x))
        x = self.drop(x)
        pred = self.fc5(x)
        return pred

3.2 搭建神经网络2

一层输入层，一层隐藏层，一层输出层。神经网络共有三层
激活函数为：Tanh
网络结构： $12 * 8 * 1$

import paddle
import paddle.nn as nn

# 定义动态图
class Classification(paddle.nn.Layer):
    def __init__(self):
        super(Classification, self).__init__()
        nn.Tanh
        self.drop = paddle.nn.Dropout(p=0.1)
        self.fc1 = paddle.nn.Linear(12, 8)
        self.fc2 = paddle.nn.Linear(8, 1)
        self.Tanh = nn.Tanh()
    
    # 网络的前向计算函数
    def forward(self, inputs):
        x = self.Tanh(self.fc1(inputs))
        x = self.drop(x)
        pred = self.fc2(x)
        return pred

3.3 搭建神经网络3

一层输入层，一层输出层，全连接神经网络。
激活函数使用的是ReLU
网络结构： $12 * 1$

import paddle
import paddle.nn as nn

# 定义动态图
class Classification(paddle.nn.Layer):
    def __init__(self):
        super(Classification, self).__init__()
        nn.Tanh
        nn.ReLU
        self.fc1 = paddle.nn.Linear(12, 1)
        self.ReLU = nn.ReLU()
    
    # 网络的前向计算函数
    def forward(self, inputs):
        pred = self.ReLU(self.fc1(inputs))
        return pred

4. 模型训练

4.1 定义可视化函数

# 定义绘制训练过程的损失值变化趋势的方法draw_train_process
train_nums = []
train_costs = []
def draw_train_process(iters,train_costs):
    title="training cost"
    plt.title(title, fontsize=24)
    plt.xlabel("iter", fontsize=14)
    plt.ylabel("cost", fontsize=14)
    plt.plot(iters, train_costs,color='red',label='training cost') 
    plt.grid()
    plt.show()

4.2 定义损失函数

import paddle
import paddle.nn.functional as F

class kl_loss(paddle.nn.Layer):
    def __init__(self):
       super(kl_loss, self).__init__()

    def forward(self, p, q, label):
        ce_loss = 0.5 * (F.mse_loss(p, label=label)) + F.mse_loss(q, label=label)
        kl_loss = self.compute_kl_loss(p, q)

        # carefully choose hyper-parameters
        loss = ce_loss + 0.3 * kl_loss 

        return loss

    def compute_kl_loss(self, p, q):
        
        p_loss = F.kl_div(F.log_softmax(p, axis=-1), F.softmax(q, axis=-1), reduction='none')
        q_loss = F.kl_div(F.log_softmax(q, axis=-1), F.softmax(p, axis=-1), reduction='none')

        # You can choose whether to use function "sum" and "mean" depending on your task
        p_loss = p_loss.sum()
        q_loss = q_loss.sum()

        loss = (p_loss + q_loss) / 2

        return loss

x_train

array([[-1.6798261 ,  0.9158834 , -0.5996304 , ...,  0.        ,
         0.99791497, -0.95602554],
       [ 0.15875107, -1.6193113 ,  0.16653812, ...,  0.        ,
        -0.9987269 ,  0.95592046],
       [ 1.2848797 ,  1.4229224 , -0.7528641 , ...,  0.        ,
        -0.9987269 , -1.434012  ],
       ...,
       [-1.2891285 , -0.0981945 , -1.1614873 , ...,  0.        ,
        -0.9987269 ,  0.47793394],
       [-1.5649151 ,  0.9158834 ,  1.0604014 , ...,  0.        ,
        -0.9987269 , -0.47803903],
       [ 0.11278664,  1.4229224 , -1.2125653 , ...,  0.        ,
        -0.9987269 , -0.23904581]], dtype=float32)

4.3 模型训练

模型参数设置如下：

训练轮数：5
batch_size：64
损失函数：kl_loss

import paddle.nn.functional as F
y_preds = []
labels_list = []
BATCH_SIZE = 64
train_data = x_train
train_data_y = y_train
test_data = x_test
test_data_y = y_test
compute_kl_loss = kl_loss()
CET_loss = paddle.nn.CrossEntropyLoss()

def train(model):
    print('start training ... ')
    # 开启模型训练模式
    model.train()
    EPOCH_NUM = 5
    train_num = 0
    scheduler = paddle.optimizer.lr.CosineAnnealingDecay(learning_rate=0.001, T_max=int(train_data.shape[0]/BATCH_SIZE*EPOCH_NUM), verbose=False)
    optimizer = paddle.optimizer.Adam(learning_rate=scheduler, parameters=model.parameters())
    for epoch_id in range(EPOCH_NUM):
        # 在每轮迭代开始之前，将训练数据的顺序随机的打乱
        np.random.shuffle(train_data)
        # 将训练数据进行拆分，每个batch包含64条数据
        mini_batches = [np.append(train_data[k: k+BATCH_SIZE], train_data_y[k: k+BATCH_SIZE], axis = 1) for k in range(0, len(train_data), BATCH_SIZE)]
        for batch_id, data in enumerate(mini_batches):
            features_np = np.array(data[:, :12], np.float32)
            labels_np = np.array(data[:, -1:], np.float32)

            features = paddle.to_tensor(features_np)
            labels = paddle.to_tensor(labels_np)
            #前向计算
            # y_pred = model(features)
            y_pred1 = model(features)
            y_pred2 = model(features)
            cost = compute_kl_loss(y_pred1, y_pred2, label=labels)
            # cost = CET_loss(y_pred, labels)
            # cost = F.mse_loss(y_pred, label=labels)
            train_cost = cost.numpy()[0]
            #反向传播
            cost.backward()
            #最小化loss，更新参数
            optimizer.step()
            # 清除梯度
            optimizer.clear_grad()
            if batch_id % 500 == 0 and epoch_id % 1 == 0:
                print("Pass:%d,Cost:%0.5f"%(epoch_id, train_cost))

            train_num = train_num + BATCH_SIZE
            train_nums.append(train_num)
            train_costs.append(train_cost)

model = Classification()
train(model)

4.4 可视化训练过程

我们选取了上面定义的第三个神经网络进行训练，将训练过程可视化如下所示。

import matplotlib
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

draw_train_process(train_nums, train_costs)

在这里插入图片描述

5. 模型预测

测试集样本量较大，可以选择预测一个样本，也可以预测部分样本值。

train_data = x_train
train_data_y = y_train
test_data = x_test
test_data_y = y_test

def predict(model):
    print('start evaluating ... ')
    model.eval()
    outputs = []
    mini_batches = [np.append(test_data[k: k+BATCH_SIZE], test_data_y[k: k+BATCH_SIZE], axis = 1) for k in range(0, len(test_data), BATCH_SIZE)]
    for data in mini_batches:
        features_np = np.array(data[:, :12], np.float32)
        features = paddle.to_tensor(features_np)
        pred = model(features)
        out = paddle.argmax(pred, axis=1)
        outputs.extend(out.numpy())
    return outputs

outputs = predict(model)

start evaluating ...

# 查看测试集中前30个样本
test_data[0:30]

5.1 预测一个样本

infer_feature = test_data[0]
infer_feature = infer_feature.reshape(1, 12)
infer_feature = paddle.to_tensor(np.array(infer_feature, dtype='float32'))
result = model(infer_feature)
print(result)

Tensor(shape=[1, 1], dtype=float32, place=CUDAPlace(0), stop_gradient=False,
       [[0.]])

result.item()

0.0

5.2 预测多个样本

predict_result = []
for infer_feature in test_data[0:30]:
    infer_feature = infer_feature.reshape(1, 12)
    infer_feature = paddle.to_tensor(np.array(infer_feature, dtype='float32'))
    result = model(infer_feature)
    predict_result.append(result)
print(predict_result)

# 预测后的样本
predict_result = np.array(predict_result)
predict_result_new = predict_result.reshape(-1,1)

# 测试集真实样本
test_data_y_new = test_data_y[0:30]
test_data_y_new

6. 计算指标

6.1 Mean Squared Error（MSE）均方误差

from sklearn.metrics import mean_squared_error
print(mean_squared_error(test_data_y,outputs))

6.2 Mean Absolute Error（MAE）平均绝对误差

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(test_data_y,outputs))
rom sklearn.metrics import mean_squared_error
print(mean_squared_error(test_data_y,outputs))

6.2 Mean Absolute Error（MAE）平均绝对误差

from sklearn.metrics import mean_absolute_error
print(mean_absolute_error(test_data_y,outputs))

7. 总结与展望⭐️

本项目搭建神经网络实现了对UEBA用户上网异常回归预测。
由于数据集特征有限，可以考虑进一步丰富数据特征。
在今后，可以考虑使用其他网络模型，例如卷积网络，对该数据集进行回归预测，分析其效果。
另外，可以尝试使用不同的损失函数对模型进一步优化。

再一次感谢项目导师张宏理的指导

本项目成员来自北京科技大学飞桨领航团

请点击此处查看本环境基本用法.

Please click here for more detailed instructions.

此文章为搬运
原项目链接

百度飞桨AI Studio社区

学大模型，用大模型上飞桨星河社区！每天8点V100G算力免费领！免费领取ERNIE 4.0 100w Token >>>

更多推荐

利用Amazon Bedrock畅玩Claude 3等多种领先模型，抢占AI高地(体验倒计时4小时)

百度飞桨星河社区

RAPTOR：索引树状 RAG，使用树结构来捕捉文本的高级和低级细节

百度飞桨星河社区

MultiHop-RAG：多跳查询的基准检索增强生成

百度飞桨星河社区

所有评论(0)

查看更多评论

AI Studio

@m0_63642362

已为社区贡献1436条内容

基于UEBA的用户上网异常行为分析

AI Studio

1. 引言

1.1 项目介绍 ✅

1.2 数据集介绍 ✅

2. 数据处理

2.1 导入数据集及依赖库

2.1.1 读取数据

2.1.2 构造新字段

2.1.3 查看数据类型

2.1.4 查看数据统计学描述

2.2 重复值检测

2.3 缺失值检测

2.4 数据预处理

2.4.1 数据编码

2.4.2 划分训练集与测试集

2.4.3 标签值数组维度转换

2.4.4 归一化（标准化）

2.5 设置随机数种子

3. 模型组网

3.1 搭建神经网络1

3.2 搭建神经网络2

3.3 搭建神经网络3

4. 模型训练

4.1 定义可视化函数

4.2 定义损失函数

4.3 模型训练

4.4 可视化训练过程

5. 模型预测

5.1 预测一个样本

5.2 预测多个样本

6. 计算指标

6.1 Mean Squared Error（MSE） 均方误差

6.2 Mean Absolute Error（MAE） 平均绝对误差

6.2 Mean Absolute Error（MAE） 平均绝对误差

7. 总结与展望⭐️

所有评论(0)

AI Studio

6.1 Mean Squared Error（MSE）均方误差

6.2 Mean Absolute Error（MAE）平均绝对误差

6.2 Mean Absolute Error（MAE）平均绝对误差