近年来,恶意软件的出现尤为猖獗。所以解决恶意软件的问题也成为了刻不容缓的事情,当然,现在运用机器学习去完成恶意软件分类的手段已经比较的成熟了,但是,机器学习去特征的处理要求是比较高的,这里我采用深度学习的思维去解决。与此同时,Canadian Institute for Cybersecurity这个研究所提供了近几年的安卓恶意软件的提取特征文件,我在此基础上进行改进,初步完成一个较为入门级别的恶意软件分类项目。

0 背景

前面使用了图像的思想去进行分类,发现效果并不好,单纯的采用机器学习的方法都可以达到一个比较不错的结果了,那我们现在就摒弃图像的想法,继续探索正常的分类,其实,一排串的数据是一类,我们其实就可以把他当成文本分类的概念去解决。这样的话,就慢慢的将解决思路转向NLP,然后就可以直接去使用transformer的那些套路完成项目,当然奔着循序渐进的理念,本项目不会涉及transformer,先用全连接进行一个网络搭建。

1 数据集

详情参照CIC研究所的具体发布内容。

https://www.unb.ca/cic/datasets/maldroid-2020.html

在这里插入图片描述

在这里插入图片描述

这里我采用了其中470维度的一个数据进行处理,数据集已经挂载在项目数据集当中。

# 对数据进行一个copy
!cp data/data157200/syscallsbinders_frequency_5_Cat.csv work/

2 数据处理

2.1 数据展示

因为是csv文件,我这里采用pandas的方式进行读取,更为方便快速。

可以通过表头看到,是各种权限、调用函数、绑定器的使用频次统计。

每一行的最后一列是该行数据所属于的分类类别

import pandas as pd
# 按文件名读取整个文件
data_471 = pd.read_csv("work/syscallsbinders_frequency_5_Cat.csv") 
data_471.head(2)     # 展示前几行数据,通常为展示前5行
ACCESS_PERSONAL_INFO___ALTER_PHONE_STATE___ANTI_DEBUG_____CREATE_FOLDER_____CREATE_PROCESS`_____CREATE_THREAD_____DEVICE_ACCESS_____EXECUTE_____FS_ACCESS____FS_ACCESS()____...utimesvforkvibratevibratePatternwait4watchRotationwindowGainedFocuswritewritevClass
010030142030...000000037101
13006042910320...00000022838461

2 rows × 471 columns

2.2 无效数据过滤

我这里的思维就是,竖向的每一列,如果求和他的值为0,那么他的这一列的数值存在将是毫无意义的。

换句话来讲,其实这里也是一种特征降维。

前面的图像解决有可以的去将数据控制到一个 N x N + 1 NxN+1 NxN+1的维度,这里就不再需要这样处理了

我这里只把为全为0的列去除。

(发现并没有全为0的列QAQ)

import pandas as pd
# 按文件名读取整个文件
data_471 = pd.read_csv("work/syscallsbinders_frequency_5_Cat.csv") 
df = pd.DataFrame(data_471)
len(df.columns)
# 记录每一列求和小于阈值的列
threshold = 0
de = []
# 记录样本数据中的最大值
de_max = 0
de_va = []
for i in range(0,470):
    name = df.columns.values[i]
    if(df[name].sum() < threshold):
        de.append(name)
    if(df[name].max() >= de_max):
        de_max = df[name].max()

# 删除小于阈值的列
for j in de:
    df.drop(j,axis=1,inplace=True)

len_new = len(df.columns)

len_new
471
de_max
3697410
# 新数据写入csv 虽然没有变化,新存一下吧,养成一个良好的文件备份习惯
df.to_csv("new_date_"+str(len_new)+".csv",index=False,sep=',')
import numpy as np
import pandas as pd

mydata = pd.read_csv("new_date_"+str(len_new)+".csv")
row_list = mydata.values.tolist()

# 一个数据展示
print(row_list[0])
[1, 0, 0, 3, 0, 14, 2, 0, 3, 0, 3, 0, 0, 0, 4, 0, 9, 0, 7, 2, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 4, 0, 0, 0, 3, 0, 14, 6, 0, 0, 11, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 21, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 0, 0, 109, 14, 24, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 3, 0, 0, 3, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 2, 0, 14, 0, 2, 0, 0, 170, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 18, 0, 23, 0, 0, 0, 6, 6, 33, 26, 25, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 70, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 5, 61, 3, 0, 0, 61, 0, 133, 0, 0, 0, 22, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 28, 0, 0, 0, 0, 2, 0, 0, 0, 0, 4, 96, 12, 0, 0, 0, 0, 0, 0, 0, 0, 29, 0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 14, 1, 1, 1, 0, 9, 0, 0, 0, 0, 1, 17, 31, 0, 7, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 37, 10, 1]
len(row_list)
11598
# 获取最后一维的数据,也就是分类类别 1-5 这个是一行的分类的展示
row_list[0][-1]
1
import numpy as np

# 获取类别分布 所有行的最后一列为label
cls_ = np.asarray(mydata.iloc[:-2,-1])
print('Labels:', np.unique(cls_))
Labels: [1 2 3 4 5]
_records = {
    i: sum([1 for j in cls_ if j==i]) for i in range(1,6)
}
print('dist:', _records)
dist: {1: 1253, 2: 2100, 3: 3904, 4: 2546, 5: 1793}

2.3 控制划分后数据类别

防止划分后打乱的数据分类不均衡

import random
random.shuffle(row_list)
# 验证一下是不是真的打乱了 因为最原始的数据前面全是1分类,所以下面的输出只要不是全1 就是打乱的了
for i in range(4):
    print(row_list[i][-1])
3
4
3
5
# 划分数据
train_rate = 0.8

train_list = []
eval_list = []

_cls_records = {i:0 for i in cls_}

for data in row_list:
    _data_str, _label_int= data[:-1],data[-1:]
    label = int(_label_int[0])
    # 控制每个类别的比例
    if _cls_records[label] < int(_records[label]*train_rate):
        train_list.append([_data_str,_label_int])
    else:
        eval_list.append([_data_str,_label_int])
    _cls_records[label] += 1

print(len(train_list))
print(len(eval_list))
print(len(train_list)+len(eval_list))
9275
2323
11598
# 解析数据样本
train_data = []
train_label = []
for data_label in train_list:
    data,label = data_label[0],data_label[1]
    train_data.append(data)
    train_label.append(label)

val_data = []
val_label = []
for data_label in eval_list:
    data,label = data_label[0],data_label[1]
    val_data.append(data)
    val_label.append(label)
print("Train Dataset Length:", len(train_data))
print("Eval Dataset Length:", len(val_data))
print("Example Sample:", train_data[0], '------', train_label[0])
Train Dataset Length: 9275
Eval Dataset Length: 2323
Example Sample: [2, 0, 0, 3, 0, 25, 4, 0, 9, 0, 2, 0, 0, 2, 2, 0, 14, 0, 7, 4, 0, 0, 0, 4, 0, 8, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 25, 5254, 0, 0, 18, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 70, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 446, 25, 27, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 7, 11, 59, 0, 0, 1, 0, 0, 0, 0, 0, 69, 5, 0, 0, 0, 2, 0, 31, 0, 0, 0, 0, 440, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 10, 10, 10, 0, 0, 0, 36, 0, 30, 0, 0, 0, 0, 3, 177, 10, 40, 0, 0, 0, 0, 1, 0, 3, 0, 0, 0, 0, 0, 0, 91, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 36, 10, 1034, 3, 0, 0, 1029, 0, 285, 0, 0, 0, 14, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 40, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 1048, 9, 0, 0, 0, 0, 0, 0, 0, 0, 5311, 0, 0, 0, 2, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 26, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 26, 21, 0, 2, 8, 0, 0, 0, 0, 0, 19, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 102, 30] ------ [3]

到这里,数据就处理完了,就可以编写数据处理类和模型网络搭建了。

3 dataset类编写

from paddle.io import Dataset
from PIL import Image

class MyDataset(Dataset):
    def __init__(self, mode = 'train'):
        # 训练样本数量
        self.training_data, self.training_label, self.test_data, self.test_label = train_data, train_label, val_data, val_label
        if mode  == 'train':
            self.num_samples = len(train_data)
        else:
            self.num_samples = len(val_data)
        self.mode = mode

    def __getitem__(self, idx):
        if self.mode == 'train':
            data = self.training_data[idx]
            label = self.training_label[idx]
        else:
            data = self.test_data[idx]
            label = self.test_label[idx]
        
        data = np.array(data).astype('float32')
        
        return data, np.array(label, dtype='int64')

    def __len__(self):
        # 返回样本总数量
        return self.num_samples
# 训练的数据提供器
train_dataset = MyDataset(mode='train')
# 测试的数据提供器
eval_dataset = MyDataset(mode='val')

# 查看训练和测试数据的大小
print('train大小:', train_dataset.__len__())
print('eval大小:', eval_dataset.__len__())

# 查看图片数据、大小及标签
for data, label in train_dataset:
    print(np.array(data).shape)
    print(np.array(label).shape)
    break
train大小: 9275
eval大小: 2323
(470,)
(1,)

4 全连接网络搭建

4.1 模型搭建

import paddle
import paddle.nn as nn

# 实现MLP模块
class Mlp(nn.Layer):
    # self代表类的实例自身
    def __init__(self):
        # 初始化父类中的一些参数
        super(Mlp, self).__init__()
        
        self.fc1 = paddle.nn.Linear(in_features=470, out_features=1024)
        self.bn1 = paddle.nn.BatchNorm1D(1024)
        self.relu1 = paddle.nn.ReLU()
        self.fc2 = paddle.nn.Linear(in_features=1024, out_features=512)
        self.bn2 = paddle.nn.BatchNorm1D(512)
        self.relu2 = paddle.nn.ReLU()
        self.fc3 = paddle.nn.Linear(in_features=512, out_features=64)
        self.bn3 = paddle.nn.BatchNorm1D(64)
        self.relu3 = paddle.nn.ReLU()
        self.fc4 = paddle.nn.Linear(in_features=64, out_features=1)

    
    # 网络的前向计算
    def forward(self, inputs):
        x = self.fc1(inputs)
        x = self.bn1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.bn2(x)
        x = self.relu2(x)
        x = self.fc3(x)
        x = self.bn3(x)
        x = self.relu3(x)
        x = self.fc4(x)

        return x
# 测试
mlp = Mlp()

# 模型结构展示
paddle.summary(mlp, (256, 470))
---------------------------------------------------------------------------
 Layer (type)       Input Shape          Output Shape         Param #    
===========================================================================
   Linear-77        [[256, 470]]         [256, 1024]          482,304    
BatchNorm1D-58     [[256, 1024]]         [256, 1024]           4,096     
    ReLU-58        [[256, 1024]]         [256, 1024]             0       
   Linear-78       [[256, 1024]]          [256, 512]          524,800    
BatchNorm1D-59      [[256, 512]]          [256, 512]           2,048     
    ReLU-59         [[256, 512]]          [256, 512]             0       
   Linear-79        [[256, 512]]          [256, 64]           32,832     
BatchNorm1D-60      [[256, 64]]           [256, 64]             256      
    ReLU-60         [[256, 64]]           [256, 64]              0       
   Linear-80        [[256, 64]]            [256, 1]             65       
===========================================================================
Total params: 1,046,401
Trainable params: 1,040,001
Non-trainable params: 6,400
---------------------------------------------------------------------------
Input size (MB): 0.46
Forward/backward pass size (MB): 9.38
Params size (MB): 3.99
Estimated Total Size (MB): 13.83
---------------------------------------------------------------------------






{'total_params': 1046401, 'trainable_params': 1040001}

4.2 模型训练

import paddle
from paddle.io import DataLoader

BATCH_SIZE = 256  # 设置batch大小

# 训练的数据提供器
train_dataloader = DataLoader(
        dataset = train_dataset,
        batch_size = BATCH_SIZE,
        shuffle = True,
        drop_last=True
        )
# 测试的数据提供器
eval_dataloader = DataLoader(
        dataset = eval_dataset,
        batch_size = BATCH_SIZE,
        shuffle =False,
        drop_last=True
        )
import os

EPOCH_NUM = 1500   # 设置外层循环次数
save_dir = 'output_model/'

# 模型封装
model = Mlp()

learning_rate = 5e-3
opt = paddle.optimizer.SGD(learning_rate=learning_rate, parameters=model.parameters())

def train_mlp():
    # 定义外层循环
    train_loss = []
    val_loss = []
    for epoch_id in range(EPOCH_NUM):
        # 开启模型训练模式
        model.train()
        for iter_id, mini_batch in enumerate(train_dataloader()):
            # 清空梯度变量,以备下一轮计算
            opt.clear_grad()

            x = np.array(mini_batch[:][0])
            y = np.array(mini_batch[:][1])
            # 将numpy数据转为飞桨动态图tensor的格式
            features = paddle.to_tensor(x,dtype='float32')
            y = paddle.to_tensor(y,dtype='float32')
            # 前向计算
            predicts = model(features)
            # 计算损失
            loss = paddle.nn.functional.l1_loss(predicts, label=y)
            avg_loss = paddle.mean(loss)
            train_loss.append(avg_loss.numpy())
            # 反向传播,计算每层参数的梯度值
            avg_loss.backward()
            # 更新参数,根据设置好的学习率迭代一步
            opt.step()
        
        # 开启验证模式
        model.eval()
        for iter_id, mini_batch in enumerate(eval_dataloader()):

            x = np.array(mini_batch[:][0])
            y = np.array(mini_batch[:][1])

            features = paddle.to_tensor(x,dtype='float32')
            y = paddle.to_tensor(y,dtype='float32')

            predicts = model(features)
            
            loss = paddle.nn.functional.l1_loss(predicts, label=y)
            avg_loss = paddle.mean(loss)
            val_loss.append(avg_loss.numpy())
        
        print(f'Epoch {epoch_id}, train loss is  {np.mean(train_loss)}, val loss is {np.mean(val_loss)}')
        if (epoch_id % 5 == 0):
            if not os.path.isdir(save_dir):
                os.makedirs(save_dir)
            save_path = os.path.join(save_dir, '{0}'.format(epoch_id))
            paddle.save(model.state_dict(), save_path + '.pdparams')
            paddle.save(opt.state_dict(), save_path + '.pdopt')

    return train_loss,val_loss
tloss,vloss = train_mlp()

4.3 训练日志可视化

import matplotlib.pyplot as plt
%matplotlib inline

x = [i for i in range(len(tloss))]
y = [i for i in range(len(vloss))]

def draw(x,tloss,label):
    plt.title(label,fontsize = 24)
    plt.xlabel('iter_id',fontsize = 20)
    plt.ylabel('loss',fontsize = 20)
    plt.plot(x,tloss,color = 'green',label = label)
    plt.legend()
    plt.grid()
    plt.show()

draw(x,tloss,'train loss')

draw(y,vloss,'val loss')
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/__init__.py:107: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import MutableMapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/rcsetup.py:20: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Iterable, Mapping
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/colors.py:53: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  from collections import Sized
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2349: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  if isinstance(obj, collections.Iterator):
/opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages/matplotlib/cbook/__init__.py:2366: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
  return list(data) if isinstance(data, collections.MappingView) else data

在这里插入图片描述

在这里插入图片描述

5 模型保存及简单验证

5.1 模型保存

paddle.save(model.state_dict(), 'out/final.pdparams')  # 保存模型

5.2 模型验证

其实上面的可视化图已经是对结果的一个很好的验证了,这里我们通过实际的数据的展示,来看一下效果。

print(val_data[10])
print(val_label[10][0])
[0, 0, 0, 3, 0, 9, 2, 0, 3, 0, 3, 0, 0, 0, 3, 0, 8, 0, 6, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 9, 6, 0, 0, 6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 20, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 3, 0, 0, 109, 9, 19, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 31, 0, 0, 1, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 2, 0, 8, 0, 1, 0, 0, 1184, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 12, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 27, 0, 30, 0, 0, 0, 0, 3, 23, 9, 46, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 103, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 4, 47, 3, 0, 0, 35, 0, 183, 0, 0, 0, 5, 11, 0, 0, 0, 0, 0, 0, 0, 0, 0, 25, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 91, 1, 0, 0, 0, 0, 0, 0, 0, 0, 24, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 6, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 10, 1, 0, 1, 0, 0, 0, 0, 0, 0, 29, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 65, 3]
2
import paddle.nn.functional as F

label_txt = ['','Adware','Banking','SMS malware','Riskware','Benign']

model = Mlp()

para_dict = paddle.load('out/final.pdparams')
model.load_dict(para_dict)

model.eval()
x = paddle.reshape(paddle.to_tensor(val_data[10],dtype='float32'),[-1,470])
predicts = model(x)

pre = paddle.round(predicts)

print("实际标签为:",label_txt[val_label[10][0]])
lp()

para_dict = paddle.load('out/final.pdparams')
model.load_dict(para_dict)

model.eval()
x = paddle.reshape(paddle.to_tensor(val_data[10],dtype='float32'),[-1,470])
predicts = model(x)

pre = paddle.round(predicts)

print("实际标签为:",label_txt[val_label[10][0]])
print("预测标签为:",label_txt[pre[0]])
实际标签为: Banking
预测标签为: Banking

项目总结

本项目用全连接网络进行了分类问题。

当然,后续还会继续改进这个项目,其实,这个项目的实现是有槽点的,我采用的是逻辑回归的方式去进行的一个预测,因为我这里的数据是一个一个的数据,很像排名预测啊那种任务,比如最近平台上的吃鸡排名预测,也是通过逻辑回归的方式进行预测排名的。

严格来说这样是不太合理的,后续会继续版本更新,改成用词向量的方式来正儿八经的引入NLP。

个人总结

全网同名:

iterhui

我在AI Studio上获得至尊等级,点亮10个徽章,来互关呀~

https://aistudio.baidu.com/aistudio/personalcenter/thirdview/643467


转载自:https://aistudio.baidu.com/aistudio/projectdetail/4416008?contributionType=1

Logo

学大模型,用大模型上飞桨星河社区!每天8点V100G算力免费领!免费领取ERNIE 4.0 100w Token >>>

更多推荐