使用飞桨搭建AutoEncoder以图搜图框架，实现图片相似度高效检索

手把手保姆式教你使用飞桨搭建以图搜图框架，实现图片高效检索

AI Studio

1015人浏览 · 2023-05-28 11:28:24

AI Studio · 2023-05-28 11:28:24 发布

★★★ 本文源自AlStudio社区精品项目，【点击此处】查看更多精品内容 >>>

1. 以图搜图技术简介

“以图搜图”正式的名称应该叫“相似图像搜索引擎”，也称为“反向图片搜索引擎”。

最初的图像搜索引擎是基于文本关键字检索的，因而这种方法本质上还是属于基于文本搜索引擎。

1992年，T. Kato提出了基于内容的图像检索（Content-Based Image Retrieval，CBIR）的概念，它使用图像的颜色、形状等信息作为特征构建索引以实现图像检索，即我们通常所说的“以图搜图”。基于这一概念，IBM开发了第一个商用的CBIR系统QBIC（Query By Image Content），用户只需输入一幅草图或图像，便可以搜索出相似的图像。

随着基于卷积神经网络的深度学习技术在计算机视觉领域内的普及，研究者发现神经网络可以很好地提取图片的特征，从而极大地提高了图像表达的准确性。诸多主流图像搜索引擎纷纷引入了深度学习算法来提高图像搜索的准确率。

如今我们日常使用的百度、搜狗等通用搜索引擎均提供了相似图像检索功能。淘宝、京东等电商平台也利用自己庞大的商品图像库开发了垂直领域内的图像检索功能来满足消费者们不易用文字描述的商品搜索需求。虽然图像检索技术已广泛应用于我们的生活当中，但目前这一技术还未完全成熟，仍有许多问题需要解决，改进和提高的空间还很大。

另外，一些上层的应用场景中，如目标追踪、位置获取等都需要图像相似性度量做理论支撑。图片的相似度研究目前仍然是热门领域。

2. AutoEncoder自编码器

AutoEncoder （AE）是一种无监督的学习算法，主要用于数据的降维或者特征的抽取。

深度学习的三巨头之一的Hinton在2006的Science上发表了"Reducing the Dimensionality of Data with Neural Networks", 该文章提出了Deep Auto-Encoder架构如下图：

原始输入X，经过多层DNN layer，编码得到Code。它的维度一般比原始输入低很多，起到特征提取作用。

编码得到的Code，经过多层DNN layer，解码得到输出X’。

训练Encoder和Decoder，使得X’和X越接近越好。
Encoder和Decoder的各层参数，可以互为转置。这样就可以少一半参数，防止过拟合。

在实际应用中，Encoder和Decoder结构可以随意设置，只要二者联合训练就好了。这个架构和PCA有异曲同工之妙。区别在于PCA采用的是线性模型，而不是多层DNN。所以它的特征提取和样本重建能力都要差一些。

实验表明，Deep Auto-Encoder的样本重建效果，明显比基于线性模型的PCA要好很多。主要原因还是采用了多层DNN的Encoder和Decoder，模型编码和解码能力，比线性模型要强很多。

下面使用paddle框架搭建基于CNN的AutoEncoder模型，并验证其用于图片的特征提取以及基于此特征进行图片特征相似度计算的图片推荐应用上的有效性。

3. 数据集

本项目采用从互联网上按照类别搜索获取的图片。将其分为database和query两个目录。
搭建训练框架并使用database中的图片训练模型后，从qeury目录中选择图片作为搜索输入，并从database中获取与输入相似度最高的图片，并可视化检验模型效果。

图片挂载在data目录下,解压到指定位置后即可使用。

3.1 解压数据集并查看

!unzip -q /home/aistudio/data/data218271/image_CBIR.zip -d /home/aistudio/work

!tree -L 2 /home/aistudio/work

/home/aistudio/work
└── image_CBIR
    ├── database
    └── query

3 directories, 0 files

# 导入工具
import os
import random
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import cv2
%matplotlib inline

# 随机选取部分图片查看
database_img_root = '/home/aistudio/work/image_CBIR/database'
query_img_root = '/home/aistudio/work/image_CBIR/query'
img_list = os.listdir(database_img_root)
img_select = np.random.choice(img_list, 10)

plt.figure(figsize=(15,20))
for index, _ in enumerate(img_select):
    img_path = os.path.join(database_img_root, _)
    img_data = cv2.imread(img_path)
    plt.subplot(1,10, index+1)
    plt.imshow(img_data[:, :, ::-1])
    plt.title(_, fontsize=6)
    plt.axis("off");

在这里插入图片描述

3.2 训练数据集创建

对于自定义的数据集，可以使用Paddle的Dataset类进行快捷方便的构建。一般，只需要完成如下步骤：

步骤一：继承 paddle.io.Dataset 类
步骤二：实现 init 函数，初始化数据集，将样本和标签映射到列表中
步骤三：实现 getitem 函数，定义指定 index 时如何获取数据，并返回单条数据（样本数据、对应的标签）
步骤四：实现 len 函数，返回数据集的样本总数

下面使用这4个步骤实现本项目的数据Dataset

# 将所有图片写入label.txt, 用于训练时拉取数据
root_dir = "/home/aistudio/work/image_CBIR/database"
img_list = os.listdir(root_dir)

for img_ in img_list:
    img_txt = img_ + "\t" + img_ + "\n"
    with open ("/home/aistudio/work/image_CBIR/label.txt", mode = 'a', encoding="utf-8") as f:
        f.write(img_txt)

from paddle.io import Dataset
img_dir = "/home/aistudio/work/image_CBIR/database"
label_path = "/home/aistudio/work/image_CBIR/label.txt"

class MyDataset(Dataset):
    # 实现 __init__ 函数，初始化数据集，将样本和标签映射到列表中
    def __init__(self, data_dir, label_path, transform=None):
        super().__init__()
        self.data_list = []
        with open(label_path,encoding='utf-8') as f:
            for line in f.readlines():
                image_path, label = line.strip().split('\t')
                image_path = os.path.join(data_dir, image_path)
                self.data_list.append([image_path, label])
        # 传入定义好的数据处理方法，作为自定义数据集类的一个属性
        self.transform = transform

    # 实现 __getitem__ 函数，定义指定 index 时如何获取数据，并返回单条数据（样本数据、对应的标签）
    def __getitem__(self, index):
        # 根据索引，从列表中取出一个图像
        image_path, label = self.data_list[index]
        image = cv2.imread(image_path)

        # 飞桨训练时内部数据格式默认为float32，将图像数据格式转换为 float32
        image = image.astype('float32')
        # 应用数据处理方法到图像上
        if self.transform is not None:
            image = self.transform(image)
        label = image # 本项目中，需要将图片提取特征后，在使用decoder复原出原图像，因此label就是输入到网络的图片（即原图tranform之后的图片）
        
        # 返回图像和对应标签
        return image, label

    # 实现 __len__ 函数，返回数据集的样本总数
    def __len__(self):
        return len(self.data_list)

# 实现图片预处理，一般有数据normalize, resize, crop等
from paddle.vision.transforms import Compose,Resize,Normalize,Transpose

# 使用normalize, resize, transpose等操作处理图片后作为网络的输入
normalize = Normalize(mean=[127.5], std=[127.5], data_format='HWC')
resize = Resize(size=(512,512))
transpose = Transpose()
transform = Compose([normalize, resize,transpose])

# 实例化数据集
train_custom_dataset = MyDataset(img_dir,label_path, transform)
print('train_custom_dataset images: ',len(train_custom_dataset))

train_custom_dataset images:  65

4. 使用Paddle搭建AutoEncoder模型

4.1 搭建网络

在深度学习中，一般采用神经网络从输入的数据样本中习得数据内在规律，最终输出预测结果。模型组网中一个关键组成就是定义各种不同的神经网络层，并将他们组合在一起使用。

在飞桨框架中，常用paddle.nn.Layer类生成各个网络层来组网，一般来说，组网包括三个步骤：

创建一个继承自 paddle.nn.Layer 的类
在类的构造函数 init 中定义组网用到的神经网络层（layer）
在类的前向计算函数 forward 中使用定义好的 layer 执行前向计算

在飞桨框架中内置了丰富的神经网络层，用类（class）的方式表示，构建模型时可直接作为实例添加到子类中，只需设置一些必要的参数，并定义前向计算函数即可，反向传播和参数保存由框架自动完成。

对于AutoEncoder自编码网络，需要用到如下的层：

Conv2D：二维卷积层，用于对输入图片进行卷积操作，本项目中用于Encoder中，提取图片特征
MaxPool2D：最大池化层，主要用于缩小特征图大小
Conv2DTranspose：二维转置卷积层，本项目用来做Decoder，用于图片上采样

其他常用的还有ReLU， Linear等，读者可以自行查看文档

import paddle
from paddle import nn
import paddle.nn.functional as F
from paddle.nn import Conv2D
from paddle.nn import MaxPool2D
from paddle.nn import BatchNorm
from paddle.nn import UpsamplingBilinear2D
from paddle.nn import Conv2DTranspose

# 创建一个类，继承Paddle.nn.Layer
class myAutoEncoder(nn.Layer):
    def __init__(self):
        super().__init__()
        # conv层， kernel_size = 3, 池化层，按最大池化 kernel_size=2  stride步长 padding边界填充
        self.conv1 = paddle.nn.Conv2D(in_channels=3,out_channels=64,kernel_size=3,stride=1,padding='SAME')    
        self.conv2 = paddle.nn.Conv2D(in_channels=64,out_channels=64,kernel_size=3,stride=1,padding='SAME')
        self.max_pool1 = paddle.nn.MaxPool2D(kernel_size=2, stride=2,padding='SAME')

        self.conv3 = paddle.nn.Conv2D(in_channels=64,out_channels=128,kernel_size=3,stride=1,padding='SAME')
        self.conv4 = paddle.nn.Conv2D(in_channels=128,out_channels=128,kernel_size=3,stride=1,padding='SAME')
        self.max_pool2 = paddle.nn.MaxPool2D(kernel_size=2, stride=2,padding='SAME')

        self.conv5 = paddle.nn.Conv2D(in_channels=128,out_channels=256,kernel_size=3,stride=1,padding='SAME')
        self.conv6 = paddle.nn.Conv2D(in_channels=256,out_channels=256,kernel_size=3,stride=1,padding='SAME')
        self.conv7 = paddle.nn.Conv2D(in_channels=256,out_channels=256,kernel_size=3,stride=1,padding='SAME')
        self.max_pool3 = paddle.nn.MaxPool2D(kernel_size=2, stride=2,padding='SAME')

        self.act1, self.act2, self.act3,self.act4, self.act5, self.act6,self.act7 = nn.ReLU(),nn.ReLU(),nn.ReLU(),nn.ReLU(),nn.ReLU(),nn.ReLU(),nn.ReLU()

        # 将Encoder单独出来用于后面的图片特征提取，进行相似度计算
        self.encoder = nn.Sequential(
            self.conv1,self.act1,self.conv2,self.act2,self.max_pool1, 
            self.conv3,self.act3,self.conv4,self.act4,self.max_pool2,
            self.conv5,self.act5,self.conv6,self.act6,self.conv7,self.act7,self.max_pool3,
        )


        self.conv2Dtrans1 = Conv2DTranspose(256,128,(3,3),stride=2,padding='SAME')
        self.conv2Dtrans2 = Conv2DTranspose(128,64,(3,3),stride=2,padding='SAME')
        self.conv2Dtrans3 = Conv2DTranspose(64,3,(3,3),stride=2,padding='SAME')

        self.decoder = nn.Sequential(
            self.conv2Dtrans1,
            self.conv2Dtrans2,
            self.conv2Dtrans3,
        )

    # 用于图片特征提取
    def predict(self, inputs):
        return self.encoder(inputs)

    # 前向计算需要将encoder和decoder结合起来
    def forward(self, inputs):
        x = self.encoder(inputs)
        x = self.decoder(x)
        
        return x

myNet1 = myAutoEncoder()
paddle.summary(myNet1, (1,3,448,448))

W0517 10:40:55.996636   176 gpu_resources.cc:61] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 11.2
W0517 10:40:56.002281   176 gpu_resources.cc:91] device: 0, cuDNN Version: 8.2.


-----------------------------------------------------------------------------
  Layer (type)        Input Shape          Output Shape         Param #    
=============================================================================
    Conv2D-1       [[1, 3, 448, 448]]   [1, 64, 448, 448]        1,792     
     ReLU-1       [[1, 64, 448, 448]]   [1, 64, 448, 448]          0       
    Conv2D-2      [[1, 64, 448, 448]]   [1, 64, 448, 448]       36,928     
     ReLU-2       [[1, 64, 448, 448]]   [1, 64, 448, 448]          0       
   MaxPool2D-1    [[1, 64, 448, 448]]   [1, 64, 224, 224]          0       
    Conv2D-3      [[1, 64, 224, 224]]   [1, 128, 224, 224]      73,856     
     ReLU-3       [[1, 128, 224, 224]]  [1, 128, 224, 224]         0       
    Conv2D-4      [[1, 128, 224, 224]]  [1, 128, 224, 224]      147,584    
     ReLU-4       [[1, 128, 224, 224]]  [1, 128, 224, 224]         0       
   MaxPool2D-2    [[1, 128, 224, 224]]  [1, 128, 112, 112]         0       
    Conv2D-5      [[1, 128, 112, 112]]  [1, 256, 112, 112]      295,168    
     ReLU-5       [[1, 256, 112, 112]]  [1, 256, 112, 112]         0       
    Conv2D-6      [[1, 256, 112, 112]]  [1, 256, 112, 112]      590,080    
     ReLU-6       [[1, 256, 112, 112]]  [1, 256, 112, 112]         0       
    Conv2D-7      [[1, 256, 112, 112]]  [1, 256, 112, 112]      590,080    
     ReLU-7       [[1, 256, 112, 112]]  [1, 256, 112, 112]         0       
   MaxPool2D-3    [[1, 256, 112, 112]]   [1, 256, 56, 56]          0       
Conv2DTranspose-1  [[1, 256, 56, 56]]   [1, 128, 112, 112]      295,040    
Conv2DTranspose-2 [[1, 128, 112, 112]]  [1, 64, 224, 224]       73,792     
Conv2DTranspose-3 [[1, 64, 224, 224]]    [1, 3, 448, 448]        1,731     
=============================================================================
Total params: 2,106,051
Trainable params: 2,106,051
Non-trainable params: 0
-----------------------------------------------------------------------------
Input size (MB): 2.30
Forward/backward pass size (MB): 819.22
Params size (MB): 8.03
Estimated Total Size (MB): 829.55
-----------------------------------------------------------------------------






{'total_params': 2106051, 'trainable_params': 2106051}

使用Conv2DTranspose进行图片上采样时，stride决定了上采样的倍数，同时设置padding="SAME"可以保证图片尺寸不变
通过上面的网络信息可以看到Encoder阶段将一张图片3*448*448的图片通过卷积池化等操作变成256*56*56的特征图，即上采样8倍
Decode的目的就是将图片恢复成原来的样子，通过3次Transpose操作最终将图片恢复原来的尺寸：3*448*448

4.2 定义优化器和损失函数进行组网训练

#参数设置
epoch_num = 200
batch_size = 13
learning_rate = 0.0001

print('训练开始...')
#实例化模型
model = myAutoEncoder()
#将模型转换为训练模式
model.train()
#设置优化器，学习率，并且把模型参数给优化器
opt = paddle.optimizer.Adam(learning_rate=learning_rate,parameters=model.parameters())
#设置损失函数
mae_loss = paddle.nn.loss.L1Loss(reduction='mean')

#设置数据读取器
data_reader = paddle.io.DataLoader(train_custom_dataset,
                    places=[paddle.CPUPlace()],
                    batch_size=batch_size,
                    shuffle=True,
                    drop_last=True,
                    num_workers=0)
                    
history_loss = []
iter_epoch = []

for epoch in range(epoch_num):
        for batch_id, data in enumerate(data_reader()):             
            x = data[0]
            out = model(x)
            avg_loss = mae_loss(out,x)   
            opt.clear_grad()
            avg_loss.backward()
            opt.step()
            if epoch % 5 ==0 and batch_id % 4 == 0: # 每2个epoch打印Log
                print(f"epoch: {epoch}, batch: {batch_id}, loss: {avg_loss}")
        iter_epoch.append(epoch)
        history_loss.append(avg_loss.numpy()[0])

训练开始...
epoch: 0, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.54580641])
epoch: 0, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.51162779])
epoch: 5, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.27193627])
epoch: 5, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.23164687])
epoch: 10, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.13509421])
epoch: 10, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.12796533])
epoch: 15, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.11263661])
epoch: 15, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.11901443])
epoch: 20, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.11629045])
epoch: 20, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.10635091])
epoch: 25, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.09412272])
epoch: 25, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.09967791])
epoch: 30, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.09534419])
epoch: 30, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07956622])
epoch: 35, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08423499])
epoch: 35, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08449296])
epoch: 40, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07027830])
epoch: 40, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08247717])
epoch: 45, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07856983])
epoch: 45, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06812122])
epoch: 50, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07380708])
epoch: 50, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06962637])
epoch: 55, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07791780])
epoch: 55, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06919570])
epoch: 60, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08193834])
epoch: 60, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08774798])
epoch: 65, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07479008])
epoch: 65, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06583022])
epoch: 70, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06783622])
epoch: 70, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07093560])
epoch: 75, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06843773])
epoch: 75, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06889559])
epoch: 80, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08069409])
epoch: 80, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05808998])
epoch: 85, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07165311])
epoch: 85, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07447124])
epoch: 90, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06272805])
epoch: 90, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07888364])
epoch: 95, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07168251])
epoch: 95, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05914074])
epoch: 100, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06622731])
epoch: 100, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06126359])
epoch: 105, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05780937])
epoch: 105, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06839848])
epoch: 110, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06755541])
epoch: 110, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05924532])
epoch: 115, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05629445])
epoch: 115, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05806985])
epoch: 120, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07422150])
epoch: 120, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05253331])
epoch: 125, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08053307])
epoch: 125, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06235427])
epoch: 130, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07550745])
epoch: 130, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06322174])
epoch: 135, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05485876])
epoch: 135, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.08097755])
epoch: 140, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05808357])
epoch: 140, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06902048])
epoch: 145, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06560820])
epoch: 145, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05637494])
epoch: 150, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05836837])
epoch: 150, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05687869])
epoch: 155, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06907557])
epoch: 155, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05679866])
epoch: 160, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06153430])
epoch: 160, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07021455])
epoch: 165, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06861637])
epoch: 165, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.07062768])
epoch: 170, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06094266])
epoch: 170, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.04719927])
epoch: 175, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05449707])
epoch: 175, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06855212])
epoch: 180, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05722924])
epoch: 180, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05917274])
epoch: 185, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06281886])
epoch: 185, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05091902])
epoch: 190, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06369766])
epoch: 190, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.06765705])
epoch: 195, batch: 0, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05700744])
epoch: 195, batch: 4, loss: Tensor(shape=[1], dtype=float32, place=Place(gpu:0), stop_gradient=False,
       [0.05723191])

4.3 训练结果可视化

分别将训练过程和训练后的模型的预测效果可视化

#绘制loss
plt.plot(iter_epoch,history_loss, label = 'MAE loss')
plt.legend()
plt.xlabel('iters')
plt.ylabel('MAE Loss')
plt.show()

在这里插入图片描述

可以看到200个epoch后，loss收敛再0.05附近。

# imgs_train: 数据原始图片，OpenCV格式
# imgs_query: 数据图片，OpenCV格式

def show_trainimg(index):
    img_raw = imgs_train[index]
    img_decoder = E_train_result_img[index]

    plt.figure(figsize=(10,8))
    plt.subplot(121)
    plt.imshow(img_raw[:,:, ::-1])
    plt.title("raw image")

    plt.subplot(122)
    plt.imshow(img_decoder[:,:, ::-1])
    plt.title("decoder image")

5. 使用KNN算法计算图片特征向量相似度用于推荐

import sklearn
from sklearn.neighbors import NearestNeighbors

上面训练过程中将数据分批次送入网络。计算图片特征相似度时，我们将所有图片组成一个大的tensor输入网络进行特征提取。

同样使用paddle框架的dataload做组并输入数据

5.1 提取图片特征并向量化

# 读取所有图片
train_img_root = "/home/aistudio/work/image_CBIR/database"
query_img_root = "/home/aistudio/work/image_CBIR/query"
train_img_list = os.listdir(train_img_root)
query_img_list = os.listdir(query_img_root)

def read_img(filePath):
    return cv2.imread(filePath)

# 读取一个文件内所有指定格式的图片
def read_imgs_dir(dirPath):
    path_list = [os.path.join(dirPath, filename) for filename in os.listdir(dirPath) if filename.endswith(("jpg"))]     
    imgs = [read_img(path) for path in path_list]
    print(f"共读取到{len(imgs)}张图片！")
    return imgs

print("Reading train images from '{}'...".format(train_img_root))
imgs_train = read_imgs_dir(train_img_root)

print("Reading query images from '{}'...".format(query_img_root))
imgs_query = read_imgs_dir(query_img_root)

Reading train images from '/home/aistudio/work/image_CBIR/database'...
共读取到65张图片！
Reading query images from '/home/aistudio/work/image_CBIR/query'...
共读取到5张图片！

# 数据集transformer
def apply_transformer(imgs, transformer):
    imgs_transform = [transformer(img) for img in imgs]
    return imgs_transform

# 归一化
def normalize_img(img):
    return (img - 127.5) / 127.5

# 双线性差值将图像尺寸统一
def resize_img(img, shape_resized):
    img_resized = cv2.resize(img, shape_resized, interpolation = cv2.INTER_LINEAR) # 
    assert img_resized.shape[:2] == shape_resized, f"img_resized = {img_resized.shape}"
    return img_resized

# Flatten image，用于将提取特征后的图片的特征图向量化，C-style, 按照行进行重排
def flatten_img(img):
    return img.flatten("C") 

class ImageTransformer(object):
    def __init__(self, shape_resize):
        self.shape_resize = shape_resize

    def __call__(self, img):
        img_transformed = resize_img(img, self.shape_resize)
        img_transformed = normalize_img(img_transformed)
        return img_transformed

img_resize = (448,448) 
transformer = ImageTransformer(img_resize)
imgs_train_transformed = apply_transformer(imgs_train, transformer)
print(f"After transform train image shape is {imgs_train_transformed[0].shape}")

imgs_query_transformed = apply_transformer(imgs_query, transformer)
print(f"After transform train image shape is {imgs_query_transformed[0].shape}")

After transform train image shape is (448, 448, 3)
After transform train image shape is (448, 448, 3)

X_train = paddle.to_tensor(np.array(imgs_train_transformed), dtype="float32")
X_query = paddle.to_tensor(np.array(imgs_query_transformed),dtype="float32")
X_train.shape, X_query.shape

([65, 448, 448, 3], [5, 448, 448, 3])

# 需要转换成NCHW,Convert images to format "NCHW" as paddle frame need
X_train = paddle.transpose(X_train, [0,3,1,2])
X_query = paddle.transpose(X_query, [0,3,1,2])
print(" After Transpose X_train.shape = {}".format(X_train.shape))
print(" After Transpose X_query.shape = {}".format(X_query.shape))

 After Transpose X_train.shape = [65, 3, 448, 448]
 After Transpose X_query.shape = [5, 3, 448, 448]

# 使用训练好的模型的predict方法将图片向量化
# 预测搜索集，将带搜索图片向量化，用于计算和训练集中的图片的相似度（距离）

print("Inferencing embeddings using pre-trained model...")

output_shape_model = (256,56,56) # 对应input=448*448
feature_train = model.predict(X_train)
E_train_flatten = feature_train.reshape((-1, np.prod(output_shape_model)))

feature_query = model.predict(X_query)
E_query_flatten = feature_query.reshape((-1, np.prod(output_shape_model)))
print(" -> feature_train.shape = {}".format(feature_train.shape))
print(" -> E_train_flatten.shape = {}".format(E_train_flatten.shape))

print(" -> feature_query.shape = {}".format(feature_query.shape))
print(" -> E_query_flatte.shape = {}".format(E_query_flatten.shape))

Inferencing embeddings using pre-trained model...
 -> feature_train.shape = [65, 256, 56, 56]
 -> E_train_flatten.shape = [65, 802816]
 -> feature_query.shape = [5, 256, 56, 56]
 -> E_query_flatte.shape = [5, 802816]

从上面的结果可以看到，目的就是使用网络提取图片特征，并将其拉成一维的向量用于图片相似度计算。

print("Fitting k-nearest-neighbour model on training images...")
knn = NearestNeighbors(n_neighbors=5, metric="cosine")
knn.fit(E_train_flatten)

Fitting k-nearest-neighbour model on training images...





NearestNeighbors(metric='cosine')

5.2 可视化搜索结果

# query原图
query_index = 2
plt.imshow(imgs_query[query_index][:,:,::-1])

<matplotlib.image.AxesImage at 0x7f22468ef4d0>

在这里插入图片描述

# 搜索结果
res = knn.kneighbors([E_query_flatten[query_index]]) 
res

(array([[0.13137984, 0.15384907, 0.17141944, 0.17830563, 0.18301469]],
       dtype=float32),
 array([[10, 15, 33,  7, 11]]))

搜素的结果为一个tuple

tuple[0]: topK余弦相似度的结果
tuple[1}: 对应结果图片在training图片中的index, 根据此index可以拿到原图并可视化

imgs_retrieval = [imgs_train[idx] for idx in res[1].flatten()]
plt.figure(figsize=(15,8))
for i,img in enumerate(imgs_retrieval):
    plt.subplot(1,5,i+1)
    img_dex = str(res[1].flatten()[i])
    plt.imshow(img[:,:,::-1])
    plt.title(f"img_{img_dex}", fontsize=15)

在这里插入图片描述

通过对比查询图片和查询结果图片，可以发现查询的结果确实和输入的查询图片非常相似！
实验验证该方法有效

5.3 批量搜索并展示

# Plot image
def plot_img(img, range=[0, 255]):
    plt.imshow(img, vmin=range[0], vmax=range[1])
    plt.xlabel("xpixels")
    plt.ylabel("ypixels")
    plt.tight_layout()
    plt.show()

# Plots images in 2 rows: top row is input img as query, bottom row is the search result
def plot_query_retrieval(img_query, imgs_retrieval):
    n_retrieval = len(imgs_retrieval)
    fig = plt.figure(figsize=(2*n_retrieval, 4))
    fig.suptitle("Search Result (k={})".format(n_retrieval), fontsize=13)

    # Plot query image
    ax = plt.subplot(2, n_retrieval, 0 + 1)
    plt.imshow(img_query[:,:,::-1])
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    for axis in ['top', 'bottom', 'left', 'right']:
        ax.spines[axis].set_linewidth(3)  # increase border thickness
        ax.spines[axis].set_color('black')  # set to black
    ax.set_title("Input: query image",  fontsize=8)  # set subplot title

    # Plot retrieval images
    for i, img in enumerate(imgs_retrieval):
        ax = plt.subplot(2, n_retrieval, n_retrieval + i + 1)
        plt.imshow(img[:,:,::-1])
        ax.get_xaxis().set_visible(False)
        ax.get_yaxis().set_visible(False)
        for axis in ['top', 'bottom', 'left', 'right']:
            ax.spines[axis].set_linewidth(2)  # set border thickness
            ax.spines[axis].set_color('black')  # set to black
        ax.set_title(f" Retrieval result, rank {i+1}", fontsize=8)  # set subplot title

    plt.show()

# 批量对所有查询图片进行搜索推荐
print("Performing image retrieval on query images...")
print("type of E_query_flatten is: ", type(E_query_flatten))
for i, emb_flatten in enumerate(E_query_flatten):
    _, indices = knn.kneighbors([emb_flatten])  # find k nearest train neighbours
    img_query = imgs_query[i]  # query image
    imgs_retrieval = [imgs_train[idx] for idx in indices.flatten()]  # retrieval images
    plot_query_retrieval(img_query, imgs_retrieval)

Performing image retrieval on query images...
type of E_query_flatten is:  <class 'paddle.Tensor'>

_train[idx] for idx in indices.flatten()] # retrieval images
plot_query_retrieval(img_query, imgs_retrieval)
``

Performing image retrieval on query images...
type of E_query_flatten is:  <class 'paddle.Tensor'>

在这里插入图片描述

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-7tRxFrbI-1685242225942)(main_files/main_43_3.png)]

在这里插入图片描述

可以看到对有些类能获得比较好的结果，有些类的搜索结果有可以改进的空间。
这个可以从网络的深度以及loss的设计进行改善

6. 总结

本项目提出提出了一种使用卷积神经网络提取图片特征并用于基于图片内容的搜素推荐算法，并使用飞桨框架搭建简单的网络来实现。

实验证明该方法有效，但是仍然后改进空间：

网络的设计：本项目仿照VGG网络设计了一个7个卷积层的网络，工程实践中可以使用其他更深的网络或者直接调用飞桨框架已经实现的图片分类网络，将其分类的头部去掉即可
Loss：本项目简单粗暴的使用MAE来计算原图和经过decode复原后的图片的像素差异，这一块对同一个前景目标处于图片不同区域敏感。可以设计更好的loss来规避。

关于作者：

wolfmax老狼，PPDE， AICA六期学员
某半导体CIM软件集成商图像算法工程师，主要方向为图像相关相关的检测分割等算法开发
我在AI Studio上获得钻石等级，点亮7个徽章，来互关呀~ https://aistudio.baidu.com/aistudio/personalcenter/thirdview/801106

此文章为搬运
原项目链接