图像匹配挑战赛介绍，并以SuperGlue与SIFT特征点匹配实验对比精度

本项目旨在介绍图像匹配挑战赛2022的数据、问题以及精度评定方法，最后使用SuperGlue与SIFT来实验并进行精度评价对比。SuperGlue:45.75%>SIFT:8.49%

AI Studio

2936人浏览 · 2022-07-01 21:43:38

AI Studio · 2022-07-01 21:43:38 发布

图像匹配挑战赛介绍，并以SuperGlue与SIFT匹配实验对比精度

注：由于AI studio中有关图像特征点匹配的内容比较少，但这一方面对计算机视觉也很重要，因此本项目将介绍Image matching challenge 2022相关的内容，包括其背景、数据、精度评定，并使用传统的SIFT匹配以及基于深度学习的Superglue匹配分别进行实验，在一小部分训练集上验证、对比精度。因为测试集没有提供真值

一、项目背景

参考：https://www.kaggle.com/competitions/image-matching-challenge-2022/overview/description

对于我们大多数人来说，所拥有的最方便的拍照设备就是随身携带的手机。例如我们在罗马面对标志性建筑特莱维喷泉时，可以将其用手机拍摄后与好友分享。但是照片依然是二维的，仅包含我们拍摄地点的视角。当然，很多人都拍过那个喷泉的照片，如下所示

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-wArNtwsk-1656682854457)(https://ai-studio-static-online.cdn.bcebos.com/fdb9ed64fd6d4539aab6e3b3b1289fd6eca11f9622b0430a9eea80062cef310c)]

在我的专业，测绘里有一句话人人都是传感器 ，如果通过机器学习，使用互联网上免费提供的大量非结构化的图像恢复三维结构、更好地捕捉世界的丰富性会怎样呢？，将能极大的减小成本、提高数据更新的速度

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-MdaHgXnU-1656682854459)(https://ai-studio-static-online.cdn.bcebos.com/7863f090916d47f0bf6d7b2098be87aa0c707d00a0454e0d8017331ea106825e)]

从图像重建3D对象和建筑物的过程称为SfM，它一般是由熟练的操作员在受控条件下捕获的，以确保数据的均匀、高质量。考虑到网上图片各种各样的视点、照明和天气条件、人和车辆的遮挡，甚至是用户应用的滤镜的不同，从网图构建 3D 模型要困难得多
SfM问题的第一部分是识别两个图像的哪些部分是相同场景的相同点，例如两张照片都有的窗口的角落。这通常通过局部特征（在不同的视图中识别出的可靠的关键位置）来实现。局部特征包含捕获兴趣点周围外观的简短描述向量。通过比较这些描述符，可以在两个或多个图像的图像位置的像素坐标之间建立可能的对应关系。这种特征点匹配的“图像配准”方式，让通过三角测量恢复点的3D位置成为可能
图像匹配挑战赛就是要我们解决这第一部分的问题，即对两幅不同视角拍摄的图片进行图像配准
通过算法从两张相片提取特征点并匹配上之后，学过计算机视觉的相关知识如双目视觉等的朋友，就知道如何根据两张照片的同名点，计算一幅图像相对于另一幅图像的相对姿势。根据估计出的相对位姿的平均的平均准确率(mean Average Accuracy, mAA) 对算法进行评估。位姿包含旋转矩阵R与平移矩阵t，对于旋转矩阵误差以角度(°)、平移矩阵误差以米(m)来分别计算。给定每个阈值，如果位姿满足两个阈值，则将其分类为准确。 IMC2022在10对阈值上执行此操作，一次一对（例如，在最精细的水平1-20 cm，在最粗的水平上是10-50m

二、数据集介绍

数据集是来自不同城市的成对图像

# 解压数据集
!unzip -qo data/data151348/image-matching-challenge-2022.zip -d ./work/image-matching-challenge-2022/

# 查看文件结构
!tree -L 3 ./work/image-matching-challenge-2022

文件解读

train/*/calibration.csv
- image_id：图像文件名
- camera_intrinsics：相机内参，是一个3×3的矩阵K，通过行主索引展成一维向量
- rotation_matrix：旋转矩阵R，是一个3×3的矩阵，通过行主索引展成一维向量
- translation_vector：平移向量T
train/*/pair_covisibility.csv
- pair：标识一对图像的字符串，编码为用连字符分隔的两个图像文件名，如key1-key2，其中key1为左片，而key2为右片
- covisibility：两幅影像的重叠率，重叠率超过10%才有可能找到足够的点来计算两张图像的相对位姿，计算过程可参考重叠度计算
- fundamental_matrix：基础矩阵，可以由它以及相机内参K来计算出R和T，具体过程可参考对极几何
train/scaling_factors.csv：每个场景的姿势都是通过SfM重建的，并且只能精确到一个比例因子。该文件包含每个场景的比例因子，可用于将深度转换为米
train/*/images/ ：同一个地点附近拍摄的一批图像

注：由于测试集在本项目中没有用到，所以不做过多介绍，具体细节可参考https://www.kaggle.com/competitions/image-matching-challenge-2022/data

三、数据集处理分析

主要为加载训练数据，并对其部分展示，主要参考https://www.kaggle.com/code/eduardtrulls/imc2022-training-data?scriptVersionId=92062607

!pip install --upgrade opencv-python==4.5.2.52

import os
import numpy as np
import cv2
import csv
from glob import glob
import matplotlib.pyplot as plt
from collections import namedtuple
from copy import deepcopy
from tqdm import tqdm
import random
random.seed(1314)
# 确认opencv的版本，如果报错，重启内核再运行本cell即可
assert cv2.__version__ > '4.5', 'Please use OpenCV 4.5 or later.'

关键性处理函数的定义

# 关键函数的定义

# 一个元组，包含相机的内参矩阵K，以及相机的外参R和T
Gt = namedtuple('Gt', ['K', 'R', 'T'])

# 一个小的ε
eps = 1e-15


def ReadCovisibilityData(filename):
    covisibility_dict = {}
    with open(filename) as f:
        reader = csv.reader(f, delimiter=',')
        for i, row in enumerate(reader):
            # Skip header.
            if i == 0:
                continue
            covisibility_dict[row[0]] = float(row[1])

    return covisibility_dict


def NormalizeKeypoints(keypoints, K):
    C_x = K[0, 2]
    C_y = K[1, 2]
    f_x = K[0, 0]
    f_y = K[1, 1]
    keypoints = (keypoints - np.array([[C_x, C_y]])) / np.array([[f_x, f_y]])
    return keypoints


def ComputeEssentialMatrix(F, K1, K2, kp1, kp2):
    ''''在给定校准矩阵的情况下，从基础矩阵计算本质矩阵，参赛者需要不依赖给定的内参计算基础矩阵'''
    
    # 老版本的OpenCV计算基础矩阵是返回多个，本项目使用版本为：   
    # https://opencv.org/evaluating-opencvs-new-ransacs
    assert F.shape[0] == 3, 'Malformed F?'

    # 使用 OpenCV的 recoverPose 功能计算两张照片的相对位姿R,t:
    # https://docs.opencv.org/4.5.4/d9/d0c/group__calib3d.html#gadb7d2dfcc184c1d2f496d8639f4371c0
    E = np.matmul(np.matmul(K2.T, F), K1).astype(np.float64)
    
    kp1n = NormalizeKeypoints(kp1, K1)
    kp2n = NormalizeKeypoints(kp2, K2)
    num_inliers, R, T, mask = cv2.recoverPose(E, kp1n, kp2n)

    return E, R, T


def ArrayFromCvKps(kps):
    '''将opencv返回的关键点转为numpy矩阵的形式'''
    
    return np.array([kp.pt for kp in kps])


def QuaternionFromMatrix(matrix):
    '''将旋转矩阵转换为4元数'''

    M = np.array(matrix, dtype=np.float64, copy=False)[:4, :4]
    m00 = M[0, 0]
    m01 = M[0, 1]
    m02 = M[0, 2]
    m10 = M[1, 0]
    m11 = M[1, 1]
    m12 = M[1, 2]
    m20 = M[2, 0]
    m21 = M[2, 1]
    m22 = M[2, 2]

    K = np.array([[m00 - m11 - m22, 0.0, 0.0, 0.0],
              [m01 + m10, m11 - m00 - m22, 0.0, 0.0],
              [m02 + m20, m12 + m21, m22 - m00 - m11, 0.0],
              [m21 - m12, m02 - m20, m10 - m01, m00 + m11 + m22]])
    K /= 3.0

    # 四元数是该矩阵的最大特征值所对应的特征向量
    w, V = np.linalg.eigh(K)
    q = V[[3, 0, 1, 2], np.argmax(w)]

    if q[0] < 0:
        np.negative(q, q)

    return q


def ExtractSiftFeatures(image, detector, num_features):
    '''计算图像的SIFT特征'''
    
    gray = cv2.cvtColor(image, cv2.COLOR_RGB2GRAY)
    kp, desc = detector.detectAndCompute(gray, None)
    return kp[:num_features], desc[:num_features]


def ComputeErrorForOneExample(q_gt, T_gt, q, T, scale):
    '''计算估计值与真实值之间的误差，给定一个阈值.
    
   该函数返回旋转、平移两个误差，ComputeMaa 在不同的阈值下将这些组合在一起，以计算平均平均准确度'''
    
    q_gt_norm = q_gt / (np.linalg.norm(q_gt) + eps)
    q_norm = q / (np.linalg.norm(q) + eps)

    loss_q = np.maximum(eps, (1.0 - np.sum(q_norm * q_gt_norm)**2))
    err_q = np.arccos(1 - 2 * loss_q)

    # Apply the scaling factor for this scene.
    T_gt_scaled = T_gt * scale
    T_scaled = T * np.linalg.norm(T_gt) * scale / (np.linalg.norm(T) + eps)

    err_t = min(np.linalg.norm(T_gt_scaled - T_scaled), np.linalg.norm(T_gt_scaled + T_scaled))

    return err_q * 180 / np.pi, err_t


def ComputeMaa(err_q, err_t, thresholds_q, thresholds_t):
    '''通过不同的阈值，计算一个场景的Maa 平均的平均准确度.'''
    
    assert len(err_q) == len(err_t)
    
    acc, acc_q, acc_t = [], [], []
    for th_q, th_t in zip(thresholds_q, thresholds_t):
        acc += [(np.bitwise_and(np.array(err_q) < th_q, np.array(err_t) < th_t)).sum() / len(err_q)]
        acc_q += [(np.array(err_q) < th_q).sum() / len(err_q)]
        acc_t += [(np.array(err_t) < th_t).sum() / len(err_t)]
    return np.mean(acc), np.array(acc), np.array(acc_q), np.array(acc_t)


def BuildCompositeImage(im1, im2, axis=1, margin=0, background=1):
    '''堆叠两个不同大小的图像'''
    
    if background != 0 and background != 1:
        background = 1
    if axis != 0 and axis != 1:
        raise RuntimeError('Axis must be 0 (vertical) or 1 (horizontal')

    h1, w1, _ = im1.shape
    h2, w2, _ = im2.shape

    if axis == 1:
        composite = np.zeros((max(h1, h2), w1 + w2 + margin, 3), dtype=np.uint8) + 255 * background
        if h1 > h2:
            voff1, voff2 = 0, (h1 - h2) // 2
        else:
            voff1, voff2 = (h2 - h1) // 2, 0
        hoff1, hoff2 = 0, w1 + margin
    else:
        composite = np.zeros((h1 + h2 + margin, max(w1, w2), 3), dtype=np.uint8) + 255 * background
        if w1 > w2:
            hoff1, hoff2 = 0, (w1 - w2) // 2
        else:
            hoff1, hoff2 = (w2 - w1) // 2, 0
        voff1, voff2 = 0, h1 + margin
    composite[voff1:voff1 + h1, hoff1:hoff1 + w1, :] = im1
    composite[voff2:voff2 + h2, hoff2:hoff2 + w2, :] = im2

    return (composite, (voff1, voff2), (hoff1, hoff2))


def DrawMatches(im1, im2, kp1, kp2, matches, axis=1, margin=0, background=0, linewidth=2):
    '''绘制关键点和匹配结果'''
    
    composite, v_offset, h_offset = BuildCompositeImage(im1, im2, axis, margin, background)

    # Draw all keypoints.
    for coord_a, coord_b in zip(kp1, kp2):
        composite = cv2.drawMarker(composite, (int(coord_a[0] + h_offset[0]), int(coord_a[1] + v_offset[0])), color=(255, 0, 0), markerType=cv2.MARKER_CROSS, markerSize=5, thickness=1)
        composite = cv2.drawMarker(composite, (int(coord_b[0] + h_offset[1]), int(coord_b[1] + v_offset[1])), color=(255, 0, 0), markerType=cv2.MARKER_CROSS, markerSize=5, thickness=1)
    
    # Draw matches, and highlight keypoints used in matches.
    for idx_a, idx_b in matches:
        composite = cv2.drawMarker(composite, (int(kp1[idx_a, 0] + h_offset[0]), int(kp1[idx_a, 1] + v_offset[0])), color=(0, 0, 255), markerType=cv2.MARKER_CROSS, markerSize=12, thickness=1)
        composite = cv2.drawMarker(composite, (int(kp2[idx_b, 0] + h_offset[1]), int(kp2[idx_b, 1] + v_offset[1])), color=(0, 0, 255), markerType=cv2.MARKER_CROSS, markerSize=12, thickness=1)
        composite = cv2.line(composite,
                             tuple([int(kp1[idx_a][0] + h_offset[0]),
                                   int(kp1[idx_a][1] + v_offset[0])]),
                             tuple([int(kp2[idx_b][0] + h_offset[1]),
                                   int(kp2[idx_b][1] + v_offset[1])]), color=(0, 0, 255), thickness=1)
    return composite


def LoadCalibration(filename):
    '''从.csv文件中加载真值'''
    
    calib_dict = {}
    with open(filename, 'r') as f:
        reader = csv.reader(f, delimiter=',')
        for i, row in enumerate(reader):
            # Skip header.
            if i == 0:
                continue

            camera_id = row[0]
            K = np.array([float(v) for v in row[1].split(' ')]).reshape([3, 3])
            R = np.array([float(v) for v in row[2].split(' ')]).reshape([3, 3])
            T = np.array([float(v) for v in row[3].split(' ')])
            calib_dict[camera_id] = Gt(K=K, R=R, T=T)
    
    return calib_dict

# 查看数据中有多少个城市
src = './work/image-matching-challenge-2022/train'

val_scenes = []
for f in os.scandir(src):
    if f.is_dir():
        cur_scene = os.path.split(f)[-1]
        print(f'Found scene "{cur_scene}"" at {f.path}')
        val_scenes += [cur_scene]

Found scene "brandenburg_gate"" at ./work/image-matching-challenge-2022/train/brandenburg_gate
Found scene "buckingham_palace"" at ./work/image-matching-challenge-2022/train/buckingham_palace
Found scene "sacre_coeur"" at ./work/image-matching-challenge-2022/train/sacre_coeur
Found scene "sagrada_familia"" at ./work/image-matching-challenge-2022/train/sagrada_familia
Found scene "notre_dame_front_facade"" at ./work/image-matching-challenge-2022/train/notre_dame_front_facade
Found scene "colosseum_exterior"" at ./work/image-matching-challenge-2022/train/colosseum_exterior
Found scene "st_pauls_cathedral"" at ./work/image-matching-challenge-2022/train/st_pauls_cathedral
Found scene "taj_mahal"" at ./work/image-matching-challenge-2022/train/taj_mahal
Found scene "lincoln_memorial_statue"" at ./work/image-matching-challenge-2022/train/lincoln_memorial_statue
Found scene "piazza_san_marco"" at ./work/image-matching-challenge-2022/train/piazza_san_marco
Found scene "temple_nara_japan"" at ./work/image-matching-challenge-2022/train/temple_nara_japan
Found scene "st_peters_square"" at ./work/image-matching-challenge-2022/train/st_peters_square
Found scene "pantheon_exterior"" at ./work/image-matching-challenge-2022/train/pantheon_exterior
Found scene "grand_place_brussels"" at ./work/image-matching-challenge-2022/train/grand_place_brussels
Found scene "trevi_fountain"" at ./work/image-matching-challenge-2022/train/trevi_fountain
Found scene "british_museum"" at ./work/image-matching-challenge-2022/train/british_museum

# 选取其中一个场景

scene = 'piazza_san_marco'

images_dict = {}
for filename in glob(f'{src}/{scene}/images/*.jpg'):
    cur_id = os.path.basename(os.path.splitext(filename)[0])

    # OpenCV读取进来的图片是BGR的，得转换一下
    images_dict[cur_id] = cv2.cvtColor(cv2.imread(filename), cv2.COLOR_BGR2RGB)
    
print(f'Loaded {len(images_dict)} images.')

num_rows = 6
num_cols = 4
f, axes = plt.subplots(num_rows, num_cols, figsize=(20, 20), constrained_layout=True)
for i, key in enumerate(images_dict):
    if i >= num_rows * num_cols:
        break
    cur_ax = axes[i % num_rows, i // num_rows]
    cur_ax.imshow(images_dict[key])
    cur_ax.set_title(key)
    cur_ax.axis('off')

# 两张同一个场景的照片不一定有重叠区域
# 数据集中有 重叠度 的计算估值，可以根据重叠度确定两张照片的重叠区域覆盖程度.
# 使用数据集的时候要求使用重叠度大于0.1的像对


covisibility_dict = ReadCovisibilityData(f'{src}/{scene}/pair_covisibility.csv')

# 查看重叠度大的和重叠度小的图像，每类展示1对
easy_subset = [k for k, v in covisibility_dict.items() if v >= 0.7]
difficult_subset = [k for k, v in covisibility_dict.items() if v >= 0.1 and v < 0.2]

for i, subset in enumerate([easy_subset, difficult_subset]):
    print(f'Pairs from an {"easy" if i == 0 else "difficult"} subset')
    
    for index, pair in enumerate(subset[:4]):
        # A pair string is simply two concatenated image IDs, separated with a hyphen.
        image_id_1, image_id_2 = pair.split('-')

        f, axes = plt.subplots(1, 2, figsize=(15, 10), constrained_layout=True)
        axes[0].imshow(images_dict[image_id_1])
        axes[0].set_title(image_id_1)
        axes[1].imshow(images_dict[image_id_2])
        axes[1].set_title(image_id_2)
        for ax in axes:
            ax.axis('off')
        plt.show()
        if index==0:
            break

    print()
    print()

fig = plt.figure(figsize=(15, 10), constrained_layout=True)
plt.title('Covisibility histogram')
plt.hist(list(covisibility_dict.values()), bins=10, range=[0, 1])
plt.show()

Pairs from an easy subset

在这里插入图片描述

Pairs from an difficult subset

在这里插入图片描述

四、SIFT算法实验

该比赛的目标是要得到两张照片的相对姿态，即旋转矩阵R与平移向量T
本项目先用比较经典的sift算法提取特征点并进行匹配、计算像片对的相对位姿

1.sift匹配并估算位姿，计算与真值差距

# sift提取特征点
num_features = 5000

# 可以降低阈值以提取到足够的特征点
sift_detector = cv2.SIFT_create(num_features, contrastThreshold=-10000, edgeThreshold=-10000)

keys = list(images_dict.keys())
keypoints, descriptors = ExtractSiftFeatures(images_dict[keys[0]], sift_detector, num_features)
print(f'Computed {len(keypoints)} features.')

# 每个局部特征都包含一个关键点（xy，可能是尺度，可能是方向）和一个描述向量（SIFT 为 128 维）
image_with_keypoints = cv2.drawKeypoints(images_dict[keys[0]], keypoints, outImage=None, flags=cv2.DRAW_MATCHES_FLAGS_DRAW_RICH_KEYPOINTS)
fig = plt.figure(figsize=(15, 15))
plt.imshow(image_with_keypoints)
plt.axis('off')
plt.show()

Computed 5000 features.

在这里插入图片描述

提取特征点之后，通过暴力配对的方式，根据两幅图的特征点找对应点
从**简单的配对集（重叠度超过0.7）**中取出一对进行实验

pair = easy_subset[0]
image_id_1, image_id_2 = pair.split('-')
keypoints_1, descriptors_1 = ExtractSiftFeatures(images_dict[image_id_1], sift_detector, 2000)
keypoints_2, descriptors_2 = ExtractSiftFeatures(images_dict[image_id_2], sift_detector, 2000)

# 从一张提取了描述子的图片，与另一张图片的描述子计算距离，小于一定阈值的则是匹配得到的点对
bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)

# 匹配
cv_matches = bf.match(descriptors_1, descriptors_2)

# 转换特征点和匹配为numpy矩阵的形式
cur_kp_1 = ArrayFromCvKps(keypoints_1)
cur_kp_2 = ArrayFromCvKps(keypoints_2)
matches = np.array([[m.queryIdx, m.trainIdx] for m in cv_matches])

# 绘制匹配结果
im_matches = DrawMatches(images_dict[image_id_1], images_dict[image_id_2], cur_kp_1, cur_kp_2, matches)
fig = plt.figure(figsize=(25, 25))
plt.title('Matches before RANSAC')
plt.imshow(im_matches)
plt.axis('off')
plt.show()

在这里插入图片描述

# 使用经典的RANSAC算法剔除误匹配的点. References:
# * https://docs.opencv.org/4.5.4/d9/d0c/group__calib3d.html#ga59b0d57f46f8677fb5904294a23d404a
# * https://opencv.org/evaluating-opencvs-new-ransacs

# OpenCV gives us the Fundamental matrix after RANSAC, and a mask over the input matches. The solution is clearly much cleaner, even though it may still contain outliers.
# This F is the prediction you'll submit to the contest.
F, inlier_mask = cv2.findFundamentalMat(cur_kp_1[matches[:, 0]], cur_kp_2[matches[:, 1]], cv2.USAC_MAGSAC, ransacReprojThreshold=0.25, confidence=0.99999, maxIters=10000)

matches_after_ransac = np.array([match for match, is_inlier in zip(matches, inlier_mask) if is_inlier])
im_inliers = DrawMatches(images_dict[image_id_1], images_dict[image_id_2], cur_kp_1, cur_kp_2, matches_after_ransac)
fig = plt.figure(figsize=(25, 25))
plt.title('Matches before RANSAC')
plt.imshow(im_inliers)
plt.axis('off')
plt.show()

在这里插入图片描述

加载真值，查看计算结果与实际的误差

calib_dict = LoadCalibration(f'{src}/{scene}/calibration.csv')
print(f'Loded ground truth data for {len(calib_dict)} images')
print()

# 注：场景是使用 Structure-from-Motion (http://colmap.github.io) 从非结构化图像集合中重建的，并且不符合“真实世界”规模（即米或英寸）
# 我们计算了每个场景的比例因子来纠正这个问题。 这是正确计算度量深度所必需的.

scaling_dict = {}
with open(f'{src}/scaling_factors.csv') as f:
    reader = csv.reader(f, delimiter=',')
    for i, row in enumerate(reader):
        # Skip header.
        if i == 0:
            continue
        scaling_dict[row[0]] = float(row[1])

print(f'Scaling factors: {scaling_dict}')
print()

# 恢复尺度之后可以进行误差的计算. 对估计得到的基础矩阵F进行分解，得到本质矩阵、旋转矩阵和平移向量.
inlier_kp_1 = ArrayFromCvKps([kp for i, kp in enumerate(keypoints_1) if i in matches_after_ransac[:, 0]])
inlier_kp_2 = ArrayFromCvKps([kp for i, kp in enumerate(keypoints_2) if i in matches_after_ransac[:, 1]])
E, R, T = ComputeEssentialMatrix(F, calib_dict[image_id_1].K, calib_dict[image_id_2].K, inlier_kp_1, inlier_kp_2)
q = QuaternionFromMatrix(R)
T = T.flatten()

# 获取该图像对真实的相对位姿
R1_gt, T1_gt = calib_dict[image_id_1].R, calib_dict[image_id_1].T.reshape((3, 1))
R2_gt, T2_gt = calib_dict[image_id_2].R, calib_dict[image_id_2].T.reshape((3, 1))
dR_gt = np.dot(R2_gt, R1_gt.T)
dT_gt = (T2_gt - np.dot(dR_gt, T1_gt)).flatten()
q_gt = QuaternionFromMatrix(dR_gt)
q_gt = q_gt / (np.linalg.norm(q_gt) + eps)

# Given ground truth and prediction, compute the error for the example above.
err_q, err_t = ComputeErrorForOneExample(q_gt, dT_gt, q, T, scaling_dict[scene])
print(f'Pair "{pair}, rotation_error={err_q:.02f} (deg), translation_error={err_t:.02f} (m)', flush=True)

Loded ground truth data for 68 images

Scaling factors: {'british_museum': 2.517, 'brandenburg_gate': 7.38, 'buckingham_palace': 18.75, 'colosseum_exterior': 36.99, 'grand_place_brussels': 10.26, 'lincoln_memorial_statue': 1.85, 'notre_dame_front_facade': 1.36, 'pantheon_exterior': 5.41, 'piazza_san_marco': 7.92, 'sacre_coeur': 20.27, 'sagrada_familia': 4.2, 'st_pauls_cathedral': 7.01, 'st_peters_square': 21.48, 'taj_mahal': 20.76, 'temple_nara_japan': 7.79, 'trevi_fountain': 3.67}

Pair "65581481_3524160597-65217461_5742419027, rotation_error=0.35 (deg), translation_error=2.83 (m)

对于简单场景的图像对，使用SIFT算法估计出的位姿和真实值之前，旋转矩阵偏差0.35°，平移矩阵偏差2.83m

2.评估SIFT算法在数据集每个场景前50对影像的mAA

使用SIFT算法，对训练集中每个场景的图像对进行位姿估计，并以误差计算平均准确率，再将每个场景的平均，得到平均的平均准确率
为了加快实验，每个场景仅取前50对图像对做测试
为了让notebook的体积不超过10m，所以不显示图像与具体计算结果，若想查看，则下方代码show_images与verbose设置为True即可

show_images = False
num_show_images = 1
max_pairs_per_scene = 50
verbose = False

# We use two different sets of thresholds over rotation and translation. Do not change this -- these are the values used by the scoring back-end.
thresholds_q = np.linspace(1, 10, 10)
thresholds_t = np.geomspace(0.2, 5, 10)

# Save the per-sample errors and the accumulated metric to dictionaries, for later inspection.
errors = {scene: {} for scene in scaling_dict.keys()}
mAA = {scene: {} for scene in scaling_dict.keys()}

# Instantiate the matcher.
bf = cv2.BFMatcher(cv2.NORM_L2, crossCheck=True)

for scene in tqdm(scaling_dict.keys()):
    # Load all pairs, find those with a co-visibility over 0.1, and subsample them.
    covisibility_dict = ReadCovisibilityData(f'{src}/{scene}/pair_covisibility.csv')    
    pairs = [pair for pair, covis in covisibility_dict.items() if covis >= 0.1]
    
    # print(f'-- Processing scene "{scene}": found {len(pairs)} pairs (will keep {min(len(pairs), max_pairs_per_scene)})', flush=True)
    
    # Subsample the pairs. Note that they are roughly sorted by difficulty (easy ones first), so we shuffle them beforehand: results would be misleading otherwise.
    random.shuffle(pairs)
    pairs = pairs[:max_pairs_per_scene]
    
    # Extract the images in these pairs (we don't need to load images we will not use).
    ids = []
    for pair in pairs:
        cur_ids = pair.split('-')
        assert cur_ids[0] > cur_ids[1]
        ids += cur_ids
    ids = list(set(ids))
    
    # Load ground truth data.
    calib_dict = LoadCalibration(f'{src}/{scene}/calibration.csv')
    
    # Load images and extract SIFT features.
    images_dict = {}
    kp_dict = {}
    desc_dict = {}
    # print('Extracting features...')
    for id in ids:
        images_dict[id] = cv2.cvtColor(cv2.imread(f'{src}/{scene}/images/{id}.jpg'), cv2.COLOR_BGR2RGB)
        kp_dict[id], desc_dict[id] = ExtractSiftFeatures(images_dict[id], sift_detector, 2000)
    # print()
    # print(f'Extracted features for {len(kp_dict)} images (avg: {np.mean([len(v) for v in desc_dict.values()])})')

    # Process the pairs.
    max_err_acc_q_new = []
    max_err_acc_t_new = []
    for counter, pair in enumerate(pairs):
        id1, id2 = pair.split('-')

        # Compute matches by brute force.
        cv_matches = bf.match(desc_dict[id1], desc_dict[id2])
        matches = np.array([[m.queryIdx, m.trainIdx] for m in cv_matches])
        cur_kp_1 = ArrayFromCvKps([kp_dict[id1][m[0]] for m in matches])
        cur_kp_2 = ArrayFromCvKps([kp_dict[id2][m[1]] for m in matches])

        # Filter matches with RANSAC.
        F, inlier_mask = cv2.findFundamentalMat(cur_kp_1, cur_kp_2, cv2.USAC_MAGSAC, 0.25, 0.99999, 10000)
        inlier_mask = inlier_mask.astype(bool).flatten()
        
        matches_after_ransac = np.array([match for match, is_inlier in zip(matches, inlier_mask) if is_inlier])
        inlier_kp_1 = ArrayFromCvKps([kp_dict[id1][m[0]] for m in matches_after_ransac])
        inlier_kp_2 = ArrayFromCvKps([kp_dict[id2][m[1]] for m in matches_after_ransac])

        # Compute the essential matrix.
        E, R, T = ComputeEssentialMatrix(F, calib_dict[id1].K, calib_dict[id2].K, inlier_kp_1, inlier_kp_2)
        q = QuaternionFromMatrix(R)
        T = T.flatten()

        # Get the relative rotation and translation between these two cameras, given their R and T in the global reference frame.
        R1_gt, T1_gt = calib_dict[id1].R, calib_dict[id1].T.reshape((3, 1))
        R2_gt, T2_gt = calib_dict[id2].R, calib_dict[id2].T.reshape((3, 1))
        dR_gt = np.dot(R2_gt, R1_gt.T)
        dT_gt = (T2_gt - np.dot(dR_gt, T1_gt)).flatten()
        q_gt = QuaternionFromMatrix(dR_gt)
        q_gt = q_gt / (np.linalg.norm(q_gt) + eps)

        # Compute the error for this example.
        err_q, err_t = ComputeErrorForOneExample(q_gt, dT_gt, q, T, scaling_dict[scene])
        errors[scene][pair] = [err_q, err_t]

        # Plot the resulting matches and the pose error.
        if verbose or (show_images and counter < num_show_images):
            print(f'{pair}, err_q={(err_q):.02f} (deg), err_t={(err_t):.02f} (m)', flush=True)
        if show_images and counter < num_show_images:
            im_inliers = DrawMatches(images_dict[id1], images_dict[id2], ArrayFromCvKps(kp_dict[id1]), ArrayFromCvKps(kp_dict[id2]), matches_after_ransac)
            fig = plt.figure(figsize=(25, 25))
            plt.title(f'Inliers, "{pair}"')
            plt.imshow(im_inliers)
            plt.axis('off')
            plt.show()
            print()

    # Histogram the errors over this scene.
    mAA[scene] = ComputeMaa([v[0] for v in errors[scene].values()], [v[1] for v in errors[scene].values()], thresholds_q, thresholds_t)
    # print()
    # print(f'Mean average Accuracy on "{scene}": {mAA[scene][0]:.05f}')
    # print()

print()
print('------- SUMMARY -------')
print()
for scene in scaling_dict.keys():
    print(f'-- Mean average Accuracy on "{scene}": {mAA[scene][0]:.05f}')
print()
print(f'Mean average Accuracy on dataset: {np.mean([mAA[scene][0] for scene in mAA]):.05f}')

100%|██████████| 16/16 [06:39<00:00, 25.89s/it]

------- SUMMARY -------

-- Mean average Accuracy on "british_museum": 0.07400
-- Mean average Accuracy on "brandenburg_gate": 0.08800
-- Mean average Accuracy on "buckingham_palace": 0.00000
-- Mean average Accuracy on "colosseum_exterior": 0.04000
-- Mean average Accuracy on "grand_place_brussels": 0.04400
-- Mean average Accuracy on "lincoln_memorial_statue": 0.26600
-- Mean average Accuracy on "notre_dame_front_facade": 0.15400
-- Mean average Accuracy on "pantheon_exterior": 0.04200
-- Mean average Accuracy on "piazza_san_marco": 0.05800
-- Mean average Accuracy on "sacre_coeur": 0.02200
-- Mean average Accuracy on "sagrada_familia": 0.27000
-- Mean average Accuracy on "st_pauls_cathedral": 0.04800
-- Mean average Accuracy on "st_peters_square": 0.01400
-- Mean average Accuracy on "taj_mahal": 0.02600
-- Mean average Accuracy on "temple_nara_japan": 0.08600
-- Mean average Accuracy on "trevi_fountain": 0.12600

Mean average Accuracy on dataset: 0.08488

结果：使用sift算法对训练集每个场景的50对图像对进行位姿估计，平均的平均准确率为：8.488%

五、SuperPoint+SuperGlue实验

本项目这部分的实验主要是使用superpoint进行特征点的提取与superglue进行特征点匹配，然后根据匹配的结果计算像片对之间的相对位姿
注：superpoint与superglue的paddle权重转化版本参考geoyee的项目深度学习图像匹配网络，本项目只是拿过来用，具体细节可以在该项目查看，不做赘述
superglue与superpoint都没有进行训练集的训练，用预训练模型就可以取得良好效果。Kaggle榜单第一也没有使用训练集训练，而是多个预训练模型集成
因为superglue相对于sift速度慢了30倍，建议仅使用深度学习图像匹配部分做匹配和计算相对位姿和误差，对于与第四节相同的数据，平均的平均准确率(mAA)计算需要花费10个小时以上

1.superpoint提特征点+superglue匹配，估算相对位姿并计算误差

scene = 'piazza_san_marco'
images_dict = {}
images_path_dict = {}
for filename in glob(f'{src}/{scene}/images/*.jpg'):
    cur_id = os.path.basename(os.path.splitext(filename)[0])

    # 存储路径字典
    images_path_dict[cur_id] = filename
    # OpenCV读取进来的图片是BGR的，得转换一下
    images_dict[cur_id] = cv2.cvtColor(cv2.imread(filename), cv2.COLOR_BGR2RGB)
    
print(f'Loaded {len(images_dict)} images.')
covisibility_dict = ReadCovisibilityData(f'{src}/{scene}/pair_covisibility.csv')

# 获取重叠度大的数据
easy_subset = [k for k, v in covisibility_dict.items() if v >= 0.7]

Loaded 68 images.

使用SuperPoint与SuperGlue提取特征点与特征匹配

from models import Matching, read_image

pair = easy_subset[0]
image_id_1, image_id_2 = pair.split('-')
im0_path = images_path_dict[image_id_1]
im1_path = images_path_dict[image_id_2]

model = Matching() # 声明模型
im0, imt0 = read_image(im0_path)
im1, imt1 = read_image(im1_path)
data = {"image0": imt0, "image1": imt1}
pred = model(data)
pred = {k: v[0].detach().cpu().numpy() for k, v in pred.items()}

W0623 16:58:44.927675   259 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0623 16:58:44.932485   259 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.


Loaded SuperPoint model.
Loaded SuperGlue model.

kpts1, kpts2 = pred["keypoints0"], pred["keypoints1"]
matches, conf = pred["matches0"], pred["matching_scores0"]

valid = matches > -1
mkpts1 = kpts1[valid]
mkpts2 = kpts2[matches[valid]]
mconf = conf[valid]

if len(mkpts1) > 8:
    F, inlier_mask = cv2.findFundamentalMat(mkpts1, mkpts2, cv2.USAC_MAGSAC, ransacReprojThreshold=0.25, confidence=0.99999, maxIters=10000)

# 得到RANSAC筛选后的特征点
inlier_kp_1 = np.array([kpt for kpt, is_inlier in zip(mkpts1, inlier_mask) if is_inlier]) 
inlier_kp_2 = np.array([kpt for kpt, is_inlier in zip(mkpts2, inlier_mask) if is_inlier])

展示SuperPoint+SuperGlue以及使用RANSAC剔除误匹配点之后的可视化结果

from models import visualize as vis
import matplotlib.cm as cm

color = cm.jet(conf[valid])
out = vis.matching_plot_simple(images_dict[image_id_1], images_dict[image_id_2], mkpts1, mkpts2, inlier_kp_1, inlier_kp_2, color)
out = cv2.cvtColor(out, cv2.COLOR_BGR2RGB)
plt.figure(figsize=(15, 15))
plt.imshow(out)
plt.axis("off")
plt.show()

在这里插入图片描述

calib_dict = LoadCalibration(f'{src}/{scene}/calibration.csv')
print(f'Loded ground truth data for {len(calib_dict)} images')
print()

# 注：场景是使用 Structure-from-Motion (http://colmap.github.io) 从非结构化图像集合中重建的，并且不符合“真实世界”规模（即米或英寸）
# 我们计算了每个场景的比例因子来纠正这个问题。 这是正确计算度量深度所必需的.

scaling_dict = {}
with open(f'{src}/scaling_factors.csv') as f:
    reader = csv.reader(f, delimiter=',')
    for i, row in enumerate(reader):
        # Skip header.
        if i == 0:
            continue
        scaling_dict[row[0]] = float(row[1])

print(f'Scaling factors: {scaling_dict}')
print()

# 恢复尺度之后可以进行误差的计算. 对估计得到的基础矩阵F进行分解，得到本质矩阵、旋转矩阵和平移向量.
E, R, T = ComputeEssentialMatrix(F, calib_dict[image_id_1].K, calib_dict[image_id_2].K, inlier_kp_1, inlier_kp_2)
q = QuaternionFromMatrix(R)
T = T.flatten()

# 获取该图像对真实的相对位姿
R1_gt, T1_gt = calib_dict[image_id_1].R, calib_dict[image_id_1].T.reshape((3, 1))
R2_gt, T2_gt = calib_dict[image_id_2].R, calib_dict[image_id_2].T.reshape((3, 1))
dR_gt = np.dot(R2_gt, R1_gt.T)
dT_gt = (T2_gt - np.dot(dR_gt, T1_gt)).flatten()
q_gt = QuaternionFromMatrix(dR_gt)
q_gt = q_gt / (np.linalg.norm(q_gt) + eps)

# Given ground truth and prediction, compute the error for the example above.
err_q, err_t = ComputeErrorForOneExample(q_gt, dT_gt, q, T, scaling_dict[scene])
print(f'Pair "{pair}, rotation_error={err_q:.02f} (deg), translation_error={err_t:.02f} (m)', flush=True)

Loded ground truth data for 68 images

Scaling factors: {'british_museum': 2.517, 'brandenburg_gate': 7.38, 'buckingham_palace': 18.75, 'colosseum_exterior': 36.99, 'grand_place_brussels': 10.26, 'lincoln_memorial_statue': 1.85, 'notre_dame_front_facade': 1.36, 'pantheon_exterior': 5.41, 'piazza_san_marco': 7.92, 'sacre_coeur': 20.27, 'sagrada_familia': 4.2, 'st_pauls_cathedral': 7.01, 'st_peters_square': 21.48, 'taj_mahal': 20.76, 'temple_nara_japan': 7.79, 'trevi_fountain': 3.67}

Pair "65581481_3524160597-65217461_5742419027, rotation_error=0.77 (deg), translation_error=0.34 (m)

对于简单的图像对，使用superpoint+superglue算法来估计其相对位姿，旋转矩阵的偏差为0.77°，平移向量的误差为0.34m

2.使用Superpoint与Superglue对数据集中每个场景前50对影像进行测试，计算平均的平均准确率(mAA)

show_images = False
num_show_images = 1
max_pairs_per_scene = 50
verbose = False

# We use two different sets of thresholds over rotation and translation. Do not change this -- these are the values used by the scoring back-end.
thresholds_q = np.linspace(1, 10, 10)
thresholds_t = np.geomspace(0.2, 5, 10)

# Save the per-sample errors and the accumulated metric to dictionaries, for later inspection.
errors = {scene: {} for scene in scaling_dict.keys()}
mAA = {scene: {} for scene in scaling_dict.keys()}

for scene in tqdm(scaling_dict.keys()):
    # Load all pairs, find those with a co-visibility over 0.1, and subsample them.
    covisibility_dict = ReadCovisibilityData(f'{src}/{scene}/pair_covisibility.csv')    
    pairs = [pair for pair, covis in covisibility_dict.items() if covis >= 0.1]
    
    # print(f'-- Processing scene "{scene}": found {len(pairs)} pairs (will keep {min(len(pairs), max_pairs_per_scene)})', flush=True)
    
    # Subsample the pairs. Note that they are roughly sorted by difficulty (easy ones first), so we shuffle them beforehand: results would be misleading otherwise.
    random.shuffle(pairs)
    pairs = pairs[:max_pairs_per_scene]
    
    # Extract the images in these pairs (we don't need to load images we will not use).
    ids = []
    for pair in pairs:
        cur_ids = pair.split('-')
        assert cur_ids[0] > cur_ids[1]
        ids += cur_ids
    ids = list(set(ids))
    
    # Load ground truth data.
    calib_dict = LoadCalibration(f'{src}/{scene}/calibration.csv')
    
    images_dict = {}
    images_path_dict = {}
    kp_dict = {}
    desc_dict = {}
    # print('Extracting features...')
    for id in ids:
        images_dict[id] = cv2.cvtColor(cv2.imread(f'{src}/{scene}/images/{id}.jpg'), cv2.COLOR_BGR2RGB)
        images_path_dict[id] = f'{src}/{scene}/images/{id}.jpg'
    # print()
    # print(f'Extracted features for {len(kp_dict)} images (avg: {np.mean([len(v) for v in desc_dict.values()])})')

    # Process the pairs.
    max_err_acc_q_new = []
    max_err_acc_t_new = []
    for counter, pair in enumerate(pairs):
        id1, id2 = pair.split('-')
        im0_path = images_path_dict[id1]
        im1_path = images_path_dict[id2]

        # model = Matching() # 声明模型
        im0, imt0 = read_image(im0_path)
        im1, imt1 = read_image(im1_path)
        data = {"image0": imt0, "image1": imt1}
        pred = model(data)
        pred = {k: v[0].detach().cpu().numpy() for k, v in pred.items()}

        kpts1, kpts2 = pred["keypoints0"], pred["keypoints1"]
        matches, conf = pred["matches0"], pred["matching_scores0"]

        valid = matches > -1
        mkpts1 = kpts1[valid]
        mkpts2 = kpts2[matches[valid]]
        mconf = conf[valid]

        # Filter matches with RANSAC.
        if len(mkpts1) > 8:
            F, inlier_mask = cv2.findFundamentalMat(mkpts1, mkpts2, cv2.USAC_MAGSAC, 0.25, 0.99999, 10000)
        else:
            continue
        inlier_kp_1 = np.array([kpt for kpt, is_inlier in zip(mkpts1, inlier_mask) if is_inlier]) 
        inlier_kp_2 = np.array([kpt for kpt, is_inlier in zip(mkpts2, inlier_mask) if is_inlier])

        # Compute the essential matrix.
        E, R, T = ComputeEssentialMatrix(F, calib_dict[id1].K, calib_dict[id2].K, inlier_kp_1, inlier_kp_2)
        q = QuaternionFromMatrix(R)
        T = T.flatten()

        # Get the relative rotation and translation between these two cameras, given their R and T in the global reference frame.
        R1_gt, T1_gt = calib_dict[id1].R, calib_dict[id1].T.reshape((3, 1))
        R2_gt, T2_gt = calib_dict[id2].R, calib_dict[id2].T.reshape((3, 1))
        dR_gt = np.dot(R2_gt, R1_gt.T)
        dT_gt = (T2_gt - np.dot(dR_gt, T1_gt)).flatten()
        q_gt = QuaternionFromMatrix(dR_gt)
        q_gt = q_gt / (np.linalg.norm(q_gt) + eps)

        # Compute the error for this example.
        err_q, err_t = ComputeErrorForOneExample(q_gt, dT_gt, q, T, scaling_dict[scene])
        errors[scene][pair] = [err_q, err_t]

    # Histogram the errors over this scene.
    mAA[scene] = ComputeMaa([v[0] for v in errors[scene].values()], [v[1] for v in errors[scene].values()], thresholds_q, thresholds_t)
    # print()
    # print(f'Mean average Accuracy on "{scene}": {mAA[scene][0]:.05f}')
    # print()

print()
print('------- SUMMARY -------')
print()
for scene in scaling_dict.keys():
    print(f'-- Mean average Accuracy on "{scene}": {mAA[scene][0]:.05f}')
print()
print(f'Mean average Accuracy on dataset: {np.mean([mAA[scene][0] for scene in mAA]):.05f}')

  0%|          | 0/16 [00:00<?, ?it/s][A
  6%|▋         | 1/16 [21:07<5:16:53, 1267.56s/it][A
 12%|█▎        | 2/16 [41:38<4:53:12, 1256.62s/it][A
 19%|█▉        | 3/16 [1:16:34<5:26:49, 1508.40s/it][A
 25%|██▌       | 4/16 [1:57:32<5:58:40, 1793.41s/it][A
 31%|███▏      | 5/16 [2:43:27<6:21:40, 2081.90s/it][A
 38%|███▊      | 6/16 [2:51:37<4:27:21, 1604.19s/it][A
 44%|████▍     | 7/16 [3:24:25<4:17:00, 1713.43s/it][A
 50%|█████     | 8/16 [3:45:20<3:30:05, 1575.72s/it][A
 56%|█████▋    | 9/16 [4:28:40<3:39:40, 1882.96s/it][A
 62%|██████▎   | 10/16 [5:08:30<3:23:31, 2035.28s/it][A
 69%|██████▉   | 11/16 [5:39:48<2:45:39, 1987.96s/it][A
 75%|███████▌  | 12/16 [6:08:41<2:07:25, 1911.42s/it][A
 81%|████████▏ | 13/16 [6:42:31<1:37:20, 1946.94s/it][A
 88%|████████▊ | 14/16 [7:05:29<59:12, 1776.31s/it]  [A
 94%|█████████▍| 15/16 [7:30:41<28:17, 1697.04s/it][A
100%|██████████| 16/16 [8:02:09<00:00, 1754.38s/it][A

------- SUMMARY -------

-- Mean average Accuracy on "british_museum": 0.54800
-- Mean average Accuracy on "brandenburg_gate": 0.60600
-- Mean average Accuracy on "buckingham_palace": 0.24000
-- Mean average Accuracy on "colosseum_exterior": 0.13800
-- Mean average Accuracy on "grand_place_brussels": 0.25200
-- Mean average Accuracy on "lincoln_memorial_statue": 0.62400
-- Mean average Accuracy on "notre_dame_front_facade": 0.70400
-- Mean average Accuracy on "pantheon_exterior": 0.43200
-- Mean average Accuracy on "piazza_san_marco": 0.26400
-- Mean average Accuracy on "sacre_coeur": 0.68600
-- Mean average Accuracy on "sagrada_familia": 0.54600
-- Mean average Accuracy on "st_pauls_cathedral": 0.37200
-- Mean average Accuracy on "st_peters_square": 0.24000
-- Mean average Accuracy on "taj_mahal": 0.56200
-- Mean average Accuracy on "temple_nara_japan": 0.51800
-- Mean average Accuracy on "trevi_fountain": 0.58723

Mean average Accuracy on dataset: 0.45745

使用深度学习的方法（superglue+superpoint）进行实验测试，得到mAA的值为45.745%，远大于sift算法的8.488%

六、总结

SuperPoint与SuperGlue结合的深度学习匹配方法，相较于SIFT算法
- 对同一对影像计算，SIFT算法估算的位姿，与真实值的误差为0.35°与2.83m，而SuperPoint+SuperGlue为0.77°与0.34m，平移量上的误差远小于经典的SIFT算法
- 对于数据集每个场景前50对影像的测试计算，SIFT算法的平均的平均准确率为8.488%，但SuperPoint+SuperGlue的准确率要有45.745%，平均的平均准确率远超经典算法
但深度学习的方法速度太慢了，远慢于经典的SIFT算法，保守估计慢了30倍甚至不止