CVPR-2020

CVPR 2020论文出炉,阅读中,小小记录。分类来自amusi

分类

图像分类

视频分类

检测

目标检测

D2Det: Towards High Quality Object Detection and Instance Segmentation

这篇文章提出了一个两阶段的检测方法D2Det,高质量体现在两个方面:更精确地定位localization和更准确地分类classification

  • 定位:对于一个目标提议,用更密集地局部回归方法通过全卷积网络来预测多个密集的box offsets

    • 和传统的方法相比:由于能够回归正敏感的实数offset,所以对于固定的区域,不限制关键点的数量(Faster RCNN通过几个全连接层回归一个gloabl offset;Grid R-CNN通过在固定区域中搜索关键点——即目标的边界点

      传统方法, candidate object proposal, and target ground-truth box,offset是:

      这篇论文: and 代表gt bbox的左上角和右下角点, represent the dense box offsets predicted by the local feature in left, top, right, and bottom directions, respectively. Offset:

      但是这样密集的回归,准确率还是有些欠缺,对于一些部分需要ignore

    • 使用binary overlap prediction方法(即图中的),减少背景的影响;binary指的就是是背景还是目标区域;用binary cross-entropy loss衡量

    • 目标:find a tight bounding-box surrounding an object

  • 分类:获取可区分的特征

    • 使用discriminative RoI pooling scheme采样一个proposal的不同子区域,performs an adaptive weighted pooling获得可区分的特征;首先light-weight(equal) offset predictor,然后通过Adaptive weighting 区分点增加权重
    • RoIAlign:obtain features from k × k sub-regions and passes these features through three fully connected layers.
    • only requires a sized RoIAlign followed by the fully connected layers (light-weight due to smaller input vector).
    • 首先对于原始的采样点,用卷积运算W去预测权重值,这样带有权重的ROI特征向量就为;运算代表Hadamard product

效果:

在MS COCO test-dev数据集上,45.4 AP(ResNet101);multi-sacle 50.1 AP;在实例分割任务上,获得 mask AP of 40.2。

Harmonizing Transferability and Discriminability for Adapting Object Detectors

协调与平衡传统无监督域对抗方法中存在的特征可迁移性和判别性的矛盾,通过在不同层次校准特征的可迁移性判别性来规范对抗域学习,从而实现 细粒度的跨域特征对齐,提出了Hierarchical Transferability Calibration Network (HTCN)。

UDA无监督域适应主要有两种方法:一个是减少统计分布的区别,另一个是对抗学习域不变特征。感觉作者所要解决的问题就是学习两个域的不变性的时候,可能会因为图片中的实例之间的判别性而有影响,成为负迁移了


网络主要由三大模块组成:

  • local feature masks: that calibrate the local transferability to provide semantic guidance for the following discriminative pattern alignment(在观察到整个图像的某些局部区域比其他局部区域更具描述性和优势后,)基于浅层特征计算两个域中的局部特征掩码以近似地指导语义一致性,从而进一步增强局部可辨性。 对齐之后,可以看作是一个类似注意的模块,以无监督的方式捕获可转移区域。

    利用pixel-wise的局部域判别器来生成feature masks。其中特征图的长、宽分别为。由此the pixel-wise adversarial training loss 定义为:

    通过的输出表示为,其不确定性表示为;由此mask定义为:。这样通过插值重新分配权重后,局部特征变为

    由此我们可以知道这一部分送入后产生的loss为:

  • Importance Weighted Adversarial Training with input Interpolation (IWAT-I):which strengthens the global discriminability(calibrate the global transferability) by reweighting the interpolated image-level features(motivation:不是所有的样本都有相同的迁移性能,尤其在插值操作之后。没有插值的话会有souce-biased的现象)

    根据交叉域的相似性来重新设置weight,相似性越高,对学习的重要性也就越高。

    和前一模块一样的输出,是每一个的不确定性。

    这样每一个图片的权重被定义为(也就是说如果不确定性很高,难以区分的话,那么就要好好学习了),由此的输入为:,然后将其通过的运算送入中,对抗损失则为:

  • Context-aware Instance-Level Alignment (CILA) module:which enhances the local discriminability(calibrate the local transferability) by capturing the underlying complementary effect between the instance-level feature and the global context information for the instance-level feature alignment.(the context vector is aggregated from the lower layer, which is relatively invariant (transferability) across domains.)

    由于直接concatenation context features and the instance-level features,它们相互之间就是独立的,忽略它们是可以作为对方的互补这一影响,由此提出了a non-linear fusion strategy:

    f_{fus}=\frac 1 {\sqrt{d}}(R_1f_c)\odot(R_2f_{ins})\\
    f_c=[f_c^1, f_c^2, f_c^3]

    L_{ins}= -\frac{1}{N_s}\sum^{N_s}_{i=1}\sum_{i,j}\log(D_{ins}(f_{fus}^{i,j})_s)\\
    = -\frac{1}{N_t}\sum^{N_t}_{i=1}\sum_{i,j}\log(1-D_{ins}(f_{fus}^{i,j})_t)
    $$

总的训练损失为

对于目标域误差的上限定义式,通过三个模块来减小源域和目标域之间的距离,通过多次层次的特征迁移学习来减小常数项

3D目标检测

Learning Depth-Guided Convolutions for Monocular 3D Object Detection

Depth-guided Dynamic-Depthwise-Dilated local convolutional network (LCN)

这篇文章大概的思路就是首先用单目的RGB图像得到深度表示,然后将深度表示转换为伪雷达表示,最后使用3D点云的目标检测方法

对于伪雷达的部分有些想不通,既然是用图片的信息得到雷达的信息,那为什么用这个伪雷达信息得到的更好呢?而不是直接使用深度估计呢?是因为雷达转换的方法更好吗?还是因为从深度信息变换雷达信息的过程获取了更多信息呢?感觉是自己对于神经网络的认识还不到位,不是说给一堆数据,一个网络结构,就能大致学会。这里面有两个函数转换关系,分别来求会更容易更快速学习到。就好像最近纠结很久的学习,需要转换坐标系后再将数据送入网络。实际上转换坐标系只是一个矩阵的乘法,但是网络却学不好。也可能是因为对应的轴还有一个变换。文章中说即使depth map的准确率不是很高,也可以有好的表现。感觉是depth map是作为一个guidance,是2D到3D的一个衔接,而不是只从depth map直接转换,然后通过一个生成网络获得了更加高层的信息。

总体的感觉来说就是RGB-based的话会丢失空间信息,因此从Depth map中去获取,模仿得到lidar信息(应该是lidar信息是depth信息的一个融合,更加高层的信息);但是仅仅有空间信息的话会丢失语义信息,因此将RGB信息通过特征提取保留,并且将上述两者结合,这样得到的3D信息比较完整。

  • 为了实现depth-wise,特征网络之后通过一个shift pooling

    1
    2
    3
    4
    5
    6
    7
    8
    9
    def _get_p(self, offset, dtype):
    N, h, w = offset.size(1)//2, offset.size(2), offset.size(3)
    # (1, 2N, 1, 1)
    p_n = self._get_p_n(N, dtype)
    # (1, 2N, h, w)
    p_0 = self._get_p_0(h, w, N, dtype)
    p = p_0 + p_n + offset
    return p
<img src="CVPR-2020/Ding_2.png" style="zoom:50%;" />
  • 为了实现depth-wise,使得不同的kernel可以使用不同的functions,用对每一个filter学习an adaptive dilation rate,即学习an adaptive function ,使得可以获取不同尺寸的感受野,其中由三层组成(d denote our maximum dilation rate):

    weight = self.adaptive_layers(x).reshape(-1, 512, 1, 3)

    weight = self.adaptive_softmax(weight)

    • an AdaptiveMaxPool2d layer with the output size of d × d and channel number c;
    • a convolutional layer with a kernel size of d × d and channel number d × c
    • a reshape and softmax layer to generate d weights with a sum of 1 for each filter.

    这样,新的特征图可以表示为:

    1
    2
    3
    x = dynamic_local_filtering(x, depth, dilated=1) * weight[:, :, :, 0:1] \
    + dynamic_local_filtering(x, depth, dilated=2) * weight[:, :, :, 1:2] \
    + dynamic_local_filtering(x, depth, dilated=3) * weight[:, :, :, 2:3]
可以解决2D卷积中**尺度不敏感和无意义的局部结构**信息,同时也**利用好了RGB的信息**。
  • 2D-3D detection head

    • overall loss contains a classification loss (standard cross-entropy (CE) loss), a 2D regression loss, a 3D regression loss and a 2D-3D corner loss (SmoothL1 regression losses). ( and denote the classification score of target class and the focusing parameter, respectively)

End-to-End Pseudo-LiDAR for Image-Based 3D Object Detection

这篇论文看github应该是有前序文章的,之后可以看一下

这篇论文的主要贡献就是设计了一个Change of Representation (CoR) modules,让PL(pseudo-lidar)和3D 深度估计可以在同一个框架下,训练的时候达到end-to-end的效果(想到end-to-end的一个是因为lidar-based的3D检测方法严重依赖于点的准确性,对于远处稀疏的点的预测并不好;二是因为在KITTI数据集中的图片里有90%的点是背景,对于检测器来说直接固定的效果,会增加影响)。Perfomance就是在PL方法中甚至image-based中STA也就是说这比直接用深度估计进行3D检测还要好


Two streams in terms of point cloud processing:

  • directly operating on the unordered point clouds in 3D , mostly by applying PointNet or/and applying 3D convolution over neighbors;
  • operating on quantized 3D/4D tensor data, which are generated from discretizing the locations of point clouds into some fixed grids.

CoR关注于两个方面,一个是subsampling,另一个是quantization(在这里使用了soft quantization来克服内在固有的非不一致性)。

  • quantization:3D信息被离散化到固定的格子中,only the occupation or densities are recorded in the resulting tensor(?).

    • 2D and 3D convolutions can be directly applied to extract features from the tensor
    • makes the back-propagation difficult

    格子的固定中间点的位置记作:,results tensor 定义为:

    直观上理解就是,如果点p落在m区域之内,那么就记作1;否则记作0. 但是对于后向传播是困难的,让损失函数求关于T的偏导,如果大于0,就说明T应该减少,即希望落在m的点更少;反之,则希望落在m的点更多。但是如何改变,是一个问题。

    因此,提出改变forward,在论文中使用了soft quantization,通过RBF(radial basis function)作为权重的参考。

    是落在m区域中的点,是点邻居区域,这样在反向传播中,可以直接影响到点,并且如果有些误差也可能会被周围部分所承担,因此更加有效。

    从下图可以更直观地理解,真实的点云在图一中,然后我们先离散化到固定的格子之中,即Hard quantization;而soft quantization要做的事情就是将当前预测错误的点在拉、推中接近真实值。如果这一格子中没有这一个点,就会想要把它推给周边的点;而如果这个格子中缺少一个点,就会从边上的格子中拉一个回来。当计算的梯度为正的时候,说明需要push;梯度为负的时候,说明需要pull。

  • subsampling

网络结构:

  • 先训练深度评估网络,SDN模型
  • 固定深度评估网络,训练3D 目标检测器,用了两个lidar-based的方法:PIXOR (voxel-based, with quantization) and PointR-CNN (P-RCNN)(point-cloud-based)
  • 最后带有balanced loss weights联合训练两者
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
def train_one_epoch(depth_model, ODmodel, ODmodel_fn):
# 设置为训练状态
self.model.train()
self.depth_model.net.train()
self.optimizer.zero_grad()
self.depth_model.optimizer.zero_grad()
# 开始训练
depth_loss, point_cloud = self.depth_model.train(batch)
loss, tb_dict, disp_dict = self.model_fn(self.model, point_cloud)
disp_dict['depth_loss'] = depth_loss.item()
loss = loss*0.01 + depth_loss
# 更新参数
loss.backward()
self.optimizer.step()
self.depth_model.optimizer.step()

loss:

前身:

  1. Pseudo-LiDAR from Visual Depth Estimation: Bridging the Gap in 3D Object Detection for Autonomous Driving——cvpr2019

    这一篇文章讲述的是说,之前伪雷达的方法主要是通过前置摄像头的方法来模拟,但是这样子的效果并不好。发现使用立体摄像头采集的鸟瞰图,可以显著提高范围检测和准确性。指出不是数据的质量而是数据的表示是准确率差异的大部分原因。其将30m范围内的目标检测准确率从22%提升到了74%。


    The error of stereo-based 3D depth estimation grows quadratically with the depth of an object, whereas for Time-of-Flight (ToF) approaches, such as LiDAR, this relationship is approximately linear.

    To this end, we propose a two-step approach by first estimating the dense pixel depth from stereo (or even monocular) imagery and then back-projecting pixels into a 3D point cloud. By viewing this representation as pseudo-LiDAR signal, we can then apply any existing LiDAR-based 3D object detection algorithm. 其实就是上面文章的方法将depth model和3D Object detection model给分开了,这里是2步。


    【注意这里箭头是单向的,两个网络之间是不影响的】

    • Depth estimation(论文使用了pyramid stereo matching network (PSMNet))

      A pair of cameras with a horizontal offset (i.e., baseline) b, a disparity map Y. Treats the left image, , as reference and records in Y the horizontal disparity to for each pixel就是左右相机有个水平线的offset b,然后在这里以左边的图片为参考,加上水平线,预测右图每一个像素的差异图Y(视差图), 且已知左边相机的horizontal focal length 。那么深度图为:

    • Pseudo-LiDAR generation

      Derive the 3D location (x, y, z) of each pixel (u, v)参考深度d:

      is the pixel location corresponding to the camera center and is the vertical focal length.

      Since real LiDAR signals only reside in a certain range of heights, we disregard pseudo-LiDAR points beyond that range. As most objects of interest (e.g., cars and pedestrians) do not exceed this height range there is little information loss.

      1
      2
      3
      4
      5
      6
      7
      8
      9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      23
      24
      25
      26
      27
      28
      29
      30
      31
      32
      33
      34
      35
      36
      37
      38
      39
      40
      41
      42
      43
      44
      """
      code from https://zhuanlan.zhihu.com/p/91479831
      """
      from PIL import Image
      import numpy as np
      import pptk
      import cv2
      fu = 2301.3147
      fv = 2301.3147
      cu = 1489.8536
      cv = 479.1750
      disparity_map = Image.open('171206_034625454_Camera_5.png')
      disparity_map = np.array(disparity_map)
      disparity_map = disparity_map/200 # 除以200恢复原来的值
      disparity_map[disparity_map == 0] = 1e-3 # 避免除0
      depth_map = fu/disparity_map # fu是水平焦距, b==1
      cv2.imwrite('depth_map.jpg', depth_map)
      # slow version
      point_cloud = np.empty(shape=(*depth_map.shape, 3))
      for u in range(point_cloud.shape[0]):
      for v in range(point_cloud.shape[1]):
      point_cloud[u, v, 2] = depth_map[u, v] # z
      point_cloud[u, v, 0] = (u-cu)*depth_map[u, v]/fu # x
      point_cloud[u, v, 1] = (v-cv)*depth_map[u, v]/fv # y
      # fast version
      rows, cols = depth_map.shape
      c, r = np.meshgrid(np.arange(cols), np.arange(rows))
      point_cloud = np.stack([c, r, depth_map])
      point_cloud = point_cloud.reshape((3, -1))
      x = ((point_cloud[:, 0]-cu)*point_cloud[:, 2])/fu
      y = ((point_cloud[:, 1]-cv)*point_cloud[:, 2])/fv
      point_cloud[:, 0] = x
      point_cloud[:, 1] = y
      # save
      cv2.imwrite('point_cloud.jpg', point_cloud)
      point_cloud = point_cloud.reshape((-1, 3))
      v = pptk.viewer(point_cloud) #pptk库可以用来可视化点云
    • 3D object detection(论文使用了AVOD and frustum PointNet)

      This example goes to show how some operations the convolutional network might perform could border on the absurd.


    Future work:

    • higher resolution stereo images
    • in this paper we did not focus on real-time image processing and the classification of all objects in one image takes on the order of 1s.
    • it is likely that future work could improve the state-of-the-art in 3D object detection through sensor fusion of LiDAR and pseudo-LiDAR.
  2. PSEUDO -L I DAR++: ACCURATE D EPTH FOR 3D O BJECT D ETECTION IN AUTONOMOUS D RIVING——ICLR 2020上面文章的改进,修改了深度估计网络,使用了4线雷达(价格较低)进行校正

    more aligned with accurate depth estimation of faraway objects.

    两大贡献:

    1. Identify the disparity estimation as a main source of error for stereo-based systems and propose a novel approach to learn depth directly end-to-end instead of through disparity estimates.
    2. Advocate that one should not use expensive LiDAR sensors to learn the local structure and depth of objects. Instead one can use commodity stereo cameras for the former and a cheap sparse LiDAR to correct the systematic bias in the resulting depth estimates.

    其实就两个方向,一个就是对深度估计网络修改,因为实验表明深度图的误差是视差图的二次方级别的。从公式上也可以看出来:

    所以直接用深度的误差作为损失函数(而不是使用视差);并且3D卷积,也是在深度图之上。

    the depth cost volume , in which will encode features describing how likely the depth of pixel is .

    W = arg\ min_W |Z − WZ|^2_2 ,s.t. W1 = 1\ and\ W_{ij} = 0\ if\ j \notin N_i
    $$

    The two steps described in the main paper can be easily turned into two (sparse) linear systems and then solved by using Lagrange multipliers. For the first step, we solve a problem that is slightly modified from that described in the main paper (for more accurate reconstruction). For the second step, we use the Conjugate Gradient (CG) to iteratively solve the sparse linear system.

MonoPair: Monocular 3D Object Detection Using Pairwise Spatial Relationships

通过部分遮挡object与周围objects的空间限制关系,受CenterNet的启发,将object当作点,用提出的检测器感知预测计算出object和相邻object的3D距离不确定性,随后通过非线性最小二乘联合nonlinear least squares优化。为了保证运行时的效率,将one-stage uncertainty-aware prediction structure(in an unsupervised manner) 和 post-optimization module 集成化。

3个模块:2D框检测、带有不确定度的3D框检测、带有不确定度的3D pair距离;11个分支(蓝色部分)其中:aleatoric uncertainty不确定性 makes the loss more robust to noisy input in a regression task.

  • 2D检测

    • heatmapW x H x C: predict the object location
    • offsetW x H x 2: predict offset vector from the located keypoint to the bounding box center respectively.(是heatmap的一个极大值点,经过偏移后得到bbox的中心点,或者说是2D中心点与的偏移值)L1 loss
    • dimensionW x H x 2: the size of the bounding box L1 loss

    Object Location Error:

  • 3D检测

    • depthW x H x 1: 图(c)的z,车到相机的深度 由于直接回归深度比较困难,用inverse depth ,by inverse sigmoid transformation . L1 loss

    • offsetW x H x 2: 图(b)的,3D中心点与的偏移值(计算得到的可以通过相机内参矩阵K计算得到世界坐标系中的object中心点坐标 L1 loss

    • dimensionW x H x 3: 回归w, h, l L1 loss

    • depthW x H x 1:

    • offsetW x H x 1:

    • roationW x H x 8: object’s local orientation (yaw), global orientation in the camera coordinate system. relative rotationof the object to the camera viewing angle (arctan(positions[1]/position[0])). Represent the orientation using eight scalars, where the orientation branch is trained by MultiBin loss.(8个?8个角点的角度吗?还是各3个表示,然后用x,z表示)

  • Pair constraint

    对于一个pair 他们的连线中点为,对应的在featmap上的连线中点为。(这里不是的投影)

    N effective objects, M pair constraints, The proposed spatial constraint optimization is formulated as a nonlinear least square problem as

    e is the Pairwise Constraint Error vector, .

    W is the weight matrix for different errors. W is a diagonal matrix with dimension . The weight of the error is higher when the uncertainty is lower, which means we have more confidence in the predicted output.

    For each vertex , there are three variables , which are the projected center of the 3D bounding box on the feature map and the depth .

    • distanceW x H x 3: The 3D absolute distance along the view point direction are taken as the regression target which is the distance branch of the pair constraint output.

      For training, can be easily collected through the groundtruth 3D object centers from the training data as:

      R(\gamma_{ij})=\left[\begin{matrix}cos(\gamma_{ij})&0&-sin(\gamma_{ij})\\0&1&0\\sin(\gamma_{ij})&0&cos(\gamma_{ij})\end{matrix}\right]
      k_{ij}^w$$的原因是,(相机坐标系下)从不同角度看pair的距离是变换的,如图:

    • distanceW x H x 1:

不确定度:

Following the heteroscedastic aleatoric uncertainty setup异方差的不确定性设置, we represent a regression task with L1 loss as

u_{post}=|d-d^{\rightleftharpoons}|

\mu (d)=\frac 1 N \sum_{i=1}^N d_i\\
u_{drop/boot/snap}=\sigma^2(d)=\frac 1 N \sum_{i=1}^N (d_i-\mu(d))^2

\lambda_t=\frac{\lambda_0}{2}\cdot(\cos (\frac{\pi\cdot mod(t-1, \lceil\frac T C\rceil)}{\lceil\frac T C\rceil})+1)
$$
  • 学习一个预测不确定度的模型

    • Learned Reprojection(Repr):投影的关系在没有标签的数据中难以使用,但可以(模仿知道的情况)重构。

    • Log-Likelihood Maximization(Log):衡量分布的均值和方差;如果是L1 loss就建模成Laplician分布,如果是L2 loss就建模成Gaussian分布;同时解释了关于深度以及姿态的不确定度

      w: network weight; log项用来避免分子为0的情况,拒绝对每一个像素的无限预测。

      另一篇提到的loss定义为:

    • Self-Teaching (Self):解耦深度和姿态这两个不确定度;利用自监督的办法学习一个T网络,这个网络的loss就是传统的重投影误差,然后再训练一个S网络,这个网络的loss使用的是深度值的对数似然

  • Bayesian estimation(结合了经验估计和预测不确定模型的方法):边缘化所有可能的w而不是进行单点估计

  • 结合:如预测+经验两者均取最好的组合(Boot+Self)

实验:

  • Baseline model:Monodepth2
  • Depth metrics
    • absolute relative error (Abs Rel)
    • root mean square error (RMSE)
    • the amount of inliers (δ < 1.25)
  • Uncertainty metrics: 所有点按照不确定性降序排列,比较抽取掉不确定度高的点和抽取掉误差高的点的曲线差距;
    • AUSE是Area Under the Sparsification Error,越低越好,计算方法就是用不确定度的曲线减去oracal(ideal)曲线,主要评价不确定度和实际误差的关系
    • AURG是Area Under the Random Gain,越高越好,如果相差越大证明不确定度估计是有效果的
  • 单目结果:
    • Depth精度上:各种方法基本降低或者不变,但是Self方法对Depth有所提高
    • 不确定性上:经验方法比post更好,预测模型的方法结果更好,Boot+self的结合效果最好

感觉惊叹的是作者做了很多实验,包括很多结合的方法,梳理了整个不确定性估计的方法,然后就评估这个不确定估计对深度估计的影响,最后得出不确定估计对深度估计是有提升的。感觉捕捉到了一个发表论文的新点。

6D目标姿态估计

手势估计

显著性检测

优化

去噪

去模糊

去雾

特征点检测与描述

视觉问答(VQA)

视频问答(VideoQA)

视觉语言导航

视频压缩

视频插帧

风格迁移

轨迹预测

运动预测

光流估计

图像检索

虚拟试衣

HDR

对抗样本

三维重建

深度补全

语义场景补全

图像/视频描述

线框解析

数据集

其他

转载请注明出处,谢谢。

愿 我是你的小太阳

买糖果去喽