3DObjectDetection

This is my blog.

沉迷于为什么我的3D框这么混乱(就有很多个很多个),特别是有多余的框,但是学长的就那么正常,那么干净。

paper

MonoGRNet: A Geometric Reasoning Network for Monocular 3D Object Localization——2019

AAAI 2019 oral

意义:为什么要研究3D目标检测,为什么要研究深度呢?因为在图像中,传统的物体定位或检测估计二维边界框,可以框住属于图像平面上物体的可见部分。但是,这种检测结果无法在真实的 3D 世界中提供场景理解的几何感知,这对很多应用的意义并不大。换句话来说,我知道在某个方位有个目标存在,但是我不知道它离我的距离,不知道他的实际大小,甚至这个方位是不准确的。因此3D目标检测是必要的,带深度的3D目标检测对于自动驾驶来说,意义也是重大的。

概括:这篇文章通过将网络解耦四个渐进式子任务,分别是2D目标检测、实例级深度估计IDE、3D位置检测、局部角落回归。在检测 到的 2D 边界框的引导下,网络首先估计 3D 框中心的深度和 2D 投影以获得全局 3D 位置,然后在本地环境中回归各个角坐标。最终的 3D 边界框基于估计的 3D 位置和局部角落在全局环境中以端到端的方式进行优化

几个重点:

  • 3D 定位问题解耦为几个渐进式子任务,分别为2D目标检测、实例级深度估计IDE、3D位置检测(2D中心点包含两个方向,3D中心点包含三个方向)、局部角落回归(公8个点)。每个子任务从单目 RGB 图像中学习,通过几何推断,在已观察到的二维投影平面和在未观察到的深度维度中定位物体非模态三维边界框(Amodal Bounding Box, ABBox-3D),即实现了由二维视频确定物体的三维位置。

  • 提出了instance depth estimation(IDE),使用3D bounding box的中心点通过(稀疏地)监督来预测目标的深度,不需要考虑目标的规模以及2D的位置;之间的都是pixel-to-pixel,而在一张图片中,背景占大部分,因此我们用像素级别,并且通过平均误差来计算损失,实际上是不准确的;IDE模块探索深度特征映射的大型感知域以捕获粗略的实例深度,然后联合更高分辨率的早期特征优化 IDE(可以看到深层网络提取的是粗略的全局信息,然后通过浅层的有细节的高分辨率的局部信息来修正,使得信息更加精确),有点像centernet,通过中心回归。

  • 全局3D位置,用公式将2D和3D的中心点进行转换(2D投影的中心点和3D的中心点不同),为了同时检索水平和垂直位置,首先要预测 3D 中心的 2D 投影。结合 IDE,然后将投影中心拉伸到真实 3D空间 以获得最终的3D对象位置。

    其中,分别表示在X和Y轴上的焦距,是坐标系的原点(coordinates of the principle point)

    这样在2D投影到3D的时候,就可以用:

    和IDE模块相似,我们使用early features来回归得到,这样中心点的3D位置就可以用

  • 局部角落回归,用高分辨率的信息来回归局部3D框的角点,根据图二的C图,从局部坐标到相机坐标的转换涉及到旋转和平移,全局角落坐标为

  • 统一的网络结构,端到端训练,通过联合的几何损失函数进行优化,最大限度地减少 3D 边界在整体背景下的边界框的差异。

    • 2D检测(softmax[] cross entropy[CE] loss + masked L1 distance[d] loss):

      其中,Pr是置信度,是指示格子g中是否属于任何目标,如果格子g到最近的目标b的距离小于,那么设置为1。

    • 实例深度预测(L1 loss):

      其中,,鼓励模型在粗略估计的时候已经接近真实值了。

    • 3D 定位损失(L1 loss):

      其中,,鼓励模型在学习投影中心点时已经接近真实值了。

    • 局部角点损失(L1 loss):

    • 联合3D损失:

效果:在KITTI 数据集上,该网络在 3D 物体定位方面优于最先进的单目方法,且推理时间最短

3d Bounding Box Estimation Using Deep Learning and Geometry——2017

这篇文章的特点在于提出的几何限制关系,以及提出了预测dimension这一stable属性。

In contrast to current techniques that only regress the 3D orientation of an object, our method first regresses relatively stable 3D object properties using a deep convolutional neural network and then combines these estimates with geometric constraints provided by a 2D object bounding box to produce a complete 3D bounding box. 之前的方法主要在意object的方向,但这篇文章增添了一些object的固定属性:choose to regress the box dimensions D rather than translation T because the variance of the dimension estimate is typically smaller (e.g. cars tend to be roughly the same size)。

同时利用了2D bbox和3D bbox之间的几何限制关系来限制9个DOF(three for translation, three for rotation, and three for box dimensions)。

总体结构使用了MultiBin的思想,proposed MultiBin architecture for orientation estimation. We first discretize the orientation angle and divide it into n overlapping bins. For each bin, the CNN network estimates both a confidence probability that the output angle lies inside the bin and the residual rotation correction that needs to be applied to the orientation of the center ray of that bin in order to obtain the output angle. The residual rotation is represented by two numbers, for the sine and the cosine of the angle. Valid cosine and sine values are obtained by applying an L2 normalization layer on top of a 2-dimensional input. This results in 3 outputs for each bin .

1
2
3
4
5
6
7
8
def generate_bins(bins):
angle_bins = np.zeros(bins)
interval = 2 * np.pi / bins
for i in range(1,bins):
angle_bins[i] = i * interval
angle_bins += interval / 2 # center of the bin
return angle_bins
  • The first network output estimates the 3D object orientation using a novel hybrid discrete-continuous loss, which significantly outperforms the L2 loss.
  • The second output regresses the 3D object dimensions, which have relatively little variance compared to alternatives and can often be predicted for many object types. These estimates, combined with the geometric constraints (use the fact that the perspective projection of a 3D bounding box should fit tightly within its 2D detection window) on translation imposed by the 2D bounding box, enable us to recover a stable and accurate 3D object pose.
1
2
3
4
5
6
7
8
9
10
11
class Model(nn.Module):
# ...
def forward(self, x):
x = self.features(x) # 512 x 7 x 7
x = x.view(-1, 512 * 7 * 7)
orientation = self.orientation(x)
orientation = orientation.view(-1, self.bins, 2) # angle ssin + cos
orientation = F.normalize(orientation, dim=2)
confidence = self.confidence(x)
dimension = self.dimension(x)
return orientation, confidence, dimension

Loss for the MultiBin orientation is thus:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def OrientationLoss(orient, angleDiff, confGT):
#
# orient = [sin(delta), cos(delta)] shape = [batch, bins, 2]
# angleDiff = GT - center, shape = [batch, bins]
#
[batch, _, bins] = orient.size()
cos_diff = torch.cos(angleDiff)
sin_diff = torch.sin(angleDiff)
cos_ori = orient[:, :, 0]
sin_ori = orient[:, :, 1]
mask1 = (confGT != 0)
mask2 = (confGT == 0)
count = torch.sum(mask1, dim=1)
tmp = cos_diff * cos_ori + sin_diff * sin_ori
tmp[mask2] = 0
total = torch.sum(tmp, dim=1)
count = count.type(torch.FloatTensor).cuda()
total = total / count
return -torch.sum(total) / batch
def OrientationLoss(orient_batch, orientGT_batch, confGT_batch):
batch_size = orient_batch.size()[0]
indexes = torch.max(confGT_batch, dim=1)[1]
# extract just the important bin
orientGT_batch = orientGT_batch[torch.arange(batch_size), indexes]
orient_batch = orient_batch[torch.arange(batch_size), indexes]
theta_diff = torch.atan2(orientGT_batch[:, 1], orientGT_batch[:, 0])
estimated_theta_diff = torch.atan2(orient_batch[:, 1], orient_batch[:, 0])
return -1 * torch.cos(theta_diff - estimated_theta_diff).mean()
def orientation_loss(y_true, y_pred):
# Find number of anchors
anchors = tf.reduce_sum(tf.square(y_true), axis=2)
anchors = tf.greater(anchors, tf.constant(0.5))
anchors = tf.reduce_sum(tf.cast(anchors, tf.float32), 1)
# Define the loss
loss = (y_true[:,:,0]*y_pred[:,:,0] + y_true[:,:,1]*y_pred[:,:,1])
loss = tf.reduce_sum((2 - 2 * tf.reduce_mean(loss,axis=0))) / anchors
return tf.reduce_mean(loss)

L_{dims}=\sum (D^∗− \bar D − \delta)^2 ,

x_{min} = (k\left[\begin{matrix}R&T\end{matrix}\right]\left[\begin{matrix}d_x/2\-d_y/2\\d_z/2\\1\end{matrix}\right])_x

L = \alpha × L_{dims} + L_{\theta}

E(x, y) =w^T_{c,sem}\phi_{c,sem}(x, y) + w^T_{c,inst}\phi_{c,inst}(x, y)+ w^T_{c,cont}\phi_{c,cont}(x, y) + w^T_{c,loc}\phi_{c,loc}(x, y)+w^T_{c,shape}\phi_{c,shape}(x, y)

\phi_{c,seg}(x, y)=\frac{\sum_{i\in\Omega(y)}S_c(i)}{|\Omega(y)|}
$$
The second feature computes the **fraction of pixels** that belong to **classes** other than the object class:
$$
\phi_{c,non−seg,c'}(x, y)=\frac{\sum_{i\in\Omega(y)}S_c'(i)}{|\Omega(y)|}
$$
  • instance semantic

    只对汽车进行实例分割

  • shape

    在2D的候选物体包围框中划分了两种栅格,其中,一种栅格图含有一个栅格,另一种栅格图含有K×K个栅格,统计这些栅格中每个栅格内的轮廓像素数量

  • context

    用2D物体包围框的下方的1/3高度的区域作为上下文区域,用汽车下必然是地面这一上下文关系作为约束

  • location

    使用核密度估计(KDE)学习物体的位置先验信息,其中固定位置偏差为4m,2D图像位置偏差为两个像素

    Weight each loss equally, and define the category loss as cross entropy, the orientation loss as a smooth l1 and the bounding box offset loss as a smooth l1 loss over the 4 coordinates that parameterized the 2D bounding box.

  • 利用CNN对高分框重新打分和分类,得到候选框的类别、精确位置以及方向信息

  • A final set of object proposals is obtained after non-maximum suppression

RTM3D: Real-time Monocular 3D Detection from Object Keypoints for Autonomous Driving——2020

这一篇文章主要是将3D Detection问题转换为预测关键点问题;首先通过预测object的9个关键点(8个角点+1个中心点),然后通过几何关系(9个关键点,18个限制)得到dimension, location, and orientation。

整个网络由3个部分组成:backbone, keypoint feature pyramid, and detection head. 总体是one-stage的。

  • backbone: ResNet-18 and DLA-34

  • keypoint feature pyramid: Keypoint in the image have no difference in size. Therefore, the keypoint detection is not suitable for using the Feature Pyramid Network(FPN). Instead, propose Keypoint Feature Pyramid Network (KFPN) to detect scale-invariant key-points in the point-wise space.

    • F scale feature maps, first resize each scale f, back to the size of maximal scale, then yields the feature maps
    • generate soft weight by a softmax operation to denote the importance of each scale
    • scale-space score map S score is obtained by linear weighing sum
  • detection head

    • three fundamental components
      • Inspired by CenterNet, we take a keypoint as the main-center for connecting all features. The heatmap can be define as , where C is the number of object categories
      • heatmap of nine perspective points projected by vertexes and center of 3D bounding box
      • For keypoints association of one object, regress an local offset from the maincenter as an indication
    • six optional components
      • The center offset and vertexes offset are discretization error for each keypoint in heatmaps(这个和local offser有什么区别,一个是离main center点的offset,这个是从heatmap来看?)
      • The dimension of 3D object have a smaller variance, which makes it easy to predict.
      • The rotation of an object only by parametrized by orientation . Multi-Bin based method to regress the local orientation. generates feature map of orientation with two bins.2个bins4个值;加上confidence也是4个,总共8个
      • regress the depth of 3D box center.

    获取bbox的:

    • main-center, center offset, wh

    获取3d corners:

    • Vertexes, vertexes offset, vertexes coordinate

    画图的时候未用到所以这些是在哪里起到限制作用的呢

    • orientation, dimension, depth

Loss:

  • The all heatmaps of keypoint training strategy(focal loss)

    The loss solves the imbalance of positive and negative samples with focal loss.

    K is the channels of different keypoints, K = C in maincenter and K = 9 in vertexes. N is the number of maincenter or vertexes in an image, and and are the hyper-parameters to reduce the loss weight of negative and easy positive samples. We set is and . can be defined by Gaussian kernel centered by ground truth keypoint . For , we find the max area and min area of 2D box in training data and set two hyperparameters and . We then define the for a object with size A.

  • Regression of dimension and distance(residual term)

  • offset of maincenter, vertexes(L1)

  • coordinate of vertexes(L1)

  • All(multi-task)

Our goal is to estimate the 3D bounding box, whose projections of center and 3D vertexes on the image space best fit the corresponding 2D keypoint. We formulate it and other prior errors as a nonlinear least squares optimization problem:

the covariance matrix of keypoints projection error.

一些error的计算,SE3 space: special euclidian 3-space.

Evaluation:

  • Average precision for 3D intersection-over-union (AP 3D)
  • Average precision for Birds Eye View (AP BEV )
  • Average Orientation Similarity (AOS) if 2D bounding box available.

非官方复现上,对于depth使用了sigmoid,应该是为了平滑吧

一个问题就是得到这么多信息后,但是复原image并没有都用上,那为什么需要预测呢?

IDA-3D: Instance-Depth-Aware 3D Object Detection from Stereo Vision for Autonomous Driving

SMOKE: Single-Stage Monocular 3D Object Detection via Keypoint Estimation——2020

old: In case of monocular vision, successful methods have been mainly based on two ingredients: (i) a network generating 2D region proposals, (ii) a R-CNN structure predicting 3D object pose by utilizing the acquired regions of interest.

This: predicts a 3D bounding box for each detected object by combining a single keypoint estimate with regressed 3D variables. As a second contribution, we propose a multi-step disentangling approach for constructing the 3D bounding box, which significantly improves both training convergence and detection accuracy. In contrast to previous 3D detection techniques, our method does not require complicated pre/post-processing, extra data, and a refinement stage. 总结来说就是包含关键点预测和3D变量回归两个模块的一阶段单目3D检测方法。


  • Problem: A single RGB image , find for each present object its category label C and its 3D bounding box B, where the latter is parameterized by 7 variables . ( is the coordinates (in meters) of the object center in the camera coordinate frame.)

  • Backbone: DLA-34 since it can aggregate information across different layers.

    • all the hierarchical aggregation connections are replaced by a Deformable Convolution Network (DCN).

    • The output feature map is downsampled 4 times with respect to the original image.

    • Replace all BatchNorm (BN) operation with GroupNorm (GN), since less sensitive to batch size and more robust to training noise

      • BN以batch的维度做归一化,依赖batch,过小的batch size会导致其性能下降,如果太大显存又可能不够用,一般来说每GPU上batch设为32最合适。但这个维度并不是固定不变的,比如训练和测试时一般不一样,一般都是训练的时候在训练集上通过滑动平均预先计算好平均-mean,和方差-variance参数。而在测试的时候,不再计算这些值,而是直接调用这些预计算好的来用,但当训练数据和测试数据分布有差别时,训练时上预计算好的数据并不能代表测试数据,这就导致在训练,验证,测试这三个阶段存在不一致。

      • GN:同样可以解决Internal Covariate Shift的问题,channel方向每个group的均值和方差

        1
        2
        3
        4
        5
        6
        7
        8
        9
        10
        def GroupNorm(x, gamma, beta, G, eps=1e-5):
        # x: input features with shape [N,C,H,W]
        # gamma, beta: scale and offset, with shape [1,C,1,1]
        # G: number of groups for GN
        N, C, H, W = x.shape
        x = tf.reshape(x, [N, G, C // G, H, W])
        mean, var = tf.nn.moments(x, [2, 3, 4], keep dims=True)
        x = (x - mean) / tf.sqrt(var + eps)
        x = tf.reshape(x, [N, C, H, W])
        return x * gamma + beta

      BatchNorm:batch方向做归一化,算的均值
      LayerNorm:channel方向做归一化,算的均值
      InstanceNorm:一个channel内做归一化,算的均值
      GroupNorm:将channel方向分group,然后每个group内做归一化,算的均值

      GN比LN效果好的原因是GN比LN的限制更少,因为LN假设了一个层的所有通道的数据共享一个均值和方差。而IN则丢失了探索通道之间依赖性的能力。

  • 3D Detection Network

    • Keypoint Branch: the key point is defined as the projected 3D center of the object on the image plane.

      \left[\begin{matrix}z\cdot x\\z\cdot y\\z\end{matrix}\right]=K_{3\times 3}\left[\begin{matrix}x\\y\\z\end{matrix}\right]
      $$
      Downsampled location on the feature map is computed and distributed using a Gaussian Kernel.

    • Regression Branch: the 3D information is encoded as an 8-tuple . is the discretization offset due to downsampling, denotes the residual dimensions. We encode all variables to be learnt in residual representation to reduce the learning interval and ease the training task. A similar operation F that converts projected 3D points to a 3D bounding box .

      • For each object, its depth z can be recovered by predefined scale and shift parameters as

        这样上述的the location for each object in the camera frame可以表示为:

        对于dimensions的部分,使用每一个类别预先设定好的平均值(从整个数据集中先计算平均值)pre-calculated category-wise average dimension作为一个base,然后用residual representation来计算出真实的维度。

        对于角度部分,Regress the observation angle instead of the yaw rotation for each object. [from. “3D bounding box estimation using deep learning and geometry”]. We further change the observation angle with respect to the object head , instead of the commonly used observation angle value , by simply adding .

        因此最终预测的yaw:

        Bounding box:

  • Loss

    • Keypoint Classification Loss

      Penalty-reduced focal loss in a point-wise. be the predicted score at the heatmap location and be the ground-truth value of each point assigned by Gaussian Kernel.

      定义

      简化为同一个类别,则分类损失函数:

      N是图片中object的数量,即的个数;The term corresponds to penalty reduction for points around the groundtruth location. 从整个损失函数来看,中心点希望score尽可能的大,远离部分的点,希望score尽可能的小

    • Regression Loss(observe that l1 loss performs better than Smooth l1 loss)

      add channel-wise activation to the regressed parameters of dimension and orientation at each feature map location to preserve consistency. The activation functions for the dimension and the orientation are chosen to be the sigmoid function and the norm, respectively:

      o stands for the specific output of network. The 3D bounding box regression loss as the l1 distance between pred and gt. λ is a scaling factor.

    • The final loss function(three different groups: orientation, dimension and location.)


细节处理:

  • 数据处理:去除3D中心点在图像外侧的框

  • 数据增强:random horizontal flip, random scale(9 steps from 0.6 to 1.4) and shift(5 steps from -0.2 to 0.2). Note that the scale and shift augmentation methods are only used for heatmap classification since the 3D information becomes inconsistent with data augmentation. 应该是指数据增强部分只是为了hm分类,并不作用到3D检测上;从代码来看,增加了一个flag,相当于最后预测完之后,再逆方向返回,得到未增强后的3D信息

  • 超参数:In the backbone, the group number for GroupNorm is set to 32. For channels less than 32, it is set to be 16. Set .

  • 训练:use the original image resolution and pad it to 1280 × 384. The learning rate is set at and drops at 25 and 40 epochs by a factor of 10.

  • 测试:Use the top 100 detected 3D projected points and filter it with a threshold of

    0.25. No data augmentation method and NMS are used in the test procedure.

something

3D corners的预测是为了更好地感知信息,从无人车来说就是知道在自己车的什么方向多少距离有什么的信息。(所以实际上2D检测并没有什么过多的用处?

现在看到的方法:

  1. 用深度图加上预测,得到物体的位置
  2. 直接回归各个参数,可能再加上2D proposal、depth方法、multibin、offset等
  3. 使用左右两个相机,利用几何约束关系
  4. 使用雷达图:(未看,还未去接触关于lidar部分的内容)
    • Use 2D Bird’s eye view,然后用2D检测的网络
    • represent point clouds in voxel grid and then leverage 2D/3D CNNs to generate proposals
  5. 预测雷达图

预测的信息(不限于):

  1. yaw
  2. dimension
  3. center
  4. depth

双目 VS lidar VS 单目:

  • 单目和双目应该都是从生物角度产生的,认为人可以做到的,机器应该在一顿乱搞之后,也可以完成这个任务
  • 单目、双目用的都是相机,总体来说售价便宜,更换零件等方面会更加适合推广
  • 双目相机需要标定,相机在车上是不稳定的,需要适时调整
  • 相机的视野范围是有限的,而lidar基本是360全范围的
  • lidar的计算和单双目计算出深度图或者说直接3D检测图的难度相差很大
  • 雨天等环境下,检测性能都不好

反正最后如果优缺点都不能避免的时候,就搞混合(现在应该大部分都是这样的吧

现在疑问:

  1. 预测的信息针对的输入大小必须唯一,K矩阵相同;直观来说,对于同样一张图,不同比例,会认为它的大小;也就是说对于一个公司来说,需要确保的是相机同一型号(感觉还可行)
  2. 并且需要监督信息包含的范围要大,否则未涉及到的检测不到或者检测错误

转载请注明出处,谢谢。

愿 我是你的小太阳

买糖果去喽