首页>学术资源 > 学术资讯 > 艾德思：AM Soft max论文翻译中英文对照标注总结

艾德思：AM Soft max论文翻译中英文对照标注总结

论文润色 | 2019/05/31 09:12:56 | 496 次浏览

AMSoftmax: Additive Margin Softmax for Face Verification Abstract In this paper, we propose a conceptually simple and geometrically interpretable objective function, i.e. additive margin Softmax (AM-Softmax), for deep face verification. In general, the face verification task can be viewed as a metric learning problem, so learning large-margin face features whose intra-class variation is small and inter-class difference is large is of great importance in order to achieve good performance. Recently, Large-margin Softmax [10] and Angular Softmax [9] have been proposed to incorporate the angular margin in a multiplicative manner. In this work, we introduce a novel additive angular margin for the Softmax loss, which is intuitively appealing and more interpretable than the existing works. We also emphasize and discuss the importance of feature normalization in the paper. Most importantly, our experiments on LFW and MegaFace show that our additive margin softmax loss consistently performs better than the current state-of-the-art methods using the same network architecture and training dataset. Our code has also been made available 1 .

摘要在本文中,我们提出了一个概念简单且几何上可解释的目标函数,即AM-Softmax loss,用于深度人脸验证.一般来说,人脸验证任务可以看作是一个度量学习问题,因此学习较大间隔的人脸特征,其类内差异小,类间差异大,对于获得良好的性能具有重要意义.最近,人们提出了将大间隔SoftMax[10]和角度间隔SoftMax[9]以乘法的方法合并角度间隔.在我们的工作中,对Softmax loss我们引入了一个新颖的附加角度间隔的,这是最具吸引力的,比现有的工作更容易解释.本文还着重讨论了特征归一化的重要性.最重要的是,我们在LFW和Megaface的实验表明,我们附加间隔的softmax loss始终比使用相同网络架构和训练数据集的当前最先进方式表现更好.我们的代码也已开源1.

注: 1. 提出了一种新的损失函数.AM-Softmax loss 目的是进一步增大类间差异,缩小类内差异 2. 讨论了特征归一化的优点. 3. 代码开源 :

1. Introduction Face verification is widely used for identity authentication in enormous areas such as finance, military, public security and so on. Nowadays, most face verification models are built upon Deep Convolutional Neural Networks and supervised by classification loss functions [18, 20, 19, 9], metric learning loss functions [16] or both [17, 13]. Metric learning loss functions such as contrastive loss [17] or triplet loss [16] usually require carefully designed sample mining strategies and the final performance is very sensitive to these strategies, so increasingly more researchers shift their attentions to building deep face verification models based on improved classification loss functions [20, 19, 9].

1.介绍人脸认证广泛应用于金融/军事/公安等领域的身份认证.目前,大多数人脸验证模型均是建立在深卷积神经网络的基础上,并由分类损失函数[18/20/19/9]/度量学习损失函数[16]或两者共同监督[17/13].contrastive loss[17]或triplet loss[16]等度量学习损失函数通常需要精心设计样本挖掘策略,而最终的性能对这些策略非常敏感,因此越来越多的研究者将注意力转移到基于改进的分类损失函数来构建人脸验证模型[20,19,9].

Current prevailing classification loss functions for deep face recognition are mostly based on the widely-used softmax loss. The softmax loss is typically good at optimizing the inter-class difference (i.e., separating different classes), but not good at reducing the intra-class variation (i.e., making features of the same class compact). To address this, lots of new loss functions are proposed to minimize the intraclass variation. [20] proposed to add a regularization term to penalize the feature-to-center distances. In [19, 12, 15],researchers proposed to use a scale parameter to control the 'temperature' [2] of the softmax loss, producing higher gradients to the well-separated samples to further shrink the intra-class variance. In [9, 10], the authors introduced an conceptually appealing angular margin to push the classification boundary closer to the weight vector of each class.[9] also provided a theoretical guidance of training a deep model for metric learning tasks using the classification loss functions. [6, 12, 15] also improved the softmax loss by incorporating differnet kinds of margins.

目前流行的深度人脸识别分类损失函数大多基于广泛使用的Softmax loss.SoftMax loss通常擅长优化类间差异(即分离不同的类),但不擅长减少类内差异(即使同一类的特性紧凑).为了解决这一问题,提出了许多新的损失函数来最小化类内差异.[20]意见增加一个正则化术语来惩罚特征到中心的距离.在[19,12,15]中,研究人员提议使用一个尺度参数来控制Softmax loss 的"容忍力'[2],从而对分离良好的样本产生更高的梯度,以进一步缩小类内方差.在[9,10]中,作者引入了概念上很吸引人的角度间隔,以推动分类边界更接近每个类的权重向量.[9]也为使用分类损失函数训练度量学习任务的深层模型提供了理论指导.[6,12,15]还通过合并不同类型的间隔来改善SoftMax loss.

In this work, we propose a novel and more interpretable way to import the angular margin into the softmax loss. We formulate an additive margin via cos θ−m, which is simpler than [9] and yields better performance. From Equation (3),we can see that m is multiplied to the target angle θ y i in [9], so this type of margin is incorporated in a multiplicative manner. Since our margin is a scalar subtracted from cosθ,we call our loss function Additive Margin Softmax (AM-Softmax).

Experiments on LFW BLUFR protocol [7] and MegaFace [5] show that our loss function with the same network architecture achieves better results than the current state-of-the-art approaches.

在这项工作中,我们提出了一种新的更具解释性的方式,将角度间隔引入到softmax loss中.我们通过cosθ−m制定了一个额外的间隔,它比[9]简单,并且产生更好的性能.从方程(3)可以看出,m乘以[9]中的目标角 θ y i θ_{y_i} θ y i ,所以这种类型的边界以乘法的方法合并.由于我们的间隔是从cosθ减去的一个标量,我们称之为损失函数加性间隔SoftMax(AM-SoftMax).

对LFW BLUFR协议[7]和MegaFace[5]的实验表明,在相同的网络结构下,我们的损失函数比目前最先进的方式得到了更好的结果.

注: 1. 大多数人脸验证模型由分类损失函数/度量学习损失函数或两者共同监督.contrastive loss或triplet loss等度量学习损失受数据挖掘的影响. 2. SoftMax loss擅长优化类间差异,不擅长减少类内差异,故有好多改进版本出现. 3. 说明AM-Softmax与Sphereface(A-softmax)的不同. 在LFW BLUFR协议[7]和MegaFace[5]取得最好的结果.

2. Preliminaries To better understand the proposed AM-Softmax loss, we will first give a brief review of the original softmax and the A-softmax loss [9]. The formulation of the original softmax loss is given by

where f is the input of the last fully connected layer (f i denotes the the i-th sample), W j is the j-th column of the last fully connected layer. The W y T i f i is also called as the target logit [14] of the i-th sample.

2. 准备工作为了更好地理解所提出的AM-SoftMax loss ,我们将首先简要回顾一下原始SoftMax loss 和A-SoftMax loss [9].原始SoftMax loss 的公式如下: 其中 f f f 是最后一个完全连接层的输入( f i f_i f i 表示第 i i i 个样本), W j W_j W j 是最后一个完全连接层参数矩阵W的第j列. W y i T f i W_{y_i} ^T f_i W y i T f i 也称为第i个样本的目标逻辑[14].

In the A-softmax loss, the authors proposed to normalize the weight vectors (making kW i k to be 1) and generalize the target logit from kf i kcos(θ y i ) to kf i kψ(θ y i ), where the ψ(θ) is usually a piece-wise function defined as where m is usually an integer larger than 1 and λ is a hyperparameter to control how hard the classification boundary should be pushed. During training, the λ is annealing from 1, 000 to a small value to make the angular space of each class become more and more compact. In their experiments, they set the minimum value of λ to be 5 and m = 4, which is approximately equivalent to m = (Figure 2).

在A-softmax loss,作者提出了归一化权重向量(使 ∣ ∣ W i ∣ ∣ = = 1 ||W_i||==1 ∣ ∣ W i ∣ ∣ = = 1 ),并将目标的逻辑推理从 ∣ ∣ f i ∣ ∣ c o s ( θ y i ) ||f_i||cos(θ_{y_i}) ∣ ∣ f i ∣ ∣ c o s ( θ y i ) 改为 ∣ ∣ f i ∣ ∣ ψ ( θ y i ) ||f_i||ψ(θ_{y_i}) ∣ ∣ f i ∣ ∣ ψ ( θ y i ) 其中,ψ(θ)通常是一个分段函数,定义为

其中m通常是一个大于1的整数,而λ是一个超参数,用于控制推动分类边界的难易程度.在训练过程中,将λ从1000退火到一个较小的值,使每个类别的角度间隔变得越来越紧凑.在他们的实验中,他们将λ的最小值设为5,m=4,这近似等于m=(图2).

注: 1. 数学推导原始SoftMax loss 和A-SoftMax loss(Sphereface). 2. A-SoftMax loss(Sphereface)中相应超参数的取值经验. 3. λ=5,m=4.

3. Additive Margin Softmax In this section, we will first describe the definition of the proposed loss function. Then we will discuss about the intuition and interpretation of the loss function.

3. AM-Softmax 在本节中,我们将首先描述所提出的损失函数的定义.然后讨论损失函数的直观性和解释性.

3.1. Definition [10] defines a general function ψ(θ) to introduce the large margin property. Motivated by that, we further propose a specific ψ(θ) that introduces an additive margin to the softmax loss function. The formulation is given by Compared to the ψ(θ) defined in L-Softmax [10] and A-softmax [9] (Equation (3)), our definition is more simple and intuitive. During implementation, the input after normalizing both the feature and the weight is actually x =k W T f ii cosθ y i = kW y y kkf, so in the forward propagation we onlyi k ineed to compute

3.1 定义 [10]定义了引入大间隔性质的一般函数ψ(θ).在此基础上,我们进一步提出了一个特殊的ψ(θ),它为SoftMax loss 函数引入了一个附加间隔.公式如下: 与L-SoftMax[10]和A-SoftMax[9](方程式(3))中定义的ψ(θ)相比,我们的定义更加简单和直观.在实现过程中,将特征和权重归一化后的输入实际上是 x = c o s θ y i x=cosθ_{y_i} x = c o s θ y i = W y i T f i ∣ ∣ W y i ∣ ∣ ∣ ∣ f y i ∣ ∣ {W^T_{y_i}f_i\over||W_{y_i}||||f_{y_i}||} ∣ ∣ W y i ∣ ∣ ∣ ∣ f y i ∣ ∣ W y i T f i ,所以在正向传播中,我们只需要计算 In this margin scheme, we don't need to calculate the gradient for back-propagation because Ψ 0 (x) = 1. It is much easier to implement compared with SphereFace [9].

在这个间隔方案中,我们不需要计算反向传播的梯度,因为 Ψ ′ ( x ) = 1 Ψ^{'}(x) = 1 Ψ ′ ( x ) = 1 .与SphereFace相比,它更容易实现[9]. Since we use cosine as the similarity to compare two face features, we follow [19, 11, 12] to apply both feature normalization and weight normalization to the inner product layer in order to build a cosine layer. Then we scale the cosine values using a hyper-parameter s as suggested in [19, 11, 12]. Finally, the loss function becomes

由于我们使用cosine作为相似度来比较两个人脸特征,因此我们遵循[19,11,12]将特征归一化和权重归一化同时应用到内积层以构建一个cosine层.然后我们使用超参数s缩放余弦值,如[19,11,12]所示.最后,损失函数变成 In this paper, we assume that the norm of both W i and f are normalized to 1 if not specified. In [19], the authors propose to let the scaling factor s to be learned through back-propagation. However, after the margin is introduced into the loss function, we find that the s will not increase and the network converges very slowly if we let s to be learned.Thus, we fix s to be a large enough value, e.g. 30, to accelerate and stablize the optimization.

在本文中,我们假设 W i W_i W i 和 f f f 的范数如果没有指定,则归一化为1.在[19]中作者意见通过反向传播来学习缩放因子s.然而,在将间隔引入损失函数后,我们发现如果让我们学习的话,S不会增加而且网络收敛很慢,因此我们将S定为一个足够大的值,例如30,以加速和稳定优化.

As described in Section 2, [10, 9] propose to use an annealing strategy to set the hyper-parameter λ to avoid net- work divergence. However, to set the annealing curve of λ,lots of extra parameters are introduced, which are more or less confusing for starters. Although properly tuning those hyper-parameters for λ could lead to impressive results, the hyper-parameters are still quite difficult to tune. With our margin scheme, we find that we no longer need the help of the annealing strategy. The network can converge flexibly even if we fix the hyper-parameter m from scratch.Compared to SphereFace [9], our additive margin scheme is more friendly to those who are not familiar with the effects of the hyper-parameters. Another recently proposed additive margin is also described in [6]. Our AM-Softmax is different than [6] in the sense that our feature and weight are normalized to a predefined constant s. The normalization is the key to the angular margin property. Without the normalization, the margin m does not necessarily lead to large angular margin.

如第2节所述,[10,9]意见使用退火策略设置超参数λ以避免网络发散.然而,为了设置λ的退火曲线,引入了许多额外的参数,这些参数对启动器来说或多或少是混乱的.虽然适当地调整这些超参数可以得到令人印象深刻的结果,但超参数仍然很难调整.通过我们的间隔方案,我们发现我们不再需要退火策略的帮助.即使从头修复超参数m,网络也能灵活收敛.与SphereFace[9]相比,我们的增加间隔的方案对那些不熟悉超参数影响的人更为友好.另一个最近提出的附加间隔也在[6]中描述.我们的AM-SoftMax与[6]不同,因为我们的特征和权重被标准化为一个预定义的常量s.归一化是角度间隔特性的关键.如果没有归一化,间隔m不一定会导致较大的间隔. 注: 1.描述所提出的损失函数(AM-Softmax)的定义,通过定义说明与SphereFace相比其更容易实现. 2.将特征和权重归一化,不将缩放参数作为超参数进行训练,将其设为固定值比如30.

3.2. Discussion Geometric Interpretation Our additive margin scheme has a clear geometric interpretation on the hypersphere manifold. In Figure 3, we draw a schematic diagram to show the decision boundary of both conventional softmax loss and our AM-Softmax. For example, in Figure 3, the features are of 2 dimensions. After normalization, the features are on a circle and the decision boundary of the traditional softmax loss is denoted as the vector P 0 . In this case, we have W 1 T P 0 = W 2 T P 0 at the decision boundary P0 .

For our AM-Softmax, the boundary becomes a marginal region instead of a single vector. At the new boundary P 1 for class 1, we have W 1 T P 1 − m = W 2 T P 1 , which gives m = (W 1 − W 2 ) T P 1 = cos(θ W 1 ,P 1 ) − cos(θ W 2 ,P 1 ). If we further assume that all the classes have the same intra-class variance and the boundary for class 2 is at P 2 , we can get cos(θ W 2 ,P 1 ) = cos(θ W 1 ,P 2 ) (Fig. 3). Thus, m = cos(θ W 1 ,P 1 ) − cos(θ W 1 ,P 2 ), which is the difference of the cosine scores for class 1 between the two sides of the margin region.

3.2. 讨论几何解释我们附加的间隔方案在超球面上有清晰的几何解释.在图3中,我们绘制了一个示意图来显示传统SoftMax loss和我们的AM SoftMax loss 的决定边界.例如,在图3中,这些特征有两个维度.经过归一化后,特征在一个圆上,传统的Softmax loss 的决策边界表示为向量 P 0 P_0 P 0 .在这种情况下,我们在决策边界 P 0 P_0 P 0 处有 W 1 T P 0 = W 2 T P 0 W_1^TP_0=W_2^TP_0 W 1 T P 0 = W 2 T P 0 . 对于我们的AM-softmax,边界变成一个边缘区域而不是一个向量.在类别1的新边界 P 1 P_1 P 1 处,我们得到 W 1 T P 1 − m = W 2 T P 1 W_1^TP_1-m=W_2^TP_1 W 1 T P 1 − m = W 2 T P 1 ,这就得到 m = ( W 1 − W 2 ) T P 1 = c o s ( θ W 1 , P 1 ) − c o s ( θ W 2 , P 1 ) m=(W_1−W_2)^TP_1=cos(θ_{W_1,P_1})−cos(θ_{W_2,P_1}) m = ( W 1 − W 2 ) T P 1 = c o s ( θ W 1 , P 1 ) − c o s ( θ W 2 , P 1 ) .如果我们进一步假设所有的类都有相同的类内方差,并且类别2的边界在 P 2 P_2 P 2 ,我们可以得到 c o s ( θ W 2 , P 1 ) = c o s ( θ W 1 , P 2 ) cos(θ_{W_2,P_1})=cos(θ_{W_1,P_2}) c o s ( θ W 2 , P 1 ) = c o s ( θ W 1 , P 2 ) (图3).因此, m = c o s ( θ W 1 , P 1 ) − c o s ( θ W 1 , P 2 ) m=cos(θ_{W_1,P_1})−cos(θ_{W_1,P_2}) m = c o s ( θ W 1 , P 1 ) − c o s ( θ W 1 , P 2 ) ,这是类别1边缘区域两侧的余弦分数的差. 注: 1. 严谨的数学推导过程.

3.2.2 Angular Margin or Cosine Margin In SphereFace [9], the margin m is multiplied to θ, so the angular margin is incorporated into the loss in a multiplicative way. In our proposed loss function, the margin is enforced by subtracting m from cos θ, so our margin is incorporated into the loss in an additive way, which is one of the most significant differences than [9]. It is also worth mentioning that despite the difference of enforcing margin, these two types of margin formulations are also different in the base values. Specifically, one is θ and the other is cos θ. Although usually the cosine margin has an one-to-one mapping to the angular margin, there will still be some difference while optimizing them due to the non-linearity induced by the cosine function.

Whether we should use the cosine margin depends on which similarity measurement (or distance) the final loss function is optimizing. Obviously, our modified softmax loss function is optimizing the cosine similarity, not the angle. This may not be a problem if we are using the conventional softmax loss because the decision boundaries are the same in these two forms (cos θ 1 = cos θ 2 ⇒ θ 1 = θ 2 ). However, when we are trying to push the boundary, we will face a problem that these two similarities (distances) have different densities. Cosine values are more dense when the angles are near 0 or π. If we want to optimize the angle, an arccos operation may be required after the value of the inner product W T f is obtained. It will potentially be more computationally expensive.

In general, angular margin is conceptually better than the cosine margin, but considering the computational cost, cosine margin is more appealing in the sense that it could achieve the same goal with less efforts.

3.2.2 角度间隔或余弦间隔在SphereFace[9]中,间隔m乘以θ,因此角度间隔以乘法的方法加入到损失中.在我们提出的损失函数中,间隔是通过从cosθ中减去m来实现的,因此我们的间隔是以加法的方法并入损失中,这是比[9]最显著的差异之一.值得一提的是,不仅添加间隔的方法不同,这两种添加间隔公式的基础值也不同.具体来说,一个是θ,另一个是cosθ.虽然通常余弦间隔相对于角度间隔有一对一的映射,但是由于余弦函数的非线性,在优化它们时仍然存在一些差异.

是否应该使用余弦间隔取决于最终损失函数优化的相似性度量(或距离).显然,我们修正的SoftMax loss函数是优化余弦相似性,而不是角度.如果我们使用传统的SoftMax损失,这可能不是问题,因为决策边界在这两种形式中是相同的(cosθ1=cosθ2⇒θ1=θ2).然而,当我们试图推动边界时,我们将面临一个问题,即这两个相似度(距离)具有不同的密度.当角接近0或π时,余弦值更密集.如果我们想优化角度,在得到内积 W T f W^Tf W T f 的值之后,可能需要进行arccos运算.它的计算成本可能更高.

一般来说,角度间隔在概念上比余弦间隔好,但考虑到计算成本,余弦间隔更具吸引力,因为它可以用较好的计算成本的实现相同的目标. 注: 1. A-softmax loss 与AM-softmax loss 不仅在损失函数公式的形式上不同,理论基础也不同,一个优化角度距离一个优化余弦距离. 2. 优化角度距离还是优化余弦距离,取决于最后做验证时使用的度量标准. 3. 优化角度距离在概念上比优化余弦距离更好,但考虑到计算成本考虑优化余弦距离.

3.2.3 Feature Normalization In the SphereFace model [9], the authors added the weight normalization based on Large Margin Softmax [10], leaving the feature still not normalized. Our loss function, following [19, 12, 15], applies feature normalization and uses a global scale factor s to replace the sample-dependent feature norm in SphereFace [9]. One question arises: when should we add the feature normalization?

Our answer is that it depends on the image quality. In [15]'s Figure 1, we can see that the feature norm is highly correlated with the quality of the image. Note that back propagation has a property that,

3.2.3 特征归一化在SphereFace模型[9]中,基于大间隔的SoftMax loss 作者添加了[10]的权重归一化,但是特征没有归一化.受[19,12,15]的启发,我们的损失函数将特征归一化,并使用全局缩放因子s替换sphereFace中的样本相关特征范数[9].一个问题出现了:我们应该何时添加特征归一化?

我们的答案是,这取决于图像质量.在[15]的图1中,我们可以看到特征范数与图像质量高度相关.请注意,反向传播具有以下特性: Thus, after normalization, features with small norms will get much bigger gradient compared with features that have big norms (Figure 5). By back-propagation, the network will pay more attention to the low-quality face images, which usually have small norms. Its effect is very similar with hard sample mining [16, 8]. The advantages of feature normalization are also revealed in [11]. As a conclusion, feature normalization is most suitable for tasks whose image quality is very low.

From Figure 5 we can see that the gradient norm may be extremely big when the feature norm is very small.This potentially increases the risk of gradient explosion,even though we may not come across many samples with very small feature norm. Maybe some re-weighting strategy whose feature-gradient norm curve is between the two curves in Figure 5 could potentially work better. This is an interesting topic to be studied in the future.

因此,经过归一化后,与具有较大范数的特征相比,具有较小范数的特征将获得更大的梯度(图5).通过反向传播,网络将更多地关注低质量的人脸图像,这些图像通常具有较小的范数.其效果与硬样品采样非常相似[16,8].特征归一化的优点也在[11]中出现过.综上所述,特征归一化最适用于图像质量很低的任务.

从图5可以看出,当特征范数非常小时,梯度范数可能非常大.这可能会增加梯度爆炸的风险,尽管我们可能不会遇到许多具有非常小特征范数的样本.也许一些特征梯度范数曲线位于图5中两条曲线之间的重新加权策略可能会更好地工作.这是一个将来要研究的有趣话题

注:

1. 特征归一化用在本文章中. 2. 何时使用特征归一化合适呐?(图像质量低的时候,进而增加网络对低质量图像的关注度)

3.2.4 Feature Distribution Visualization To better understand the effect of our loss function, we designed a toy experiment to visualize the feature distributions trained by several loss functions. We used Fashion MNIST [21] (10 classes) to train several 7-layer CNN models which output 3-dimensional features. These networks are supervised by different loss functions. After we obtain the 3-dimensional features, we normalize and plot them on a hypersphere (ball) in the 3 dimensional space (Figure 4).

From the visualization, we can empirically show that our AM-Softmax performs similarly with the best SphereFace [9] (A-Softmax) model when we set s =10, m = . Moreover, our loss function can further shrink the intra-class variance by setting a larger m. Compared to A-Softmax [9], the AM-Softmax loss also converges easier with proper scaling factor s. The visualized 3D features well demonstrates that AM-Softmax could bring the large margin property to the features without tuning too many hyper-parameters.

3.2.4 特征分布可视化为了更好地理解损失函数的作用,我们设计了一个实验来可视化由几个损失函数训练的特征分布.我们使用流行的MNIST[21]数据集(10类)训练了几个7层CNN模型,这些模型输出三维特征.这些网络由不同的损失函数进行监督训练.在获得三维特征后,我们将其归一化并绘制在三维空间中的超球体(图4).

从这个可视化的实验过程,我们可以经验性的说明,当我们设置s=10,m=时,我们的AM-softmax模型与最好的SphereFace模型[9](a-softmax)性能相似.此外,我们的损失函数可以通过设置更大的m进一步缩小类内方差.与A-SoftMax[9]相比,AM-SoftMax损失也更容易收敛,具有适当的缩放因子s.可视化的3D特征很好地表明,AM-SoftMax可以在不调整过多超参数的情况下为特征带来较大的间隔属性. 注: 1. 在MNIST上利用不同的loss对同一7层CNN模型网络进行训练,进而对比性能(输出为归一化的三维特征向量). 2. 设置s=10,m=时,AM-softmax模型与最好的SphereFace模型(a-softmax)性能相似.

4. Experiment In this section, we will firstly describe the experimental settings. Then we will discuss the overlapping problem of the modern in-the-wild face datasets. Finally we will compare the performance of our loss function with several previous state-of-the-art loss functions.

4. 实验在本节中,我们将首先描述实验设置.然后,我们将讨论现代in-the-wild人脸数据集的重叠问题.最后,我们将把损失函数的性能与之前的几个最先进的损失函数进行比较.

4.1. Implementation Details Our loss function is implemented using Caffe framework[4]. We follow all the experimental settings from [9], including the image resolution, preprocessing method and the network structure. Specifically speaking, we use MTCNN[24] to detect faces and facial landmarks in images. Then the faces are aligned according to the detected landmarks.The aligned face images are of size 112 × 96, and are normalized by subtracting 128 and dividing 128. Our network structure follows [9], which is a modified ResNet [1] with 20 layers that is adapted to face recognition.

4.1. 实现细节我们的损失函数是使用caffe框架实现的[4].我们遵循[9]中的所有实验设置,包括图像分辨率/预处理方式和网络结构.具体来说,我们使用MTCNN[24]检测图像中的面部和面部关键点.然后根据检测到的关键点对齐人脸,对齐的人脸图像尺寸为112×96,通过减去128和除以128进行归一化.我们的网络结构遵循[9],这是一个经过修改的适合人脸识别的20层resnet[1]网络. All the networks are trained from scratch. We set the weight decay parameter to be 5e−4. The batch size is 256 and the learning rate begins with and is divided by 10 at the 16K, 24K and 28K iterations. The training is finished at 30K iterations. During training, we only use image mirror to augment the dataset.

In testing phase, We feed both frontal face images and mirror face images and extract the features from the output of the first inner-product layer. Then the two features are summed together as the representation of the face image. When comparing two face images, cosine similarity is utilized as the measurement.

所有的网络都是从头开始训练的.我们将权重衰减参数设置为5e-4.批量大小为256,学习率从开始,在16K/24K和28K迭代中除以10.培训在30K迭代完成.在培训过程中,我们只使用镜像来扩充数据集.

在测试阶段,我们提供正面图像和镜面图像,并从第一个内积层的输出中提取特征.然后将这两个罗列到一起作为人脸图像的表示.比较两幅人脸图像时,采用余弦相似性作为测量指标. 注: 1. caffe框架 2. 遵循[9]Sphereface中的所有实验设置,包括图像分辨率/预处理方式和网络结构(20层resnet[1]网络0. 3. 网络结构,训练中超参数的设置均给出了具体的参考值.

. Dataset Overlap Removal The dataset we use for training is CASIA-Webface [22], which contains 494,414 training images from 10,575 identities. To perform open-set evaluations, we carefully remove the overlapped identities between training dataset (CASIA-Webface [22]) and testing datasets (LFW[3] and MegaFace[5]). Finally, we find 17 overlapped identities between CASIA-Webface and LFW, and 42 overlapped identities between CASIA-Webface and MegaFace set1. Note that there are only 80 identities in MegaFace set1, i.e. over half of the identities are already in the training dataset. The effect of overlap removal is remarkable for MegaFace (Table ). To be rigorous, all the experiments in this paper are based on the cleaned dataset. We have made our overlap checking code publicly available 2 to encourage researchers to clean their training datasets before experiments. In our paper, we re-train some of the previous loss functions on the cleaned dataset as the baselines for comparison. Note that, we make our experiments fair by using the same network architecture and training dataset for every compared methods.

. 数据集去噪我们用于训练网络的数据集是CASIA-Webface[22],其中包含来自10575个身份的494414个人脸图像.为了执行open-set评估,我们小心地删除了训练数据集(CASIA-Webface)和测试数据集(LFW[3] and MegaFace[5])之间的重叠的身份图像.最后,我们发现CASIA-Webface和LFW之间有17个重叠的身份,CASIA-Webface和MegaFace set1之间有42个重叠的身份.注意,MegaFace set1中只有80个身份,即超过一半的身份已经在训练数据集中.对MegaFace而言,消除重叠效果显著(表).为了严格起见,本文中的所有实验都是基于清理后的数据集.我们已经公开了我们的身份重叠检查代码2,以鼓励研究人员在实验前清理他们的训练数据集. 在本文中,为了比较的严谨性,我们利用清理后的数据集重新训以前的损失函数.注意,我们使用相同的网络体系结构和训练数据集对每个比较方式进行了公平的实验. 注: 1. 训练数据集CASIA-Webface,测试数据集(LFW[3] and MegaFace[5]). 2. 删除了训练数据集与测试数据集中重叠的身份图片,为了实现open-set识别,进行重叠身份删除对megface挑战有性能的提高. 3. 删除重叠身份的代码已经开源:

4.3. Effect of Hyper-parameter m There are two hyper-parameters in our proposed loss function, one is the scale s and another is the margin m. The scale s has already been discussed sufficiently in several previous works [19, 12, 15]. In this paper, we directly fixed it to 30 and will not discuss its effect anymore.

The main hyper-parameter in our loss function is the margin m. In Table 4, we list the performance of our proposed AM-Softmax loss function with m varies from 5 to . From the table we can see that from m = 5 to , the performance improves significantly, and the performance become the best when m = 5 to m = .

We also provide the result for the loss function without feature normalization (noted as w/o FN) and the scale s. As we explained before, feature normalization performs better on low quality images like MegaFace[5], and using the original feature norm performs better on high quality images like LFW [3].

In Figure 6, we draw both of the CMC curves to evaluate the performance of identification and ROC curves to evaluate the performance of verification. From this figure, we can show that our loss function performs much better than the other loss functions when the rank or false positive rate is very low.

4.3. 超参数m的影响我们提出的损失函数中有两个超参数,一个是缩放系数s,另一个是间隔m.缩放系数s已经在之前的几项工作中得到了充分的讨论[19,12,15].在本文中,我们直接将其固定为30,不再讨论其影响.

损失函数中的主要超参数是间隔m.在表4中,列出了我们提出的AM-Softmax损失函数的性能随着m的变化所产生的影响,m的值在5到之间变化.从表中可以看出,从m=5到性能显著提高,当m=5到m=时性能最佳.

我们还提供了不进行特征归一化(即W/O FN)和缩放系数s的损失函数结果.正如我们之前所解释的,特征归一化在Megaface[5]等低质量图像上表现得更好,使用原始特征归一化在LFW[3]等高质量图像上表现得更好.

在图6中,我们绘制了两条CMC曲线来评估识别性能,并绘制了ROC曲线来评估验证性能.从这个图可以看出,当秩或假阳性率很低时,我们的损失函数比其他损失函数的性能要好得多.

注: 1. loss中有两个超参数,但是s之前的文章细致研究过这里设置为30,主要超参数为m并给出了参数的参考值. 2.验证了特征归一化在图像质量好的数据集上效果不是很显著. 3. 最后那个rank相关的参数没有具体查,在这不具体翻译.

5. Conclusion and Future Work In this paper, we propose to impose an additive margin strategy to the target logit of softmax loss with feature and weights normalized. Our loss function is built upon the previous margin schemes[9, 10], but it is more simple and interpretable. Comprehensive experiments show that our loss function performs better than A-Softmax [9] on LFW BLUFR [7] and MegaFace [5].

There is still lots of potentials for the research of the large margin strategies. There could be more creative way of specifying the function ψ(θ) other than multiplication and addition. In our AM-Softmax loss, the margin is a manually tuned global hyper-parameter. How to automatically determine the margin and how to incorporate class-specific or sample-specific margins remain open questions and are worth studying.

5. 结论和未来的工作本文中我们提出了一种基于特征和权值归一化的附加性间隔策略,并将其应用于softmax loss中.我们的损失函数建立在先前的间隔方案[9,10]的基础上,但它更简单更易于解释.综合实验表明,我们的损失函数在LFW BLUFR[7]和Megaface[5]上的性能优于A-SoftMax[9].

大间隔策略的研究仍然具有很大的潜力.除了乘法和加法之外,还有更具创造性的方式来指定函数ψ(θ).在我们的AM SoftMax loss 中,间隔是一个手动调整的全局超参数.怎样自动确定间隔以及怎样合并特定于类或特定于样本的间隔然是个悬而未决的问题,值得研究. 注: 1. 在前人基础上定义新的loss. 2. 特征和权值归一化. 3. 对未来的研究指明可行的方向.

更多科研论文服务，动动手指，请戳论文润色、投稿期刊推荐、论文翻译润色、论文指导及修改、论文预审！

语言不过关被拒？美国EditSprings--专业英语论文润色翻译修改服务专家帮您！