Batch Normalization(批标准化)

1,029 阅读12分钟
原文链接: blog.csdn.net

Batch Normalization(批标准化)

Batch normalization is one of the most exciting recentinnovations in optimizing deep neural networks.—摘自Ian Goodfellow etc. 《deep learning》

注:这篇博客虽然也可以当做《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》的读书笔记,但是重点放在了BN的推导与实现上,工程实践性较强。

        Batch normalization是Google在2015年的论文《Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift》中提出的,对BN的溢美之词就不多说了,坊间传言,“BN大法好”,在图像分类上效果确实很好。我们还是把注意力放在BN本身上,为了能够对BN有个比较直白详细的描述,这篇博客主要从以下几个方面来介绍BN:

  1. BN的目的及思想
  2. BN的公式
  3. BN的优点
  4. BN的bp推导
  5. BN的代码实现

一、BN的目的及思想

        BN的主要目的:是为了解决深层网络难以训练的问题。深度神经网络之所以难以训练是因为每一层输入的分布在训练期间会随着前一层参数变化而变化,就要求我们必须使用一个很小的学习率和对参数很好的初始化,但是这样么做会让训练过程变得很慢。 “每一层输入的分布在训练期间会随着前一层参数变化而变化”这种现象在论文中被称为 Internal Covariate Shift,而Covariate Shift在统计机器学习中指的是训练样本与目标样本分布不一致,这样就导致训练出来的模型泛化能力很差。因此,BN的思想是对每层隐藏层的输入做一个正态标准化处理(使激活函数的输入(Z)的均值为0,方差为1)。其实这个和预处理中的对特征做归一化基本一样,我们对特征做归一化也是为了代价函数能够更快的收敛,加快训练速度,这和BN的思想都是一致的。然而我个人觉得BN中的神来之笔,就是做完归一化后,又引入了一个线性函数 Z ˜ ( i) =γ ⋅Z ( i) no rm +β Z~(i)=γ⋅Znorm(i)+β \widetilde{Z}^{(i)} = \gamma\cdot Z_{norm}^{(i)} + \beta Z (i) = γ⋅ Z nor m ( i) ​ + β ,其中参数 γ γ \gamma γ 起到scale的作用, β β \beta β 起到shift的作用,这两个参数要通过训练得到。因为如果只对 Z Z Z Z 做normalization的话,就改变了每一层网络的表示能力,比如激活函数如果使用sigmoid函数,对对 Z Z Z Z 做normalization后,输入就限制在sigmoid的线性部分了。为了具体形象的解释这段话,我们来看个例子,下图图左为原始数据集的分布,图右为使用normalization后的数据集。
备注:为方便展示,只抽取了2维特征,横坐标表示一个特征,纵坐标表示一个特征

从上图能够看出,数据经过normalization后,每一维特征符合一个标准正态分布(注意观察两张图的横纵坐标),这样就导致,如果激活函数是sigmoid,那么输入就限制在sigmoid的线性部分了,如下图所示。我们都知道神经网络是非常善于处理非线性问题的,这样做无异于自废武功。

因此,为了处理上面这个问题,Google的研究员们引入了 Z ˜ ( i) =γ ⋅Z ( i) no rm +β Z~(i)=γ⋅Znorm(i)+β \widetilde{Z}^{(i)} = \gamma\cdot Z_{norm}^{(i)} + \beta Z (i) = γ⋅ Z nor m ( i) ​ + β 。这样的话,能够还原原来网络每一层的表示能力。

二、BN的公式
        在这里依然沿袭我以前博客的符号,即 Z =W ⋅X +b Z=W⋅X+b Z = W \cdot X + b Z= W⋅ X+ b ,假设 m ini mini mini mi ni - b atch batch batch ba tch 的 s iz e=m size=m size=m si ze= m ,那么对于 某一个隐藏单元(激活单元)而言,有:

公式就是这么简单,没啥可说的了。

三、BN的优点
下面来看下BN的优点,直接从BN的论文上拔下来,会加一点点自己的看法:

  1. 学习率可以设置的比较大。因为能够把 learning rate设置的比较大,算法收敛的会很快,所以训练速度肯定是会很快的。即使用和原来一样的学习率,因为normalization的存在,训练速度也会加快。
  2. BN的论文中说到BN有一定的避免过拟合能力,可以不使用dropout、L2 regularization这样的正则化技术了,但ng在课上说,BN只有一定的能力,并不能完全替代dropout等,实际测试中也印证了ng的说法。
  3. 可以不用局部归一化了,对于局部归一化这个我不怎么熟悉,这一点就不多说了。

最后总结一句吧,BN确实很好用,但是不是万能的,即“BN大法好,但也不能乱用”。

四、BN的bp推导
        下面来到了重头戏,即BN的bp推导,这个当然是整个过程的核心了,因为我们要bp求参数嘛。因为BN又引入了两个参数 γ γ \gamma γ 和 β β \beta β ,所以训练的时候就要求4个参数,即: γ γ \gamma γ 、 β β \beta β 、 W W W W 、 b b b b 。其实因为BN中引入了一个线性公式, b b b b 已经没什么用了,可以不用求了。先来求 γ γ \gamma γ 和 β β \beta β 的梯度:

解释一下,这里之所以有个求和是因为针对某一个激活单元,有mm mm个样本。
接下来是求 W W W W 的梯度,我们只要求得代价函数 J J J J 对 Z (i) Z(i) Z^{(i)} Z ( i) 的导数就好了,因为 Z ( i) =W ⋅X +b Z(i)=W⋅X+b Z^{(i)} = W \cdot X+b Z (i) = W⋅ X+ b ,有了 Z ( i) Z(i) Z^{(i)} Z (i) 的梯度后很容易就求出 W W W W 的梯度。下面来看具体的求导过程,为了求导清晰一目了然,我们可以先把树形图画出来,如下图所示:

因此,根据链式法则可得:

下面来一项项求:

所以,有:

关于这个求导,几乎所有网上的资料都只写到了红框的位置就结束了,猜测应该都是根据BN的论文那个简要的结果来推导的,但实际上因为 还有一步 Z ˜ ( i) =γ ⋅Z ( i) norm +β Z~(i)=γ⋅Znorm(i)+β \widetilde{Z}^{(i)} = \gamma\cdot Z_{norm}^{(i)} + \beta Z (i) = γ⋅ Z norm ( i) ​ + β ,所以我们应该是写到 J J J J 对 Z ˜ ( i) Z~(i) \widetilde{Z}^{(i)} Z (i) 的导数,因为我们在做bp的时候求完激活函数的导数传过来的是 ∂ J ∂ Z ˜ ( i) ∂J∂Z~(i) \frac{\partial J}{\partial \widetilde{Z}^{(i)}} ∂ Z (i) ∂ J ​ 。

五、BN的代码实现
        接下来,我们就来自己动手实现这个算法,BN在前向传播的时候可以说非常简单了,只要在原来网络的基础上添加个BN层就好了,主要是BP的实现。在实现的时候有一点是非常非常重要的:

就是BN在训练阶段的时候 μ μ \mu μ 和 δ 2 δ2 \delta^2 δ 2 根据每个batch的样本非常好求,但是在测试阶段往往是几个甚至一个样本,那么 μ μ \mu μ 和 δ 2 δ2 \delta^2 δ 2 是无法求得的,这里的解决办法是在训练阶段使用指数加权平均(Exponential Weighted Average)(又叫指数移动平均)(详细的介绍,可参见我的博客:深度学习中优化方法——momentum、Nesterov Momentum、AdaGrad、Adadelta、RMSprop、Adam),去计算更新均值和方差,等训练完毕后,这个均值和方差(moving_average,moving_variance)直接应用于测试阶段,当成测试样本的均值和方差(注意moving_average和moving_variance仅用于测试阶段),moving_average和moving_variance的计算公式如下:
m ovi ng _me an= 0.9∗m ovi ng _me an+ 0.1∗cu rre nt_ mea n movi ng _var ianc e= 0.9∗ movin g _v ar ia nc e+ 0.1∗ cu rr en t_ va ri an ce moving_mean=0.9∗moving_mean+0.1∗current_meanmoving_variance=0.9∗moving_variance+0.1∗current_variance moving\_mean = 0.9*moving\_mean + 0.1*current\_mean \\moving\_variance = 0.9*moving\_variance + 0.1*current\_variance mo vi ng _m ea n = 0 . 9 ∗ m o v i n g _ m e a n + 0 . 1 ∗ c u r r e n t _ m e a n m o v i n g _ v a r i a n c e = 0 . 9 ∗ m o v i n g _ v a r i a n c e + 0 . 1 ∗ c u r r e n t _ v a r i a n c e

观看我们求出的导数,能够看到在fp的过程中保存的变量为: 1 δ 2 +ϵ √ 1δ2+ϵ \frac{1}{\sqrt{\delta^2 + \epsilon}} δ 2 +ϵ ​ 1 ​ 、 Z ( i) nor m Znorm(i) Z_{norm}^{(i)} Z nor m ( i) ​ 、 γ γ \gamma γ , Z ˜ ( i) Z~(i) \widetilde{Z}^{(i)} Z (i) 。
下面是代码实现,主要是初始化下bn层的几个参数,然后显示bn层的前向和后向传播,更新参数的时候也要更新 γ γ \gamma γ 和 β β \beta β 。完整的代码已放到github上:batch_normalization.py

# implement the batch normalization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import  load_breast_cancer
from sklearn.model_selection import train_test_split


#initialize parameters(w,b)
def initialize_parameters(layer_dims):
	"""
	:param layer_dims: list,每一层单元的个数(维度)
			gamma -- scale vector of shape (size of current layer ,1)
            beta -- offset vector of shape (size of current layer ,1)
	:return: parameter: directory store w1,w2,...,wL,b1,...,bL
			 bn_param: directory store moving_mean, moving_var
	"""
	np.random.seed(3)
	L = len(layer_dims)#the number of layers in the network
	parameters = {}
	# initialize the exponential weight average
	bn_param = {}
	for l in range(1,L):
		parameters["W" + str(l)] = np.random.randn(layer_dims[l],layer_dims[l-1])*0.01
		parameters["b" + str(l)] = np.zeros((layer_dims[l],1))
		parameters["gamma" + str(l)] = np.ones((layer_dims[l],1))
		parameters["beta" + str(l)] = np.zeros((layer_dims[l],1))
		bn_param["moving_mean" + str(l)] = np.zeros((layer_dims[l], 1))
		bn_param["moving_var" + str(l)] = np.zeros((layer_dims[l], 1))

	return parameters, bn_param

def relu_forward(Z):
	"""
	:param Z: Output of the linear layer
	:return:
	A: output of activation
	"""
	A = np.maximum(0,Z)
	return A

#implement the activation function(ReLU and sigmoid)
def sigmoid_forward(Z):
	"""
	:param Z: Output of the linear layer
	:return:
	"""
	A = 1 / (1 + np.exp(-Z))
	return A

def linear_forward(X, W, b):
	z = np.dot(W, X) + b
	return z

def batchnorm_forward(z, gamma, beta, epsilon = 1e-12):
	"""
	:param z: the input of activation (z = np.dot(W,A_pre) + b)
	:param epsilon: is a constant for denominator is 0
	:return: z_out, mean, variance
	"""
	mu = np.mean(z, axis=1, keepdims=True)#axis=1按行求均值
	var = np.var(z, axis=1, keepdims=True)
	sqrt_var = np.sqrt(var + epsilon)
	z_norm = (z - mu) / sqrt_var
	z_out = np.multiply(gamma,z_norm) + beta #对应元素点乘
	return z_out, mu, var, z_norm, sqrt_var


def forward_propagation(X, parameters, bn_param, decay = 0.9):
	"""
	X -- input dataset, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "gamma1","beta1",W2", "b2","gamma2","beta2",...,"WL", "bL","gammaL","betaL"
                    W -- weight matrix of shape (size of current layer, size of previous layer)
                    b -- bias vector of shape (size of current layer,1)
                    gamma -- scale vector of shape (size of current layer ,1)
                    beta -- offset vector of shape (size of current layer ,1)
                    decay -- the parameter of exponential weight average
                    moving_mean = decay * moving_mean + (1 - decay) * current_mean
                    moving_var = decay * moving_var + (1 - decay) * moving_var
                    the moving_mean and moving_var are used for test
    :return:
	AL: the output of the last Layer(y_predict)
	caches: list, every element is a tuple:(A, W,b,gamma,sqrt_var,z_out,Z_norm)
	"""
	L = len(parameters) // 4  # number of layer
	A = X
	caches = []
	# calculate from 1 to L-1 layer
	for l in range(1,L):
		W = parameters["W" + str(l)]
		b = parameters["b" + str(l)]
		gamma = parameters["gamma" + str(l)]
		beta = parameters["beta" + str(l)]
		z = linear_forward(A, W, b)
		z_out, mu, var, z_norm, sqrt_var = batchnorm_forward(z, gamma, beta) #batch normalization
		caches.append((A, W, b, gamma, sqrt_var, z_out, z_norm)) #以激活单元为分界线,把做激活前的变量放在一起,激活后可以认为是下一层的x了
		A = relu_forward(z_out) #relu activation function
		#exponential weight average for test
		bn_param["moving_mean" + str(l)] = decay * bn_param["moving_mean" + str(l)] + (1 - decay) * mu
		bn_param["moving_var" + str(l)] = decay * bn_param["moving_var" + str(l)] + (1 - decay) * var
	# calculate Lth layer(last layer)
	WL = parameters["W" + str(L)]
	bL = parameters["b" + str(L)]
	zL = linear_forward(A, WL, bL)
	caches.append((A, WL, bL, None, None, None, None))
	AL = sigmoid_forward(zL)
	return AL, caches, bn_param

#calculate cost function
def compute_cost(AL,Y):
	"""
	:param AL: 最后一层的激活值,即预测值,shape:(1,number of examples)
	:param Y:真实值,shape:(1, number of examples)
	:return:
	"""
	m = Y.shape[1]
	cost = 1. / m * np.nansum(np.multiply(-np.log(AL), Y) +
	                          np.multiply(-np.log(1 - AL), 1 - Y))
	#从数组的形状中删除单维条目,即把shape中为1的维度去掉,比如把[[[2]]]变成2
	cost = np.squeeze(cost)
	return cost

#derivation of relu
def relu_backward(dA, Z):
	"""
	:param Z: the input of activation function
	:return:
	"""
	dout = np.multiply(dA, np.int64(Z > 0))
	return dout

def batchnorm_backward(dout, cache):
	"""
	:param dout: Upstream derivatives
	:param cache:
	:return:
	"""
	_, _, _, gamma, sqrt_var, _, Z_norm = cache
	m = dout.shape[1]
	dgamma = np.sum(dout*Z_norm, axis=1, keepdims=True) #*作用于矩阵时为点乘
	dbeta = np.sum(dout, axis=1, keepdims=True)
	dy = 1./m * gamma * sqrt_var * (m * dout - np.sum(dout, axis=1, keepdims=True) - Z_norm*np.sum(dout*Z_norm, axis=1, keepdims=True))
	return dgamma, dbeta, dy

def linear_backward(dZ, cache):
	"""
	:param dZ: Upstream derivative, the shape (n^[l+1],m)
	:param A: input of this layer
	:return:
	"""
	A, W, _, _, _, _, _ = cache
	dW = np.dot(dZ, A.T)
	db = np.sum(dZ, axis=1, keepdims=True)
	da = np.dot(W.T, dZ)
	return da, dW, db

def backward_propagation(AL, Y, caches):
	"""
	Implement the backward propagation presented in figure 2.
	Arguments:
	Y -- true "label" vector (containing 0 if cat, 1 if non-cat)
	caches -- caches output from forward_propagation(),(w,b,gamma,sqrt_var,z_out,Z_norm,A)

	Returns:
	gradients -- A dictionary with the gradients with respect to dW,db
	"""
	m = Y.shape[1]
	L = len(caches)-1
	# print("L:   " + str(L))
	#calculate the Lth layer gradients
	dz = 1./m * (AL - Y)
	da, dWL, dbL = linear_backward(dz, caches[L])
	gradients = {"dW"+str(L+1): dWL, "db"+str(L+1): dbL}
	#calculate from L-1 to 1 layer gradients
	for l in reversed(range(0,L)): # L-1,L-3,....,1
		#relu_backward->batchnorm_backward->linear backward
		A, w, b, gamma, sqrt_var, z_out, z_norm = caches[l]
		#relu backward
		dout = relu_backward(da,z_out)
		#batch normalization
		dgamma, dbeta, dz = batchnorm_backward(dout,caches[l])
		# print("===============dz" + str(l+1) + "===================")
		# print(dz.shape)
		#linear backward
		da, dW, db = linear_backward(dz,caches[l])
		# print("===============dw"+ str(l+1) +"=============")
		# print(dW.shape)
		#gradient
		gradients["dW" + str(l+1)] = dW
		gradients["db" + str(l+1)] = db
		gradients["dgamma" + str(l+1)] = dgamma
		gradients["dbeta" + str(l+1)] = dbeta
	return gradients

def update_parameters(parameters, grads, learning_rate):
	"""
	:param parameters: dictionary, W, b
	:param grads: dW,db,dgamma,dbeta
	:param learning_rate: alpha
	:return:
	"""
	L = len(parameters) // 4
	for l in range(L):
		parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l+1)]
		parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l+1)]
		if l < L-1:
			parameters["gamma" + str(l + 1)] = parameters["gamma" + str(l + 1)] - learning_rate * grads["dgamma" + str(l + 1)]
			parameters["beta" + str(l + 1)] = parameters["beta" + str(l + 1)] - learning_rate * grads["dbeta" + str(l + 1)]
	return parameters

def L_layer_model(X, Y, layer_dims, learning_rate, num_iterations):
	"""
	:param X:
	:param Y:
	:param layer_dims: list containing the input size and each layer size
	:param learning_rate:
	:param num_iterations:
	:return:
	parameters:final parameters:(W,b,gamma,beta)
	bn_param: moving_mean, moving_var
	"""
	costs = []
	# initialize parameters
	parameters, bn_param = initialize_parameters(layer_dims)
	for i in range(0, num_iterations):
		#foward propagation
		AL,caches,bn_param = forward_propagation(X, parameters,bn_param)
		# calculate the cost
		cost = compute_cost(AL, Y)
		if i % 1000 == 0:
			print("Cost after iteration {}: {}".format(i, cost))
			costs.append(cost)
		#backward propagation
		grads = backward_propagation(AL, Y, caches)
		#update parameters
		parameters = update_parameters(parameters, grads, learning_rate)
	print('length of cost')
	print(len(costs))
	plt.clf()
	plt.plot(costs)  # o-:圆形
	plt.xlabel("iterations(thousand)")  # 横坐标名字
	plt.ylabel("cost")  # 纵坐标名字
	plt.show()
	return parameters,bn_param

#fp for test
def forward_propagation_for_test(X, parameters, bn_param, epsilon = 1e-12):
	"""
	X -- input dataset, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "b1", "gamma1","beta1",W2", "b2","gamma2","beta2",...,"WL", "bL","gammaL","betaL"
                    W -- weight matrix of shape (size of current layer, size of previous layer)
                    b -- bias vector of shape (size of current layer,1)
                    gamma -- scale vector of shape (size of current layer ,1)
                    beta -- offset vector of shape (size of current layer ,1)
                    decay -- the parameter of exponential weight average
                    moving_mean = decay * moving_mean + (1 - decay) * current_mean
                    moving_var = decay * moving_var + (1 - decay) * moving_var
                    the moving_mean and moving_var are used for test
    :return:
	AL: the output of the last Layer(y_predict)
	caches: list, every element is a tuple:(A, W,b,gamma,sqrt_var,z,Z_norm)
	"""
	L = len(parameters) // 4  # number of layer
	A = X
	# calculate from 1 to L-1 layer
	for l in range(1,L):
		W = parameters["W" + str(l)]
		b = parameters["b" + str(l)]
		gamma = parameters["gamma" + str(l)]
		beta = parameters["beta" + str(l)]
		z = linear_forward(A, W, b)
		#batch normalization
		# exponential weight average
		moving_mean = bn_param["moving_mean" + str(l)]
		moving_var = bn_param["moving_var" + str(l)]
		sqrt_var = np.sqrt(moving_var + epsilon)
		z_norm = (z - moving_mean) / sqrt_var
		z_out = np.multiply(gamma, z_norm) + beta  # 对应元素点乘
		#relu forward
		A = relu_forward(z_out) #relu activation function

	# calculate Lth layer(last layer)
	WL = parameters["W" + str(L)]
	bL = parameters["b" + str(L)]
	zL = linear_forward(A, WL, bL)
	AL = sigmoid_forward(zL)
	return AL
	
#predict function
def predict(X_test, y_test, parameters, bn_param):
	"""
	:param X:
	:param y:
	:param parameters:
	:return:
	"""
	m = y_test.shape[1]
	Y_prediction = np.zeros((1, m))
	prob = forward_propagation_for_test(X_test, parameters, bn_param)
	for i in range(prob.shape[1]):
		# Convert probabilities A[0,i] to actual predictions p[0,i]
		if prob[0, i] > 0.5:
			Y_prediction[0, i] = 1
		else:
			Y_prediction[0, i] = 0
	accuracy = 1- np.mean(np.abs(Y_prediction - y_test))
	return accuracy

#DNN model
def DNN(X_train, y_train, X_test, y_test, layer_dims, learning_rate= 0.01, num_iterations=30000):
	parameters, bn_param = L_layer_model(X_train, y_train, layer_dims, learning_rate, num_iterations)
	accuracy = predict(X_test,y_test,parameters,bn_param)
	return accuracy


if __name__ == "__main__":
	X_data, y_data = load_breast_cancer(return_X_y=True)
	X_train, X_test,y_train,y_test = train_test_split(X_data, y_data, train_size=0.8,test_size=0.2,random_state=28)
	X_train = X_train.T
	y_train = y_train.reshape(y_train.shape[0], -1).T
	X_test = X_test.T
	y_test = y_test.reshape(y_test.shape[0], -1).T
	accuracy = DNN(X_train,y_train,X_test,y_test,[X_train.shape[0],10,5,1])
	print(accuracy)

用了scikit-learn中的breast_cancer数据集测试了下,测试的准确率为96.49%,代价函数变化示意图如下所示: