ML(5) NN:Learning
Neural network:
cost function \[h_Θ(x) \in \Bbb{R}^K \quad (h_Θ(x))_i = i^{th} output \]
\[J(Θ) = -\frac 1m\left[\sum^m_{i=1}\sum^K_{k=1}y^{(i)}_klog(h_Θ(x^{(i)})_k+(1-y^{(i)}_k)log(1-(h_Θ(x^{(i)}))_k) \right]+\frac{\lambda}{2m}\sum^{L-1}_{l-1}\sum^{s_l}_{i=1}\sum^{s_l+1}_{j=1}(Θ^{(l)}_{ji})^2\] 1. Randomly initialize weights
Implement forward propagation to get \(h_{Θ}(x^{(i)})\) for any \(x^{(i)}\)
Implement code to compute cost function \(J(Θ)\)
Imlement backpropagation to compute partial derivatives \[\frac{\partial}{\partial Θ^l_{jk}}J(Θ)\]
- Training set \[\left\{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\right\}\]
Set \(\Delta^{(l)}_{ij}=0\) (for all \(l,i,j\))
- For i = 1 to m
- Set \(a^{(1)} = x^{(i)}\)
- Perform forward propagation to compute \(a^{(l)}\) for \(l = 2,3,\ldots,L\)
\(a^{(1)} = x\) \(z^{(2)} = Θ^{(1)}a^{(1)}\) \(a^{(2)} = g(z^{(2)})(add\quad a^{(2)}_0\))
- Using \(y^{(i)}\), compute \(\delta^{L}=a^{L}-y^{(i)}\)
- compute \(\delta^{(L-1)},\delta^{(L-2)},\ldots,\delta^{(2)}\)
\(\delta^{(4)}_j=a^{(4)}_j-y_j\) \(\delta^{(3)}=(Θ^{(3)})^T\delta^{(4)}.*g^{\prime} (z^{(2)})\)
\[ \Delta ^{(l)}_{ij}:= \Delta ^{(l)}_{ij}+a^{(l)}_j \delta ^{(l+1)}_i\]
\[\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac1m\Delta^{(l)}_{ij} \qquad for\quad j=0\]
\[\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac1m\Delta^{(l)}_{ij}+\frac{\lambda}m\Theta^{(l)}_{ij} \qquad for\quad j\geq1\]
- For i = 1 to m
- Use gradient checking to compare \(\frac{\partial}{\partial Θ^l_{jk}}J(Θ)\)
computed using backpropagation vs. using numerical estimate of gradient
of \(J(Θ)\)
- Parameter vector \(\theta\)
- \(\theta \in \Bbb{R}^n\) (E.g. \(\theta\) is "unrolled version of \(Θ^{(1)}, Θ^{(2)}, Θ^{(3)})\)
- \(\theta = \theta_1, \theta_2, \ldots, \theta_n\)
- \(\frac{\partial}{\partial\theta_2}J(\theta) \approx
\frac{J(\theta_1,\theta_2,\ldots,\theta_n+\varepsilon)-J(\theta_1,\theta_2,\ldots,\theta_n-\varepsilon)}{2\varepsilon}\)
1
2
3
4
5
6
7for i = 1:n,
thetaPlus = theta;
thetaPlus(i) = thetaPlus(i) + EPSILON;
thetaMinus = theta;
thetaMinus(i) = thetaPlus(i) - EPSILON;
gradApprox(i) = (J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
end;
- Parameter vector \(\theta\)
- Use gradien descent or advanced optimization method with
backpropagation to minimize \(J(Θ)\) as
function of parameters \(Θ\)
- Initialize each \(\Theta\) to random value in \(\left[-\varepsilon, \varepsilon \right]\)
- E.g.
1
Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;