ML(5) NN:Learning

Posted on 2017-06-30 Edited on 2024-01-21 In Machine Learning

Neural network:

cost function \[h_Θ(x) \in \Bbb{R}^K \quad (h_Θ(x))_i = i^{th} output \]

\[J(Θ) = -\frac 1m\left[\sum^m_{i=1}\sum^K_{k=1}y^{(i)}_klog(h_Θ(x^{(i)})_k+(1-y^{(i)}_k)log(1-(h_Θ(x^{(i)}))_k) \right]+\frac{\lambda}{2m}\sum^{L-1}_{l-1}\sum^{s_l}_{i=1}\sum^{s_l+1}_{j=1}(Θ^{(l)}_{ji})^2\] 1. Randomly initialize weights

Implement forward propagation to get \(h_{Θ}(x^{(i)})\) for any \(x^{(i)}\)
Implement code to compute cost function \(J(Θ)\)

Imlement backpropagation to compute partial derivatives \[\frac{\partial}{\partial Θ^l_{jk}}J(Θ)\]
- Training set \[\left\{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\right\}\]

Set \(\Delta^{(l)}_{ij}=0\) (for all \(l,i,j\))
- For i = 1 to m
  - Set \(a^{(1)} = x^{(i)}\)
  - Perform forward propagation to compute \(a^{(l)}\) for \(l = 2,3,\ldots,L\)
  \(a^{(1)} = x\) \(z^{(2)} = Θ^{(1)}a^{(1)}\) \(a^{(2)} = g(z^{(2)})(add\quad a^{(2)}_0\))
  - Using \(y^{(i)}\), compute \(\delta^{L}=a^{L}-y^{(i)}\)
  - compute \(\delta^{(L-1)},\delta^{(L-2)},\ldots,\delta^{(2)}\)
  \(\delta^{(4)}_j=a^{(4)}_j-y_j\) \(\delta^{(3)}=(Θ^{(3)})^T\delta^{(4)}.*g^{\prime} (z^{(2)})\)
  - \[ \Delta ^{(l)}_{ij}:= \Delta ^{(l)}_{ij}+a^{(l)}_j \delta ^{(l+1)}_i\]
  - \[\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac1m\Delta^{(l)}_{ij} \qquad for\quad j=0\]
  - \[\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac1m\Delta^{(l)}_{ij}+\frac{\lambda}m\Theta^{(l)}_{ij} \qquad for\quad j\geq1\]

Use gradient checking to compare \(\frac{\partial}{\partial Θ^l_{jk}}J(Θ)\) computed using backpropagation vs. using numerical estimate of gradient of \(J(Θ)\)

Parameter vector \(\theta\)

\(\theta \in \Bbb{R}^n\) (E.g. \(\theta\) is "unrolled version of \(Θ^{(1)}, Θ^{(2)}, Θ^{(3)})\)
\(\theta = \theta_1, \theta_2, \ldots, \theta_n\)

\(\frac{\partial}{\partial\theta_2}J(\theta) \approx \frac{J(\theta_1,\theta_2,\ldots,\theta_n+\varepsilon)-J(\theta_1,\theta_2,\ldots,\theta_n-\varepsilon)}{2\varepsilon}\)

for i = 1:n,
    thetaPlus = theta;
    thetaPlus(i) = thetaPlus(i) + EPSILON;
    thetaMinus = theta;
    thetaMinus(i) = thetaPlus(i) - EPSILON;
    gradApprox(i) = (J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
end;

Use gradien descent or advanced optimization method with backpropagation to minimize \(J(Θ)\) as function of parameters \(Θ\)
- Initialize each \(\Theta\) to random value in \(\left[-\varepsilon, \varepsilon \right]\)
- E.g.
  1
  Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;