ML(5) NN:Learning

Neural network:

cost function \[h_Θ(x) \in \Bbb{R}^K \quad (h_Θ(x))_i = i^{th} output \]

\[J(Θ) = -\frac 1m\left[\sum^m_{i=1}\sum^K_{k=1}y^{(i)}_klog(h_Θ(x^{(i)})_k+(1-y^{(i)}_k)log(1-(h_Θ(x^{(i)}))_k) \right]+\frac{\lambda}{2m}\sum^{L-1}_{l-1}\sum^{s_l}_{i=1}\sum^{s_l+1}_{j=1}(Θ^{(l)}_{ji})^2\] 1. Randomly initialize weights

  1. Implement forward propagation to get \(h_{Θ}(x^{(i)})\) for any \(x^{(i)}\)

  2. Implement code to compute cost function \(J(Θ)\)

  1. Imlement backpropagation to compute partial derivatives \[\frac{\partial}{\partial Θ^l_{jk}}J(Θ)\]

    • Training set \[\left\{(x^{(1)}, y^{(1)}), \ldots, (x^{(m)}, y^{(m)})\right\}\]
  • Set \(\Delta^{(l)}_{ij}=0\) (for all \(l,i,j\))

    • For i = 1 to m
      • Set \(a^{(1)} = x^{(i)}\)
      • Perform forward propagation to compute \(a^{(l)}\) for \(l = 2,3,\ldots,L\)

      \(a^{(1)} = x\) \(z^{(2)} = Θ^{(1)}a^{(1)}\) \(a^{(2)} = g(z^{(2)})(add\quad a^{(2)}_0\))

      • Using \(y^{(i)}\), compute \(\delta^{L}=a^{L}-y^{(i)}\)
      • compute \(\delta^{(L-1)},\delta^{(L-2)},\ldots,\delta^{(2)}\)

      \(\delta^{(4)}_j=a^{(4)}_j-y_j\) \(\delta^{(3)}=(Θ^{(3)})^T\delta^{(4)}.*g^{\prime} (z^{(2)})\)

      • \[ \Delta ^{(l)}_{ij}:= \Delta ^{(l)}_{ij}+a^{(l)}_j \delta ^{(l+1)}_i\]

      • \[\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac1m\Delta^{(l)}_{ij} \qquad for\quad j=0\]

      • \[\frac{\partial}{\partial \Theta^{(l)}_{ij}}J(\Theta)=D^{(l)}_{ij}=\frac1m\Delta^{(l)}_{ij}+\frac{\lambda}m\Theta^{(l)}_{ij} \qquad for\quad j\geq1\]

  1. Use gradient checking to compare \(\frac{\partial}{\partial Θ^l_{jk}}J(Θ)\) computed using backpropagation vs. using numerical estimate of gradient of \(J(Θ)\)
    • Parameter vector \(\theta\)
      • \(\theta \in \Bbb{R}^n\) (E.g. \(\theta\) is "unrolled version of \(Θ^{(1)}, Θ^{(2)}, Θ^{(3)})\)
      • \(\theta = \theta_1, \theta_2, \ldots, \theta_n\)
      • \(\frac{\partial}{\partial\theta_2}J(\theta) \approx \frac{J(\theta_1,\theta_2,\ldots,\theta_n+\varepsilon)-J(\theta_1,\theta_2,\ldots,\theta_n-\varepsilon)}{2\varepsilon}\)
        1
        2
        3
        4
        5
        6
        7
        for i = 1:n,
        thetaPlus = theta;
        thetaPlus(i) = thetaPlus(i) + EPSILON;
        thetaMinus = theta;
        thetaMinus(i) = thetaPlus(i) - EPSILON;
        gradApprox(i) = (J(thetaPlus)-J(thetaMinus))/(2*EPSILON);
        end;
  2. Use gradien descent or advanced optimization method with backpropagation to minimize \(J(Θ)\) as function of parameters \(Θ\)
    • Initialize each \(\Theta\) to random value in \(\left[-\varepsilon, \varepsilon \right]\)
    • E.g.
      1
      Theta1 = rand(10,11)*(2*INIT_EPSILON) - INIT_EPSILON;