Deep learning

💡
N hidden layers Neural network = Deep Learning

Function with unknown → Define loss from training data → Optimization

y=fθ(x)y = f_{\theta}(x)  L(θ)L(\theta)  θ=argminL(θ)\theta^* = arg \min{L(\theta)}

Regression
The function outputs a scalar

Model → y=b+wx1 y = b + wx_1 (linear model)

Model → y=b+wx1 y = b + \sum{wx_1} (increase domain knowledge)

Classification
The function outputs a correct given options(classes)
  • Softmax

    y=softmax(y)y^\prime = softmax(y) (Normalize y to 0 < yy^\prime < 1)

    y=exp(yi)jexp(yi)y^\prime = \dfrac{exp(y_i)}{\sum\limits_j{exp(y_i)}}

    jyi=1\sum\limits_jy_i^\prime = 1

  • Mean Square Error

    e=i(yˆiyi)2e = \sum\limits_i{(\^y_i - y_i^\prime)^2}

  • Cross-entropy

    e=iyˆiln(yi)e = - \sum\limits_i{\^y_i ln(y_i^\prime)}

    Best suit for classification

Minimize cross-entropy == maximize likelihood

softmaxsoftmax == 2 input sigmoidsigmoid

Loss
A function of parameters. How good a set of value is.

yˆ\^y → label

ei=yiyˆie_i = |y_i -\^y_i| → Mean Absolute Error

ei=(yiyˆi)2e_i = (y_i -\^y_i)^2 → Mean Square Error

L(b,w)=1NeiL(b, w) = \dfrac{1}{N}\sum{e_i} → loss(small the best)

Optimization
  • Gradient Descent

    θ=minL(θ)\theta^* = \min{L(\theta)}

    η\eta → learning rate. (hyperparameter)

    ηδLδθ\eta \dfrac{\delta L}{\delta \theta} → Gradient Descent(only find local minima)

    θ1=θ0ηδLδθθ=θ0\theta^1 = \theta^0 - \eta \dfrac{\delta L}{\delta \theta}|_{\theta=\theta^0} (reverse direction of gradient)

    g=L(θ0)g = \nabla L(\theta^0)

    If gradientgradient is negative, increase θ\theta.

    θ1=θ0ηg\theta^1 = \theta^0 - \eta g

    Hyperparameter: defined by ourselves.

    epoch = N ×\times Batch Updates

Critical points
Find loss function near θ=θ\theta = \theta^ \prime
Tayler series approximation

L(θ)L(θ)+(θθ)Tg+12(θθ)TH(θθ)L(\theta) \approx L(\theta^ \prime) + (\theta - \theta^ \prime)^T g + \dfrac{1}{2} (\theta - \theta^ \prime)^T H(\theta - \theta^ \prime)

L(θ)L(θ)+vTg+12vTHvL(\theta) \approx L(\theta^ \prime) + v^T g + \dfrac{1}{2} v^T Hv

  • gg gradient vector

    g=L(θ)g = \nabla L(\theta^ \prime) gi=δL(θ)δθig_i = \dfrac{\delta L(\theta^ \prime)}{\delta \theta_i}

    If == 0 → at critical point

  • HH Hessian matrix

    Hij=δ2δθiδθjL(θ)H_{ij} = \dfrac{\delta^2}{\delta \theta_i \delta \theta_j} L(\theta^ \prime)

    Tells the properties of critical point & update direction

  • Local minima

    vTHv<0v^T Hv < 0, for all vv

    H is positive defined → All eigen values are positive

  • Local maxima

    vTHv>0v^T Hv > 0, for all vv

    H is negative defined → All eigen values are negative

  • Saddle point

    else

    uTHuu^THu = uT(λu)u^T(\lambda u) = λu2\lambda ||u||^2

    θθ=u\theta - \theta^ \prime = u θ=θ+u\theta = \theta^ \prime + u

    Update direction: (λi<0)ui(\lambda_i < 0) → u_i

💡
Eigen value λ\lambdadet(AλI)=0 \det({A - \lambda I}) = 0
Eigen vector
uiu_i(AλiI)ui=0(A - \lambda_i I) u_i = 0

Batch (a hyperparameter)
epoch = see all batches once → shuffle after each epoch(divide to batches)
  • Batch size = N (full batch)
    • long time for 1 update
    • Faster for 1 epoch

  • Batch size = 1
    • Noisy, more updates
    • Faster for 1 update
    • Noise is better for training & testing

For parallel computing

Momentum
Mimic real world physics

Movement(mm): movement of last step - gradient at present, m0=0m^0 = 0

m1=λm0ηg0m^1 = \lambda m^0 - \eta g^0 (movement) → (sum of all past gradients)

θ1=θ0+m1\theta^1 = \theta^0 + m^1 (move to)

Adaptive learning rate

θit+1=θitησitgit\theta_i^{t+1} = \theta_i^t - \dfrac{\eta}{\sigma_i^t}g_i^t (parameter/time dependent)

  • Root mean square

    σit=1Nt=0N1git\sigma_i^t = \sqrt{\dfrac{1}{N}\sum \limits_{t=0}^{N-1} g_i^t}

    If gg is big → decrease η\eta

  • RMSProp

    σit=α(σit1)2+(1α)(git)2\sigma_i^t = \sqrt{\alpha (\sigma_i^{t-1})^2 + (1 - \alpha)(g_i^t)^2}

    0<α<10 < \alpha < 1 (decide the importance of previous α\alpha)

    Small α\alpha → fast reaction to new gg

  • Scheduling

    Cumulated small σ\sigma lead to η\eta burst

    • Learning rate decay
    • Warm up

Adam
RMSProp + Momentum

θit+1=θitηtσitmit\theta_i^{t+1} = \theta_i^t - \dfrac{\eta^t}{\sigma_i^t}m_i^t

ηt\eta^t → scheduled η\eta

mm → previous direction of gg

σ\sigma → previous magnitude of gg

Batch Normalization
Smooth error surface

wi+ΔwiL+ΔLw_i + \Delta w_i → L + \Delta L

large xix_i has greater affect

  • Feature normalization
    • ii → dimension
    • μi\mu_i → mean
    • σ\sigma → standard deviation

      x˜ir=xirμiσi\~x_i^r = \dfrac{x_i^r - \mu_i}{\sigma_i} (all dimension are 0 with variance 1)

    If desired output ≠ 0 → add network parameter

    xˆi=γx˜i+β\^x_i = \gamma \odot \~x_i + \beta

    • \odot → element wise multiplication
    • γ\gamma → initially a 1 vector (until a good error surface is found)
    • β\beta → initially a 0 vector (until a good error surface is found)
    • Testing stage

      Moving average of training

      x˜ir=xirμˉiσˉi\~x_i^r = \dfrac{x_i^r - \bar \mu_i}{\bar \sigma_i}

      μˉ=pμˉ+(1p)μt\bar \mu = p \bar \mu + (1 - p) \mu^t

Models
  • Linear
  • Piecewise Linear - sets of sigmoid(activation) functions {Neuron}

yn=csigmoid(b+wxi)=c11+eb+wxny_n = c \cdot sigmoid(b+wx_i) = c \dfrac{1}{1 + e^{-b+wx_n}}

yn=b+cisigmoid(bi+wixn)y_n = b + \sum{c_i \cdot sigmoid(b_i + w_ix_n)}

  • yn=b+icisigmoid(bi+jwijxj)y_n = b + \sum\limits_{i}{c_i \cdot sigmoid(b_i + \sum\limits_{j}{w_{ij}x_j})}
    • ii → piecewise sigmoid function
    • jj → range of knowledge domain
Linear Algebra

1 layer

Feature: xx

Unknown parameter (θ\theta): y,b,cT,wy, b, c^T, w

Rectified Linear Unit(ReLU)

yn=b+icisigmoid(bi+jwijxj)y_n = b + \sum\limits_{i}{c_i \cdot sigmoid(b_i + \sum\limits_{j}{w_{ij}x_j})}

yn=b+2icimax(0,bi+jwijxj)y_n = b + \sum\limits_{2i}{c_i \cdot max(0, b_i + \sum\limits_{j}{w_{ij}x_j})} → ReLU (better)

Backpropagation
An efficient Gradient Descent
  • Chain Rule

    y=g(x)z=h(y)y = g(x) z = h(y)

    dzdx=dzdydydx\dfrac{dz}{dx} = \dfrac{dz}{dy} \dfrac{dy}{dx}

    ΔxΔyΔz\Delta x \rightarrow \Delta y \rightarrow \Delta z

    x=g(s)y=h(s)z=k(x,y)x = g(s) y = h(s) z = k(x, y)

    dzds=δzδxdxds+δzδydyds\dfrac{dz}{ds} = \dfrac{\delta z}{\delta x} \dfrac{dx}{ds} + \dfrac{\delta z}{\delta y} \dfrac{dy}{ds}

    Δx\Delta x

    Δs\Delta s  Δz\Delta z

    Δy\Delta y

CnC^n → distance between yny^n & yˆn\^y^n

L(θ)=n=1NCn(θ)L(\theta) = \sum\limits_{n=1}^N{C^n(\theta)}

δL(θ)δw=n=1NδCn(θ)δw\dfrac{\delta L(\theta)}{\delta w} = \sum\limits_{n=1}^N {\dfrac{\delta C^n(\theta)}{\delta w}}

δCδw=δzδwδCδz\dfrac{\delta C}{\delta w} = \dfrac{\delta z}{\delta w} \dfrac{\delta C}{\delta z}

  • Forward pass

    δzδw=x\dfrac{\delta z}{\delta w} = x (input)

    z=x1w1+x2w2+bz = x_1w_1 + x_2w_2 + b

  • Backward pass

    δCδz=δaδzδCδa\dfrac{\delta C}{\delta z} = \dfrac{\delta a}{\delta z} \dfrac{\delta C}{\delta a}

    a=σ(z)a = \sigma(z) activation function

    δCδa=δzδaδCδz+δz’’δaδCδz’’\dfrac{\delta C}{\delta a} = \dfrac{\delta z^’}{\delta a} \dfrac{\delta C}{\delta z^’} + \dfrac{\delta z^{’’}}{\delta a} \dfrac{\delta C}{\delta z^{’’}}

    δzδa=w\dfrac{\delta z^’}{\delta a} = w

    δCδz=σ(z)[w3δCδz+w4δCδz’’]\dfrac{\delta C}{\delta z} = \sigma^’(z) [w_3 \dfrac{\delta C}{\delta z^’} + w_4 \dfrac{\delta C}{\delta z^{’’}}]

    σ(z\sigma^’(z) is a constant

    δCδz=δy1δzδCδy1\dfrac{\delta C}{\delta z^’} = \dfrac{\delta y_1}{\delta z^’} \dfrac{\delta C}{\delta y_1} δCδz’’=δy2δz’’δCδy2\dfrac{\delta C}{\delta z^{’’}} = \dfrac{\delta y_2}{\delta z^{’’}} \dfrac{\delta C}{\delta y_2}

Improve training
Observe training data first, then testing data.

Model bias

y=b+wx1y = b + wx_1
  • More features: increase domain knowledge y=b+wx1y = b + \sum{wx_1}
  • More layers: deep learning yn=b+icisigmoid(bi+jwijxj)y_n = b + \sum\limits_{i}{c_i \cdot sigmoid(b_i + \sum\limits_{j}{w_{ij}x_j})}

Bad optimization

Big training data loss
  • Gain insight from shallow network optimization
  • Increase training data
  • Data augmentation(隆乳): generate new data from existing data
  • Constrained model: based on our interpretation of the problem
💡
Overfitting:

Small training data loss + Big testing data loss
optimization not enough, higher layer must be better

Mismatch

Distribution of training & testing data is different