This is my blog.




Lesson 1: CNN for Visual Recognition

  1. what is computer vision

    It is really the study of visual data. And it includes physics, biology, psychology, computer science, mathematics, engineering.

  2. The history of the computer vision.

    biological vision(540 millions): The first animal has eyes, and Big Bang happened.

    Cameras vision(1600): “Block world”—recognize them and reconstruct what these shapes are.

    Stages of Visual Representation(1970):

    • Input image
    • Primal sketch: likes zero crossings, blobs, edges, bars, ends, virtual lines, groups, curves boundaries.
    • 2 1/2-D Sketch: likes local surface orientation and discontinuities in depth and in surface orientation
    • 3-D model representation: organized in terms of surface and volumetric primitives.

    Maybe we should take image segmentation first, then take image recognition.

    SIFT feature:

    • Identify critical features on the object
    • Match the features to a similar object

    Image net:

    • Recognize all the object in the world.
    • Solve the overfitting oroblem in machine learning.

    CNN: deep learning(focus on).

  3. Outline of CS231n

    Focus on image classification.
    Transistors and pixels used in training are important.
    Human don’t only have the ability to recognize objects, so there are many things we can do.

Lesson 2: Image Classification pipeline

  1. Need

    input and redetermined labels.

  2. Challenges

    illumination(eg. light), deformation(eg. position), occlusion(eg. In grass), intraclass variation(eg. cats are different).

  3. Classifier

    def classify_image(image):
    # some magic here
    return class_label
  4. Method

    like the supervised learning: by dataset to predict the new examples

    have two functions: training function and predict function

  5. Data-Driven approach

    1. Collect a dataset of images and labels
    2. Use Machine Learning to train a classifier
    3. Evaluate the classifier on new images
    def train(images, labels):
    # Machine learning!
    return model
    def predict(model, test_images):
    # use model to predict labels
    return test_labels
  6. Neighbor Classifier

    import numpy as np
    class NearestNeighbor:
    def __init__(self):
    def train(self, X, y)
    X is N*D where each row is an example.
    Y is 1-dimension of size N
    # the nearest neighbor classifier simply remembers all the training data
    self.Xtr = X
    self.ytr = y
    def predict(self, X):
    X is N*D where each row is an example we wish to predict label for
    num_test = X.shape[0]
    # lets make sure that the output type matches the input type
    Ypred = np.zeros(num_test, dtype = self.ytr.dtype)
    # loop over all test rows
    for i in xrange(num_test):
    # find the nearset training image to the i'th test image
    # using the L1 distance (sum of absolute value of differences)
    distances = np.sum(np.abs(self.Xtr - X[i, :]), axis=1)
    min_index = np.argmin(distances) # get the index with smallest distance
    Ypred[i] = self.ytr[min_index] # predict the label of the nrearest example
    retrun Ypred

    Train: O(1) Predict: O(N)

    This is bad, we want to be fast at prediction.


    • Very slow at test time(on images, there are too pixels.)
    • Distance metrics on pixels are not informative.

    So, K-Nearest Neighbor on images never used.

    K-Nearest Neighbor: instead of coping label form nearest neighbor, take majority vote from K closet points.(curse of dimensionality)

  7. Compare approach

    L1 distance: Manhattan distance(if the input has its mean, depen coordinate)

    L2 distance: Euclidean distance

  8. Hyperparameters

    choices about the algorithm that we set rather than learn, such as K(K-Nearest Neighbor), distance metric.

  9. Dataset

    Such as, CIFAR10(32*32*3).

    Split the dataset to train, validation and test data.

    • Only whole, always works perfectly on training data
    • Split to train and test data, then no idea how algotihm will perform on new data(may it’s too bad, because the train data and test data are so different).

    Also we can use cross-validation: split data into folds, try each fold as validation and average results(but it doesn’t use in practice, because it cots too much).

  10. Linear Classification

    $f(X,W)=WX+b$: (W is parameters or weights, and make it more efficient than K-Nearest neighbor. And b is a bias.)Each of row is a temple probility of the classBut it only allow to learn one template per category, that may be caused that a horse has two head.And maybe the region of same kinds is not in linear, maybe a circle.

    Each line of W is a classifier of a classification category. If you change the number of one of the lines, you will see that the classifier’s corresponding line in space starts to rotate in different directions. The bias B allows the linear shift of the classifier. If there is no bias, all the classifier’s lines have to cross the origin point.

Lesson 3: Loss functions and Optimization

  1. Loss function

    Tells how good current classifier is.

  2. Multiclass SVM loss

    The min is zero, and the max is infinity. The scale isn’t important, because we care about the max, not the value.

    def L_i_vectorized(x, y, W):
    scores = W.dot(x)
    margins = np.maximum(0, scores - scores[y] + 1)
    margins[y] = 0
    loss_i = np.sum(margins)
    return loss_i

    We want to make the loss be zero, and the W is not unique.

    Data loss: model predictions should match training data.

    Regularization: model should be “simple”, so it works on test data(Occam’s Razor: the simplest is the best).

  3. Multinomial logistic regression

    softmax function:

    Want to maximize the log likelihood, or to minimize the negative log likelihood of the correct class.

    The min is zero, the max is infinity.

  4. Regularization

    L2 regularization(perfer W2 than W1):

    L1 regularization(perfer W1 than W2):

    Elastic net(L1+L2):

    Max norm regularization:


    Fancier: Batch normalization, stochastic depth.

  5. Optimization(Find the W to minimize the loss)

    • Random search(bad)

    • Follow the slope

      In 1-dimension, the derivative of a function:

      Also can:

      In multiple dimensions, the gradient is the vector of (partial derivatives) along each dimension.

      The slope in any direction is the dot product of the direction with the gradient. The direction of steepest descent is the negative gradient.

  6. Gradient

    • Numerical gradient: approximate, slow, easy to write
    • Analytic gradient: exact, fast, error-prone
    • In practice: always use analytic gradient, but check implementation with numerical gradient(gradient check).

    Gradient descent:

    • Stochastic Gradient Descent(SGD)

      Because full sum expensive when N is large! So we approximate sum using minibatch of examples 32/64/128 common.

      # Minibatch Gradient Descent
      while True:
      data_batch = sample_training_data(data, 256) # sample 256 examples
      weights_grad = evaluate_gradient(loss_fun, data_batch, weights)
      weights += -step_size * weights_grad # perform parameter update
  7. Image features vs ConvNets

Lecture 4: Introduction to Neural Networks

  1. Backpropagation

    Recursive application of the chain rule along a compuatational graph to compute the gradients of all

    Chain rule:

    Patterns in backward flow

    • add gate: gradient distributor
    • max gate: gradient router
    • mul gate: gradient switcher
  2. forward() and backward() API

  3. Neural networks

    example of 2-layer Neural Network:

    import numpy as np
    from numpy.random import randn
    N, D_in, H, D_out = 64, 1000, 100, 10
    x, y - randn(N, D_in), randn(N, D_out)
    w1, w2 = randn(D_in, H), randn(H, D_out)
    for t in range(2000):
    h = 1 / (1 + np.exp(-x.dot(w1)))
    y_pred = h.dot(w2)
    loss = np.square(y_pred - y).sum()
    print(t, loss)
    grad_y_pred = 2.0 * (y_pred - y)
    grad_w2 = h.T.dot(grad_y_pred)
    grad_h = grad_y_pred.dot(w2.T)
    grad_w1 = x.T.dot(grad_h * h * (1 - h))
    w1 -= 1e-4 * grad_w1
    w2 -= 1e-4 * grad_w2
  4. Activation function

    • Relu

    • sigmoid

    • tanh

    • Leaky Relu

    • Maxout

    • ELU

  5. Nural networks: Architectures

    Hidden layer, Fully-connected layer

Lecture 5: Convolutional Neural Networks for Visual Recognition

  1. ConvNets

    Used in classification, retrieval, detection, segmentation, self-driving cars, face recognition, image captioning, etc..

    Typical architectures look like:


    Where N is usually up to ~5, M is large, $0\leq K\leq2$.

    But ResNet/GoogleNet challenge this paradigm.

  2. Filters

    Filters alawys extend the full depth of the input volume. The last size should be the same.

  3. Stride

    Stride x means you skip x-1 pixels.

    Output size: (N-F)/stride+1

    F: Filter size

    The size should be integer. So, the more common is add zero pad to the border(also mirror, extend, etc.).

    Shrinking too fast is not good, doesn’t work well.

  4. Summary

    Accepts a volume of size $W_1\times H_1 \times D_1$

    Requires four hyperparameters:

    • Number of filters K (powers of 2, e.g. 32,64,128,512)
    • their spatial extent F
    • the strde S
    • the amount of zero padding P

    Common settings:

    • F = 3, S = 1, P = 1
    • F = 5, S = 1, P = 2
    • F = 5, S = 2, P = ?(whatever fits)
    • F = 1, S = 1, P = 0

    Produces a volume of size $W_2\times H_2\times D_2$ where:

    • $W_2 = (W_1 - F + 2P)/S+1$
    • $H_2=(H_1-F+2P)/S+1$ (i.e. width and height are computed equally by symmetry)
    • $D_2=K$

    With parameter sharing, it introduces $F\cdot F\cdot D_1$ weights per filter, for a total of $(F\cdot F\cdot D_1)\cdot K$ weights and $K$ biases.

    In the output volume, the d-th depth slice (of size $W_2\times H_2$) is the result of performing a valid convolution of the $d$-th filter over the input volume with a stride of $S$, and then offset by $d$-th bias.

    eg. Input volume: 32*32*310 5*5 filters with stride 1, pad 2

    Output volume size:(32+2*2-5)/1+1=32 spatially, so it is 32*32*10.

    (The fist 2 means the left and right or the top and bottom.)

    And the number of parameters in this layer:

    Each filter has 5*5*3+1=76 params (+1 for bias).

    => So 76*10 = 760

  5. Pooling layer

    • Just downsampling
    • Makes the representations smaller and more manageable
    • operates over each activation mao independently

    Eg. if you use themax pooling with 2*2 filters and stride 2, then 2*2=4 numbers to choose the maximum one, and then skip 2 numbers, then loop. That means, if your input is 4*4, then the output is 2*2.

    Not use the zero padding, because it just use the downsampling.

    Common settings:

    F = 2, S = 2

    F = 3, S = 2

  6. Fully Connected Layer(FC layer)

    • Contains neurons taht connect to the entire input volume, as in ordinary Neural Networks.

Lesson 6: Training Neural Networks I

  • Activation functions

    • sigmoid $\in(0,1)$, firing rate.

      • Saturated neurons “kill” the gradients
      • Sigmoid outputs are not zero-centered, and will make the gradient update always in the same direction
    • tanh $\in(-1,1)$

      • Zero centered(nice)
      • still kills gradients when saturated
    • Relu (the gradient in zero always says 0.)

      • Does not saturate(in +region)
      • Very computationally efficient
      • Converges much faster than sigmoid/tanh in practice
      • Actually more biologically plausible than sigmoid
      • Not zero-centered ouput
      • an annoyance: when $x<0$
      • Always like to initialize ReLu neurons with slightly positive biases to avoid bad initialization.
    • Leaky Relu

      • Does not saturate
      • Computationally efficient
      • Converges much faster than sigmoid/tanh in practice
      • will not “die”
    • Parametric Rectufier(PReLu)

      backprop into $\alpha$ parameter

    • ELU

      • All benefits of Relu
      • Closer to zero mean outputs
      • Negative saturation regime compared with Leaky ReLu adds some robustness to noise
      • Computation requires exp()
    • Maxout

      • Does not have the basic form of dot product -> nonlinearity
      • Generalizes ReLu and Leaky ReLu
      • Linear Regime! Does not saturate! Does not die!
      • Doubles the number of parameters/neuron

    So in practice:

    • Use ReLu. Be careful with your learning rates
    • Try out Leaky ReLu/ Maxout/ ELU
    • Try out tanh but don’t expect much
    • Don’t use sigmoid
  • Data preprocessing

    In image, we don’t normalized and other complicated method so much, because we have position, scales and so on.

    • zero-centered data

    • Normalized data: if we don’t use it, it will be more sensitisensitive.

    • Decorrelated data

      data has diagonal covariance matrix

    • Whitened data

      covariance matrix is the identity matrix

    So in practice:

    • Subtract the mean image, eg. AlexNet
    • Subtract per-channel mean, eg. VGGNet
    • Not common to normalize variance, to do PCA or whitening

    And the mean solve the first layer problem in sigmoid, but in deeper layer, it will be worser.

  • Weight initialization

    • if $W = 0$, then all the neurons do the same things, they don’t learn anything.

    • First idea: small random numbers (gaussian with zero mean and $1e-2$ standard deviation)[use tanh in this.]

      Works okay for small networks, but problems with deeper networks. All actications become zero!

    • when we make 0.01 replaced 1.0, almost all neurons completely saturated, eith -1 and 1. Gradients will be zero.

    • Xavier initialization: One kind of good rule of thumb that you can use:

      W = np.random.randn(fan_in,fan_out)/np.sqrt(fan_in)

      mathematical derivation assumes linear activations.

      but when using the ReLU nonlinearity it breaks.

    • So we use the W = np.random.randn(fan_in,fan_out)/np.sqrt(fan_in)/2

      note additional /2

  • Batch normalization

    To make each dimension unit gaussian, apply:

    So we should first compute the empirical mean and variance independently for each dimension. Then normalized it.

    Usually inserted after Fully Connected or Convolutional layers and before nonlinearity.

    After normalize, then allow the network to squash the range if it wants to:

    The network can learn:

    Batch Normalization

    Input: Values of x over a mini-batch: $\Beta = \{x_{1\dots m}\}$; Parameters to be learned: $\gamma, \beta$

    Output: $\{y_i=BN_{\gamma,\beta}(x_i)\}$

    The advantages:

    • Improves gradient flow through the network
    • Allows higher learning rates
    • Reduces the strong dependence on initialization
    • Acts as a form of regularization in a funny way, and slightly reduces the need for dropout, maybe
    • The mean/std are not computed based on the batch. Instead, a single fixed empirical mean of activations during training is used.
  • Babysitting the learning process

    • Loss not going down: learning rate too low
    • Loss exploding: learning rate too high
  • Hyperparameter optimization

    • Cross-validation strategy:
      • First stage: only a few epochs to get rough idea of what params work
      • Second stage: longer running time, finer search
    • Random search vs. Grid search
    • Hyperparameters to play with:
      • network architecture
      • learning rate, its decay schedule, update type
      • regularization(L2/Dropout strength)
    • Big gap = overfitting; no gap = increase model capacity.

Lesson 7: Training Neural Networks II

  1. Fancier optimization

    • Loss function tells how good the algorithm is.

    • Problem 1: Lower in one direction(will get zigzagging), and quickly in another. And it will more usual in high-dimension space

    • Problem 2: local minima or saddle point(more common in high dimension, one decrease, other increase). The gradient is zero.

    • SGD(normal)

    • SGD+Momentum: can solve the last problems

      vx = 0
      while True:
      dx = compute_gradient(x)
      vx = rho * vx + dx
      x += learning_rate * vx

      $\rho$ gives “friction”, typically $\rho=0.9$ or $0.99$

    • SGD+Nesterov Momentum:

      Annoying, usually we want update in terms of $x_t,\triangle f(x_t)$

      Change of variables $\tilde x_t=x_t+\rho v_t$ and rearrange:

      dx = compute_gradient(x)
      old_v = v
      v = rho * v - learning_rate * dx
      x += -rho * old_v + (1 + rho) * v
    • AdaGrad

      Added element-wise scaling of the gradient based on historical sum of squares in each dimension. If the grad_squared are big, then it will lower; if the grad_squared are small, then it will bigger. Then can solve the problem of different dimension.As the time going by, the Adagrad will be smaller.

      grad_squared = 0
      while True:
      dx = compute_gradient(x)
      grad_squared += dx * dx
      x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
    • RMSProp:

      grad_squared = 0
      while True:
      dx = compute_gradient(x)
      grad_squared = decay_rate * grad_squared + (1 - decay_rate) * dx * dx
      x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
    • Adam(use them both)

      first_moment = 0
      second_moment = 0
      while True:
      dx = compute_gradient(x)
      first_moment = beta1 * first_moment + (1 - beta1) * dx # Momentum
      second_moment = beta2 * second_moment + (1 - beta2) * dx * dx # AdaGrad/RMSProp
      x -= learning_rate * first_moment / (np.sqrt(second_squared) + 1e-7)

      full form

      first_moment = 0
      second_moment = 0
      for t in range(num_iterations):
      dx = compute_gradient(x)
      first_moment = beta1 * first_moment + (1 - beta1) * dx # Momentum
      second_moment = beta2 * second_moment + (1 - beta2) * dx * dx # AdaGrad/RMSProp
      first_unbias = first_moment / (1- beta1 ** t)
      second_unbias = second_moment / (1 - beta2 ** t)
      x -= learning_rate * first_moment / (np.sqrt(second_squared) + 1e-7)

      $\text{bet1} = 0.9,\ \text{beta2} = 0.999,\ \text{learning_rate}=1e-3\ or\ 5e-4$

      Learning rate decay over time!

      • exponential decay

      • 1/t decay

    • First-Order Optimization

      • Use gradient form linear approximation
      • Step to minimize the approximation
    • Second-Order Optimization

      • Use gradient and Hessian($O(N^2)$) to form quadratic approximation

      • Step to the minima of the approximation

      • Second-order Taylor expansion:

        Newton parameter update:

      • And will take $O(N^3)$, $N=\text{(Tens or Hundreds of) Millions}$

      • Will use Quasi-Netton methods(BGFS most popular)

      • L-BFGS(Limited memory BFGS): Does not form/store the full inverse Hessian.

        • Usually works vey well in full batch, deterministic mode
        • Does not transfer very well to mini-batch setting
    • In practice

      • Adam is a good default choice in most cases
      • If you can afford to do full batch updates then try out L-BFGS(and don’t forget to disable all sources of noise)
  2. Regularization

    In common use:

    • L2 regularization

    • L1 regularization

    • Elastic net (L1+L2)

    • Dropout

      In each forward pass, randomly set some neurons to zero probability of dropping is a hyper parameter; 0.5 is common. More common in fully-connected layers. But sometimes see this in convolutional layers, as well.

      At test time, multiply by dropout probability.

    • Add random noise

    • DropConnect

    • Data Augmentation

    • Batch Normalization

    • Fractional Max pooling

    • Stochastic Depthz

    • Summary

      • Training: add some kind of randomness
      • Testing: Average out randomness (sometimes approximate)
  3. Transfer learning

    | | Very similar dataset | Very different dataset |
    | —————————- | ————————————————— | —————————————————————————————— |
    | Very little data | Use Linear Classifier on top layer | You’re in trouble… Try linear classifier from differnet stages |
    | quite a lot of data | Finetune a few layers | Finetune a larger number of layers |

    Don’t need too much data.

Lesson 8: Deep learning Software

  1. CPU vs GPU

    Deep learning prefer NVIDIA than AMD. GPU can parally do same things.

    CPU: Fewer cores, but each core is much faster and much more capable; great at sequential tasks

    GPU: More cores, but each core is much slower and “dumber”; great for parallel tasks. Eg. matrix multiplication.

    Programming GPUs, can use:

    • CUDA(NVIDIA only), write C-like code, have higher-level APIs: cuFFT, cuDNN, etc.
    • OpenCL: similar to CUDA, but runs on anything, usually slower
    • Udacity: just use existing libraries

    GPU is very fast, but if your data is in disk or something like, the reading is slow. There are three solutions:

    • Read all data into RAM
    • Use SSD instead of HDD
    • Use multiple CPU threads to prefetch data
  2. Deep learning framework

    The reason of using framework:

    • Easily build big computational graphs
    • Easily compute gradients in computational graphs
    • Run it all efficiently on GPU(wrap cuDNN, cuBLAS, etc.)

    Numpy can’t run on GPU and have to compute our own gradient.


    It is similar with Theano.

    tf.gradients(loss, [w1, w2]) you can use it to compute.

    tf.device('/cpu:0') can change the cpu and gpu

    with tf.Session() as sess: how to run the graph and feed it.

    sess.run() run it.

    tf.train.GradientDescentOptimizer(1e-5) set the optimizer.

    optimizer.minimize(loss) use optimizer to compute gradients and update weights. And you can use other optimizer like RMSProp.

    tf.losses.mean_squared_error(y_pred, y) use predefined common losses.

    tf.relu as parameter

    First define computational graph, then run the graph many times. Crate placeholders for input x, weights w1 and w2, and targets y.

    Problem: copying weights between CPU/GPU each step. The idea is that we change w1 and w2 from placeholder to Variable.

    Problem: loss is not going down. The solution is that we should explicitly tell TensorFlow to perform those update operations. If copy the new w1 and w2 replace the old one. So we use the dummy node.

    High-level wrappers:

    • Keras
    • TFLearn
    • TensorLayer
    • Ty.layers
    • TF-Slim
    • tf.contrib.learn

    PyTorch: Three Levels of abstraction.

    Pytorch Tensors are just like Numpy array. In Tensorflow, we set up explicit graph, and then run it many times. It called dynadynamic computational graph. But in PyTorch, we built a new graph every time we do a forward pass. Torch only have two modules. It doesn’t habe ReLu, no autograd, but more stable, lots of existing code.

    dtype = torch.cuda.FloatTensor


    torch.autograd.Variable(torch.randn(N, D).cuda(), requires_grad=True)

    torch.nn.xx, like torch.nn.Linear(D_in, H), torch.nn.ReLu(), torch.nn.MSELoss(size_average=False), etc.

    torch.utils.data.DataLoader() wraps a dataset and provides minibatching, shuffling, multithreading for you.

    pretrained models: torch vision.models.xx like torch vision.models.alexnet(pretrained=True), torch vision.models.vgg16(pretrained=True), torch vision.models.resnet101(pretrained=True), etc.

    And package Visdom that lets you visualize some of these loss statistics, it is similar with Tensorboard. But Tensorboard will show you the computational graph.


    • Core written in C++
    • Good for training or fine-tuning feedforward classification models.
    • Often no need to write code
    • Still popular for deploying models
    • Step:
      • Convert data(HDF5)
      • Define Network(prototxt)
      • Define Solver(prototxt)
      • Train
    • have model zoo
    • have python interface
    • Now have caffe2, that shouldn’t do with prototxt.


    • Optimization: framework can optimize the graph for you before it runs.
    • Serialization: Once graph is built, can serialize it and run it without the code that built the graph.
    • Conditional: use ontrol flow operator.
    • Loops: a little bit uglier.


    • Optimization: Equivalent graph with fused operations.
    • Serialization: Graph building and execution are intertwined, so always need to keep code around.
    • Conditional: more simple
    • Loops: super easy.
    • Applications:
      • Recurrent networks
      • Recursive networks
      • Modular networks


    • TensorFlow is a safe bet for most projects. Not perfect but has huge community, wide usage. Maybe pair with high-level wrapper. Use TensorFlow for one graph over many machines.
    • PyTorch is best for research, however stillness, there can be rough patches.
    • Consider Caffe, Caffe2, or TensorFlow for production deployment.
    • Consider TensorFlow or Caffe2 for mobile.

Lesson 9: CNN Architectures

  1. AlexNet was the first large scale convolutional neural network that was able to do well on the ImageNet classification.

    If input is: 227x227x3 images, then:

    First layer(CONV1): 96 11x11 filters applied at stride 4, so the output volume size is (227-11)/4+1=55, [55x55x96], and the total number of parameters in this layer is [11x11x3]x96=35K

    Second layer(POOL1): 3x3 filters applied at stride 2, so the output volume is (55-3)/2+1=27, [272796], and the total number of parameters in this layer is 0! Because the parameters is the weight that we need to learn, the pooling layer we always have ReLu. Pooling layer don’t need to learn parameters.

    Fully (simplified) AlexNet architecture:

    [227x227x3] INPUT

    [55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0

    [27x27x96] MAX POOL1: 3x3 filters at stride 2

    [27x27x96] NORM1: Normalization layer

    [27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2

    [13x13x256] MAX POOL2: 3x3 filters at stride 2

    [13x13x256] NORM2: Normalization layer

    [13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1

    [13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1

    [13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1

    [6x6x256] MAX POOL3: 3x3 filters at stride 2

    [4096] FC6: 4096 neurons

    [4096] FC7: 4096 neurons

    [1000] FC8: 1000 neurons(class scores)


    • First use of ReLU
    • Used Norm layers(not common anymore)
    • Heavy data augmentation
    • Dropout 0.5
    • Batch size 128
    • SGD Momentum 0.9
    • Learning rate 1e-2, reduced by 10 manually when val accuracy plateaus
    • L2 weight decay 5e-4
    • 7 CNN ensemble: 18.2% -> 15.4%
  2. VGGNet has small filters, deeper networks.

    8 layers(AlexNet) -> 16-19 layers(VGG16Net)

    Only 3x3 CONV stride 1, pad 1 and 2x2 MAX POOL stride 2

    11.7% top 5 error in ILSVRC’13(ZFNet) -> 7.3% top 5 error in ILSVRC’14

    Stack of three 3x3 conv (stride 1) layers has same effective receptive field as one 7x7 conv layer. But deeper, more non-linearities and fewer parameters ($3*(3^3C^2)\ vs.\ 7^2C^2$ for C channels per layer).

    Most memory is in early CONV, and most params are in late FC.


    • ILSVRC’14(a competition of image classification in 2014) 2nd in classification, 1st in localization
    • Similar training procedure as Krizhevsky 2012
    • No Local Response Normalisation(LRN)
    • Use VGG16 or VGG19 (VGG19 only slightly better, more memory)
    • Use ensembles for best results
    • FC7 features generalize well to other tasks
  3. GoogleNet

    Deeper networks, with computational efficiency


    • 22 layers
    • Efficient “Inception” module

    “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other.

    Naive Inception module: Apply parallel filter operations on the input from previous layer:

    • Multiple receptive field sizes for convolution(1x1, 3x3, 5x5)
    • Pooling operation(3x3)

    Concatenate all filter outputs together depth-wise. But the computational complexity is bigger.

    Use pad to keep 28x28.

    • No FC layers network
    • Only 5 million parameters! 12x less than AlexNet
    • ILSVRC’14 classification winner (6.7% top 5 error)
  4. ResNet

    Very deep networks using residual connections


    • 152-layer model for ImageNet
    • ILSVRC’15 classification winner (3.57% top 5 error)
    • Swept all classification and detection competitions in ILSVRC’15 and COCO’15!

    Full ResNet architecture:

    • Stack residual blocks
    • Every residual block has two 3x3 conv layers
    • Periodically, double # of filters and downsampling spatially using stride 2(/2 in each dimension)
    • Additional conv layer at the begining
    • No FC layers at the end (only FC 1000 to output).

    Training ResNet in practice:

    • Batch Normalization after every CONV layer
    • Xavier/2 initialization from He et al.
    • SGD + Momentum (0.9)
    • Learning rate: 0.1, divided by 10 when 10 when validation error plateaus
    • Mini-batch size 256
    • Weight decay of 1e-5
    • No dropout used

    The deep layer don’t work well both in test and training. It not dued to overfitting.

    Hypothesis: the problem is an optimization problem, deeper models are harder to optimize.

    The deeper model should be able to perform at least as well as the shallower model.

    Solution: Use network layers to fit a residual mapping instead of directly trying to fit a desired underlying mapping. In $F(x)+x$, the x is identity mapping, and F(x) is residual mapping.

    For deeper networks, use “bottleneck” layer to improve efficiency(similar to GoogleNet).

    Can able to train very deep networks without degrading, and deeper networks now achieve lowing training error as expected.

  5. Summary:

    • GoogleNet was the most efficient.
    • AlexNet: smaller compute, still memory heavy(much parameters), lower accuracy
    • ResNet: moderate efficiency depending on model, highest accuracy.

    • VGG, GoogleNet, ResNet all in wide use, available in model zoos
    • ResNet current best default
    • Trend towards extremely deep networks
    • Sigificant research centers around design of layer / skip connections and improving gradient flow
    • Even more recent trend towards examining necessity of depth vs. width and residual connections.
  6. Other architecture:

    • Network in Network (NiN)

      Uses MLP. computer more abstract features for local patches. Philosophical inspiration of GoogleNet.

    • Identity Mappings in Deep Residual Networks

      Improved ResNet block design from creators of ResNet, creats a more direct path for propagating information

    • Wide Residual Networks

      Argues that residuals are the important factor, not depth, user wider residual blocks, increasing width instead of depth more computationally efficient(parallelizable).

    • ResNeXt

      Also from creators of ResNet, increases width of residual block through multiple parallel pathways, parallel pathways similar in spirit to inception module.

    • Deep Networks with Stochastic Depth

      Reduce vanishing gradients and training time through short networks during training, randomly drop a subset of layers during each training pass, bypass with identity funciton, use full deep network at test time.

    • FractalNet

      Argues that key is transitioning effectively from shallow to deep and residual representations are nit necessary, fractal architecture with both shallow and deep paths to output.

    • DenseNet

      Dense blocks where each layer is connected to every other layer in feedforward fashion, encourages feature reuse.

    • SqueezeNet

      Fire modules consisting of squeeze layer with 1x1 filters deeding an expand layer with 1x1 and 3x3 filters, can compress to 510x smaller than AlexNet.

Lesson 10: Recurrent Neural Networks

  1. Input -> output

    • one to one

      Vanilla Neural Networks

    • one to many

      Image Captioning: image inverts to the sequence of words

      Visual Question Answering: RNN with attention

    • many to one

      Sentiment Classification: sequence of words invert to sentiment

    • many to many(have intermedias)

      Machine Translation: seq of words invert to seq of words

      You can see the many to many as many-to-one + one-to-many

    • Many to many(one to one)

      eg. Video classification on frame level

  2. RNN: Recurrent Neural Network

    We can process a sequence of vectors x by applying a recurrence formula at every time step

    Use old state to update the new state, and $x_t$ was the input in t step.

    Attention: we should use the same function and the same set of parameters at every time step of computation.

    The underlying neural network establishes a weighted connection only between layers. And the biggest difference between RNNs is the establishment of a weighted connection between neurons.

  3. BPTT: Back-propagation through time

    The training method for RNN, and the essential is BP. Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient. But because of using sigmoid and tanh, in gradient prod, gradient vanishing will happen. As we known, updating the shared parameters of more layers based on the gradient of the finite layer will definitely cause problems.

    The main methods to solve the gradient disappearance are:

    1. Selecting a better activation function, such as ReLU, but a constant 1 gradient exponential explosion can be resolved by setting a threshold.
    2. Change the propagation structure, such as LSTM.
  4. Truncated Backpropagation through time

    Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps.

    min-char-rnn.py that implements it.

  5. LSTM: Long short Term Memory

    The reason for the disappearance of the RNN gradient is only short-term memory. LSTM combines short-term memory with long-term memory through gate control, and the repetitive modules are not just a single neural network. Added a c cell that represents long-term memory. Run directly on the chain with only a small amount of linear interaction. And using the door. Door is a way to let the information selection pass. Sigmoid selects the update content, and the tanh function creates update candidates.

    f: Forget gate, whether to erase cell, use sigmoid

    i: input gate, whether to write to cell, use sigmoid

    g: gate gate, how much to write to cell, use tanh

    o: output gate, how much to reveal cell, use sigmoid

    Uninterrupted gradient flow! And is similar to ResNet! In between: highway networks.

  6. Vanilla RNN

  7. Other RNN variants

    • GRU
  8. Summary

    • RNNs allow a lot of flexibility in architecture design.
    • Vanilla RNNs are simple but don’t work very well.
    • Common to use LSTM or GRU: their additive interactions improve gradient flow
    • Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions(LSTM).
    • Better/simpler architectures are a hot topic of current research
    • Better understanding (both theoretical and empirical) is needed.

Lesson 11: Detection and Segmentation

Many computer vision tasks.

  1. Semantic Segmentation(No objects, just pixels)

    Output decision of a category for every pixel in that image. And label it.

    • Sliding window (Not use this)

      But very inefficient! Not reusing shared features between overlapping patches.

    • Fully convolutional

      Get training data is very expansive, beacause we need give every pixels a category. Maybe we can draw a contour for the image, roughly.

      And if the input image is very big, then the computational is also too large. But we can design network as a bunch of convolutional layers, with downsampling and upsampling inseide the network. Then the network can be very deep, but the computational could be more efficient.


      • uppooling
      • Transpose convolution: we can express convolution in terms of a matrix multiplication. So transpose convolution multiplies by the transpose of the same matrix.
  2. Classification + Localization(Single Object)

    Draw a bounding for the object in the image. We can treat localization as a regression problem.

    We have two loss: class scores -> softmax loss and box coordinates -> L2 loss

  3. Object Detection(Multiple Object)

    Core problem in computer vision.

    For the input image, every time one of those categories appears in the image, we draw a box around it and predict the category of that box.

    This is different from classification plus localization. Because there might be a varying number of outputs for every input image.

    And we don’t know how many object in image, so this is pretty challenge.

    • Sliding window

      Apply a CNN to many different crops pf the image, CNN classifies each crop as object or background.

      How to choose the crop is very important.

    • Region proposals Network (RPN)

      • Find “blobby” image regions that are likely to contain objects
      • Relatively fast to run

    Base Network:

    • VGG16
    • ResNet-101/ResNet
    • Inception V2/Inception V3/Inception
    • MobileNet

    Object Detection architecture:

    • Faster R-CNN
    • R-FCN
    • SSD

    Image size: Region Proposals


    • Faster R-CNN is slower but more accurate
    • SSD is much faster but not as accurate

    Aside: object detection + captioning = dense captioning

  4. Instance Segmentation(Multiple Object, Semantic + Object detection)

    Full problem. We want to predict the locations and identities of objects in image similar to object detection, but rather than just predicting a bounding box for each of those objects, instead we want to predict a whole segmentation mask for each of those objects and predict which pixels in the input image corresponds to each object instance.

    Mask R-CNN: very good results!

Lesson 12: Visualizing and Understanding

From the feature visualization of the network layer to the feature matching and feature inversion based on the gradient lifting method, texture synthesis, image generation and style transfer are derived.

  1. Visualizing

    • Visualize the filter

      • First layer: there are angles, lines, colors and so on. The visual result of the first convolutional layer is similar, looking for directed edges (such as light and dark lines)
      • Last layer: dimensionality reduction. Visualize the space last layer feature vectors by reducing dimensionality of vectors from 4096 to 2 dimensions. You can use Principle component analysis (PCA), simply. Also, more complexly, t-SNE. Similar to the results of Nearest Neighbors, objects of similar character will come together.
    • Visualizing activations

      • The weighting of the visual middle layer is not so strong, but the activation map of the visual middle layer is interpretable in some cases.

        The bigger one in feature map is object, others are background and something like this.

      • Also, you can select activating patches that maximizes the activation of different features and different neurons.

      • Add mask to illustrate it. This is Occlusion Experiments.

        The purpose of the occlusion experiment is to figure out which part of the input image is causing the neural network to make a classification decision. If we block a part of the image and lead to the sharp change in the value of the network, then the input image portion of this occlusion may play a very important role in the classification decision.

      • Saliency Maps

        Compute gradient of (unnormalized) class score with respect to image pixels, take absolute value and max over RGB channels to tell which pixels matter for classification. It is different with last one. The occlusion is about part, and saliency maps is about pixels.

        You can use grabcut with saliency maps, and subdivide the objects in the image. But this method has some weakness.

    • Visualizing CNN features: Gradient Ascent

      • (Guided) backprop: Find the part of an image that neuron responds to.

      • Gradient ascent: Generate a synthetic image that maximally activates a neuron.

        $f(I)$: Neuron value, $R(I)$: Natural image regularizer.

        1. Initialize image to zeros


        1. Forward image to compute current scores
        2. Backprop to get gradient of neuron value with respect to image pixels
        3. Make a small update to the image.

        Which part of the input image affects the score of the neuron, the weight of the trained convolutional neural network is corrected by the gradient ascending method, and the gradient is raised on the pixels of the image to synthesize the image to try and maximize some intermediate neuron and class scores.

        In the process of performing a gradient rise, we are no longer optimized, the weight values in the neural network remain the same. Instead, we try to change the pixels of some images to maximize the value of this neuron or the fractional value of this class. In addition, we need some regular items $\lambda|I|^2_2$ to prevent the images we generate from overfitting the characteristics of a particular network.
        Generating an image requires two properties:

        1. Maximize the activation of some fractional values or values of neurons;
        2. We hope that this generated image will look natural, that is, the image we want to generate has statistics in the natural image. The image forced by the regular item looks like a natural image.

        Always optimize in FC6 latent space instead of pixel space.

  2. Intermediate features via (guided) backprop

    Instead of using the fractional values of the class, some neurons in the middle layer of the neural network are selected to see which parts of the image affect the scores of the neurons in the neural network. Only positive gradients are propagated, and only the positive effects of the entire neural network are tracked.

    Compute gradient of neuron value with respect to image pixels.

    Images come out nicer if you only backprop positive gradients through each ReLU(guided backprop).

  3. Fooling images / Adversarial examples

    For example, we want to let the network think the elephant is a kola. Then the network modift the image and add gaussian noisy in it. But We can’t see the difference bewteen in these two images. The reseaon, we will see in lecture 13.

    1. Start from an arbitrary image
    2. Pick an arbitrary class
    3. Modify the image to maximize the class
    4. Repeat until network is fooled
  4. DeepDream: Amplify existing features.

    Pick different layers, you can get different results.

    Choose an image and a layer in CNN; repeat:

    1. Forward: compute activations at choose layer.
    2. Set gradient of chosen layer equal to its activation
    3. Backward: compute gradient on image
    4. Update image

    Some tricks:

    • Calculate the dithered image before the gradient, that is, instead of running the same image as the original image through the neural network, move the image by two pixels and wrap the other two pixels. This is a regularization method to make the image more smooth;
    • Normalization using the L1 of the gradient;
    • Sometimes modify some pixel value.
  5. Feature inversion

    given a CNN feature vector for an image, find a new image that:

    • Matches the given feature vector
    • looks natural (image prior regularization)

    And $\Phi(X)$ features of new image, $\Phi_0$ given feature vector, $\mathcal R_{V^\beta}$ is total variation regularizer (encourages spatial smoothness).

  6. Gram Matrix

    It discards all spatial information in the feature volume because we average the feature vectors for each point in the image, it just captures the second-order co-occurrence statistics between features, which is ultimately a good texture description symbol. And the calculation efficiency of the Gram matrix is very high.

  7. Neural style transfer: Texture Synthesis based on gram matrix + feature inversion of feature matching.

    Also can multiple style images. But style transfer requires many forward / backward passes through VGG; very slow! The solution is that train another neural network to perform style transfer for us!

    1. Train a feedforward network for each style
    2. Use pretrained CNN to compute same losses as before
    3. After training, stylize images using a single forward pass

    You can use the same network for multiple styles using conditional instance normalization: learn separate scale and shift parameters per style.

  8. Summary

    Many methods for understanding CNN representations.

    • Activations
      • Nearest neighbors
      • Dimensionality
      • reduction
      • maximal patches
      • Occlusion
    • Gradients
      • Saliency maps
      • Class visualization
      • fooling images
      • Feature inversion
    • Fun
      • DeepDream
      • Style Transfer

Lesson 13: Generative Models

Supervised learning and some problem about it.

  1. Supervised vs unsupervised learning

    Supervised learning just learn a function to map x -> y

    Unsupervised learning has just data, no labels! Traing data is cheap. It just understands structure of visual world, learn some underlying hidden structure of the data.

  2. Generative Models

    Given training data, generate new samples from same distribution.

    Addresses density estimation, a core problem in unsupervised learning.

    Generate model: infinite samples ==> probability density model = generate model ==> prediction
    The generation method is based on the data learning joint probability distribution P(X, Y), and then the conditional probability distribution P(Y|X)=P(X, Y)/P(X) is obtained as a prediction model. The reason why such a method becomes a generation method is because the model indicates the generation relationship of the output Y by a given input X.

    • Why?

      • Realistic samples for artwork, super-resolution, colorization, etc.
      • Generative models of time-series data can be used for simulation and planning (reinforcement learning applications)
      • Training generative models can also enable inference of latent representations that can be useful as general features.
    • Taxonomy

      Also you can combine them to use it.

  3. PixelRNN/CNN

    Fully visible belief network.

    Use chain rule to decompose likelihood of an image x into product of 1-d distributions:

    Then maximize likelihood of training data. We can express complex distribution over pixel values using a neural network.

    • PixelRNN

      Generate image pixels starting from corner. Collect a lot of pictures, then use these pictures to start training the picture generation model, predict the next pixel according to the previous pixel, and generate a picture by giving a (or several) initial pixels after the training. PixelRNN not only works, but the graphs produced by PixelRNN in the various generation methods are the clearest.

      There are some tips on the image: If the three values of RGB are not much different, the resulting color is gray and not bright enough. You can cluster many colors into several classes and do 1-of-N encoding to “continuous”. The pixels become discrete.

    • PixelCNN

      Still generate image pixels starting from corner. But using a CNN over context region.

      Training faster then PixelRNN because of parallel.

      But generation must still proceed sequentially, so it’s still slow.

    And both them have following properties:


    • can explicitly compute likelihood p(x)
    • Explicit likelihood of training data gives good evaluation metric
    • Good samples


    • Inefficient sequential generation => slow

    Improving PixelCNN performance:

    • Gated convolutional layers
    • Short-cut connections
    • Discretized logistic loss
    • Multi-scale
    • Training tricks
    • Etc.
  4. Variational Autoencoder

    Use output and input to do the same unconstrained learning. After the training is completed, take out the decoder in the auto-encoder, and generate a vector as the code input to the decoder to get an image, but the effect is usually not good. VAE appeared.

    VAE is adding noise to the encoder output.

    VAEs define intractable density function with latent z:

    Cannot optimize directly, derive and optimize lower bound on likelihood instead.

    And the loss function is L2, $|x-\hat x|^2$, don’t use labels. This is different.

    Encoder can be used to initialize a supervised model.

    x is an image, z is latent factors used to generate x: attributes, orientation, etc. choose prior p(z) to be simple. Eg.g. Gaussian. But conditional p(x|z) is complex, because it represents with neural network.

    The problem is intractable. It is intractable to compute p(x|z) for every z.

    The solution is we define additional encoder network $q_\Phi(z|x)$ that approximates $q_\Phi(x|z)$. It derive a lower bound on the data likelihood.

    The data likelihood can be simplified as follows:

    The first part: Decoder network gives $p_{\theta}(x|z)$, can compute estimate of this term through sampling.

    The second part is KL term that has nice closed-form.

    The last part: $p_{\theta}(z|x)$ intractable, can’t compute this KL term. But we know KL divergence represents distance and is always greater than or equal to zero by definition.

    So first two part make up to tractable lower bound. Then we maximize lower bound.

    The results is good, but it still has a bit of a blurry aspect to them.


    • Principled approach to generative models
    • Allows inference of $q(z|x)$, can be useful feature representation for other tasks


    • Maximizes lower bound of likelihood: okay, but not as good evaluation as PixelRNN/PixelCNN
    • Samples blurrier and lower quality compared to state-of-the-art (GANs).

    Active areas of research:

    • More flexible approximations, e.g. richer approximate posterior instead of diagonal Gaussian
    • Incorporating structure in latent variables
  5. GAN: Generative Adversarial Networks

    The last two cannot optimize directly, derive and optimize lower bound on likelihood instead. So what if we give up on explicit modeling density, and just want ability to sample?

    One of the biggest problems in VAE is that it can’t really be out of nothing, because its loss is MSE, the image generated by the constraint must be similar to the existing image pixel level, and can not produce a “conceptual” similarity out of thin air.
    GAN solves this problem very well. Its generator G has never seen a real picture. It can only learn from the gradient provided by the discriminator D. This is not a simple way to memorize an existing picture.

    GAN: don’t work with any explicit density function. Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.

    Problem is that we wnat to sample from complex, high-dimensional training. No direct way to do this. But we can sample from a simple distribution, learn transformation to training distribution. And the complex transformation we can use a neural network to represent it.

    GAN is a Two-player game:

    • Generator network: try to fool the discriminator by generating real-looking images
    • Discriminator network: try to distinguish between real and fake images.

    Train jointly in minimax game, and the objective function is:

    The first one is Discriminator output for real data x, and the last one is discriminator output for generated fake data G(z).

    So, the discriminator ($\theta_d$) wants to maximize objective such that D(x) is close to 1 (real) and D(G(z)) is close to 0 (fake). And take gradient ascent.

    Generator($\theta_g$) wants to minimize objective such that D(G(z)) is close to1 (discriminator is fooled into the thinking generated G(z) is real). And take gradient deascent.


    Change the min log (1-D(G(z))) to max log(D(G(z))), because in the original loss design, the better the discriminator, the more serious the generator gradient disappears, the harder to train. However, the improved model will also have a problem of mode collapse, which will be detailed in the future WGAN.
    Note that our loss is BCELoss, the cross entropy loss, and later LSGAN will use MSELoss, ie the square difference loss.


    • Beautiful, state-of-the-art samples!


    • Tricker / more unstable to train
    • Can’t solve inference queries such as p(x), p(z|x)

    Active areas of research:

    • Better loss functions, more stable training (Wasserstein Gan, LSGAN, many others)
    • Conditional GANs, GANs for all kinds of applications.
  6. Interpretable Vector math

    eg. smiling woman - natural woman + natural man, you can get smiling man.

Lesson 14: Deep Reinforcement learning

  1. Reinforcement learning

    Problems involving an agent interacting with an environment, which provides numeric reward.

    The Environment give agent a state $S_t$, then agent make an Action $a_t$ to environment. After that, Environment reward $r_t$ to agent, also gets the next state $S_{t+1}$.

    Goal: Learn how to take actions in order to maximize reward.

  2. Markov Decision Processes

    It is Mathematical formulation of the RL problem.

    Markov property: Current state completely characterises the state of the world.

    Defined by: ($\mathcal{S,A,R},\mathbb{P},\gamma$)

    $\mathcal {S}$: set of possible states.

    $\mathcal {A}$: set of possible actions.

    $\mathcal {R}$: distribution of reward given (state, action) pair.

    $\mathbb {P}$: transition probability i.e. distribution over next state given (state, action) pair.

    $\gamma$: discount factor, it assigns weights to recent rewards as well as forward rewards.

    Then the process:

    At time step $t = 0$, environment samples initial state $s_0\sim p(s_0)$

    Then, for $t = 0$ until done:

    ​ Agent selects action $a_t$

    ​ Environment samples reward $r_t\sim R(\cdot|s_t,a_t)$

    ​ Environment samples next state $s_{t+1}\sim P(\cdot|s_t,a_t)$

    ​ Agent receives reward $r_t$ and next state $s_{t+1}$

    A policy $\pi$ is a function from S to A that specifies waht action to take in each state.

    Objective: find policy $\pi^$ that *maximizes cumulative discounted reward: $\sum_{t>0} \gamma^tr_t$

    So, it like in grid world how to go we can get the star more quickly. We can maximize the expected sum of rewards!


  3. Value function

    Following a policy produces sample trajectories (or paths) $s_0,a_0,r_0,s_1,a_1,r_1,\cdots$

    The value function at state s, is the expected cumulative reward from folloing the policy from state s (how good is a state):

  4. Q-value function

    The Q-value function at state s and action a, is the expected cumulative reward from taking action a in state s and then following the policy (how good is a state-action pair):

    And we want to find the optimal Q-value function $Q^$ that is the maximum the last value. And $Q^$ satisfies the Bellman equation (Also Dynamic Programming Equation):

    If the optimal state-action values for the next time-step $Q^(s,a)$ are know, then the optimal strategy is to take action that maximizes the expected value of $r+\gamma Q^(s’,a’)$. Obviously, the optimal policy $\pi^$ corresponds to taking the best action in any state as specified by $Q^$.

    But this is not scalable that it must compute Q(s, a) for every state-action pair.

  5. Q-Learning

    As we all known, if the computation is too complex, we can use neural network. Q-learning is the solution for the optimal policy.

    Q-learning: Use a function approximator to estimate the action-value function.

    If the function approximator is a deep neural network, then it is deep Q-learning.

    So the loss function is $L_i(\theta_i)=\mathbb E_{s,a\sim\rho(\cdot)}[(y_i-Q(s,a;\theta_i))^2]$ where $y_i=Q^*(s,a)$.

    And the backward is gradient update.

    But learning from batches of consecutive samples is problematic:

    • Samples are correlated => inefficient learning
    • Current Q-network parameters determines next training samples -> can lead to bad feedback loops.

    Address these problems using experience replay, update replay memory table of transitions ($s_t,a_t,r_t,s_{t+1}$) and train Q-network on random mini batches of transitions from replay memory.

    But it does not always work but when it works, usually more sample-efficient. Challenge: exploration. And zero guarantees since you are approximating Bellman equation with complicated function approximator.

  6. Policy Gradients

    The Q-learning function can be very complicated!

    Define a class of parametrized policies: $\prod=\{\pi_\theta,\theta\in\R^m\}$.

    For each policy, define its value:

    So, find the optimal policy $\theta^*=arg\ max\ J(\theta)$. Gradient ascent on policy parameters.

    Where $r(\tau)$ is the reward of a trajectory $\tau=(s_0,a_0,r_0,s_1,\cdots)$

    Differentiate this:

    Intractable! Gradient of an expectation is problematic when p depends on $\theta$.

    But we can use a nice trick:

    Then, it is:

    And we have:


    And when differentiating:

    So, it doesn’t depend on transition probabilities!

    Therefore when sampling a trajectory $\tau$, we can estimate $J(\theta)$ with

    So, if $r(\tau)$ is high, push up the probabilities of the actions seen; if $r(\tau)$ is low, push down the probabilities of the actions seen. But this also suffers from high variance because credit assignment is really hard. So it requires a lot of samples. Challenge: sample-efficiency.

    We can use discount factor $\gamma$ to ignore delayed effects.

    Choose a simple baseline: constant moving average of rewards experienced so far from all trajectories.

    Also you can choose a better baseline: want to push up the probability of an action from a state, if this action was better than the expected value of what we should get from that state.

    Converges to a local minima of $J(\theta)$, often good enough.

  7. Actor-critic algorithm

    we can combine policy gradients and Q-learning by training both an actor (the policy) and a critic (the Q-function).

  8. RAM: Recurrent Attention model

    When we look at the complicated images, we will focus one part. So like this, we will see some parts and then combine to the whole image. In that case, can save computing resources and helps with image classification because you can ignore the mess and a bunch of irrelevant information in the image. So the action is decide which part to observe in next time.

Lesson 15: Efficient Methods and Hardware for Deep Learning

The problems of model compression and optimization are explained from two aspects of algorithm and hardware, so as to reduce the volume of the deep learning model, reduce the number of parameters, reduce the amount of calculation, and accelerate the calculation.

  1. Challenge

    • Model size
    • (training) Speed
    • Energy efficiency, and it mostly in storage access.
  2. Hardware

    • General Purpose
      • CPU: latency oriented, like an elephant
      • GPU: throughput oriented, like a group of small ants
    • Specialized hardware
      • FPGA: programmable logic, so it’s cheaper for you to try new ideas and do prototype, but it’s less efficient.
      • ASIC: fixed logic, eg. Google TPU.

    And the number representation of different cores are different, like int8 is S1+M7, FP32 is S1+E8+M23 and so on. Because of energy for one operation, we prefer use 8 or 16 bits rather than 32 bits.

  3. Algorithm for efficient inference

    1. Pruning nerual network.

      We want less parameters, the accuracy are the same. And because not all parameter are useful, so this is feasible. Also changed the weight distribution, drop it, small it.

    2. Weight sharing

      We want each parameter has fewer digits, so the model will become smaller when the parameters are multiplied together. Not all the number should be the exact number. The idea is that regroup the ownership and use a cluster centroid to represent these similar numbers, eg. kmeans.

    3. Quantization

      Just use single number to represent. Also we can combine it with pruning. Huffman coding is used to represent infrequently occurring weights with more digits, and fewer digits to represent frequently occurring weights.

      Train with float, quantizing the weight and activation, and also fine-tune and convergence in float format.

    4. SqueezeNet

      Instead of compressing an existing model, build a streamlined model directly. We will have squeeze layer that share the same neurons.

    5. Low Rank Approximation

      You can break complicated network to two simple network.

    6. Binary / Ternary Net

      During inference, only use two or three parameters.

    7. Winograd Transformation

      It is about how do we implement convolution layer.

      Winograd Transformation is a equivalent method to Direct convolution. It transforms data to reduce math intensity.

  4. Hardware for efficient inference

    A common goal: minimize memory access.

    Rooflines in CPU, GPU, TPU, you will see TPU is more high. The so far below Rooflines is because low latency requirement that can’t batch more. And we can use less memory footprint to compress the model. And the challenge is the hardware that can infer on compressed model.

    EIE: the first DNN accelerator for sparse, compressed model by the Song Han.

  5. Algorithm for efficient training

    1. Paralleization

      Including data paralleization and model parallelization.


      • Run multiple examples in parallel.
      • Limited by batch size.


      • Split model over multiple processors
      • By layer
      • Conv layers by map region
      • Fully connected layers by output activation.
    2. Mexed precision with FP16 and FP32

    3. Model Distillation

      Teacher model to teach student model. And we should use soft label, like dog 80%, cat 20%, instead of hard label, like dog. We can use softmax.

    4. DSD: Dense-Sparse-Dense training

      A better regularization. Like learn the trunk first, then add leaves to learn all the tree.

  6. Hardware for efficient training

    New volta: tensor core.

Lesson 16: Adversarial Examples and Adversarial Training

What are the Adversarial Examples, why they happen, how they destroy machine learning systems, how to defend them, and how to use them to improve machine learning performance, even without Adversarial Examples.

  1. Adversarial examples

    Example: panda + 0.07 * noise = gibbon. But In our eyes, it doesn’t change, in convolutional network it change much. And at first, 30% pandas, but after change, 99.9% gibbon.

    We can fool not just neural nets, but also linear models (logistic regression, softmax regression, SVMs), decision trees, nearest neighbors.

    And the reseaon, we guess:

    • Adversarial examples from overfitting (No!!!)

      If it is over-fitting, then this picture is misclassified more or less bad luck, and is unique. If we fit a somewhat different model, we expect to randomly make another mistake on the training set. error. But this is not the case. We find that different models are easy to make mistakes on the same confrontation sample and will assign the same class to them. In addition, if we study the difference between the original sample and the antagonist sample, we have a direction in the input space, we can add a same offset vector to any other sample without noise, this method can always get a counter sample. So we realized that this is a systemic effect, not a random effect.

    • Adversarial examples from excessive linearity

    • Modern deep nets are very piecewise linear

    • Small inter-class distances

    • High-dimensional linear model

    • Linear models of Imagenet

    Like clever Hans learn the wrong cue.

    Training on adversarial samples is a good method of regularization. The beast is Train = Adv, and the test is Clean.

  2. Draw maps of Random cross-sections.

    Different models use the same data to generate almost the same weight! The middle of the cell corresponds to the original sample without modification. The left and right represent the direction of the FGSM attack, and the upper and lower sides represent the random direction perpendicular to FGSM. White pixels indicate correct classification and different colors indicate different categories. We can see that the left half of the cell is basically divided into pairs, the right half is a different color, and the boundary between the left and the right is almost linear. This means that FGSM recognizes a direction, and if we have a large internal product in this direction, we can get a confrontation sample. We find that each real sample is close to a linear decision boundary, and as you cross the boundary into the confrontation subspace, all other nearby points are against the sample. This means that as long as you find the right direction, you don’t have to find the specific spatial coordinates. You just need to find a direction, a direction that can form a large inner product with the gradient direction, and then you can deceive the network model by moving a little along this direction.

    In general, a sample is correct (error) from the beginning, and adding noise does not improve the situation. Noise can also make the model classification wrong, especially for large noise. In addition, some cells are initially classified incorrectly, but noise can make it a correct classification. And Adversarial examples are not noise.

  3. The fast gradient sign method(FGSM)

    When FGSM is used to attack a NN without any special defense, a 99% attack success rate is obtained, which means that the assumption that the model is too linear is valid.

    Maximize $J(\tilde x, \theta)$, subject to

    Generative modeling is not sufficient to solve the problem.

  4. Estimating the subspace dimensionality

    The larger the subspace dimension, the more likely the subspaces of the two models overlap. If you have two different models and have a large confrontational subspace, it is very likely that the confrontation sample will be migrated to another model. If the confrontation space is small, it is more difficult to migrate the samples unless the system effect makes the two models have exactly the same subspace, because the subspace is random.

  5. Wrong alomost everywhere

    Almost all samples are misclassified, and the model performs well only on a very narrow manifold that is distributed near our training data. We hope that the noise energy is not divided into any class, that is, the value obtained by each sigmoid should be less than 0.5, and 0.5 is not a small threshold. For such a defective input, it is reasonable to change the sigmoid threshold to less than 0.01. But we will find that the Gaussian distribution of noise with a large norm is added to the model. These sigmoid always tend to be divided into a certain class.

  6. Adversarial examples for RL

  7. Adversarial training

    With the exception of DNN, other models may not be well served through adversarial training.

    • Linear models: SVM/linear regression cannot learn a step function, so adversarial training is less useful, very similar to weight decay.
    • K-NN: adversarial training is prone to overfitting
    • Takeway: neural nets can actually become more secure than other models. Adversarially trained neural nets have the best empirical success rate on adversarial examples of any machine learning model.

    Do an adversarial disturb that aims to change the decision. We want to make the network think this is a truck, or something similar. Not something you thought might be before. The category distribution that can be trained to judge it should be the same as before, that is, it still feels like a bird or an airplane. This method is called virtual adversarial training. Using virtual confrontation training, we can achieve semi-supervised learning, that is, using both labeled and unlabeled data for training, which improves the accuracy of the model.

  8. Failed defenses

    • Generative pretraining
    • Adding noise at test time
    • Adding noise at train time
    • Removing perturbation with an auto encoder
    • Confidence-reducing perturbation at test time
    • Ensembles
    • Error correcting codes
    • Multiple glimpses
    • Weight decay
    • Various non-linear units
    • Double backprop
    • Dropout

    Because machine learning algorithms are generalized. The objective function we hope to learn is independent of the data we train, and it does not matter which training sample we choose. If we want to generalize the test set to the training set, and hope that different training sets can get similar results. This means that they learn about similar functions, so they will be weak against similar confrontation samples.

    Dawn Song of UC Berkeley et al. found that if several different models were integrated and then searched against a sample with gradient descent, each model in the integration could be deceived. When you deliberately lead to the migration phenomenon, it can really construct a very powerful migration.

    The shallow RBF network is well protected against disturbances. The current problem is that deep RBF networks are difficult to train, and it is possible to successfully train to solve the problem of against the sample. Because this model is highly nonlinear and has a wide, flat area, it is difficult for an enemy to increase the loss function by making small changes to the model’s input.

  9. Universal approximator theorem

    The general approximation theorem tells us that no matter what shape our classification function wants, a neural network large enough can always be expressed. Whether we can train the neural network to get this function is still a problem that everyone can’t solve now, but we can at least give a right shape.

  10. Other problem

    There are still many problems when using neural networks as the optimization process. For example, we want to build a car that is very fast. We imagine a neural network accepting the blueprint of the car as input and predicting how fast the car can drive. We can Optimize the input of the neural network and find the blueprint of the fastest car that is predicted to run, so that we can build a super fast car. But unfortunately, we don’t have a blueprint for a very fast car now. We construct the confrontation sample so that the model thinks the car is very fast. If we can solve the problem against the sample, we can solve the model-based optimization problem.

  11. Summary

    • attacking is easy
    • Defending is difficult
    • Adversarial training provides regularization and semi-supervised learning
    • The out-of-domain input problem is a bottleneck for model-based optimization generally.
  12. gradio

    pip install gradio

    It allows you to:

    • rapidly create visual interfaces on top of your model
    • share them with others without dealing with hosting
    • get feedback








愿 我是你的小太阳