Neural Architecture Search

1. Search Space

  • chain-structured neural networks: sequence of n layers

  • multi-branch networks: incorporates modern design elements, known from hand-crafted architectures, such as skip connections

  • Search for cells or blocks, then the final architecture is then built by stacking these cells in a predefined manner (or macro-structure search)

    • Learning Transferable Architectures for Scalable Image Recognition. (Zoph et al., 2018 CVPR )

      • Use a controller RNN to predicts the rest of the structure of the convolutional cell, given two initial hidden states
      • The predictions of the controller for each cell are grouped into B blocks, where each block has 5 prediction steps made by 5 distinct softmax classifiers corresponding to discrete choices of the elements of a block

      • The controller RNN was trained using Proximal Policy Optimization (PPO).

2. Search Strategy

  • Random search

  • Bayesian optimization

  • Evolutionary methods

  • Reinforcement learning

  • Gradient-based methods

    • DARTS- Differentiable Architecture Search. (Liu et al., 2019 ICLR)

      • To make the search space continuous, we can relax the categorical choice of a particular operation to a softmax over all possible operations:

      • This implies a bilevel optimization problem with as the upper-level architecture variable and as the lower-level network parameters:

      • Consider a simple approximation scheme as follows:

      • Applying chain rule to the approximate architecture gradient yields and the Hessian can be approximated using finite difference method:

3. Performance Estimation Strategy

  • Lower fidelity estimates: Training time reduced by training for fewer epochs, on subset of data, downscaled models, downscaled data
  • Learning curve extrapolation: Training time reduced as performance can be extrapolated after just a few epochs of training.
  • Weight inheritance: Instead of training models from scratch, they are warm-started by inheriting weights of, e.g., a parent model.
  • One-Shot models/ Weight sharing: Only the one-shot model needs to be trained; its weights are then shared across different architectures that are just subgraphs of the one-shot model.

results matching ""

    No results matching ""