Liam Li | publications

2021

Geometry-Aware Gradient Algorithms for Neural Architecture Search Li, L., Khodak, M., Balcan, M.F., and Talwalkar, A. In International Conference on Learning Representations 2021 [abstract] [arXiv] [html] [pdf]
Recent state-of-the-art methods for neural architecture search (NAS) exploit gradient-based optimization by relaxing the problem into continuous optimization over architectures and shared-weights, a noisy process that remains poorly understood. We argue for the study of single-level empirical risk minimization to understand NAS with weight-sharing, reducing the design of NAS methods to devising optimizers and regularizers that can quickly obtain high-quality solutions to this problem. Invoking the theory of mirror descent, we present a geometry-aware framework that exploits the underlying structure of this optimization to return sparse architectural parameters, leading to simple yet novel algorithms that enjoy fast convergence guarantees and achieve state-of-the-art accuracy on the latest NAS benchmarks in computer vision. Notably, we exceed the best published results for both CIFAR and ImageNet on both the DARTS search space and NAS-Bench-201; on the latter we achieve near-oracle-optimal performance on CIFAR-10 and CIFAR-100. Together, our theory and experiments demonstrate a principled way to co-design optimizers and continuous relaxations of discrete NAS search spaces.
On Data Efficiency of Meta-learning Al-Shedivat, M, Li, L., Xing, E., and Talwalkar, A. In Artificial Intelligence and Statistics Conference 2021 [abstract] [arXiv] [pdf]
Meta-learning has enabled learning statistical models that can be quickly adapted to new prediction tasks. Motivated by use-cases in personalized federated learning, we study the often overlooked aspect of the modern meta-learning algorithms – their data efficiency. To shed more light on which methods are more efficient, we use techniques from algorithmic stability to derive bounds on the transfer risk that have important practical implications, indicating how much supervision is needed and how it must be allocated for each method to attain the desired level of generalization. Further, we introduce a new simple framework for evaluating meta-learning methods under a limit on the available supervision, conduct an empirical study of MAML, Reptile, and Protonets, and demonstrate the differences in the behavior of these methods on few-shot and federated learning benchmarks. Finally, we propose active meta-learning, which incorporates active data selection into learning-to-learn, leading to better performance of all methods in the limited supervision regime.

2020

Weight-sharing Beyond NAS: Efficient Feature Map Selection and Federated Hyperparameter Tuning Khodak, M., Li, L., Balcan, M.F., and Talwalkar, A. In On-device Intelligence Workshop at MLSys 2020 [abstract] [pdf]
Hyperparameter optimization is a critical component of the machine learning pipeline. Although there has been much progress in this area, many methods for tuning model settings and learning algorithms are difficult to deploy in more restrictive settings such as federated learning. Recent progress in NAS has yielded a heuristic technique– weight-sharing, or the simultaneous optimization of multiple neural networks using the same parameters–that presents a promising new paradigm for hyperparameter optimization. In this paper we identify weight-sharing as a cheap, practical approach for more traditional hyperparameter optimization problems. We validate our claim with experiments on feature map selection problems where an approach combining weight-sharing with successive halving is able to find a good configuration much faster than full training. Finally, we propose a natural way of using weight-sharing to perform hyperparameter optimization for federated learning that enables learning a tuned model using data on all devices without significantly impacting on-device computation.
A System for Massively Parallel Hyperparameter Tuning Li, L., Jamieson, K., Rostamizadeh, A., Gonina, E., Ben-tzur, J., Hardt, M., Recht, B., and Talwalkar, A. In Conference on Machine Learning Systems 2020 [abstract] [arXiv] [blog] [pdf]
Modern learning models are characterized by large hyperparameter spaces and long training times. These properties, coupled with the rise of parallel computing and the growing demand to productionize machine learning workloads, motivate the need to develop mature hyperparameter optimization functionality in distributed computing settings. We address this challenge by first introducing a simple and robust hyperparameter optimization algorithm called ASHA, which exploits parallelism and aggressive early-stopping to tackle large-scale hyperparameter optimization problems. Our extensive empirical results show that ASHA outperforms existing state-of-the-art hyperparameter optimization methods; scales linearly with the number of workers in distributed settings; and is suitable for massive parallelism, converging to a high quality configuration in half the time taken by Vizier (Google’s internal hyperparameter optimization service) in an experiment with 500 workers. We then describe several design decisions we encountered, along with our associated solutions, when integrating ASHA in SystemX, an end-to-end production-quality machine learning system that offers hyperparameter tuning as a service.

2019

Random Search and Reproducibility for Neural Architecture Search Li, L., and Talwalkar, A. In Conference on Uncertainty in Artificial Intelligence 2019 [abstract] [arXiv] [html] [pdf]
Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed networks with learned, task-specific architectures. In this work, in order to help ground the empirical results in this field, we propose new NAS baselines that build off the following observations: (i) NAS is a specialized hyperparameter optimization problem; and (ii) random search is a competitive baseline for hyperparameter optimization. Leveraging these observations, we evaluate both random search with early-stopping and a novel random search with weight-sharing algorithm on two standard NAS benchmarks—PTB and CIFAR-10. Our results show that random search with early-stopping is a competitive NAS baseline, e.g., it performs at least as well as ENAS, a leading NAS method, on both benchmarks. Additionally, random search with weight-sharing outperforms random search with early-stopping, achieving a state-of-the-art NAS result on PTB and a highly competitive result on CIFAR-10. Finally, we explore the existing reproducibility issues of published NAS results. We note the lack of source material needed to exactly reproduce these results, and further discuss the robustness of published results given the various sources of variability in NAS experimental setups. Relatedly, we provide all information (code, random seeds, documentation) needed to exactly reproduce our results, and report our random search with weight-sharing results for each benchmark on two independent experimental runs.

2018

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. Journal of Machine Learning Research 2018 [abstract] [html] [pdf]
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, Hyperband, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare Hyperband with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that Hyperband can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems.
Exploiting Reuse in Pipeline-Aware Hyperparameter Tuning Li, L., Sparks, E., Jamieson, K., and Talwalkar, A. In Workshop on Systems for ML at NeurIPS 2018 [abstract] [arXiv] [html] [pdf]
Hyperparameter tuning of multi-stage pipelines introduces a significant computational burden. Motivated by the observation that work can be reused across pipelines if the intermediate computations are the same, we propose a pipeline-aware approach to hyperparameter tuning. Our approach optimizes both the design and execution of pipelines to maximize reuse. We design pipelines amenable for reuse by (i) introducing a novel hybrid hyperparameter tuning method called gridded random search, and (ii) reducing the average training time in pipelines by adapting early-stopping hyperparameter tuning approaches. We then realize the potential for reuse during execution by introducing a novel caching problem for ML workloads which we pose as a mixed integer linear program (ILP), and subsequently evaluating various caching heuristics relative to the optimal solution of the ILP. We conduct experiments on simulated and real-world machine learning pipelines to show that a pipeline-aware approach to hyperparameter tuning can offer over an order-of-magnitude speedup over independently evaluating pipeline configurations.

2017

Hyperband: Bandit-Based Configuration Evaluation for Hyperparameter Optimization Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., and Talwalkar, A. In International Conference on Learning Representations 2017 [abstract] [html] [pdf]
Performance of machine learning algorithms depends critically on identifying a good set of hyperparameters. While recent approaches use Bayesian Optimization to adaptively select configurations, we focus on speeding up random search through adaptive resource allocation. We present Hyperband, a novel algorithm for hyperparameter optimization that is simple, flexible, and theoretically sound. Hyperband is a principled early-stoppping method that adaptively allocates a predefined resource, e.g., iterations, data samples or number of features, to randomly sampled configurations. We compare Hyperband with state-of-the-art Bayesian Optimization methods on several hyperparameter optimization problems. We observe that Hyperband can provide over an order of magnitude speedups over competitors on a variety of neural network and kernel-based learning problems.