# When will gradient methods converge to max‐margin classifier under ReLU models?

@article{Xu2018WhenWG, title={When will gradient methods converge to max‐margin classifier under ReLU models?}, author={Tengyu Xu and Yi Zhou and Kaiyi Ji and Yingbin Liang}, journal={arXiv: Learning}, year={2018} }

We study the implicit bias of gradient descent methods in solving a binary classification problem over a linearly separable dataset. The classifier is described by a nonlinear ReLU model and the objective function adopts the exponential loss function. We first characterize the landscape of the loss function and show that there can exist spurious asymptotic local minima besides asymptotic global minima. We then show that gradient descent (GD) can converge to either a global or a local max-margin… Expand

#### Figures and Topics from this paper

#### 14 Citations

Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

- Computer Science, Mathematics
- ICLR
- 2020

The implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or LeakyReLU activations, is studied, and it is proved that both the normalized margin and its smoothed version converge to the objective value at a KKT point of the optimization problem. Expand

Convergence of Gradient Descent on Separable Data

- Mathematics, Computer Science
- AISTATS
- 2019

It is proved that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients, which might be useful for deep networks. Expand

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

- Mathematics, Computer Science
- AISTATS
- 2019

It is proved that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate --- in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Expand

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

- Mathematics, Computer Science
- IEEE Transactions on Signal Processing
- 2019

This paper presents a novel stochastic gradient descent (SGD) algorithm, which can provably train any single-hidden-layer ReLU network to attain global optimality, despite the presence of infinitely many bad local minima, maxima, and saddle points in general. Expand

The Implicit Regularization for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by… Expand

HOMOGENEOUS NEURAL NETWORKS

- 2020

In this paper, we study the implicit regularization of the gradient descent algorithm in homogeneous neural networks, including fully-connected and convolutional neural networks with ReLU or… Expand

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models

- Mathematics, Computer Science
- ICML
- 2019

The limit of loss minimization with a diverging norm constraint is studied, related to the limit of a "margin path" and characterize the resulting solution, which shows convergence to a "lexicographic max-margin solution" in both homogeneous and non-homogeneous models. Expand

Dynamics and Neural Collapse in Deep Classifiers trained with the Square Loss

- 2021

Recent results suggest that square loss performs on par with cross-entropy loss in classification tasks for deep networks. While the theoretical understanding of training deep networks with the… Expand

Understanding Estimation and Generalization Error of Generative Adversarial Networks

- Computer Science
- IEEE Transactions on Information Theory
- 2021

An upper bound as well as a minimax lower bound on the estimation error for training GANs are developed, which justifies the generalization ability of the GAN training via SGM after multiple passes over the data and reflects the interplay between the discriminator and the generator. Expand

Sampling Bias in Deep Active Classification: An Empirical Study

- Computer Science
- EMNLP/IJCNLP
- 2019

This work demonstrates that active set selection using the posterior entropy of deep models like FastText.zip (FTZ) is robust to sampling biases and to various algorithmic choices (query size and strategies) unlike that suggested by traditional literature and proposes a simple baseline for deep active text classification that outperforms the state of the art. Expand

#### References

SHOWING 1-10 OF 23 REFERENCES

The Implicit Bias of Gradient Descent on Separable Data

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2018

We examine gradient descent on unregularized logistic regression problems, with homogeneous linear predictors on linearly separable datasets. We show the predictor converges to the direction of the… Expand

Convergence of Gradient Descent on Separable Data

- Mathematics, Computer Science
- AISTATS
- 2019

It is proved that the convergence rate could be improved to $\log (t) /\sqrt{t}$ by using aggressive step sizes that compensates for the rapidly vanishing gradients, which might be useful for deep networks. Expand

Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

- Mathematics, Computer Science
- AISTATS
- 2019

It is proved that SGD converges to zero loss, even with a fixed (non-vanishing) learning rate --- in the special case of homogeneous linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Expand

Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2014

After N iterations, with a constant step-size proportional to 1/R2√N where N is the number of observations and R is the maximum norm of the observations, the convergence rate is always of order O(1/ √N), and improves to O(R2/µN), which shows that averaged stochastic gradient is adaptive to unknown local strong convexity of the objective function. Expand

SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data

- Computer Science, Mathematics
- ICLR
- 2018

This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model. Expand

Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)

- Computer Science, Mathematics
- NIPS
- 2013

We consider the stochastic approximation problem where a convex function has to be minimized, given only the knowledge of unbiased estimates of its gradients at certain points, a framework which… Expand

Risk and parameter convergence of logistic regression

- Computer Science, Mathematics
- ArXiv
- 2018

The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal{O}(\ln t)^2 / \sqrt{t})$. Expand

Stochastic Convex Optimization

- Mathematics, Computer Science
- COLT
- 2009

Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability. Expand

Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data

- Computer Science, Mathematics
- NeurIPS
- 2018

It is proved that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels, when the data comes from mixtures of well-separated distributions. Expand

On the Learning Dynamics of Deep Neural Networks

- Computer Science, Mathematics
- ArXiv
- 2018

This work studies the case of binary classification and proves various properties of learning in such networks under strong assumptions such as linear separability of the data, and confirms empirical observations by proving that the classification error also follows a sigmoidal shape in nonlinear architectures. Expand