This talk revolves around Polyak’s momentum gradient descent method, also known as ‘momentum’. Its stochastic version, momentum stochastic gradient descent (SGD), is one of the most commonly used optimization methods in deep learning. Throughout the talk we will study a number of important properties of this versatile method, and see how this understanding can be used to engineer better deep learning systems.