在吳安德（Andrew Ng）的machine learning course中，他介紹了線性回歸和邏輯回歸，並展示瞭如何使用梯度下降和牛頓法擬合模型參數。

我知道梯度下降在機器學習的某些應用（例如，反向傳播）中可能有用，但是在更一般的情況下，有任何理由為什麼您不會以封閉形式求解參數-即，通過成本函數的導數並通過微積分求解？

在可用的情況下，相對於封閉形式的解決方案，使用梯度下降等迭代算法的優勢是什麼？

81

在吳安德（Andrew Ng）的machine learning course中，他介紹了線性回歸和邏輯回歸，並展示瞭如何使用梯度下降和牛頓法擬合模型參數。

我知道梯度下降在機器學習的某些應用（例如，反向傳播）中可能有用，但是在更一般的情況下，有任何理由為什麼您不會以封閉形式求解參數-即，通過成本函數的導數並通過微積分求解？

在可用的情況下，相對於封閉形式的解決方案，使用梯度下降等迭代算法的優勢是什麼？

regressionmachine-learninglogisticgradient-descent

99

Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,

For most nonlinear regression problems there is no closed form solution.

Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.

For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,

$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$

is given by

$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$

Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.

Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.

However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.

In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.

This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.

2

**UPDATE**

For linear regression, it's a one step procedure, so iteration of any kind is not needed.

For logistic regression, the Newton-Raphson iterative approach uses the second partial derivatives of the objective function w.r.t. each coefficient, as well as the first partial derivatives, so it converges much faster than gradient descent, which only uses the first partial derivatives.

**OP**

There have been several posts on machine learning (ML) and regression. ML is not needed for solving ordinary least squares (OLS), since it involves a one-step matrix sandwiching operation for solving a system of linear equations -- i.e., $\boldsymbol{\beta}=(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}$ . The fact that everything is linear means that only a one-step operation is needed to solve for the coefficients. Logistic regression is based on maximizing the likelihood function $L=\prod_i{p_i}$, which can be solved using Newton-Raphson, or other ML gradient ascent methods, metaheuristics (hill climbing, genetic algorithms, swarm intelligence, ant colony optimization, etc).

Regarding parsimony, use of ML for OLS would be wasteful because iterative learning is inefficient for solving OLS.

Now, back to your real question on derivatives vs. ML approaches to solving gradient-based problems. Specifically, for logistic regression, Newton-Raphson's gradient descent (derivative-based) approach is commonly used. Newton-Raphson requires that you know the objective function and its partial derivatives w.r.t. each parameter (continuous in the limit and differentiable). ML is mostly used when the objective function is too complex ("narly") and you don't know the derivatives. For example, an artificial neural network (ANN) can be used to solve either a function approximation problem or supervised classification problem when the function is not known. In this case, the ANN is the function.

Don't make the mistake of using ML methods to solve a logistic regression problem, just because you can. For logistic, Newton-Raphson is extremely fast and is the appropriate technique for solving the problem. ML is commonly used when you don't know what the function is. (by the way, ANNs are from the field of computational intelligence, and not ML).