Machine Learning is one of today's most influential technologies. With it, we enable computers to make predictions without being manually programmed.
It is important to remember that what we call machine learning is the intersection of applied statistics and computer science. I've met colleagues that are interested in the technology but still view is as a "magic box" into which you put data and out of which you get a functioning computer program. The truth is, these machine learning algorithms you hear about aren't all that complex. Moreover, understanding of the underlying processes is essential to giving the user a feel for what is and isn't possible with the technology, not to mention how to use it correctly.
Typically, courses can always be boiled down to a couple key conceptual components. The aim of this post is to summarize the content of Andrew Ng's famous Machine Learning Coursera course as succinctly as possible to give the reader a glimpse behind the hood of these machine learning algorithms once and for all.
Linear Regression
Given a collection of data points consisting of tuples \((x, y)\), an easy problem in statistics is to choose a line of best fit to the data. In a machine learning context, the data points we define as the training set and fitting the line is the process of training the machine learning algorithm.
-> A training set is a set \(D\) where elements of the set are vectors of \(n\) dimensions. We say that \(D\) contains \(m\) samples.
To use one of Andrew Ng's analogies, given a dataset of past sales from a certain realtor, we could, for instance, create a simple algorithm that when given the size of a house, predicts its market value. This will become more obvious from the graphic below.
Of the class of problems about fitting lines, the most fundamental problem is to fit a linear function to the data. Later, we will expand this further to incorporate non-linear functions.
-> Define \(\theta\) to be some row vector interpreted to be a model we are fitting to our dataset.
In our case, the models are the set of all possible linear functions. Since we know that all such functions can be written as \(h(x) = mx + b\), we can let \(\theta = \begin{bmatrix}b & m \end{bmatrix}\). Here the letter h stands for "hypothesis". In order to apply this model to the independent variable \(x\), it is now a matter of matrix multiplication. $$\begin{align} h(x) &= \theta x \\ &= \begin{bmatrix}b & m \end{bmatrix} \begin{bmatrix} 1 \\ x \end{bmatrix} \\ &= mx + b \end{align}$$
That extra one above the \(x\) is called the bias unit and is only necessary so that our predictions can have vertical offsets. Our end goal is to choose \(\theta\) such that, roughly \(\forall (x, y) \in D, (x, h(x)) \approx (x, y) \leftrightarrow h(x) \approx y\).
We're going out of our way to define things in terms of linear algebra. This is intentional. Later on, we'll see that even non-linear terms are going to be incorporated through the framework of linear algebra.
Cost Functions & Gradient Descent
Now that we have a way to talk about the training set and hypotheses, the meat of the algorithm will be determining which one is the most suitable. For this we will use the method of least squares and gradient descent.
-> For an arbitrary sample \((x, y) \in D\), define the cost as $$J((x, y), \theta) = (\theta x - y)^2 = (h(x) - y)^2$$
-> For an arbitrary hypothesis \(\theta\), define the cost as $$J(\theta) = \frac{1}{m}\sum\limits_{i=1}^m J(D_i, \theta)$$ where \(D_i\) denotes the i-th training example. Recall \(m\) is the size of the training set, so in this sense we're finding the average cost of all the samples
You notice that the function \(J\) has two meanings; there is one for an arbitrary sample and one for an arbitrary hypothesis. The hypothesis cost is made by adding together the costs of all the samples under that hypothesis. Exactly which \(J\) we are using should be clear from the context.
Let us draw a 3d graph consisting of the vector space of all hypotheses \(\theta\) along the bottom plane and their associated costs \(J(\theta)\) along the vertical axis. We want to find the point \(\theta\) that minimizes that cost.
In two dimensions this graph is in the shape of a paraboloid. This is good, because the vertex is the only minimum on the whole graph. Sometimes you get graphs that have many local minima, and finding the global minimum can be quite the challenge.
Proving the paraboloid shape isn't too difficult. Recall that a parabola is some equation of the form \(y = ax^2 + bx + c\) where \(a > 0\) if it is upwards facing. A paraboloid is just this with an extra dimension: \(y = ax_1^2 + bx_2^2 + x_1x_2c + x_1d + x_2e + f\) where \(a, b > 0\). Looking at \(J(\theta)\), we find$$\begin{align}J(\theta) &= \frac{1}{m}\sum\limits_{i=1}^m (\theta x^{(j)} - y^{(j)})^2 \\ &= \frac{1}{m}\sum\limits_{i=1}^m (\theta_0 + x_1 \theta_1 + x_2 \theta_2 - y^{(j)})^2 \\ &= \frac{1}{m}\sum\limits_{i=1}^m (a^{(i)}x_1^2 + b^{(i)}x_2^2 + c^{(i)}x_1x_2 + d^{(i)}x_1 + e^{(i)}x_2 + f)\end{align}$$So really it's the sum of many paraboloids, which still remains a paraboloid.
The gradient descent algorithm is just like sliding down a hill. We begin by initializing some random point as a guess. We will call this guess \(\theta_0\) and intialize it at the origin: \(\theta_0 = \begin{bmatrix} 0 & 0 \end{bmatrix}\). Recall that here the origin would be the linear function with slope and intercept both zero.
We then compute the gradient of the cost function \(J\). Recall from vector calculus that the gradient, \(\nabla f\), is defined as $$\nabla f = \frac{\delta f}{\delta x_1}\hat{x_1} + \frac{\delta f}{\delta x_2}\hat{x_2} + ... + \frac{\delta f}{\delta x_n}\hat{x_n}$$where \(\hat{x_i}\) are orthogonal unit vectors. Intuitively, we think of the gradient as being a multidirectional derivative. It is a vector field with arrows that point uphill with magnitudes that equal the intensity of the slope.
The gradient descent algorithm now proceeds as follows.
\(\theta = \theta_0\) repeat { \(\theta = \theta - \alpha \nabla J(\theta)\) }
The variable \(\alpha\) plays the role of how quickly we would like to slide down the slope. If set high, we will converge at the minimum after very few iterations but we risk jumping back and forth over the true minimum. If set low, we will converge very slowly but attain a much more accurate minimum.
Implementing gradient descent for our least-squares cost function is now quite easy. Our main task is to derive \(\nabla J(\theta)\). $$\begin{align} \nabla J(\theta) &= &&\frac{\delta J}{\delta \theta_1}\hat{\theta_1} + \frac{\delta J}{\delta \theta_2}\hat{\theta_2} + ... + \frac{\delta J}{\delta \theta_n}\hat{\theta_n} \\ \\ & &&\frac{\delta J}{\delta \theta_i} = \frac{\delta}{\delta \theta_i}(\frac{1}{m}\sum\limits_{j=1}^m (\theta x^{(j)} - y^{(j)})^2) \\ & &&\frac{\delta J}{\delta \theta_i} = \frac{1}{m}\sum\limits_{j=1}^m \frac{\delta}{\delta \theta_i}(\theta x^{(j)} - y^{(j)})^2 \\ & &&\frac{\delta J}{\delta \theta_i} = \frac{2}{m}\sum\limits_{j=1}^m (\theta x^{(j)} - y^{(j)})(\frac{\delta}{\delta \theta_i}(\theta x^{(j)})) \\ & &&\frac{\delta J}{\delta \theta_i} = \frac{2}{m}\sum\limits_{j=1}^m (\theta x^{(j)} - y^{(j)})(x_i^{(j)}) \\ \\ \nabla J(\theta) &= && \frac{2}{m} \sum\limits_{i=1}^n \sum\limits_{j=1}^m (y^{(j)} - \theta x^{(j)})(x_i^{(j)}) \end{align}$$
It's cool how it's possible to find this. This function, \(\nabla J(\theta)\), gives how the farily complicated cost function changes with respect to \(\theta\) in any number of dimensions. We're not talking about the attributes of any of the samples; the components of \(\theta\) are the slopes of the line itself. Computing this function means we have a sure method to always know where "downhill" is.
Examples
We have defined everything above in the multivariate case. This is confusing. Let's begin with the single-variate example.
Now let's extend this to two variables. Instead of just floor size, let's include a dimension for how close it is to major metropolitan areas. We say we are adding an additional feature to our training set.
Instead of one slope, we will have two slopes. The first one, \(\theta_1\) controls the price sensitivity to the size of the house and the second one, \(\theta_2\) contols the price sensitivity to its location. \(\theta_0\) is always used for vertical offset at the origin. When graphed, we have two input dimensions and one output dimension. Instead of a straight line, we get its two-dimensional analogue - a flat plane.
In the two-dimensional case, we also see how linear algebra is beneficial. Our previous definition of the hypothesis expands all by itself.
$$\begin{align} h(x) &= \theta x \\ &= \begin{bmatrix}b & m_1 & m_2 \end{bmatrix} \begin{bmatrix} 1 \\ x_1 \\ x_2 \end{bmatrix} \\ &= m_1x_1 + m_2x_2 + b \end{align}$$In the next post we will investigate training non-linear components for each feature. This will allow us to blend certain features together in a unique way. We will also face the problem of overfitting and a simple solution known as regularization.