import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
np.random.seed(1000)
What is machine learning? Historically, it has been defined as, “the field of study that gives computers the ability to learn without being explicitly programmed.” More formally (and informatively), a good definition of machine learning is
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
That’s kind of hard to follow the logic of. To facilitate its understanding, I’m going to try and explain how linear regression, a massively common technique in statistics, and how it can be considered a machine learning algorithm.
Let’s say you have two continuous variables \(x\) and \(y\) and you want to test whether or not they are correlated with one another. Let \(x\) be the salary of an individual and \(y\) be that individual’s self-reported happiness. In real life (just in case you weren’t yet convinced this was relevant to real life) if you wanted to test whether or not there was a correlative relationship between these two variables, you would sample a bunch of individuals, learn their salaries, and interogate how happy they are. From these you would get a bunch of salaries
\[\underline{x} = \{x^{(0)}, x^{(1)}, \dots , x^{(m-1)}\}\]
and a bunch of happinesses
\[\underline{y} = \{y^{(0)}, y^{(1)}, \dots , y^{(m-1)}\}.\]
Look at all those happinesses. Then you would plot the variables against one another. I’m going to make some fake-ass data to illustrate whatever it is I’m talking about. Salary will be expressed in USD and happiness will be expressed on a \(0-100\) scale, where \(0\) is last picked in gym class and \(100\) is apotheosis.
# number of people you bother
m = 10
# Pareto-distributed is more correct but this isn't economics class
X = np.random.exponential(3e4,size=m)
# sophisticated metrics for happines
buying_whatever = 100*np.sqrt(X/np.max(X))
induced_cynicism = -50*(1 - (X/np.max(X))**(1/5.))
intrinsic_nihilism = -20*np.random.rand(m)
human_unpredictability = np.random.normal(1,0.2,size=m)
# the equation for happiness
Y = (buying_whatever + induced_cynicism + intrinsic_nihilism) * human_unpredictability
# fixing people who are too happy
Y[Y>100] = 100 - 1
# removing individuals where happiness < 0 (suicide)
X = X[Y>0]
Y = Y[Y>0]
# recalculating sample size after accounting for suicide
m = len(X)
plt.scatter(X,Y)
plt.xlabel("salary [$]"); plt.ylabel("happiness")
plt.show()
png
There is an observed correlation between money made and happiness garnered. Therefore money = happiness, Ghandi was right. We can fit this to a straight line with a method called, you guessed it, linear regression. The premise is to create a line that minimizes the mean-squared-error between the data points and a line. The line is defined by the equation
\[h(x;\underline{\theta}) = \theta_0 + \theta_1 x\]
and if we want to minimize the mean squared error of this function with respect to the data, this is equivalent to minimizing the following cost function:
\[J(\underline{\theta}) = \frac{1}{2 m} \sum\big[\text{fit} - \text{data}\big]^2\] \[J(\underline{\theta}) = \frac{1}{2 m} \sum_{i=0}^{m-1}\big[h(x^{(i)}) - y^{(i)}\big]^2\]
Below I illustrate this concept.
def fit(x,theta0,theta1):
return theta0 + theta1*x
# trying different parameter sets
thetas = [(50,4e-4),(5,1e-3)]
fig, axes = plt.subplots(1, 2, figsize=(12,4), sharey=True)
for i,theta in enumerate(thetas):
# calculate the hypothesis line for given theta
Y_fit = fit(X,theta[0],theta[1])
# plot everything
minmax = np.sort(np.vstack((Y,Y_fit)).T, axis=1)
axes[i].scatter(X,Y)
axes[i].plot(X,Y_fit,c='k',alpha=0.7)
axes[i].vlines(x=X,ymin=minmax[:,0],ymax=minmax[:,1],alpha=0.5,color='r',linestyle='--')
plt.tight_layout()
plt.show()
png
In an attempt to model this data, on the left I’ve drawn a really crappy line–I’m sure you’ll agree. We know it’s bad because the distances between the data and the line are large. On the right I draw a different line, that looks better and better minimizes the distances to the data. But rather than continuing to play pin the tail on the donkey like a jackass, I’m going to employ a method that iteratively converges to the line that minimizes the mean squared error. Mathematically, our goal is to minimize the cost function:
\[\displaystyle{\min_{\theta_0, \theta_1} J(\theta_0,\theta_1)} = \displaystyle{\min_{\theta_0, \theta_1} \frac{1}{2 m} \sum_{i=0}^{m-1}\big[(\theta_0 + \theta_1 x^{(i)}) - y^{(i)}\big]^2 }\]
The method I’m going to use is called gradient descent. As you may have learned from calculus, if you have some function \(f(x,y)\) and you calculate its gradient, \(\nabla f(x,y)\), the result is a vector in \(x\) and \(y\) that points in the direction of greatest ascent of the function \(f(x,y)\). If I’m losing you, instead pay attention to this analogy: Pretend you’re athletic–I know, I know–and you’re hiking with the goal of getting to the top of a hill. Let \(x\) be your longitude, \(y\) be your latitude, and let \(f(x,y)\) be the height of the hill at your current position. The gradient of \(f(x,y)\), \(\nabla f(x,y)\), points in the direction of steepest ascent. This means if you want to get to the top of the hill taking as few steps as possible, your hiking algorithm should be to calculate \(\nabla f(x,y)\), take a step in that direction, calculate \(\nabla f(x,y)\) at your new position, take a step in that direction, etc. etc.
So now you’re at the top and you want to get down with as few steps as possible because hiking sucks. Since \(\nabla f(x,y)\) took you in the direction of steepest ascent, \(-\nabla f(x,y)\) is guaranteed to take you in the direction of steepest descent. The algorithm for hiking down the hill in as few steps as possible is the same algorithm for finding the line that minimizes the mean squared distance between the line and the data. The variables map in the following way:
longitude \(\rightarrow \theta_0\)
latitude \(\rightarrow \theta_1\)
height \(\rightarrow J(\theta_1,\theta_1)\)
So instead of existing on the surface of the earth, you now persist in an plane of existence where distances with respect to longitude and latitude are defined by \(\theta_0\) and \(\theta_1\) and distances with respect to altitude are characterized by a function \(J(\theta)\). Hiking doesn’t exist, you weren’t picked last in gym class, and your sentience is doubtful, although strictly speaking indeterminable. In fact, your whole existence is merely an emergent property of someone’s fake data generated in a completely disjoint realm of reality. Nonetheless, you have been programmed to find the minimum of this function and you employ the same algorithm that the dude on the hill did. You update your coordinates iteratively:
\[\underline{\theta} := \underline{\theta} - \alpha \nabla J(\theta_1,\theta_2)\]
Computing the gradient and expressing the algorithm term by term,
\[\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=0}^{m-1}\big[(\theta_0 + \theta_1 x^{(i)}) - y^{(i)}\big] \]
\[\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=0}^{m-1}\big[(\theta_0 + \theta_1 x^{(i)}) - y^{(i)}\big] x^{(i)}\]
The only new variable I’ve introduced is \(\alpha\), which is the step length (for the case of the hiker \(\alpha\) is literally his stride length projected onto the \(xy\)-plane) you take during each iteration. I am gonna implement this now.
def iterate(Xm,y,theta,T,alpha):
"""
Performs multi-variate linear regression.
Fits n-variate data x to
h(x;theta) = theta0 + theta1*x1 + ... + thetan*xn
by using method of gradient descent to find the theta
that minimizes the least mean squared error cost function
J(theta) = 1/(2m) * sum_i( [h(x^(i);theta) - y^(i)]^2 )
where m is the number of data points and x^(i) is the ith
data points.
(https://en.wikipedia.org/wiki/Linear_regression)
INPUTS
------
Xm : numpy array
Xm is a matrix of explanatory variables with
shape = (a,b), where a is the number of observations
and b is the number of explanatory variables. In the
case of a 2D linear regression, the matrix should
look like either
var1 var2
[ 9 9 ]
[ 8 2 ]
[ 3 2 ]
[... ...]
or
var0 var1 var2
[ 1 9 9 ]
[ 1 8 2 ]
[ 1 3 2 ]
[... ... ...]
In the second case the first row is the
pseudo explanatory variable for the parameter theta0
and is necessarily all 1's. If the user provides
Xm like the first type, this column will be
automatically added. If you pass Xm with the column
already added, it has to be column 0.
y : numpy array
y is the dependent variable and should have a length
equal to the number of Xm rows.
theta : numpy array
y is the initial guess for your parameter estimates.
If you're data is n-dimensional, theta should be of
length n+1 and ordered. For example, if n=2,
theta = [theta0,theta1,theta2]. I'm being
redundant at this point, but just to drive home the
point, the fit function would then be
y = theta0 + theta1*x1 + theta2*x2.
T : int
The maximum number of iterations performed. If
convergence is not reached by T iterations, the params
are returned.
alpha : float
The step size of the gradient descent. alpha should be
large enough so that theta traverses the parameter
landscape sufficiently, but small enough so that the
cost function J decreases with each iteration.
conv : float, default = None
This value defines convergence of the cost function.
throughout each iteration t, if cost[t+1]-cost[t] < conv,
the fit is said to have converged. It is situation
specific and if not provided, all T iterations
will be carried out.
RETURNS
-------
theta_history : numpy array
numpy array of shape (iters,n+1), where iters
is the number of iterations and n is the dimension of the
data. Each theta_history[t,:] defines theta for iteration t
cost_history : numpy array
1D numpy array of length iters, where iters
is the number of iterations. Each cost_history[t] defines
the cost at each iteration t.
"""
# append pseudo-column if not present
if (Xm.ndim == 1):
pseudo = np.ones(np.shape(Xm)[0])
Xm = np.vstack((pseudo,Xm.T)).T
elif all(Xm[:,0]!=1):
pseudo = np.ones(np.shape(Xm)[0])
Xm = np.vstack((pseudo,Xm.T)).T
else:
pass
# feature scaling with mean normalization so J is not highly
# elliptical and exists about 0 vector the means and range are
# stored so theta can be converted back once the fit converges
ranges = np.zeros(np.shape(Xm)[1]-1)
means = np.zeros(np.shape(Xm)[1]-1)
rangey = np.max(y)-np.min(y)
meany = np.mean(y)
y = (y-meany)/rangey
for col in range(len(ranges)):
ranges[col] = np.max(Xm[:,col+1])-np.min(Xm[:,col+1])
means[col] = np.mean(Xm[:,col+1])
Xm[:,col+1] = (Xm[:,col+1]-means[col]) / ranges[col]
# track the cost and parameters
cost_history = np.zeros(T)
theta_history = np.zeros((T,2))
# iterate
for t in range(0,T):
cost_history[t] = calc_cost(Xm,y,theta)
theta_history[t,:] = theta
theta = take_step(Xm,y,theta,alpha)
# stop if it converges
if conv != None:
if (cost_history[t-1]-cost_history[t]<conv) & (t > 0):
theta_history[t,:] = theta
cost_history[t] = calc_cost(Xm,y,theta)
theta_history = theta_history[cost_history!=0,:]
cost_history = cost_history[cost_history!=0]
break
# reverse effects of feature scaling and mean normalization
nasty = np.sum(theta_history[:,1:]*(means/ranges),axis=1)
theta_history[:,0] = (theta_history[:,0]-nasty)*rangey + meany
theta_history[:,1:] = rangey*theta_history[:,1:]/ranges
return cost_history, theta_history
def take_step(Xm,Y,theta,alpha):
# hypothesis
h = np.dot(Xm,theta)
# gradient
grad = 1./m * np.dot(Xm.T,(h-Y))
# each step is just -grad * alpha
step = -grad * alpha
# return the new theta
return theta + step
def calc_cost(Xm,y,theta):
# hypothesis
h = np.dot(Xm,theta)
# sum of squares
cost = 1./(2*m) * np.sum((h-y)**2)
return cost
##################################################
# initial guess for theta
theta = np.array([0.5,0.0])
# alpha is the step size
alpha = 1
# max number of iterations
T = 1000
# value cost[i+1]-cost[i] must less than to stop
conv = 1e-6
cost, theta = iterate(X,Y,theta,T,alpha)
Let’s quickly plot some of the results.
iterations = np.arange(len(cost))
fig, axes = plt.subplots(2, 2, figsize=(12,8))
# iteration vs cost
axes[0,0].plot(iterations,cost)
axes[0,0].set_title("Cost")
# iteration vs theta0
axes[0,1].plot(iterations,theta[:,0])
axes[0,1].set_title(r"$\theta_0$");axes[0,1].set_xlabel("iterations")
# iteration vs theta1
axes[1,0].plot(iterations,theta[:,1])
axes[1,0].set_title(r"$\theta_1$");axes[1,0].set_xlabel("iterations")
# X vs Y with line of best fit
Y_fit = fit(X,theta[-1,0],theta[-1,1])
minmax = np.sort(np.vstack((Y,Y_fit)).T, axis=1)
axes[1,1].scatter(X,Y)
axes[1,1].plot(X,Y_fit,c='k',alpha=0.7)
axes[1,1].vlines(x=X,ymin=minmax[:,0],ymax=minmax[:,1],alpha=0.5,color='r',linestyle='--')
axes[1,1].set_xlabel("salary [\$]"); axes[1,1].set_ylabel("happiness")
plt.show()
png
The cost function continually decreases which means we correctly programmed our hiker to descend the cost hill function. Since the rate of descent levels off, we know we converged to the minimum of the function, which can also be seen in the converence of \(\theta_1\) and \(\theta_2\). Indeed, constructing the line of best fit with these parameters yields a line that fits the data well, which as you will remember was the whole point.
So what is machine learning about this? Let’s look back to the definition of machine learning:
“A computer program C is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”
Let’s frame this in terms of the hiker.
C \(\rightarrow\) a hiker
E \(\rightarrow\) moving his legs (updating \(\underline{\theta}\))
T \(\rightarrow\) descending the hill (minimizing the cost function)
P \(\rightarrow\) his current altitude (value of cost function at \(\underline{\theta}\))
Then,
“A hiker, whose goal is to descend a hill, is said to learn from moving his legs if his performance at descending the hill, as measured by his current altitude, improves as he moves his legs.
Linear regression is therefore a machine learning method, which is funny, because machine learning is new jargon, but linear regression is ancient. So I’m bored of this now so this is the end of the tutorial. Bye.
By the way, the above linear regression function generalizes to an n-variate linear regression. Feel free to use it.