What is Machine Learning?

Introduction

What is machine learning? Historically, it has been defined as, “the field of study that gives computers the ability to learn without being explicitly programmed.” More formally (and informatively), a good definition of machine learning is

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

That’s kind of hard to follow the logic of. To facilitate its understanding, I’m going to try and explain how linear regression, a massively common technique in statistics, and how it can be considered a machine learning algorithm.

Linear regression in 1D

Let’s say you have two continuous variables \(x\) and \(y\) and you want to test whether or not they are correlated with one another. Let \(x\) be the salary of an individual and \(y\) be that individual’s self-reported happiness. In real life (just in case you weren’t yet convinced this was relevant to real life) if you wanted to test whether or not there was a correlative relationship between these two variables, you would sample a bunch of individuals, learn their salaries, and interogate how happy they are. From these you would get a bunch of salaries

\[\underline{x} = \{x^{(0)}, x^{(1)}, \dots , x^{(m-1)}\}\]

and a bunch of happinesses

\[\underline{y} = \{y^{(0)}, y^{(1)}, \dots , y^{(m-1)}\}.\]

Look at all those happinesses. Then you would plot the variables against one another. I’m going to make some fake-ass data to illustrate whatever it is I’m talking about. Salary will be expressed in USD and happiness will be expressed on a \(0-100\) scale, where \(0\) is last picked in gym class and \(100\) is apotheosis.

# number of people you bother
m = 10

# Pareto-distributed is more correct but this isn't economics class
X = np.random.exponential(3e4,size=m)

# sophisticated metrics for happines
buying_whatever        = 100*np.sqrt(X/np.max(X))
induced_cynicism       = -50*(1 - (X/np.max(X))**(1/5.))
intrinsic_nihilism     = -20*np.random.rand(m)
human_unpredictability = np.random.normal(1,0.2,size=m)

# the equation for happiness
Y = (buying_whatever + induced_cynicism + intrinsic_nihilism) * human_unpredictability

# fixing people who are too happy
Y[Y>100] = 100 - 1

# removing individuals where happiness < 0 (suicide)
X = X[Y>0]
Y = Y[Y>0]

# recalculating sample size after accounting for suicide
m = len(X)

plt.scatter(X,Y)
plt.xlabel("salary [$]"); plt.ylabel("happiness")
plt.show()

png

There is an observed correlation between money made and happiness garnered. Therefore money = happiness, Ghandi was right. We can fit this to a straight line with a method called, you guessed it, linear regression. The premise is to create a line that minimizes the mean-squared-error between the data points and a line. The line is defined by the equation

\[h(x;\underline{\theta}) = \theta_0 + \theta_1 x\]

and if we want to minimize the mean squared error of this function with respect to the data, this is equivalent to minimizing the following cost function:

\[J(\underline{\theta}) = \frac{1}{2 m} \sum\big[\text{fit} - \text{data}\big]^2\] \[J(\underline{\theta}) = \frac{1}{2 m} \sum_{i=0}^{m-1}\big[h(x^{(i)}) - y^{(i)}\big]^2\]

Below I illustrate this concept.

def fit(x,theta0,theta1):
    return theta0 + theta1*x

# trying different parameter sets
thetas = [(50,4e-4),(5,1e-3)]

fig, axes = plt.subplots(1, 2, figsize=(12,4), sharey=True)
for i,theta in enumerate(thetas):
#   calculate the hypothesis line for given theta
    Y_fit = fit(X,theta[0],theta[1])
#   plot everything
    minmax = np.sort(np.vstack((Y,Y_fit)).T, axis=1)
    axes[i].scatter(X,Y)
    axes[i].plot(X,Y_fit,c='k',alpha=0.7)
    axes[i].vlines(x=X,ymin=minmax[:,0],ymax=minmax[:,1],alpha=0.5,color='r',linestyle='--')
    
plt.tight_layout()
plt.show()

png

In an attempt to model this data, on the left I’ve drawn a really crappy line–I’m sure you’ll agree. We know it’s bad because the distances between the data and the line are large. On the right I draw a different line, that looks better and better minimizes the distances to the data. But rather than continuing to play pin the tail on the donkey like a jackass, I’m going to employ a method that iteratively converges to the line that minimizes the mean squared error. Mathematically, our goal is to minimize the cost function:

\[\displaystyle{\min_{\theta_0, \theta_1} J(\theta_0,\theta_1)} = \displaystyle{\min_{\theta_0, \theta_1} \frac{1}{2 m} \sum_{i=0}^{m-1}\big[(\theta_0 + \theta_1 x^{(i)}) - y^{(i)}\big]^2 }\]

The method I’m going to use is called gradient descent. As you may have learned from calculus, if you have some function \(f(x,y)\) and you calculate its gradient, \(\nabla f(x,y)\), the result is a vector in \(x\) and \(y\) that points in the direction of greatest ascent of the function \(f(x,y)\). If I’m losing you, instead pay attention to this analogy: Pretend you’re athletic–I know, I know–and you’re hiking with the goal of getting to the top of a hill. Let \(x\) be your longitude, \(y\) be your latitude, and let \(f(x,y)\) be the height of the hill at your current position. The gradient of \(f(x,y)\), \(\nabla f(x,y)\), points in the direction of steepest ascent. This means if you want to get to the top of the hill taking as few steps as possible, your hiking algorithm should be to calculate \(\nabla f(x,y)\), take a step in that direction, calculate \(\nabla f(x,y)\) at your new position, take a step in that direction, etc. etc.

So now you’re at the top and you want to get down with as few steps as possible because hiking sucks. Since \(\nabla f(x,y)\) took you in the direction of steepest ascent, \(-\nabla f(x,y)\) is guaranteed to take you in the direction of steepest descent. The algorithm for hiking down the hill in as few steps as possible is the same algorithm for finding the line that minimizes the mean squared distance between the line and the data. The variables map in the following way:

longitude \(\rightarrow \theta_0\)

latitude \(\rightarrow \theta_1\)

height \(\rightarrow J(\theta_1,\theta_1)\)

So instead of existing on the surface of the earth, you now persist in an plane of existence where distances with respect to longitude and latitude are defined by \(\theta_0\) and \(\theta_1\) and distances with respect to altitude are characterized by a function \(J(\theta)\). Hiking doesn’t exist, you weren’t picked last in gym class, and your sentience is doubtful, although strictly speaking indeterminable. In fact, your whole existence is merely an emergent property of someone’s fake data generated in a completely disjoint realm of reality. Nonetheless, you have been programmed to find the minimum of this function and you employ the same algorithm that the dude on the hill did. You update your coordinates iteratively:

Gradient descent algorithm

\[\underline{\theta} := \underline{\theta} - \alpha \nabla J(\theta_1,\theta_2)\]

Computing the gradient and expressing the algorithm term by term,

\[\theta_0 := \theta_0 - \alpha \frac{1}{m} \sum_{i=0}^{m-1}\big[(\theta_0 + \theta_1 x^{(i)}) - y^{(i)}\big] \]

\[\theta_1 := \theta_1 - \alpha \frac{1}{m} \sum_{i=0}^{m-1}\big[(\theta_0 + \theta_1 x^{(i)}) - y^{(i)}\big] x^{(i)}\]

The only new variable I’ve introduced is \(\alpha\), which is the step length (for the case of the hiker \(\alpha\) is literally his stride length projected onto the \(xy\)-plane) you take during each iteration. I am gonna implement this now.

def iterate(Xm,y,theta,T,alpha):
    """
    Performs multi-variate linear regression.
    
    Fits n-variate data x to
    
    h(x;theta) = theta0 + theta1*x1 + ... + thetan*xn
    
    by using method of gradient descent to find the theta 
    that minimizes the least mean squared error cost function
    
    J(theta) = 1/(2m) * sum_i( [h(x^(i);theta) - y^(i)]^2 )
    
    where m is the number of data points and x^(i) is the ith
    data points.
    
    (https://en.wikipedia.org/wiki/Linear_regression)
    
    INPUTS
    ------
    Xm : numpy array
        Xm is a matrix of explanatory variables with 
        shape = (a,b), where a is the number of observations
        and b is the number of explanatory variables. In the
        case of a 2D linear regression, the matrix should
        look like either
        
        var1 var2
        [ 9   9 ]
        [ 8   2 ]
        [ 3   2 ]
        [... ...]
        
        or
        
        var0 var1 var2
        [ 1   9   9 ]
        [ 1   8   2 ]
        [ 1   3   2 ]
        [... ... ...]
        
        In the second case the first row is the 
        pseudo explanatory variable for the parameter theta0
        and is necessarily all 1's. If the user provides
        Xm like the first type, this column will be
        automatically added. If you pass Xm with the column
        already added, it has to be column 0.
    y : numpy array
        y is the dependent variable and should have a length
        equal to the number of Xm rows.
    theta : numpy array
        y is the initial guess for your parameter estimates.
        If you're data is n-dimensional, theta should be of 
        length n+1 and ordered. For example, if n=2, 
        theta = [theta0,theta1,theta2]. I'm being
        redundant at this point, but just to drive home the 
        point, the fit function would then be
        y = theta0 + theta1*x1 + theta2*x2.
    T : int
        The maximum number of iterations performed. If 
        convergence is not reached by T iterations, the params
        are returned.
    alpha : float
        The step size of the gradient descent. alpha should be
        large enough so that theta traverses the parameter 
        landscape sufficiently, but small enough so that the
        cost function J decreases with each iteration.
    conv : float, default = None
        This value defines convergence of the cost function.
        throughout each iteration t, if cost[t+1]-cost[t] < conv,
        the fit is said to have converged. It is situation
        specific and if not provided, all T iterations
        will be carried out.
    
    RETURNS
    -------
    theta_history : numpy array
        numpy array of shape (iters,n+1), where iters
        is the number of iterations and n is the dimension of the
        data. Each theta_history[t,:] defines theta for iteration t
    cost_history : numpy array
        1D numpy array of length iters, where iters
        is the number of iterations. Each cost_history[t] defines
        the cost at each iteration t.
    """
    
#   append pseudo-column if not present
    if (Xm.ndim == 1):
        pseudo = np.ones(np.shape(Xm)[0])
        Xm = np.vstack((pseudo,Xm.T)).T
    elif all(Xm[:,0]!=1):
        pseudo = np.ones(np.shape(Xm)[0])
        Xm = np.vstack((pseudo,Xm.T)).T
    else:
        pass
    
        
#   feature scaling with mean normalization so J is not highly 
#   elliptical and exists about 0 vector the means and range are
#   stored so theta can be converted back once the fit converges
    ranges = np.zeros(np.shape(Xm)[1]-1)
    means = np.zeros(np.shape(Xm)[1]-1)
    rangey = np.max(y)-np.min(y)
    meany = np.mean(y)
    y = (y-meany)/rangey

    for col in range(len(ranges)):
        ranges[col] = np.max(Xm[:,col+1])-np.min(Xm[:,col+1])
        means[col] = np.mean(Xm[:,col+1])
        Xm[:,col+1] = (Xm[:,col+1]-means[col]) / ranges[col]

#   track the cost and parameters
    cost_history = np.zeros(T)
    theta_history = np.zeros((T,2))
    
#   iterate 
    for t in range(0,T):
        cost_history[t] = calc_cost(Xm,y,theta)
        theta_history[t,:] = theta
        theta = take_step(Xm,y,theta,alpha)
        
#       stop if it converges
        if conv != None:
            if (cost_history[t-1]-cost_history[t]<conv) & (t > 0):
                theta_history[t,:] = theta
                cost_history[t] = calc_cost(Xm,y,theta)
                theta_history = theta_history[cost_history!=0,:]
                cost_history = cost_history[cost_history!=0]
                break
    
#   reverse effects of feature scaling and mean normalization
    nasty = np.sum(theta_history[:,1:]*(means/ranges),axis=1)
    theta_history[:,0] = (theta_history[:,0]-nasty)*rangey + meany
    theta_history[:,1:] = rangey*theta_history[:,1:]/ranges
    
    return cost_history, theta_history

def take_step(Xm,Y,theta,alpha):
    
#   hypothesis
    h = np.dot(Xm,theta)
#   gradient
    grad = 1./m * np.dot(Xm.T,(h-Y))
#   each step is just -grad * alpha
    step = -grad * alpha
#   return the new theta
    return theta + step

def calc_cost(Xm,y,theta):
#   hypothesis
    h = np.dot(Xm,theta)
#   sum of squares
    cost = 1./(2*m) * np.sum((h-y)**2)
    return cost

##################################################

# initial guess for theta
theta = np.array([0.5,0.0])
# alpha is the step size
alpha = 1
# max number of iterations
T = 1000
# value cost[i+1]-cost[i] must less than to stop
conv = 1e-6

cost, theta = iterate(X,Y,theta,T,alpha)

Let’s quickly plot some of the results.

iterations = np.arange(len(cost))
fig, axes = plt.subplots(2, 2, figsize=(12,8))

# iteration vs cost
axes[0,0].plot(iterations,cost)
axes[0,0].set_title("Cost")

# iteration vs theta0
axes[0,1].plot(iterations,theta[:,0])
axes[0,1].set_title(r"$\theta_0$");axes[0,1].set_xlabel("iterations")

# iteration vs theta1
axes[1,0].plot(iterations,theta[:,1])
axes[1,0].set_title(r"$\theta_1$");axes[1,0].set_xlabel("iterations")

# X vs Y with line of best fit
Y_fit = fit(X,theta[-1,0],theta[-1,1])
minmax = np.sort(np.vstack((Y,Y_fit)).T, axis=1)
axes[1,1].scatter(X,Y)
axes[1,1].plot(X,Y_fit,c='k',alpha=0.7)
axes[1,1].vlines(x=X,ymin=minmax[:,0],ymax=minmax[:,1],alpha=0.5,color='r',linestyle='--')
axes[1,1].set_xlabel("salary [\$]"); axes[1,1].set_ylabel("happiness")
plt.show()

png

The cost function continually decreases which means we correctly programmed our hiker to descend the cost hill function. Since the rate of descent levels off, we know we converged to the minimum of the function, which can also be seen in the converence of \(\theta_1\) and \(\theta_2\). Indeed, constructing the line of best fit with these parameters yields a line that fits the data well, which as you will remember was the whole point.

The point of this exercise

So what is machine learning about this? Let’s look back to the definition of machine learning:

“A computer program C is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Let’s frame this in terms of the hiker.

C \(\rightarrow\) a hiker

E \(\rightarrow\) moving his legs (updating \(\underline{\theta}\))

T \(\rightarrow\) descending the hill (minimizing the cost function)

P \(\rightarrow\) his current altitude (value of cost function at \(\underline{\theta}\))

Then,

“A hiker, whose goal is to descend a hill, is said to learn from moving his legs if his performance at descending the hill, as measured by his current altitude, improves as he moves his legs.

Linear regression is therefore a machine learning method, which is funny, because machine learning is new jargon, but linear regression is ancient. So I’m bored of this now so this is the end of the tutorial. Bye.

By the way, the above linear regression function generalizes to an n-variate linear regression. Feel free to use it.