What is Machine Learning

Learning like intelligence, covers such a broad range of processes that it is difficult to define precisely.
- to gain knowledge, or understanding of, or skill in, by study, instruction, or experience,
- modification of a behavioral tendency by experience.
Machine very broadly, that a machine learns whenever it changes its structure, program, or data (based on its inputs or in response to external information) in such a manner that its expected future performance improves.
Machine Learning usually refers to the changes in systems that perform tasks associated with artificial intelligence (AI).
- Such tasks involve recognition, diagnosis, planning, robot control, prediction, etc.

Why should machines have to learn?

Some tasks cannot be defined well except by example, that is, we might be able to specify input/output pairs but not a concise relationship between inputs and desired outputs.
It is possible that hidden among large piles of data are important relationships and correlations. Machine learning methods can often be used to extract these relationships (data mining).
Human designers often produce machines that do not work as well as desired in the environments in which they are used.
The amount of knowledge available about certain tasks might be too large for explicit encoding by humans.
Environments change over time.
New knowledge about tasks is constantly being discovered by humans.

Wellsprings of Machine Learning

Statistics
Brain Model
Adaptive Control Theory
Psychological Models
Artificial Intelligence
Evolutionary Models

Varieties of Machine Learning

• Functions

• Logic programs and rule sets

• Finite-state machines

• Grammars

• Problem solving systems

Type of Learning

Supervised learning
Unsupervised learning

Key ML Terminoly

Label and Feature
- A label is the thing we’re predicting — the y variable in simple linear regression.
- A feature is an input variable—the x variable in simple linear regression.
Example
- An example is a particular instance of data, x. We break examples into two categories:
  - labeled examples {features, label}: (x, y)
  - unlabeled examples {features, label}: (x, ?)
Model
- A model defines the relationship between features and label. Let’s highlight two phases of a model’s life:
  - Training means creating or learning the model. That is, you show the model labeled examples and enable the model to gradually learn the relationships between features and label.
  - Inference means applying the trained model to unlabeled examples. That is, you use the trained model to make useful predictions (y’).
Regression vs. classification
- A regression model predicts continuous values. For example, regression models make predictions that answer questions like the following:
  - What is the value of a house in California?
  - What is the probability that a user will click on this ad?
- A classification model predicts discrete values. For example, classification models make predictions that answer questions like the following:
  - Is a given email message spam or not spam?
  - Is this an image of a dog, a cat, or a hamster?

Examples to illustrate ML terms

Labeled examples

For example, the following table shows 5 labeled examples from a data set containing information about housing prices in California:

Labeled examples
housingMedianAge (feature)	totalRooms (feature)	totalBedrooms (feature)	medianHouseValue (label)
15	5612	1283	66900
19	7650	1901	80100
17	720	174	85700
14	1501	337	73400
20	1454	326	65500
15	5612	1283	66900

Examples to illustrate ML terms

Unlabeled examples

Unlabeled examples
housingMedianAge (feature)	totalRooms (feature)	totalBedrooms (feature)
42	1686	361
34	1226	180
33	1077	271

Descending into ML: Linear Regression

Background It has long been known that crickets (an insect species) chirp more frequently on hotter days than on cooler days. For decades, professional and amateur scientists have cataloged data on chirps-per-minute and temperature. As a birthday gift, your Aunt Ruth gives you her cricket database and asks you to learn a model to predict this relationship. Using this data, you want to explore this relationship.

Descending into ML: Linear Regression

Find regression line As expected, the plot shows the temperature rising with the number of chirps. Is this relationship between chirps and temperature linear? Yes, you could draw a single straight line like the following to approximate this relationship:

Using the equation for a line, you could write down this relationship as follows: \[y=mx+b\] where: - \(y\) is the temperature in Celsius—the value we’re trying to predict. - \(m\) is the slope of the line. - \(x\) is the number of chirps per minute—the value of our input feature. - \(b\) is the y-intercept.

By convention in machine learning, you’ll write the equation for a model slightly differently: \[y'=b+w_1 x_1\] where: - \(y'\) is the temperature in Celsius—the value we’re trying to predict. - \(w_1\) is is the weight of feature 1. Weight is the same concept as the “slope” \(m\) in the traditional equation of a line. - \(x_1\) is a feature (a known input). - \(b\) is bias (the y-intercept), sometimes referred to as \(w_0\).

Although this model uses only one feature, a more sophisticated model might rely on multiple features, each having a separate weight (\(w_1\), \(w_2\), etc.). For example, a model that relies on three features might look as follows: \[y'=b+w_1 x_1+w_2 x_2+w_3 x_3\]

Descending into ML: Training and Loss

Training and Loss

Training a model simply means learning (determining) good values for all the weights and the bias from labeled examples.
- In supervised learning, a machine learning algorithm builds a model by examining many examples and attempting to find a model that minimizes loss; this process is called empirical risk minimization.
Loss is the penalty for a bad prediction. It is a number indicating how bad the model’s prediction was on a single example. If the model’s prediction is perfect, the loss is zero; otherwise, the loss is greater.
- The goal of training a model is to find a set of weights and biases that have low loss, on average, across all examples.

Descending into ML: Squared Loss

Squared Loss a popular loss function

The linear regression models we’ll examine here use a loss function called squared loss (also known as \(L_2\) loss).

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples: \[ M S E=\frac{1}{N} \sum_{(x, y) \in D}(y-\operatorname{prediction}(x))^{2} \] where:
- \((x,y)\) is an example in which
  - \(x\) is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
  - \(y\) is the example’s label (for example, temperature).
- prediction(x) is a function of the weights and bias in combination with the set of features \(x\).
- \(D\) is a data set containing many labeled examples, which are \((x,y)\) pairs.
- \(N\) is the number of examples in \(D\).

Remark Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Reducing Loss: An Iterative Approach

In ML, the majority commonly used methods to reduce loss are interactively proceeded.

Iterative learning might remind you of the “Hot and Cold” kid’s game for finding a hidden object like a thimble. In this game, the “hidden object” is the best possible model. You’ll start with a wild guess (“the values of \(w_1\) is 0”) and wait for the system to tell you what the loss is. Then, you’ll try another guess (“the values of \(w_1\) is 0.5”) and see what the loss is. Aah, you’re getting warmer. Actually, if you play this game right, you’ll usually be getting warmer. The real trick to the game is trying to find the best possible model as efficiently as possible.

Reduce Loss: Gradient Descent

Suppose we had the time and the computing resources to calculate the loss for all possible values of \(w_1\). For the kind of regression problems we’ve been examining, the resulting plot of loss vs. \(w_1\) will always be convex. In other words, the plot will always be bowl-shaped, kind of like this:

Reduce Loss: Gradient Descent

Convex problems have only one minimum; that is, only one place where the slope is exactly 0. That minimum is where the loss function converges.

Calculating the loss function for every conceivable value of \(w_1\) over the entire data set would be an inefficient way of finding the convergence point. Let’s examine a better mechanism—very popular in machine learning—called gradient descent.

The first stage in gradient descent is to pick a starting value (a starting point) for \(w_1\).

The gradient descent algorithm then calculates the gradient of the loss curve at the starting point.

Note: a gradient is a vector, so it has both of the following characteristics: a direction and a magnitude

Note: The gradient always points in the direction of steepest increase in the loss function. The gradient descent algorithm takes a step in the direction of the negative gradient in order to reduce loss as quickly as possible.

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient’s magnitude to the starting point as shown in the following figure:

Reduce Loss: Learning Rate

As noted, the gradient vector has both a direction and a magnitude. Gradient descent algorithms multiply the gradient by a scalar known as the learning rate (also sometimes called step size) to determine the next point. For example, if the gradient magnitude is 2.5 and the learning rate is 0.01, then the gradient descent algorithm will pick the next point 0.025 away from the previous point.

Hyperparameters are the knobs that programmers tweak in machine learning algorithms. Most machine learning programmers spend a fair amount of time tuning the learning rate.

If you pick a learning rate that is too small, learning will take too long:

Conversely, if you specify a learning rate that is too large, the next point will perpetually bounce haphazardly across the bottom of the well like a quantum mechanics experiment gone horribly wrong:

There’s a Goldilocks learning rate for every regression problem. The Goldilocks value is related to how flat the loss function is. If you know the gradient of the loss function is small then you can safely try a larger learning rate, which compensates for the small gradient and results in a larger step size.

Reduce Loss: Stochastic Gradient Descent

When the loss function is NOT concave (for complicated model, like neural network), the gradient decent method highly depends on the initial value assignments and only reach to local minimum.

Model Evaluation

High loss, not a good model (under fit)
Low loss, but still a bad model (over fit)

Model Evaluation

In modern times, we’ve formalized Ockham’s razor into the fields of statistical learning theory and computational learning theory. These fields have developed generalization bounds–a statistical description of a model’s ability to generalize to new data based on factors such as:

the complexity of the model
the model’s performance on training data

A machine learning model aims to make good predictions on new, previously unseen data. But if you are building a model from your data set, how would you get the previously unseen data? Well, one way is to divide your data set into two subsets:

training set— a subset to train a model.
test set — a subset to test the model.

Training and Test Sets: Splitting Data

You could imagine slicing the single data set as follows:

Make sure that your test set meets the following two conditions:

Is large enough to yield statistically meaningful results.
Is representative of the data set as a whole. In other words, don’t pick a test set with different characteristics than the training set.

Validation Set: Another Partition

This partitioning enabled you to train on one set of examples and then to test the model against a different set of examples. With two partitions, the workflow could look as follows:

You can greatly reduce your chances of overfitting by partitioning the data set into the three subsets shown in the following figure:

Introduction to Machine Learning Part I

Basic Terms and Concepts

What is Machine Learning

Why should machines have to learn?

Wellsprings of Machine Learning

Varieties of Machine Learning

Type of Learning

Key ML Terminoly

Examples to illustrate ML terms

Examples to illustrate ML terms

Descending into ML: Linear Regression

Descending into ML: Linear Regression

Descending into ML: Training and Loss

Training and Loss

Descending into ML: Squared Loss

Reducing Loss: An Iterative Approach

Reduce Loss: Gradient Descent

Reduce Loss: Gradient Descent

Reduce Loss: Learning Rate

Reduce Loss: Stochastic Gradient Descent

Model Evaluation

Model Evaluation

Training and Test Sets: Splitting Data

Validation Set: Another Partition

Representation: Feature Engineering