1 Single Class Classification Algorithms

1.1 Examples of Single Class

“Spam / Not Spam”

“Tumor / Not Tumor”

We are predicting variable \(y\) where \(y \in {0,1}\) and \({0,1}\) are representing those two possible classifications. \(0\) = negative class, \(1\) = positive class.

Plot example (from video):

ggplot2::qplot(seq_along(c(0,0,0,0,1,1,1,1)), c(0,0,0,0,1,1,1,1), size = I(3.0))

2 Logistic Regression

Whereas the linear regression algorithms (\(h_θ(x)\)) returned values either >1 or <0, the logistic regression looks like this: \(0 \leq h_θ(x) \leq 1\) (between 0 and 1). Also this is considered a classification algorithm, even though it’s called regression.

2.1 How does it work

\[h_θ(x) = g(θ^Tx)\]

g(z) is…

\[g(z) = \frac{1}{1+e^{-z}}\]

or (all together)

\[h_θ(x) = \frac{1}{1+e^{-(θ^Tx)}}\]

Apparently that fancy g function is also known as the sigmoid function, also known as the logistic function.

A general sigmoid/logistic function demo:

ggplot2::qplot(seq(-10, 10, by = 0.2), 1/(1+exp(-seq(-10, 10, by = 0.2))))

#formula at x of -10
1/(1+exp(-(-10)))
## [1] 4.539787e-05
#formula at x of 10
1/(1+exp(-(10)))
## [1] 0.9999546

Basically this dawg keeps everything within 0 and 1, to the point of infinity(or, asymptoting) (in terms of x) to either 0 or 1

2.2 What the output of this thing means

So what does the output of \(h_θ(x) = \frac{1}{1+e^{θ^Tx}}\) tell us? In the case of our single class classification, it’s the probability that \(y = 1\) (positive) based on the input of \(x\).

Example from video:

\[x = \begin{bmatrix} x_0 \\ x_1 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ tumorSize \\ \end{bmatrix}\]

\[h_θ(x) = 0.7\]

Means that given whatever the value \(tumorSize\) ended up being in the patient’s \(x\) entry, there’s a 70% chance the tumor is malignant given that particular algorithm.

And to put it in probability notation:

\[h_θ(x) = P(y=1 | x; θ)\]

“Probability of \(y = 1\) given \(x\), parameterized by theta”

And some given logical facts, based on how probability works:

\[P(y=1 | x; θ) + P(y=0 | x; θ) = 1\]
and

\[P(y=0 | x; θ) = 1-P(y=1 | x; θ)\]

2.3 Decision Boundary

Because the output of the function is a value between 0 and 1, a threshold for the classification needs to be made. In the videos, the \(x = 0, y = 0.5\) marker is the dividing point between a negative and positive classification. So, \(h_θ(x) \geq 0.5\) is \(y=1\) and \(h_θ(x) < 0.5 is y=0\).

2.3.1 Drawing the Boundary

A kind of shitty simulation:

ggplot2::qplot(c(sample(seq(0, 3, 0.02), 20), sample(seq(2.5, 5, 0.02), 20)), c(sample(seq(0, 2.3, 0.02), 20), sample(seq(2.5, 5, 0.02), 20))) + ggplot2::geom_abline(intercept = 5, slope = -1)

This would be a plot of the data and boundary line given the following theta values:

\[θ = \begin{bmatrix} -5 \\ 1 \\ 1 \end{bmatrix}\]

So with this we can rewrite the above line as \(-5 + x_1 + x_2 \geq 0\) or \(x_1+x_2 = 5\). Our gradient descent/whatever calculation is going to give us those three theta variables, I think.

This does two things:

  1. It means the line is straight, since that’s a straight line formula. Since this is a basic classification algorithm between two classes, that’s what we want.
  2. We predict \(y=1\) if \(x_1 + x_2 \geq 5\) and \(y=0\) if the opposite.

So, our theta values are what draws the decision boundary. It is the outcome of our hypothesis function, \(h_θ(x)\). We want to optimize this.

3 Cost Function

Slightly different from the linear regression cost function, since that wouldn’t produce a convex feature.

\[Cost(h_θ(x),y) = -log(h_θ(x)) if y=1\] \[Cost(h_θ(x),y) = -log(1-h_θ(x)) if y=0\]

All together: \[h_θ(x) = \frac{1}{1+e^{-(θ^Tx)}}\] \[Cost(h_θ(x),y) = -log(\frac{1}{1+e^{-(θ^Tx)}}) if y=1\] \[Cost(h_θ(x),y) = -log(1-(\frac{1}{1+e^{-(θ^Tx)}})) if y=0\]

3.1 Some Intuition

So, the point of a cost function is to predict the cost. The IF statement in the function is there because they return the opposite/complementary calculation.

#example 1: one step in the function
costFunctionStep1 <- function(x, y) {
  if (y == 1) {
    -log(x)
  } else {
    -log(1-(x))
  }
}
costFunctionStep1(x = 0.99, y = 1)
## [1] 0.01005034
costFunctionStep1(x = 0.01, y = 0)
## [1] 0.01005034
costFunctionStep1(x = 0, y = 1)
## [1] Inf
costFunctionStep1(x = 1, y = 0)
## [1] Inf
#plotting out y = 1 and a sequence from 0.01 to 0.99 (simulating the h_x output)
ggplot2::qplot(seq(0.01, 0.99, 0.005), costFunctionStep1(seq(0.01, 0.99, 0.005), y = 1))

#plotting out y = 0 and a sequence from 0.01 to 0.99 (simulating the h_x output)
ggplot2::qplot(seq(0.01, 0.99, 0.005), costFunctionStep1(seq(0.01, 0.99, 0.005), y = 0))

With these output demos, we can see that we’ll get a convex shape given our function’s possible outputs.

Likewise we can see the actual cost step is working – if \(y=1\) and \(x=0.9883\) or whatever, the cost is very low because it’s close to \(y\). If you say it’s 1 when it’s 0, and the reverse, then the function returns infinity. IE: you’re very wrong.

3.2 Simplified Cost Function

A tite way of doing this without an if statement would be this:

costFunctionStep1 <- function(x, y) {-y*log(x)-(1-y)*log(1-x)}
costFunctionStep1(x = 0.99, y = 1)
## [1] 0.01005034
costFunctionStep1(x = 0.01, y = 0)
## [1] 0.01005034
costFunctionStep1(x = 0, y = 1)
## [1] Inf
costFunctionStep1(x = 1, y = 0)
## [1] Inf

4 Logistic Regression Cost Function (Nearly Everything)

Big ole formula:

\[J(θ) = -\frac{1}{m} \lbrack \sum_{i=1}^my^{(i)}log(\frac{1}{1+e^{-(θ^Tx)}})+(1-y^{(i)})log(1-(\frac{1}{1+e^{-(θ^Tx)}})) \rbrack\]

“Principle Maximum Likelihood Estimation” -> A way in statistics to find parameters \(θ\) for different models.

4.1 Next/Final Steps

Objective: Find the parameters \(θ\) that minimize the output of the cost function \(J(θ)\).

End Product: A classification algorithm as supplied by the \(\theta\) values. Can predict new values of \(x\).

5 Gradient Descent

Some calculus magic was done in the videos to get this formula, ie the partial derivative. Will learn this.

\[\theta_j := \theta_j - \alpha \sum_{i=1}^m ((\frac{1}{1+e^{-(θ^Tx)}})-y^{(i)})x_j^{(i)}\]

This is the structure of our gradient descent algorithm for logistic classification.

5.1 Alternative Algorithms

The video talks about these as being other possibilites to find the \(min \theta\):

  1. Conjugate Gradient
  2. BFGS
  3. L-BFGS

With them the \(\alpha\) is automatically adjusted and it’s usually more efficient. They’re also more complex.

**Doc on doing this in R: https://cran.r-project.org/web/views/Optimization.html**

5.1.1 Example of these in R

Two parameters in \(\theta\): \(\theta_1, \theta_2\)

\(J(\theta) = (\theta_1-5)^2+(\theta_2-5)^2\)

Deriv 1: \(2(\theta_1-5)\)

Deriv 2: \(2(\theta_2-5)\)

costFunctionA <- function(theta) {
  jVal = (theta[1]-5)^2+(theta[2]-5)^2
#  gradient = c(0,0)
#  gradient[1] = 2*(theta[1]-5)
#  gradient[2] = 2*(theta[2]-5)
}
theta = c(0,0)
optim(theta, costFunctionA, gr = "BFGS", control = list(maxit = 5000))
## $par
## [1] 5.000411 4.999785
## 
## $value
## [1] 2.153122e-07
## 
## $counts
## function gradient 
##       87       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

But somehow I made this work by commenting out the last three lines in the function? No idea.

Come back to this

6 Determining the Fit

Video: Basic points on the optimal complexity of an algorithm and how well the prediction fits itself to the training data, VS how well it would operate on new variables.

6.1 Addressing Overfitting

  1. Reduce the number of features used in the prediction algorithm.
    1. Manually pick whatever features you want to keep/throw out
    2. Have an algorithm pick the best model.
  2. Regularization
    1. You can keep all of the features, but reduce the magnitude or values(?) of the \(\theta\) parameters.
    2. This method works well if you have a lot of features predicting \(y\).

6.2 Regularization

Takes the large order polynomial that might make a model overfit the data and makes the last \(\theta\) values less significant through squaring/multiplication.

This works on the logic that smaller theta values mean a simpler and more generalized model for the data.

“Housing”: Through a modification of the cost function (adding another calculation to it) we can regularize all features/parameters. This can be added to a formula that’s a high-order polynomial to smooth it out. Example of formula:

\[J(\theta)=\frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum_{i=1}^n \theta_j^2\]

\(\lambda\) = regularization parameter. “Controls the tradeoff between two different goals.” * Setting this to a super huge value (10^10) would make the algorithm underfit the data. A 10^10 value would almost be a flat horizontal line.

Also, \(\theta_0\) is left out of the regularization/not penalized.

6.3 Regularized Linear Regression (GD)

\[h_θ(x^{(i)}) =\] whatever it was in the previous week

\[\theta_0 := \theta_0 - \alpha \frac1m \sum_{i=1}^m (h_θ(x^{(i)})-y^{(i)})x_0^{(i)}\]

\[\theta_j := \theta_j(1-\alpha \frac{\lambda}{m}) - \alpha \sum_{i=1}^m (h_θ(x^{(i)})-y^{(i)})x_j^{(i)}\]

\[(j=1,2,3,...,n)\]

6.4 Regularized Logistic Regression (GD)

Very similar… We’re still leaving \(\theta_0\) alone.

\[\theta_0 := \theta_0 - \alpha \frac1m \sum_{i=1}^m ((\frac{1}{1+e^{-(θ^Tx)}})-y^{(i)})x_0^{(i)}\]

\[\theta_j := \theta_j - \alpha \lbrack \frac1m \sum_{i=1}^m ((\frac{1}{1+e^{-(θ^Tx)}})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m} \theta_j \rbrack\]

\[(j=1,2,3,...,n)\]

7 Assignment