“Spam / Not Spam”
“Tumor / Not Tumor”
We are predicting variable \(y\) where \(y \in {0,1}\) and \({0,1}\) are representing those two possible classifications. \(0\) = negative class, \(1\) = positive class.
Plot example (from video):
ggplot2::qplot(seq_along(c(0,0,0,0,1,1,1,1)), c(0,0,0,0,1,1,1,1), size = I(3.0))
Whereas the linear regression algorithms (\(h_θ(x)\)) returned values either >1 or <0, the logistic regression looks like this: \(0 \leq h_θ(x) \leq 1\) (between 0 and 1). Also this is considered a classification algorithm, even though it’s called regression.
\[h_θ(x) = g(θ^Tx)\]
g(z) is…
\[g(z) = \frac{1}{1+e^{-z}}\]
or (all together)
\[h_θ(x) = \frac{1}{1+e^{-(θ^Tx)}}\]
Apparently that fancy g function is also known as the sigmoid function, also known as the logistic function.
A general sigmoid/logistic function demo:
ggplot2::qplot(seq(-10, 10, by = 0.2), 1/(1+exp(-seq(-10, 10, by = 0.2))))
#formula at x of -10
1/(1+exp(-(-10)))
## [1] 4.539787e-05
#formula at x of 10
1/(1+exp(-(10)))
## [1] 0.9999546
Basically this dawg keeps everything within 0 and 1, to the point of infinity(or, asymptoting) (in terms of x) to either 0 or 1
So what does the output of \(h_θ(x) = \frac{1}{1+e^{θ^Tx}}\) tell us? In the case of our single class classification, it’s the probability that \(y = 1\) (positive) based on the input of \(x\).
Example from video:
\[x = \begin{bmatrix} x_0 \\ x_1 \\ \end{bmatrix} = \begin{bmatrix} 1 \\ tumorSize \\ \end{bmatrix}\]
\[h_θ(x) = 0.7\]
Means that given whatever the value \(tumorSize\) ended up being in the patient’s \(x\) entry, there’s a 70% chance the tumor is malignant given that particular algorithm.
And to put it in probability notation:
\[h_θ(x) = P(y=1 | x; θ)\]
And some given logical facts, based on how probability works:
\[P(y=1 | x; θ) + P(y=0 | x; θ) = 1\]\[P(y=0 | x; θ) = 1-P(y=1 | x; θ)\]
Because the output of the function is a value between 0 and 1, a threshold for the classification needs to be made. In the videos, the \(x = 0, y = 0.5\) marker is the dividing point between a negative and positive classification. So, \(h_θ(x) \geq 0.5\) is \(y=1\) and \(h_θ(x) < 0.5 is y=0\).
A kind of shitty simulation:
ggplot2::qplot(c(sample(seq(0, 3, 0.02), 20), sample(seq(2.5, 5, 0.02), 20)), c(sample(seq(0, 2.3, 0.02), 20), sample(seq(2.5, 5, 0.02), 20))) + ggplot2::geom_abline(intercept = 5, slope = -1)
This would be a plot of the data and boundary line given the following theta values:
\[θ = \begin{bmatrix} -5 \\ 1 \\ 1 \end{bmatrix}\]
So with this we can rewrite the above line as \(-5 + x_1 + x_2 \geq 0\) or \(x_1+x_2 = 5\). Our gradient descent/whatever calculation is going to give us those three theta variables, I think.
This does two things:
So, our theta values are what draws the decision boundary. It is the outcome of our hypothesis function, \(h_θ(x)\). We want to optimize this.
Slightly different from the linear regression cost function, since that wouldn’t produce a convex feature.
\[Cost(h_θ(x),y) = -log(h_θ(x)) if y=1\] \[Cost(h_θ(x),y) = -log(1-h_θ(x)) if y=0\]
All together: \[h_θ(x) = \frac{1}{1+e^{-(θ^Tx)}}\] \[Cost(h_θ(x),y) = -log(\frac{1}{1+e^{-(θ^Tx)}}) if y=1\] \[Cost(h_θ(x),y) = -log(1-(\frac{1}{1+e^{-(θ^Tx)}})) if y=0\]
So, the point of a cost function is to predict the cost. The IF statement in the function is there because they return the opposite/complementary calculation.
#example 1: one step in the function
costFunctionStep1 <- function(x, y) {
if (y == 1) {
-log(x)
} else {
-log(1-(x))
}
}
costFunctionStep1(x = 0.99, y = 1)
## [1] 0.01005034
costFunctionStep1(x = 0.01, y = 0)
## [1] 0.01005034
costFunctionStep1(x = 0, y = 1)
## [1] Inf
costFunctionStep1(x = 1, y = 0)
## [1] Inf
#plotting out y = 1 and a sequence from 0.01 to 0.99 (simulating the h_x output)
ggplot2::qplot(seq(0.01, 0.99, 0.005), costFunctionStep1(seq(0.01, 0.99, 0.005), y = 1))
#plotting out y = 0 and a sequence from 0.01 to 0.99 (simulating the h_x output)
ggplot2::qplot(seq(0.01, 0.99, 0.005), costFunctionStep1(seq(0.01, 0.99, 0.005), y = 0))
With these output demos, we can see that we’ll get a convex shape given our function’s possible outputs.
Likewise we can see the actual cost step is working – if \(y=1\) and \(x=0.9883\) or whatever, the cost is very low because it’s close to \(y\). If you say it’s 1 when it’s 0, and the reverse, then the function returns infinity. IE: you’re very wrong.
A tite way of doing this without an if statement would be this:
costFunctionStep1 <- function(x, y) {-y*log(x)-(1-y)*log(1-x)}
costFunctionStep1(x = 0.99, y = 1)
## [1] 0.01005034
costFunctionStep1(x = 0.01, y = 0)
## [1] 0.01005034
costFunctionStep1(x = 0, y = 1)
## [1] Inf
costFunctionStep1(x = 1, y = 0)
## [1] Inf
Big ole formula:
\[J(θ) = -\frac{1}{m} \lbrack \sum_{i=1}^my^{(i)}log(\frac{1}{1+e^{-(θ^Tx)}})+(1-y^{(i)})log(1-(\frac{1}{1+e^{-(θ^Tx)}})) \rbrack\]
“Principle Maximum Likelihood Estimation” -> A way in statistics to find parameters \(θ\) for different models.
Objective: Find the parameters \(θ\) that minimize the output of the cost function \(J(θ)\).
End Product: A classification algorithm as supplied by the \(\theta\) values. Can predict new values of \(x\).
Some calculus magic was done in the videos to get this formula, ie the partial derivative. Will learn this.
\[\theta_j := \theta_j - \alpha \sum_{i=1}^m ((\frac{1}{1+e^{-(θ^Tx)}})-y^{(i)})x_j^{(i)}\]
This is the structure of our gradient descent algorithm for logistic classification.
The video talks about these as being other possibilites to find the \(min \theta\):
With them the \(\alpha\) is automatically adjusted and it’s usually more efficient. They’re also more complex.
**Doc on doing this in R: https://cran.r-project.org/web/views/Optimization.html**
Two parameters in \(\theta\): \(\theta_1, \theta_2\)
\(J(\theta) = (\theta_1-5)^2+(\theta_2-5)^2\)
Deriv 1: \(2(\theta_1-5)\)
Deriv 2: \(2(\theta_2-5)\)
costFunctionA <- function(theta) {
jVal = (theta[1]-5)^2+(theta[2]-5)^2
# gradient = c(0,0)
# gradient[1] = 2*(theta[1]-5)
# gradient[2] = 2*(theta[2]-5)
}
theta = c(0,0)
optim(theta, costFunctionA, gr = "BFGS", control = list(maxit = 5000))
## $par
## [1] 5.000411 4.999785
##
## $value
## [1] 2.153122e-07
##
## $counts
## function gradient
## 87 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
But somehow I made this work by commenting out the last three lines in the function? No idea.
Come back to this
Video: Basic points on the optimal complexity of an algorithm and how well the prediction fits itself to the training data, VS how well it would operate on new variables.
Takes the large order polynomial that might make a model overfit the data and makes the last \(\theta\) values less significant through squaring/multiplication.
This works on the logic that smaller theta values mean a simpler and more generalized model for the data.
“Housing”: Through a modification of the cost function (adding another calculation to it) we can regularize all features/parameters. This can be added to a formula that’s a high-order polynomial to smooth it out. Example of formula:
\[J(\theta)=\frac{1}{2m} \sum_{i=1}^m (h_{\theta}(x^{(i)})-y^{(i)})^2+\lambda \sum_{i=1}^n \theta_j^2\]
\(\lambda\) = regularization parameter. “Controls the tradeoff between two different goals.” * Setting this to a super huge value (10^10) would make the algorithm underfit the data. A 10^10 value would almost be a flat horizontal line.
Also, \(\theta_0\) is left out of the regularization/not penalized.
\[h_θ(x^{(i)}) =\] whatever it was in the previous week
\[\theta_0 := \theta_0 - \alpha \frac1m \sum_{i=1}^m (h_θ(x^{(i)})-y^{(i)})x_0^{(i)}\]
\[\theta_j := \theta_j(1-\alpha \frac{\lambda}{m}) - \alpha \sum_{i=1}^m (h_θ(x^{(i)})-y^{(i)})x_j^{(i)}\]
\[(j=1,2,3,...,n)\]
Very similar… We’re still leaving \(\theta_0\) alone.
\[\theta_0 := \theta_0 - \alpha \frac1m \sum_{i=1}^m ((\frac{1}{1+e^{-(θ^Tx)}})-y^{(i)})x_0^{(i)}\]
\[\theta_j := \theta_j - \alpha \lbrack \frac1m \sum_{i=1}^m ((\frac{1}{1+e^{-(θ^Tx)}})-y^{(i)})x_j^{(i)} + \frac{\lambda}{m} \theta_j \rbrack\]
\[(j=1,2,3,...,n)\]