suppressMessages(library(readr))
suppressMessages(library(plotly))
suppressMessages(library(DT))
This chapter introduces logistic regression as a single layer network. Whereas the preceding chapters considered a continuous numerical variable as the output, this chapter looks at a categorical outcome.
Machine learning problems that model categorical outcomes are called classification problems as opposed to the previous chapters which considered simple regression problems. The categorical target variable is expressed as natural numbers, \(\mathbb{N} = \left\{0,1,2,3, \ldots \right\}\). When there are only two elements in the sample space of the target variable, such as Yes and No, or has disease and does not have disease, or benign and malignant, or fraudulent transaction and legitimite transaction, these are coded as \(0\) and \(1\). Note that it is possible to create models that predict more than two outcomes.
Logistic regression is a well-known statistical tool for classification problems. It can be expressed as a single-layer neural network (of sorts).
Logistic regression, in its simple form, is still a linear model, similar to those in the preceding chapters. Therefor, it still requires solutions for parameters, such as \(\beta_0, \beta_1, \ldots\), such that a cost function can be minimized.
Considering only dichotomous (binary) target variables, the problem lies in expressing the predicted value, \(\hat{y}\) in terms of the sample space of the target variable, i.e. \(0\) and \(1\). The solution is the sigmoid function, shown in equation (1).
\[\sigma \left( z \right) = \frac{1}{1+e^{-z}} \tag{1}\]
A graph of the sigmoid function is shown below. Note that irrespective of the input value, \(z\), the solution will always be between \(0\) and \(1\).
x <- seq(-5, 5, 0.01)
y = 1 / (1 + exp(-x))
p <- plot_ly(x = x,
y = y,
name = "Sigmoid function",
type = "scatter",
mode = "lines") %>%
layout(title = "Sigmoid function")
p
Considering then a classification problem with four feature variables and a binary target variable, \(z\) can be expressed as shown in equation (2).
\[z \left( \beta_0 , \beta_1 , \beta_2 , \beta_3 , \beta_4 \right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 \tag{2} \]
Inserting equation (2) into the sigmoid function yields equation (3).
\[\sigma \left( z \right) = \frac{1}{1+e^{- \left( \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 \right)}} \tag{3}\]
The sigmoid function is inserted at the end of the single layer network as shown below.
In the example below, a spreadsheet file is imported showing four feature variables and a binary target variable.
df <- read_csv("LogisticRegression.csv")
## Parsed with column specification:
## cols(
## x1 = col_double(),
## x2 = col_double(),
## x3 = col_double(),
## x4 = col_double(),
## y = col_integer()
## )
datatable(df)
The glm() function in R can be used to create a logistic regression model.
model <- glm(y ~ .,
family = binomial(link = "logit"),
data = df)
summary(model)
##
## Call:
## glm(formula = y ~ ., family = binomial(link = "logit"), data = df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.05684 -0.04493 0.00123 0.02374 2.28103
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.84361 5.75109 -2.407 0.01608 *
## x1 -0.66990 0.21833 -3.068 0.00215 **
## x2 0.10644 0.05027 2.117 0.03422 *
## x3 2.65496 0.64941 4.088 4.35e-05 ***
## x4 0.12111 0.04225 2.867 0.00415 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 203.415 on 149 degrees of freedom
## Residual deviance: 37.006 on 145 degrees of freedom
## AIC: 47.006
##
## Number of Fisher Scoring iterations: 9
The summary shows the results of the parameter values (listed in the column Estimate). The (Intercept) value is for \(\beta_0\) and the values for \(\beta_1 \ldots \beta_4\) follows. These values can be plugged-into the first row of data above and results in a predicted value of 0.6197766.
1 / (1 + exp(-(-13.84361 + (-0.6699 * 15.5) + 0.10644 * 110 + 2.65496 * 2.5 + 0.1211 * 52.6)))
## [1] 0.6197766
In effect, this expresses the predicted value as a probability between \(0\) and \(1\). A value of \(0.5\) can be chosen as an arbitrary cut-off. For a predicted value at or above \(0.5\) the prediction will be \(1\) and for a value of less than \(0.5\) the model will predict a \(0\).
This chapter introduced the last piece of the puzzle required before moving on to proper neural networks. The sigmoid function in one of several functions that can be used in a deep neural network. These are collectively known as activation functions.