Student Information: - Jonah Berson 0799906
We explored Linear Regression and the Logistic (Logit) Model. As we move into Neural Networks, it is crucial to understand how these models are related and why neural networks require specific mathematical functions (such as sigmoid function) to succeed.
At their core, individual neurons in a neural network are mathematically identical to the models you already know.
If a single neuron is just logistic regression, why build a “network” with hidden layers?
Hidden layers allow the model to learn intermediate, complex representations of the data by routing signals through interconnected nodes.
Below is an illustration of a standard, albeit small, neural network. It takes two inputs, passes them through a hidden layer containing two nodes, and aggregates the result into a single output prediction.
# Generates a standard 2x2x1 neural network diagram
nodes_net <- data.frame(
x = c(1, 1, 2, 2, 3),
y = c(0.5, -0.5, 0.5, -0.5, 0),
label = c("Input\n(X1)", "Input\n(X2)", "Hidden\n(H1)", "Hidden\n(H2)", "Output\n(Y)")
)
edges_net <- data.frame(
x = c(1, 1, 1, 1, 2, 2),
y = c(0.5, 0.5, -0.5, -0.5, 0.5, -0.5),
xend = c(2, 2, 2, 2, 3, 3),
yend = c(0.5, -0.5, 0.5, -0.5, 0, 0)
)
ggplot() +
geom_segment(data = edges_net, aes(x = x, y = y, xend = xend, yend = yend),
color = "gray60", size = 0.8) +
geom_point(data = nodes_net, aes(x = x, y = y, color = factor(x)), size = 24) +
geom_text(data = nodes_net, aes(x = x, y = y, label = label), size = 3.5, fontface = "bold", color = "black") +
theme_void() +
theme(legend.position = "none") +
scale_color_manual(values = c("1" = "#AEC6CF", "2" = "#B39EB5", "3" = "#FFB347")) +
xlim(0.5, 3.5) + ylim(-1, 1) +
labs(title = "Standard Interconnected Architecture (2 Inputs, 2 Hidden, 1 Output)")
However, if we do not use a non-linear activation function (like the sigmoid) at the hidden nodes, this entire network collapses mathematically back into a single linear regression model, rendering the hidden layer entirely useless.
The Mathematical Proof:
To prove this collapse algebraically, let’s isolate a single pathway
through the network—simplifying the architecture to just 1 input, 1
hidden node, and 1 output node.
# Generates a simplified 1-to-1 architecture diagram matching the mathematical proof
nodes_h <- data.frame(
x = c(1, 2, 3),
y = c(0, 0, 0),
label = c("Input\n(X)", "Hidden Layer\n(H)", "Output\n(Y)")
)
edges_h <- data.frame(
x = c(1, 2),
y = c(0, 0),
xend = c(2, 3),
yend = c(0, 0),
weight = c("W1", "W2")
)
ggplot() +
geom_segment(data = edges_h, aes(x = x, y = y, xend = xend, yend = yend),
arrow = arrow(length = unit(0.4, "cm")), color = "gray60", size = 1.2) +
geom_text(data = edges_h, aes(x = (x + xend)/2, y = 0.15, label = weight), fontface = "italic", size = 5) +
geom_point(data = nodes_h, aes(x = x, y = y, color = factor(x)), size = 28) +
geom_text(data = nodes_h, aes(x = x, y = y, label = label), size = 3.5, fontface = "bold", color = "black") +
theme_void() +
theme(legend.position = "none") +
scale_color_manual(values = c("1" = "#AEC6CF", "2" = "#B39EB5", "3" = "#FFB347")) +
xlim(0.5, 3.5) + ylim(-0.5, 0.5) +
labs(title = "Simplified Architecture: The Linear Collapse Proof")
Imagine this simplified 2-layer network does not use activation functions.
Substitute the hidden layer equation into the output equation:
\[Y = W_2 \cdot (W_1 \cdot X + b_1) +
b_2\]
\[Y = (W_2 \cdot W_1) \cdot X + (W_2 \cdot
b_1 + b_2)\] Let \(W_{new} = W_2 \cdot
W_1\) and \(b_{new} = W_2 \cdot b_1 +
b_2\).
The equation simplifies to: \[Y = W_{new} \cdot X + b_{new}\]
No matter how many nodes or layers you stack, purely linear transformations just algebraically combine into one single linear function. Non-linear activation functions are strictly necessary to prevent this collapse.
The mathematical formulation of the sigmoid function is:
\[\sigma(x) = \frac{1}{1 +
e^{-x}}\]
Write an R function named sigmoid that takes a numeric vector x as input and returns the sigmoid transformation.
sigmoid <- function(x) {
1/(1+exp(-x))
}
Generate a sequence of 200 numbers ranging from -10 to 10. Apply your sigmoid function and plot the result using ggplot2 to observe the characteristic S-curve.
x_values <- seq(-10, 10, length.out = 200)
y_values <- sigmoid(x_values)
plot_data <- data.frame(x = x_values, y = y_values)
ggplot(plot_data, aes(x = x, y = y)) +
geom_line(color = "#EA4335", size = 1.2) +
labs(title = "The Sigmoid Activation Function", x = "Z (Linear Output)", y = "Probability (A)") +
theme_minimal() +
geom_hline(yintercept = c(0, 0.5, 1), linetype = "dashed", color = "gray")
Question 1 (The Vanishing Gradient Problem): Based on your plot, what happens to the slope (gradient) of the curve as \(x\) becomes very large (e.g., 10) or very small (e.g., -10)? Why is this detrimental to training deep neural networks, and how do modern architectures overcome it?
As \(x\) reaches extreme positive or negative values, the curve becomes entirely flat, meaning the gradient approaches exactly zero. Deep networks learn by passing these gradients backward through the layers (backpropagation) using the chain rule. If the local gradients are near zero, multiplying them across multiple layers causes the overall gradient to shrink exponentially until it “vanishes.” When gradients vanish, the network stops updating its weights, and the model fails to learn. Modern architectures overcome this by using the ReLU (Rectified Linear Unit) activation function, \(f(x) = \max(0, x)\), which maintains a constant gradient of 1 for all positive inputs, completely avoiding the vanishing effect.
To understand why we need neural networks, we must look at data that a simple Logit model cannot handle. Run the code below to generate a “bullseye” dataset—a cluster of points surrounded by a ring of different points.
# Generate Concentric Circles (Bullseye Dataset)
set.seed(42)
n <- 200
# Inner circle (Class 1)
r_in <- runif(n/2, 0, 1.5)
theta_in <- runif(n/2, 0, 2*pi)
# Outer ring (Class 0)
r_out <- runif(n/2, 2.5, 4)
theta_out <- runif(n/2, 0, 2*pi)
dataset <- data.frame(
X1 = c(r_in * cos(theta_in), r_out * cos(theta_out)),
X2 = c(r_in * sin(theta_in), r_out * sin(theta_out)),
Y = c(rep(1, n/2), rep(0, n/2))
)
ggplot(dataset, aes(x = X1, y = X2, color = as.factor(Y))) +
geom_point(size = 2) +
labs(title = "Complex Non-Linear Dataset (Bullseye)", color = "Class (Y)") +
theme_minimal()
Fit a standard Logistic Regression model predicting Y using X1 and X2. Calculate its accuracy.
# Solution: Fit Logit Model
logit_model <- glm(Y ~ X1 + X2, data = dataset, family = binomial)
# Calculate Accuracy
logit_probs <- predict(logit_model, type = "response")
logit_preds <- ifelse(logit_probs >= 0.5, 1, 0)
logit_accuracy <- mean(logit_preds == dataset$Y)
cat("Logistic Regression Accuracy: ", logit_accuracy * 100, "%\n")
## Logistic Regression Accuracy: 55.5 %
We have generated a background grid for you. Use your logit_model to predict the class for the entire grid, and plot the decision boundary.
# Create a grid across the feature space
grid <- expand.grid(X1 = seq(-4.5, 4.5, length.out = 100),
X2 = seq(-4.5, 4.5, length.out = 100))
# Solution: Predict on grid and plot
grid$Logit_Prob <- predict(logit_model, newdata = grid, type = "response")
grid$Logit_Pred <- ifelse(grid$Logit_Prob >= 0.5, "1", "0")
ggplot() +
geom_tile(data = grid, aes(x = X1, y = X2, fill = Logit_Pred), alpha = 0.3) +
geom_point(data = dataset, aes(x = X1, y = X2, color = as.factor(Y))) +
labs(title = "Logistic Regression Decision Boundary", fill = "Predicted Area", color = "Actual Class") +
theme_minimal()
Question 2: Look at the accuracy and the graph. Why did the Logistic Regression model fail so badly here?
The accuracy is essentially 50% (a coin toss). The logit model maps a linear combination of inputs directly to an output. Geometrically, this means it can only ever draw a single, straight line to separate classes. Because the actual data is circular (Class 1 is entirely surrounded by Class 0), no straight line can possibly separate them.