Student Information: - Jonah Berson 0799906

Part I: Bridging Models & Theoretical Background

We explored Linear Regression and the Logistic (Logit) Model. As we move into Neural Networks, it is crucial to understand how these models are related and why neural networks require specific mathematical functions (such as sigmoid function) to succeed.

1. Comparing the Three Models

At their core, individual neurons in a neural network are mathematically identical to the models you already know.

  • Linear Regression: Maps inputs to a continuous output.
    \[Y = (W \cdot X) + b\]
  • Logistic Regression: Takes that same linear combination and passes it through a sigmoid function (\(\sigma\)) to “squash” the output between 0 and 1, creating a probability. \[Y = \sigma((W \cdot X) + b)\]
  • A Single Artificial Neuron: Mathematically identical to Logistic Regression! It calculates a linear combination (\(Z\)) and applies an activation function (\(A\)).
    \[Z = (W \cdot X) + b\]
    \[A = \sigma(Z)\]

2. The Necessity of Activation Functions (Why Layers Collapse)

If a single neuron is just logistic regression, why build a “network” with hidden layers?

Hidden layers allow the model to learn intermediate, complex representations of the data by routing signals through interconnected nodes.

Below is an illustration of a standard, albeit small, neural network. It takes two inputs, passes them through a hidden layer containing two nodes, and aggregates the result into a single output prediction.

# Generates a standard 2x2x1 neural network diagram
nodes_net <- data.frame(
  x = c(1, 1, 2, 2, 3),
  y = c(0.5, -0.5, 0.5, -0.5, 0),
  label = c("Input\n(X1)", "Input\n(X2)", "Hidden\n(H1)", "Hidden\n(H2)", "Output\n(Y)")
)
edges_net <- data.frame(
  x = c(1, 1, 1, 1, 2, 2),
  y = c(0.5, 0.5, -0.5, -0.5, 0.5, -0.5),
  xend = c(2, 2, 2, 2, 3, 3),
  yend = c(0.5, -0.5, 0.5, -0.5, 0, 0)
)

ggplot() +
  geom_segment(data = edges_net, aes(x = x, y = y, xend = xend, yend = yend), 
               color = "gray60", size = 0.8) +
  geom_point(data = nodes_net, aes(x = x, y = y, color = factor(x)), size = 24) +
  geom_text(data = nodes_net, aes(x = x, y = y, label = label), size = 3.5, fontface = "bold", color = "black") +
  theme_void() +
  theme(legend.position = "none") +
  scale_color_manual(values = c("1" = "#AEC6CF", "2" = "#B39EB5", "3" = "#FFB347")) +
  xlim(0.5, 3.5) + ylim(-1, 1) +
  labs(title = "Standard Interconnected Architecture (2 Inputs, 2 Hidden, 1 Output)")

However, if we do not use a non-linear activation function (like the sigmoid) at the hidden nodes, this entire network collapses mathematically back into a single linear regression model, rendering the hidden layer entirely useless.

The Mathematical Proof:
To prove this collapse algebraically, let’s isolate a single pathway through the network—simplifying the architecture to just 1 input, 1 hidden node, and 1 output node.

# Generates a simplified 1-to-1 architecture diagram matching the mathematical proof
nodes_h <- data.frame(
  x = c(1, 2, 3),
  y = c(0, 0, 0),
  label = c("Input\n(X)", "Hidden Layer\n(H)", "Output\n(Y)")
)
edges_h <- data.frame(
  x = c(1, 2),
  y = c(0, 0),
  xend = c(2, 3),
  yend = c(0, 0),
  weight = c("W1", "W2")
)

ggplot() +
  geom_segment(data = edges_h, aes(x = x, y = y, xend = xend, yend = yend), 
               arrow = arrow(length = unit(0.4, "cm")), color = "gray60", size = 1.2) +
  geom_text(data = edges_h, aes(x = (x + xend)/2, y = 0.15, label = weight), fontface = "italic", size = 5) +
  geom_point(data = nodes_h, aes(x = x, y = y, color = factor(x)), size = 28) +
  geom_text(data = nodes_h, aes(x = x, y = y, label = label), size = 3.5, fontface = "bold", color = "black") +
  theme_void() +
  theme(legend.position = "none") +
  scale_color_manual(values = c("1" = "#AEC6CF", "2" = "#B39EB5", "3" = "#FFB347")) +
  xlim(0.5, 3.5) + ylim(-0.5, 0.5) +
  labs(title = "Simplified Architecture: The Linear Collapse Proof")

Imagine this simplified 2-layer network does not use activation functions.

  • Hidden Layer (H): \(H = W_1 \cdot X + b_1\)
  • Output Layer (Y): \(Y = W_2 \cdot H + b_2\)

Substitute the hidden layer equation into the output equation:

\[Y = W_2 \cdot (W_1 \cdot X + b_1) + b_2\]
\[Y = (W_2 \cdot W_1) \cdot X + (W_2 \cdot b_1 + b_2)\] Let \(W_{new} = W_2 \cdot W_1\) and \(b_{new} = W_2 \cdot b_1 + b_2\).

The equation simplifies to: \[Y = W_{new} \cdot X + b_{new}\]

No matter how many nodes or layers you stack, purely linear transformations just algebraically combine into one single linear function. Non-linear activation functions are strictly necessary to prevent this collapse.

Part II: Implementation

The mathematical formulation of the sigmoid function is:
\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Task 1: Mathematical Implementation and Visualization

Task 1.1:

Write an R function named sigmoid that takes a numeric vector x as input and returns the sigmoid transformation.

sigmoid <- function(x) {
  1/(1+exp(-x))
}

Task 1.2:

Generate a sequence of 200 numbers ranging from -10 to 10. Apply your sigmoid function and plot the result using ggplot2 to observe the characteristic S-curve.

x_values <- seq(-10, 10, length.out = 200)
y_values <- sigmoid(x_values)
plot_data <- data.frame(x = x_values, y = y_values)

ggplot(plot_data, aes(x = x, y = y)) +
  geom_line(color = "#EA4335", size = 1.2) +
  labs(title = "The Sigmoid Activation Function", x = "Z (Linear Output)", y = "Probability (A)") +
  theme_minimal() +
  geom_hline(yintercept = c(0, 0.5, 1), linetype = "dashed", color = "gray")

Question 1 (The Vanishing Gradient Problem): Based on your plot, what happens to the slope (gradient) of the curve as \(x\) becomes very large (e.g., 10) or very small (e.g., -10)? Why is this detrimental to training deep neural networks, and how do modern architectures overcome it?

As \(x\) reaches extreme positive or negative values, the curve becomes entirely flat, meaning the gradient approaches exactly zero. Deep networks learn by passing these gradients backward through the layers (backpropagation) using the chain rule. If the local gradients are near zero, multiplying them across multiple layers causes the overall gradient to shrink exponentially until it “vanishes.” When gradients vanish, the network stops updating its weights, and the model fails to learn. Modern architectures overcome this by using the ReLU (Rectified Linear Unit) activation function, \(f(x) = \max(0, x)\), which maintains a constant gradient of 1 for all positive inputs, completely avoiding the vanishing effect.

Task 2: When Linear Models Fail (The Logit Model)

To understand why we need neural networks, we must look at data that a simple Logit model cannot handle. Run the code below to generate a “bullseye” dataset—a cluster of points surrounded by a ring of different points.

# Generate Concentric Circles (Bullseye Dataset)
set.seed(42)
n <- 200
# Inner circle (Class 1)
r_in <- runif(n/2, 0, 1.5)
theta_in <- runif(n/2, 0, 2*pi)
# Outer ring (Class 0)
r_out <- runif(n/2, 2.5, 4)
theta_out <- runif(n/2, 0, 2*pi)

dataset <- data.frame(
  X1 = c(r_in * cos(theta_in), r_out * cos(theta_out)),
  X2 = c(r_in * sin(theta_in), r_out * sin(theta_out)),
  Y = c(rep(1, n/2), rep(0, n/2))
)

ggplot(dataset, aes(x = X1, y = X2, color = as.factor(Y))) +
  geom_point(size = 2) +
  labs(title = "Complex Non-Linear Dataset (Bullseye)", color = "Class (Y)") +
  theme_minimal()

Task 2.1:

Fit a standard Logistic Regression model predicting Y using X1 and X2. Calculate its accuracy.

# Solution: Fit Logit Model
logit_model <- glm(Y ~ X1 + X2, data = dataset, family = binomial)

# Calculate Accuracy
logit_probs <- predict(logit_model, type = "response")
logit_preds <- ifelse(logit_probs >= 0.5, 1, 0)
logit_accuracy <- mean(logit_preds == dataset$Y)
cat("Logistic Regression Accuracy: ", logit_accuracy * 100, "%\n")
## Logistic Regression Accuracy:  55.5 %

Task 2.2:

We have generated a background grid for you. Use your logit_model to predict the class for the entire grid, and plot the decision boundary.

# Create a grid across the feature space
grid <- expand.grid(X1 = seq(-4.5, 4.5, length.out = 100), 
                    X2 = seq(-4.5, 4.5, length.out = 100))

# Solution: Predict on grid and plot
grid$Logit_Prob <- predict(logit_model, newdata = grid, type = "response")
grid$Logit_Pred <- ifelse(grid$Logit_Prob >= 0.5, "1", "0")

ggplot() +
  geom_tile(data = grid, aes(x = X1, y = X2, fill = Logit_Pred), alpha = 0.3) +
  geom_point(data = dataset, aes(x = X1, y = X2, color = as.factor(Y))) +
  labs(title = "Logistic Regression Decision Boundary", fill = "Predicted Area", color = "Actual Class") +
  theme_minimal()

Question 2: Look at the accuracy and the graph. Why did the Logistic Regression model fail so badly here?

The accuracy is essentially 50% (a coin toss). The logit model maps a linear combination of inputs directly to an output. Geometrically, this means it can only ever draw a single, straight line to separate classes. Because the actual data is circular (Class 1 is entirely surrounded by Class 0), no straight line can possibly separate them.

Task 3: The Power of Hidden Layers (Neural Networks)

Now, we will solve the same problem using a Neural Network. Instead of mapping inputs directly to the output, we will insert a hidden layer containing 4 neurons (nodes).

Before writing the code, review the architecture you are about to build. Every input connects to every hidden node, and every hidden node connects to the final output.

# Generates the 2-input, 4-hidden, 1-output architecture diagram for Task 3
nodes_nn4 <- data.frame(
  x = c(1, 1, 2, 2, 2, 2, 3),
  y = c(1, -1, 1.5, 0.5, -0.5, -1.5, 0),
  label = c("X1", "X2", "H1", "H2", "H3", "H4", "Y")
)

# Create combinations of edges from Input to Hidden, and Hidden to Output
edges_in_to_h <- expand.grid(x = 1, y = c(1, -1), xend = 2, yend = c(1.5, 0.5, -0.5, -1.5))
edges_h_to_out <- data.frame(x = 2, y = c(1.5, 0.5, -0.5, -1.5), xend = 3, yend = 0)
edges_nn4 <- rbind(edges_in_to_h, edges_h_to_out)

ggplot() +
  geom_segment(data = edges_nn4, aes(x = x, y = y, xend = xend, yend = yend), color = "gray70") +
  geom_point(data = nodes_nn4, aes(x = x, y = y, color = factor(x)), size = 18) +
  geom_text(data = nodes_nn4, aes(x = x, y = y, label = label), size = 4, fontface = "bold", color = "black") +
  theme_void() +
  theme(legend.position = "none") +
  scale_color_manual(values = c("1" = "#AEC6CF", "2" = "#B39EB5", "3" = "#FFB347")) +
  xlim(0.5, 3.5) + ylim(-2, 2) +
  labs(title = "Task 3 Architecture: 2 Inputs, 4 Hidden Nodes, 1 Output")

Task 3.1:

Fit a neural network using the nnet package. Use the argument size = 4 to instruct the model to create the 4 hidden neurons pictured above. Calculate its accuracy, predict on the grid, and plot the new decision boundary.

# Solution: Fit Neural Network
set.seed(123) # For reproducibility of the hidden layer weights
# size = 4 creates a hidden layer with 4 neurons
nn_model <- nnet(as.factor(Y) ~ X1 + X2, data = dataset, size = 4, trace = FALSE)

# Calculate Accuracy
nn_preds <- predict(nn_model, type = "class")
nn_accuracy <- mean(nn_preds == dataset$Y)
cat("Neural Network Accuracy: ", nn_accuracy * 100, "%\n")
## Neural Network Accuracy:  100 %
# Predict on grid and plot
grid$NN_Pred <- predict(nn_model, newdata = grid, type = "class")

ggplot() +
  geom_tile(data = grid, aes(x = X1, y = X2, fill = NN_Pred), alpha = 0.3) +
  geom_point(data = dataset, aes(x = X1, y = X2, color = as.factor(Y))) +
  labs(title = "Neural Network Decision Boundary (4 Hidden Nodes)", fill = "Predicted Area", color = "Actual Class") +
  theme_minimal()

Question 3: Compare the Logistic Regression boundary to the Neural Network boundary. What mathematical component of the Neural Network allowed it to draw this complex shape?

The Neural Network achieved near 100% accuracy by drawing a geometric boundary that curves around the inner circle. It was able to do this because of the hidden layer combined with non-linear activation functions. Each of the 4 hidden nodes draws a separate linear boundary, and the non-linear activation function allows the network to combine those linear boundaries into a complex, enclosed polygon/circle in the output layer.

End