Question 2

Consider the softmax function in (10.13) (see also (4.13) on page 141) for modeling multinomial probabilities.

10.13 function:

\[ f_m(X) = \Pr(Y = m \mid X) = \frac{e^{Z_m}}{\sum_{\ell=0}^{9} e^{Z_\ell}}, \]

4.13 function:

\[ \Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{\sum\limits_{l=1}^{K} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}. \]

Part A

In (10.13), show that if we add a constant c to each of the zℓ, then the probability is unchanged.

The softmax function is:

\[ f_m(X) = \frac{e^{Z_m}}{\sum_{\ell=0}^{9} e^{Z_\ell}} \]

Suppose we add a constant \(c\) to every \(Z_\ell\), then:

\[ f_m(X) = \frac{e^{Z_m + c}}{\sum_{\ell=0}^{9} e^{Z_\ell + c}} = \frac{e^c e^{Z_m}}{e^c \sum_{\ell=0}^{9} e^{Z_\ell}} = \frac{e^{Z_m}}{\sum_{\ell=0}^{9} e^{Z_\ell}} \]

The constant \(c\) cancels, so the softmax output is invariant to translation.

Part B

In (4.13), show that if we add constants cj, j=0,1,…,p, to each of the corresponding coefficients for each of the classes, then the predictions at any new point x are unchanged. This shows that the softmax function is over-parametrized. However, over regularization and SGD typically constrain the solutions so that this is not a problem.

The multinomial logistic regression model is:

\[ \Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{\sum_{l=1}^{K} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}} \]

Suppose we add constants \(c_0, ..., c_p\) to each \(\beta_{kj}\). This leads to:

\[ \beta'_{kj} = \beta_{kj} + c_j \Rightarrow \sum_{j=0}^{p} \beta'_{kj} x_j = \sum_{j=0}^{p} \beta_{kj} x_j + \sum_{j=0}^{p} c_j x_j \]

Since the second term is common across all classes, the probabilities are unaffected due to the same cancellation effect in the softmax denominator.

Conclusion: The model is overparameterized, but SGD and regularization constrain it.

Question 3

Show that the negative multinomial log-likelihood (10.14) is equivalent to the negative log of the likelihood expression (4.5) when there are M =2classes.

10.14 Function:

\[ - \sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log(f_m(x_i)), \]

4.5 function:

\[ \ell(\beta_0, \beta_1) = \prod_{i : y_i = 1} p(x_i) \prod_{i' : y_{i'} = 0} \left(1 - p(x_{i'})\right). \]

The negative multinomial log-likelihood is:

\[ - \sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log(f_m(x_i)) \]

When \(M = 2\), define \(f_1(x_i) = p(x_i)\), \(f_0(x_i) = 1 - p(x_i)\), and \(y_i \in \{0, 1\}\). Then:

\[ - \sum_{i=1}^{n} \left[ y_i \log(p(x_i)) + (1 - y_i) \log(1 - p(x_i)) \right] \]

This is exactly the negative log of the likelihood:

\[ \ell(\beta_0, \beta_1) = \prod_{i : y_i = 1} p(x_i) \prod_{i' : y_{i'} = 0} \left(1 - p(x_{i'})\right) \]

Verified: 10.14 reduces to the log of 4.5 in the binary case.

Question 4

Consider a CNN that takes in 32 × 32 grayscale images and has a single convolution layer with three 5 × 5 convolution filters (without boundary padding).

Part A

Draw a sketch of the input and first hidden layer similar to Figure 10.8:

FIGURE 10.8. Architecture of a deep CNN for the CIFAR100 classification task. Convolution layers are interspersed with 2 × 2 max-pool layers, which reduce the size by a factor of 2 in both dimensions.

Part B

How many parameters are in this model?

There are 78 parameters in this model.

Part c

Explain how this model can be thought of as an ordinary feed forward neural network with the individual pixels as inputs, and with constraints on the weights in the hidden units. What are the constraints?

This CNN model can be thought of as an ordinary feedforward neural network (FFNN) in which:

  • Each individual pixel in the 32×32 input image (i.e., 1024 total) is treated as a separate input neuron.

  • Each convolutional filter produces an output at multiple spatial locations by applying the same set of weights (i.e., the same 5×5 kernel) across the input.

  • The output of each filter is like a set of hidden neurons, each connected to a small local region (5×5 patch) of the input.

Constraints:

Local connectivity

  • Each hidden neuron is connected to only a small patch (5×5) of the input, not the full image.

  • This contrasts with a standard FFNN where each hidden neuron is connected to every input neuron.

Weight sharing

  • All neurons that belong to the same output feature map share the same 5×5 weights and bias.

  • In an FFNN, each connection typically has its own unique weight, but CNNs reuse weights to detect patterns anywhere in the input.

Part D

If there were no constraints, then how many weights would there be in the ordinary feed-forward neural network in (c)?

(1024+1) parameters per neuron×3 neurons=1025×3= 3075

There would be 3,075 parameters in an ordinary fully connected feedforward neural network with the same input and 3 hidden units, if there were no constraints.