Question 2

Consider the softmax function in (10.13) (see also (4.13) on page 141) for modeling multinomial probabilities.

10.13 function:

\[ f_m(X) = \Pr(Y = m \mid X) = \frac{e^{Z_m}}{\sum_{\ell=0}^{9} e^{Z_\ell}}, \]

4.13 function:

\[ \Pr(Y = k \mid X = x) = \frac{e^{\beta_{k0} + \beta_{k1}x_1 + \cdots + \beta_{kp}x_p}}{\sum\limits_{l=1}^{K} e^{\beta_{l0} + \beta_{l1}x_1 + \cdots + \beta_{lp}x_p}}. \]

Part A

In (10.13), show that if we add a constant c to each of the zℓ, then the probability is unchanged.

Part B

In (4.13), show that if we add constants cj, j=0,1,…,p, to each of the corresponding coefficients for each of the classes, then the predictions at any new point x are unchanged. 10.10 Exercises 459 This shows that the softmax function is over-parametrized. However, over regularization and SGD typically constrain the solutions so that this is not a problem.

Question 3

Show that the negative multinomial log-likelihood (10.14) is equivalent to the negative log of the likelihood expression (4.5) when there are M =2classes.

10.14 Function:

\[ - \sum_{i=1}^{n} \sum_{m=0}^{9} y_{im} \log(f_m(x_i)), \]

4.5 function:

\[ \ell(\beta_0, \beta_1) = \prod_{i : y_i = 1} p(x_i) \prod_{i' : y_{i'} = 0} \left(1 - p(x_{i'})\right). \]

Question 4

Consider a CNN that takes in 32 × 32 grayscale images and has a single convolution layer with three 5 × 5 convolution filters (without boundary padding).

Part A

Draw a sketch of the input and first hidden layer similar to Figure 10.8:

FIGURE 10.8. Architecture of a deep CNN for the CIFAR100 classification task. Convolution layers are interspersed with 2 × 2 max-pool layers, which reduce the size by a factor of 2 in both dimensions.

Part B

How many parameters are in this model?

Part c

Explain how this model can be thought of as an ordinary feed forward neural network with the individual pixels as inputs, and with constraints on the weights in the hidden units. What are the constraints?

Part D

If there were no constraints, then how many weights would there be in the ordinary feed-forward neural network in (c)?