Assignment 1

Q1 Facts)

Suppose we are predicting whether an individual experience Success or Failure (1 or 0 respectively ) based on whether they are in the treatment or control group (x = 1, -1 respectively).

\[ y =1(\beta x - \epsilon \ge 0) \]

\[ y= \begin{cases}1, & \text{if } \beta x - \varepsilon \ge 0 \\ 0, & \text{otherwise.}\end{cases} \]

\[ x = \{1,-1\} (\text{treatment or control}) \]

\[ y = \{0,1\} (\text{Success or Failure}) \]

\[ \text{Prob}(x=1)=p \implies \text{Prob}(x=-1)=1-p \]

\[ \beta>0 \]

\[ \epsilon \perp x \]

\[ F(\varepsilon) = \frac{1}{1 + e^{-\varepsilon}} \]

Q1a)

(link)

\[ \text{Show that the log odds ratio } \log \frac{P(y = 1 \mid x)}{P(y = 0 \mid x)} \text{ is linear in } \beta, \text{ and therefore the true model for } y \text{ is a logistic regression.} \]

Notice :

\[ \beta x - \epsilon \ge 0 \\\implies \\ \beta x \ge \epsilon \]

So when our signal \(\beta x\) is greater than our noise \(\epsilon\), our model will indicate \(1\) and state : “Success”.

Further, notice :

\[ \text{Prob}(y=1|X=x) = \text{Prob}(\beta x + \epsilon \ge 0) = \text{Prob}(\beta x \ge \epsilon) = \text{Prob}(\epsilon \le \beta x) \]

and recall :

\[ F(\varepsilon) = \text{Prob}(\epsilon \le \epsilon_o) \]

Therefore,

\[ \epsilon_o = \beta x \]

\[ \text{Prob}(\epsilon \le \epsilon_o) = \text{Prob}(\epsilon \le \beta x) = F(\beta x) \]

\[ \text{Prob}(y=1|X=x)=\frac{1}{1 + e^{-\beta x}} \]

and we know :

\[ \text{Prob}(y=0|X=x) = 1-p = 1 - \frac{1}{1 + e^{-\beta x}} \]

So, notice :

\[ \frac{\text{Prob}(y=1|X=x)}{\text{Prob}(y=0|X=x)} = \]

\[ \frac{\frac{1}{1 + e^{-\beta x}}}{1 - \frac{1}{1 + e^{-\beta x}}} \]

Q1b)

\[ \text{Suppose you have access to } \beta \text{ and therefore can use the true logistic classifier} \]

\[ f(x) = 1 \left[ \log \frac{P(y = 1 \mid x)}{P(y = 0 \mid x)} \ge 0 \right] \]

\[ \text{Calculate its 0–1 risk:} \]

\[ R_{01}(f) = \mathbb{E} \left[ 1(y \neq f(x)) \right] \]

So, in Data Science we typically estimate \(\beta\) with \(\hat{\beta}\) ; however here we are lucky and are told the true population parameter. Therefore, knowing our population parameter \(\beta\) , will allow us to take a look at the lowest possible misclassification rate.

We measure the misclassification rate with something called Risk :

\[ R(f) = \mathbb{E}_{X,Y}[L(f(X), Y)] \]

Where

\[ L(f(X), Y) = \mathbf{1}[Y \neq f(X)] \]

So in other words the Loss function measure how wrong our model is. When our model missclassifies an observation, we assign it a value of 1. So if our model makes 2 mistakes, our loss is 2.

We define Risk as our expected loss–the average amount of error your model makes in the long run. Across all possible data points drawn from the true (unknown) distribution of \((X,Y)\)

So :

\[ R_{01}(f) = \mathbb{E} \left[ 1(y \neq f(x)) \right] \]

\[ R_{01}(f) = \mathbb{E}_{XY}\left[ 1(y \neq f(x)) \right] = \]

\[ \mathbb{E}_X \!\left[\, \mathbb{E}_Y \!\left[\, \mathbf{1}(Y \neq f(X)) \,\middle|\, X \,\right] \right] = \]

\[ \mathbb{E}_X [\text{Prob}(Y \ne f(X) | X = x)] \]

In other words, the true classification error rate is the average probability of misclassification across all possible input values.

So we can compute each error. In other words when our model \(f(X)\) predicts Success ( \(1\) ) when it is actually failure ( \(0\) ) & vice versa.

Suppose :
We have an observation ( \(x\) ) which is part of the treatement group ( \(1\) ) :

\[ x=1 \implies f(x=1)=\beta x_{x=1}=\beta \]

And recall that we define \(\beta>0\) so we know our input is positive & therefore activates things.

In other words :

\[ \beta > 0 \implies y= \begin{cases}1, & \text{if } \beta x - \varepsilon \ge 0 \\ 0, & \text{otherwise.}\end{cases} \implies y=1 \]

Our classifier predicts success for everyone in the treatment group, the true outcome \(Y\) still depends on the random noise term \(\varepsilon\). That means some individuals in the treatment group will still fail by chance.

Formally, recall the data-generating model:

\[ Y = 1(\beta X - \varepsilon \ge 0), \quad \varepsilon \sim \text{Logistic}(0, 1) \]

Therefore, the probability of a success given \(X = x\) is:

\[ P(Y = 1 \mid X = x) = P(\varepsilon \le \beta x) = F(\beta x) = \frac{1}{1 + e^{-\beta x}}. \]

Case 1: Treatment group (\(x = 1\))

For the treatment group, the classifier predicts success:

\[ f(1) = 1. \]

A misclassification occurs when the true outcome is a failure, i.e. when \(Y = 0\):

\[ P(Y \neq f(1) \mid X = 1) = P(Y = 0 \mid X = 1) = 1 - P(Y = 1 \mid X = 1). \]

Using the logistic CDF:

\[ P(Y = 0 \mid X = 1) = 1 - F(\beta) = 1 - \frac{1}{1 + e^{-\beta}} = \frac{1}{1 + e^{\beta}} \]

Case 2: Control group (\(x = -1\))

For the control group, the classifier predicts failure:

\[ f(-1) = 0. \]

A misclassification occurs when the true outcome is a success, i.e. when \(Y = 1\):

\[ P(Y \neq f(-1) \mid X = -1) = P(Y = 1 \mid X = -1) = F(-\beta) = \frac{1}{1 + e^{\beta}}. \]

Compute the Expected 0–1 Risk :

We know that \(X\) is a binary variable taking values \(\{-1, 1\}\) with probabilities:

\[ X \in \{-1, 1\}, \quad P(X = 1) = p, \quad P(X = -1) = 1 - p. \]

Therefore, the expectation over \(X\) becomes a weighted sum:

\[ \mathbb{E}_X [P(Y \neq f(X) \mid X)] = \mathbb{E}_X(g(x)) = \sum_{\forall x}g(x)P(X=x) \]

\[ = P(X = 1) \cdot P(Y \neq f(1) \mid X = 1) + P(X = -1) \cdot P(Y \neq f(-1) \mid X = -1). \]

From our earlier derivations, we know that:

\[ P(Y \neq f(1) \mid X = 1) = \frac{1}{1 + e^{\beta}}, \quad P(Y \neq f(-1) \mid X = -1) = \frac{1}{1 + e^{\beta}}. \]

Substituting these values, we get:

\[ R_{01}(f) = p \cdot \frac{1}{1 + e^{\beta}} + (1 - p) \cdot \frac{1}{1 + e^{\beta}}. \]

\[ \boxed{R_{01}(f) = \frac{1}{1 + e^{\beta}}} \]

This shows that the 0–1 risk does not depend on the group proportion \(p\), only on the model parameter \(\beta\). As \(\beta\) increases, the classifier makes fewer mistakes.

Q1c)

Suppose now you use a dubious classifier that just assigns labels randomly (without even looking at \(x\)).

Show that this classifier is never better than the true classifier in terms of the \(0 - 1\) risk.

Sol)
If we use a classifier that ignores \(X\) and predicts labels completely at random, then we can flip a coin to pick success or failure :

\[ P(\hat{Y} = 1 \mid X) = P(\hat{Y} = 0 \mid X) = \frac{1}{2}. \]

For each \(x\), the conditional probability of misclassification is:

\[ P(Y \neq \hat{Y} \mid X = x) = P(\hat{Y} = 1, Y = 0 \mid X = x) + P(\hat{Y} = 0, Y = 1 \mid X = x). \]

Since \(\hat{Y}\) is independent of both \(X\) and \(Y\), we can write:

\[ P(Y \neq \hat{Y} \mid X = x) = \frac{1}{2} P(Y = 0 \mid X = x) + \frac{1}{2} P(Y = 1 \mid X = x) \]

and,

\[ 1= P(Y = 0 \mid X = x) + P(Y = 1 \mid X = x) \]

Therefore :

\[ \frac{1}{2}(P(Y = 0 \mid X = x) + P(Y = 1 \mid X = x) )=\frac{1}{2}*1=\frac{1}{2} \]

Thus, no matter what value \(x\) takes, this random classifier yields a constant misclassification probability of \(0.5\).

Therefore, its 0–1 risk is:

\[ R_{01}(\text{random}) = \mathbb{E}_X [P(Y \neq \hat{Y} \mid X)] = \frac{1}{2}. \]

In contrast, the Bayes optimal (true logistic) classifier achieves:

\[ R_{01}(f) = \frac{1}{1 + e^{\beta}} \]

Since \(\beta > 0\), we know that:

\[ \frac{1}{1 + e^{\beta}} < \frac{1}{2}. \]

This means that the Bayes optimal (true logistic) classifier has a lower 0–1 risk than the random classifier.

Intuitively, the random classifier flips a coin and ignores any information from \(X\), resulting in a constant 50% chance of being wrong. The true logistic classifier, however, uses the relationship between \(X\) an4d \(Y\) through \(\beta\) to make informed predictions, thus achieving a smaller probability of error.

Hence, the random classifier can never outperform the true classifier in terms of the 0–1 risk:

\[ R_{01}(\text{random}) = \frac{1}{2} \ge R_{01}(f^*) = \frac{1}{1 + e^{\beta}}. \]

Equality would only occur in the degenerate case when \(\beta = 0\), i.e., when \(Y\) is completely independent of \(X\) (no signal in the data), and both classifiers perform equally poorly.

Q2 Facts)

There are 4 measures we care about :
Accuracy :

\[ \frac{TP+TN}{TP+TN+FP+FN} \]

Precision :

\[ \frac{TP}{TP + FP} \]

Recall :

\[ \frac{TP}{TP + FN} \]

F1 Score :

\[ 2*\frac{\text{Precision}*\text{Recall}}{\text{Precision}+\text{Recall}} \]