3: Logistic Regression
Categorical Data
Linear to Logistic
The Sigmoid
Log Loss
Minimizing Log Loss
Multiclass Context Module Sub Header Demonstration of Logistic
Regression
Demonstration of Logistic Regression, Part I
Demonstration of Logistic Regression, Part II Context Module Sub
Header Case Study 2
Case Study 2 Introduction
Case Study 2, Part I
Case Study 2, Part II Assignment Case Study 2 Assignment
- categorical data
Hi and welcome to this video on categorical data, where we’ll we be
talking about categorical targets and categorical inputs. So categorical
data is when we have data that’s based on classes. Good examples of this
are things like colors–
is something red, green, or blue?–
or logic–
is something true or false? Other examples are things like objects.
Is it a car? Is it a house?
Is it a truck? So this data, when we see it initially–
red, green, and blue–
it’s non-numeric. But all our algorithms actually take numeric
inputs. So how are we going to take this both inputs and possibly
targets as we could be predicting whether something is going to be a red
object, a green object, or a blue object? It can both go on both sides
of the equation, both the inputs and the targets.
So how do we make this data into numeric data? The example or the
standard method is what we’re going to call one-hot encoding. So in
essence, what we’re going to do is create new indicator variables, and
those indicator variables, the object type–
or, in this case, color–
will be the column name. So we’re going to have this feature.
Basically, is it red, is it green, is it blue, or is it pink? This is
one-hot encoding and becomes essentially a vector, where the vector
represents all the different values.
So vectors can be sometimes a little bit intimidating. But what we
need to think of is we are just simply creating new variables that say,
is it red, yes or no? So as we see in the example here, the first two
lines are red. So we see a 1 in the red column.
They’re not green, not blue, and not pink, so all of those features
are 0. We can see, as we go down, each time the color appears, there’s a
1 in the appropriate column. So we see this as a vector. So the color
red becomes a vector 1, 0, 0, 0, and then blue becomes the vector 0, 0,
1, 0.
And this allows us to transform these categories or features into
numeric data. And most importantly, not only does it allow us to change
features into data, it allows us to use them as targets as well. So we
may not be used to seeing targets as vectors, but towards the end of our
video series, we’ll see how this plays out. And we can actually predict
multiple categories based on this same concept of is it red, is a green,
is it blue, is it pink?
Now, an important thing to note is only one column should have the 1
in it. All the other columns should be 0. If you do happen to find
features that are both–
that have multiple columns that are 1, make sure that that’s what you
really intend. In this case, red, green, blue, and pink are all
exclusive. But perhaps, if you wanted to break it into primary colors
and say, oh, I have red and blue and that will indicate purple, well,
then, you would have a vector 1, 0, 1. That’s extremely uncommon.
Usually, for each unique value of the feature, we add a column. And
only one of the columns will have a 1 indicated, and the rest of the
columns will be 0. But that is not a hard and fast rule. It’s more of a
guideline.
So categorical data is closely related to discrete data, but they are
not the same. And a good example of this is the number of rooms in a
house. Is it categorical? The answer is actually no.
Even though we don’t have like 3 and 1/2 rooms in a house or 3.21
rooms, what we have here is discrete data. So discrete data, it’s
important to still treat it as numeric because, in this case, the value
of 3 actually indicates something that is greater than the value of 2.
There are three rooms versus two rooms.
Now, if you have different classes–
now I’d say I have class 1 or class 2 or, in this example, category 1
or category 2–
is it really that category 2 represents twice the value of category
1? So that’s how you can tell the difference between discrete data and
categorical data. Categorical data needs to be one-hot encoded. Discrete
data, while you can one-hot encode it, can be left in its original
integer form.
So remember that when you’re looking at your possibly categorical
data to ensure that it’s not accidentally discrete data. So a couple of
examples of discrete data. We talked about the number of rooms in a
house or the number of cars a family owns. You don’t own 2 and 1/2 cars
or 4.1
cars. How many children do you have? The average family has 2.4
children. That’s not an actual number of children.
The number of children is discrete. Number of cylinders in a car
engine. Cars don’t come with a partial cylinder. They have two, four,
sometimes five for the really odd ones, but the cylinders in a car
engine is discrete.
So remember, do not one-hot encode discrete data. Allow those values
to inform your model. It’s possible you can do a data transform, but the
best way to handle this discrete data is to input it directly. So once
you’ve actually one-hot encoded your categorical features, you can use
this in fairly any model.
So if we’re trying to figure out the price of our car, and we’re
using a linear regression model, which we learned about in one of the
earlier weeks, and one of the features is the car color. So as we look
at the color of the car, we may have black cars and silver cars and gray
cars, what we’re going to do is we’re going to transform that single
column of car colors to multiple columns, where each column is a color,
and there will be a 1 to indicate, for a row, if that particular car is
that color. So I own a black car. So underneath the Black column, there
would be a 1.
And let’s say our other car colors are red and white. Under the Red
and White columns, there would be 0’s. So now I can put this into a
linear model. I can put it into any model.
Our categories have been transformed into numbers that our model can
actually understand.
- linear to logistic Welcome to our video that’s going to take us from
linear regression to logistic regression. So we’ve encountered
categorical data now, and the question becomes, what happens if the
target or our prediction is actually one-hot encoded? So the idea being,
let’s try and predict if something is true or something is false. And we
saw, in the past, that we one-hot encode these, and that can transform
our categorical data into numeric data.
But, now, what happens if our target is now this category? We’ve
one-hot encoded the target, but, remember, linear regression doesn’t
predict a zero or a one. Linear regression produces a continuous output.
So even if we manage to get that output between zero and one, we would
have a value that could be anywhere from zero to one, and, quite
possibly, could go over or under.
So the question becomes, how can we use what we’ve learned in linear
regression to tackle these problems where the output is–
or the prediction–
is categorical? So when we look at what happens with a continuous
output versus a categorical output, what we need to do is, we will see
on the left hand side that the output varies with x. So there’s a
continuous output. The discrete output means that y can only take on
specific values.
In this case, it would be roughly negative one or one. And it doesn’t
have to jump exactly at zero. In this case, it can take on different
discrete values for different ranges of x. It doesn’t have to be x
greater than 0, x less than zero.
It could be something like x greater than negative one or x less than
than negative one. But we see that y only takes on discrete values. So
we talked about discrete numbers, but this is now a discrete target.
And, in particular, we’re going to look at binary targets where we’ve
got two values.
One of which will be a value zero, and one of which will be a value
one, which will correspond to our one-hot encoded targets. So to do
that, we’re going to come back to linear regression. And, hopefully,
these mathematical equations look familiar to you, but it’s our old y
equals mx plus b. So, of course, each column, x–
in this case x1, x2, et cetera–
has an associated slope, m1, m2. And a reminder that the intercept is
hiding as m zero. So if you’re wondering, why is this the sum of mi xi,
x0 is actually one, and m0, of course, will then be the intercept. And
this is the compact notation I mentioned a few videos ago.
So what we’re going to do is, we’re going to take a variable
transform. And this variable transform is going to take us to logistic
regression. Now, it seems like this is a strange type of variable
transform. But, as we’re going to see, this actually forces our output
to now essentially be discrete.
We’re going to model a discrete output, and that’s what this function
is giving us. And this function will lead us to logistic regression.
Now, one of the things you’ll notice is I add this sigma at the bottom,
sigma of mi xi. And that simply means that the sigma function, or what
we’re going to call the sigmoid function, takes as its input, m times
x–
the summary of that. And, typically, we’ll drop the summation. It’s
implied because we know for linear regression that we see this original
top equation–
the m1 x1 plus m2 x2, et cetera, et cetera. Because we’re so familiar
with that, for shortness and brevity we typically drop the summation
sign. But it’s already there. And so I’ve dropped it here for clarity,
but you can see that y is now a function of m times x rather than m
times x directly.
- the sigmoid
Hi, and welcome to the video about the sigmoid function, which is one
of the key functions in transforming from a linear regression to a
logistic regression. So in the last video, we talked a little bit about
this variable transform. Now, we can start to see what this variable
transform actually does to our data. So this sigmoid function takes the
input x, and it essentially squeezes the output so that it’s between 0
and 1.
And it gives us the S-shaped curve that we see on the left. However,
as that slope–
and when I say slope, this would be the slopes m–
get larger and larger, we see that this sigmoid function becomes much
more like a step function, modeling the behavior that we saw for
discrete output. And this shows us that we can actually get an output
that is essentially forced to be either 0 or 1. Now, we’ll still get
outputs that are between 0 and 1. So some people may say, well, although
we’ve squeezed the tails to 0 and 1, what about the point in the
middle?
And what we’ll look at, at these points in the middle, is these are
going to represent probabilities that we belong to class 1 or class 0.
So what we’re looking at now is we will find a threshold or a cutoff in
our sigmoid function. And we’ll say anything above this value is class
1. And anything below it is class 0.
The default, of course, would be at y equals 0.5, the idea being
essentially rounding. So we could take–
if its 0.51 and above, it belongs to class 1. And if it belongs to
0.50 and below, it belongs to class 0 like our normal rounding would
be.
However, this is not required. You can make your thresholds wherever
you want. And this is important as we look forward. And we maybe want to
be cautious about certain assignments.
So what if an assignment for 1 is a fraud detection? Well, we might
want to have either a very low threshold. Like we want to detect any
possible chance of fraud. So perhaps when our y output value is say
0.2,
so that would mean that there’s about a 20% chance that there’s fraud
going on. So maybe we should check that. On the other hand, if you put
the threshold really high, you would say, ah, I’m extremely certain that
fraud is going on at something like 0.99. But the point is you have the
choice to adjust that threshold.
So the default output if you don’t instruct your algorithm will make
the cutoff at 0.5. This is one of the most common errors I see students
make is they don’t question where that threshold should be. Now, there
are a number of ways that we can determine that threshold. It’s not
entirely a judgment call.
But be aware that you can make the determination where should your
assignment be for class 1 and class 0. So let’s take a look at the
properties of the sigmoid function. We can see that we have the output y
is 1 over 1 plus an exponential. And that exponential, again, the
implied sum, we have here mx.
But it’s actually a summary of all m’s and all the x’s. So, again, m0
times x0, x0 being 1, m0 being the intercept, x1 m1, our first column
and first slope. But the idea is if that sum total is extremely large
and it’s positive, we’ll have 1 over 1 plus e to the negative large
number. e to a negative large number is, in essence, a very, very small
number because it’s 1 over e to that power.
So that number becomes 1 plus an extremely small number. That becomes
1 over 1, which means our output is 1. If our m times x is negative–
now, remember, we have a negative sign in front of m times x. That
means m times x itself is negative. Then we have an e to the positive
number because the two negative signs will cancel out. Now, I’ve got an
exponential to a very large power.
Well, anytime I take something to a large power, that number gets
very, very large. So I have 1 over 1 plus a large number. 1 plus a large
number is essentially that large number. And 1 over a large number is 0
roughly.
Now, again, I like to talk in physics terms because of my physics
background. But the idea is that we are extremely close to 0 with 1 over
a large number. And that’s the behavior we actually want. We want if the
output is large to be, yes, this is class 1.
And that’s how we’re going to model it. You may ask yourself, why did
we say if mx is large, we want it to be one, and if an x is negative, we
want it to be 0? Well, that’s the idea of us taking advantage of this
sigmoid problem. So we will frame the problem such that we want the
output to be large for class 1, and large and positive, and negative for
class 0.
This simulates the output of the step function. And as an added
bonus, although many people are intimidated by the mathematics of
exponentials, exponentials actually have some very friendly properties
that we’re going to take advantage of. And they’ll help us with doing
some advanced mathematics in the next couple of steps. So we’ll use this
sigmoid function to give us an advantage.
And, hopefully, we’ll see that in the next couple of videos. And it
won’t be so obscure why did we choose this particular function. At its
face value, we’re looking at, of course, outputs that are 0 and 1. And
this function starts to simulate that.
And then we’re going to take advantage of that exponential property
as we go forward.
- log loss Hi, and welcome to our video on logarithmic loss. So, in
linear regression, what we find is that we typically use the mean
squared error or the mean absolute error. And the idea is, we are
measuring the distance between our prediction and the target. And
because distance, of course, is Pythagorean theorem based, we typically
use the square.
However, because we’re also one dimensional with prediction and
target, it’s also possible to use the mean absolute error. And all we’re
doing is just measuring how far away from the actual value is our guess.
And so our loss is actually the sum of the distances from the prediction
in the value of x to the target in y. And that’s why we use the vertical
distance rather than the shortest distance.
The idea is that the input and our data is used and fixed for us. And
so we adjust the slope of the line to provide a prediction. And the
distance from the position at input x to the target y is then our
vertical distance. So our object is to get our straight line as close to
the target points as possible, minimizing all of those distance.
That’s what we’re doing when we’re using both mean squared error and
mean absolute error. However, when we get to logarithmic loss and
categorical data, we’ve got a problem. Here, our output or our target,
our guess, really has only two values. Those values tend to be zero and
one.
So what we need is, we need a loss function that represents our
distance to this target and our error that when we guess, is it target
zero or is it target one, how far away from that guess are we? So we’re
actually going to have two cases because we have the case where our
target is zero and the case where our target is one. So if the target is
one, the error is just going to be minus y log with p. And p is just our
output of our sigmoid function.
I’ve mentioned, if we’ve been dealing with sigmoid functions, we know
that the output of the sigmoid function represents a probability that
that input belongs to a certain class. So p is really just the output or
the sigmoid of m times x. It’s that one over the exponential. So what
we’re doing is, we’re looking at the distance between our target and our
guess or our output.
And we put y in front of it because we’re going to combine these two
cases. Now, in this case, the target y is one, so our error is really
just minus log p. Now, here, I’m using the traditional notation used in
data science. In this case, the log is not log base 10, it’s actually
the natural log.
That will become important in a few moments. Now, when the target is
zero, we see we have a one minus y. In other words, one minus zero times
the log of one minus the probability. This is basically an
exclusion.
So, , just like before, we said, we had a probability to belong to
class one. Well, if we’ve got a class one and a class zero, and we look
at this as, what’s the probability I belong to class one? But they’re
mutually exclusive. So that means that the probability I belong to class
zero is one minus the probability of belonging to class one.
It’s simply finding the probability that I belong to the other class.
But these two classes are coupled, and so we see the one minus p. So,
again, p is our prediction or output of our sigmoid. So if we put these
two terms together, we get a two-term loss function.
But you see that one of those terms is always zero. If y is one, the
right-hand term becomes one minus one or zero, so that term contributes
nothing to the loss. On the other hand, if y is zero, the left-hand term
becomes zero. And we have, once again, minus log of one minus p.
And that’s the idea of, what’s my probability of belonging to class
zero? So both terms are symmetric. What we’re seeing is we are just
taking advantage of our classes being coupled to select which term we’re
going to use. So, here, what we’re doing is, we are penalizing
predictions that are farther away.
So if I’m trying to predict zero and my output is really, really
large, log of one minus the probability is going to be a high
contribution to the loss. On the other hand, if my prediction–
I want it to be close to one–
my target is one, then the output should be close to one. And that
means my probability should be close to one. As the farther and farther
I get away from one–
and, remember, the only way to get away from one is to move towards
zero. By using that minus log p term, I’m actually increasing the value
of my error. So the more I predict zero–
or the closer I am to predicting zero, the more I contribute to the
loss. In other words, my error is larger. So, sometimes, it helps us to
take a look at this. And let’s take a look at that visually now.
So, here, we can see what log loss looks like, and we have the
individual terms. We can see the blue line, which is the probability for
y equals one–
or I should say the loss for y equals one–
you’ll see, if we’re predicting close to one, our loss is very, very
small. But if we’re predicting close to zero and our target is one, we
contribute a great amount to the loss. And the idea is, when we’re
trying to minimize this loss equation, what we’re going to try and do
is, force our predictions to be correct. In other words, the model
should start outputting predictions that are closer to one because they
will contribute less to the loss.
Likewise, when we’ve got a target that’s zero, the farther away from
zero you are, the more you contribute to the loss. Now, remember, our
sigmoid is going to squash our outputs to be between zero and one. So
the farther away you are from zero, means the closer you are to one. If
we add these two terms together, we see that it forms a concave
curve–
or I should say, a convex curve–
with a well-defined minimum. And this is what we always wanted for
any of our loss minimizations. We want a well defined minimum. Well, we
see that this logarithmic loss has that well defined minimum, and it’s a
function of these two terms.
So by using these two terms with the values of y equals one and y
equals zero, we formed this loss function with a well defined minimum.
That gives us an advantage because a well defined a minimum for us is
easy to find and minimize.
- minimizing log loss
Hi, and welcome to the video on minimizing log loss. So, once we have
a log loss equation, our question becomes, how do we minimize that? And
of course, minimizing our log loss, means that our model is producing
good predictions. So a small loss, or a minimal loss, gives us what we
want–
a good, well-fit model. We’ve discussed the loss function. We have a
prediction function. So if we take the sigmoid prediction function and
our logarithmic loss, let’s try and combine them in a similar fashion to
linear regression.
And keep in mind that although we see the value p, p is really a
function of x. And although I shouldn’t say function of x, p contains
the data, x, and is actually a function of the slopes times x. So we’ve
got a little bit of math here. But it’s important for us to understand
how this works, and how it’s similar to linear regression.
So traditionally, we denote the loss with the value J. And here, what
I’m doing is I’m summing up all the loss terms for each prediction. So
again, if the prediction is 1, we’re going to use the left hand turn. If
the prediction is 0, we’re going to use the right hand term.
Or I shouldn’t say the prediction, if the target. And as we sum all
of those up, we will get the final loss. And just a quick reminder that
I have p in the loss equation. But remember that p is actually the
sigmoid that contains our data and our slopes.
And remember, that’s the same as linear regression, that sum of mi
xi. So, we’ve got things that are fixed. Our target, y, is fixed. That’s
our target.
It’s our data. Our x values are fixed. That’s our data. So to make a
minimum value of J, the only thing that can change is the slope.
And because we have this function that has a well-defined minimum, if
we take the slope of the function J, what we can do is we can find the
minimum. Because what we’ll do by looking at the slope is we’ll find,
which way does the direction of the minimum lie? Does it lie to our
left? Does it lie to our right in each dimension?
And so by taking the partial derivative of J with respect to the
slope, we will get an update rule for the value of the slope. Now
conveniently, the update rule for logistic regression is the exact, down
to the variable function as the linear regression. In other words, I
have this sigmoid function. And I have an update rule.
And when I take the actual derivative of the update rule, it turns
out that the value is the same. Now, although it’s a nicety to see, I’ve
included a link here so that you can go through the details of the
derivation yourself. But the key takeaway from all this mathematics is
our update is the same. And if our update is the same, that means that
we can use all the algorithms from linear regression to update our
values of the slope for logistic regression.
So now we’ve extended our ability to model problems from linear
problems to categorical problems, simply by using an appropriate loss
function. So now, hopefully it starts to become a little bit clearer why
we use this particular exponential function. It was actually chosen so
that those update rules come out the same. And again, just a quick
reminder, although I’ve got log, L-O-G, these are actually natural
logs.
And that can be important if you’re taking those derivatives, because
the natural log derivative is actually 1 over the value. So the
derivative of natural log of p is 1 over p. That plays an important role
in this update rule coming out the same.
- multiclass
Hi, and welcome to the video on multiclass classification. So the
question may come up to your mind. We’ve been studying how to do binary
classifications. And our sigmoid function gives us a binary output–
class 1 or class 0. So you may come across the log loss as binary
cross entropy loss. But as we extend to more and more complex problems,
not every problem is a simple example of two classes. We could have
three classes, 10 classes, 1,000 classes, or even a million classes.
So how can we get from our binary classification, the output of the
sigmoid being 0 or 1, to multiple classes? And a simple example of this
is the iris data set that’s a classic toy data set. There are three
species of flower within the iris. And the question would become, how
would I do a logistic regression on that?
Because a sigmoid output will only be 0 to 1, but I’ve somehow got
three classes. And even if I did a one-hot including, my target is no
longer 1 or 0, there’s actually three classes that have ones and zeros.
So how would I adapt my logistic regression to deal with these multiple
classes? So the first method that we can use is something that’s called
one-versus-all or one-versus-the-rest.
So what we’re going to do is for each class, we will train a
discriminator or a model. So if I have three classes that correspond to
the colors red, blue, and green, my classifications would be red/not
red, green/not green, blue/not blue. So I’ve got a classifier for each
class. And what we’re going to do is we’ll find which class gets the
highest output value.
What happens is you’ll look at the sigmoid, and although we often
talk about the sigmoid outputting a 0 and 1, in reality, we get some
value between 0 and 1. And so to pick the winning class, what we will do
is find out which output is the largest. Now we frame these problems as
when we do the red/not red, red would be class 1, and not red would be
class 0. That way the not value is always class 0, and the value is
class 1.
So we will come out with three values. And what we will do is we will
look at those three values, and they may all be very large and very
positive. The key is, which one is the largest? So if we look and say
red has a score of 0.97
and green has a score of 0.22 and blue has a score of 0.50, well,
because red has the highest score, that’s the winning model. So we’re
trying to apply our models off against each other.
It’s not really a competition, but we’re looking at each of the
output scores and finding which one gives us the max score. And
oftentimes, we’ll see a function called argmax, which means it will look
at the outputs of all three values and just takes the maximum value and
look for which one is maximum, and it returns that class. That’s the
one-versus-rest or one-versus-all method. The downside, of course, is
you have to keep a model for each of your classes.
So this isn’t so bad if we’ve got three classes. But if we had a lot
of classes–
100, or 1,000, or even that 100,000, million classes–
that could be quite time consuming. So that is one of the drawbacks
of logistic regression. It’s not inherently multiclass, but we can adapt
to it. But as the number of classes increases, it becomes more and more
difficult for us to find a solution that’s easy for us to implement.
It’s hard to manage millions of models. It’s not so bad to manage
three models. So here is an example of some decent outputs of the red,
blue, and green. So we look at the red score, 0.87,
the blue score, 0.22, the green score, 0.45. So we would look at
those and form a vector of the positive outputs. And when I say the
positive outputs, the class 1 outputs.
So here I have 0.87, 0.22, and 0.45. So with 0.87 being the largest
value corresponding to the red value, once again, we’re going to
classify that as red.
So one of the issues is if your data is not evenly distributed, what
our classifiers are going to see is a large number of negative examples.
And because classifiers learn based on probability, they may start to
favor the not example. So let’s say our data is evenly distributed, with
1/3 red, 1/3 blue, and 1/3 green. Well when I go to the red/not red
classifier, 1/3 of the data is red, but 2/3 of the data is not red.
And this can have the effect of biasing our models towards the 0
class probability. So keep that in mind, especially as the number of
classes increases. This would get worse if I had 10 classes, 10 colors.
Then 10% of the data would be red and 90% of the data would be not
red.
So as a basic guess to get 90% accuracy, you could actually predict
100% of the data being not red. That’s an example of biasing the data.
Now in reality, the classifier will hopefully pick up which input
variables contribute to the class red. But be aware that as you get more
and more classes, your classifiers will start to get biased for negative
samples.
And there are a number of things that we can do to take care of that
bias, but you have to be aware and fundamentally address them as you get
to larger and larger multiclass. Now, there’s another way to handle this
multiclass, and it’s more of a head-to-head battle of classifiers. And
again, we use these which one’s the max, which one’s the best–
a head-to-head battle. But it’s setting up a comparison. So we’re
going to do a comparison for all these classifiers. The problem is this
can get out of hand quite quickly as we can see the number of
classifiers starts to expand quite quickly.
So here, we’re going to be even worse as we go to multiple classes.
And what we’re going to do is compare things like, is it red? Is it
green? Is it green?
Is it blue? As we go through that sequence, obviously we can start to
see there’s factorials involved. But that’s why we see the number of
classifiers expanding quite a bit more than we had for the
one-versus-all. Whereas if we had 10 classes, we would have 10
classifiers.
But this also allows direct comparisons between the different
classes. And the idea here is the class that wins the most votes is the
assigned class. So for the example of red and green and blue, we would
have a red/green and a red/blue. So red has two opportunities to
win.
Well, just like that, if you’ve got the red/green, you’ve also got
green and blue. So green has two chances to win, and so does blue.
That’s why you see as the number of classes expands, the more
head-to-heads we tend to get. And this can lead to some crazy solutions
where some of the classes aren’t super popular, but it’s very difficult
for our classifiers to distinguish between different parts.
So one-versus-one, although it is used in smaller classifiers, it’s
typically not used as we go to larger and larger numbers of classes. But
the one-versus-one is out there, and in many of our algorithms is
implemented behind the scenes, so you don’t have to write this from
scratch. You can say, I want to do a one-to-one classifier. Here are my
inputs, here are my targets, and the algorithm will set up all these
classifiers for you.
So here I’ve got a four-value example of Texas, Iowa, California, and
Florida. Now it turns out I need six head-to-head classifiers. So
because SMU is in Texas, I purposely set this up so that Texas would
win. And we don’t actually know what these values are, but this is just
an example.
And so I’ve got the Texas-Florida classifier. And you can see that
Texas score is the largest value there. So Texas is, quote unquote, “the
winner” of that classifier. And we have the Texas versus Iowa.
And you can see once again, Texas comes out on top. Well then we’ve
got the Texas versus California. And we see once again, Texas is
triumphant. All of our students in Texas are celebrating right at this
point.
Now we go to the California versus Iowa. Before we were looking at
Texas, all the Texas head-to-heads. Now let’s look at the California
head-to-heads. So California has already gone head-to-head with
Texas.
Now it needs to go head-to-head with Iowa and Florida. And you can
see it comes out on the losing end versus Iowa. So Iowa wins that
head-to-head battle. And California comes out on the bottom against
Florida.
So Florida has won one. So our score now is Texas 3, Iowa 1, Florida
1. So we’ve got left is the Florida-Iowa battle. And of course, the
classifier for this result says Iowa comes out on top.
So our final tally is Texas got three votes, Iowa got two votes,
Florida got one vote, and California has zero. And we call these votes,
but we’re just tallying the winner of each classifier. So we can sum
these up and we’ll get these results. But what can happen is a state, or
in this case, a classifier can actually lose the head-to-head and win
the overall classifier–
which can seem strange. So remember, because this is a summary of
outputs at each head-to-head, you can get different results. So one of
the other problems is you can end up with ties, where everybody ends up
in a tie. It’s rare, but it can happen.
So here we’ve got three votes–
or two votes for three states and six classifiers. So we have a tie.
There’s no way to break that tie statistically in how this is set up. So
we have two methods to attempt multiclass.
Neither one is perfect. Remember that the one-versus-one can have
these ties. And the one-versus-rest or the one-versus-all can suffer
from unbalanced classifiers. Again, we’ve got that idea of class/not the
class.
And as the number of classes increases, we have this bias towards the
negative class, just because of the number of samples that the classes
are going to see. So both of these implement what I like to call “under
the hood.” We’re not going to set these up individually. What we’ll do
is we’ll go to sklearn and we’ll say I would like a one-v-one model
situation.
Here’s my data–
run. Or I have a one-v-r. I have a one-versus-all method multiclass.
Here’s my data–
set it up. So it takes place behind the scenes that it’s
well-established how these are run. And so you get your final outputs.
So once again, it allows us to focus on what’s going on with the problem
rather than implementing the software solution.
The software solution is there. It’s for us to use the appropriate
software correctly, but it’s already been optimized so we can get the
best performance and less worrying about the specifics of the coding and
more about the problem itself.
- demonstrating of logistic regressio, part 1
Welcome to our demonstration of logistic regression. So to get
started, we’ll do our traditional imports. And now what we want to do is
we want to find a binary classification set. To practice, Sklearn has a
number of data sets.
You can always scroll through them. And I’ve chosen one to start with
that gets us started with our binary classification. So we’re going to
be taking a look at the breast cancer data set. Let’s go ahead and do
our first tab.
As always, let’s take a look at our data. So you can see this is not
actually a data frame. What we’ve got here is actually a dictionary,
where we’ve got our data and we’ve got our targets. And then we’ve got
all the information we need.
We just have to put it into a data frame. That’s not too hard. And
this is the traditional way that data sets are stored in Sklearn. So if
we’re not familiar with dictionaries, we’re about to get a short
lesson.
So you can see this is our actual array of data. So let’s start to
create a data frame. And I’ll just call it cancer data frame. And for
brevity, this is the data that’s going to go in.
So we’ve got our indexes. Now you’ll notice we don’t have any column
names. But the column names were actually there in the raw data. Let’s
take a quick look.
So we’ll have to remember to add our targets here. But here’s the
description of all the different names. So let’s see if we can pull up
those descriptions. So these are all the descriptions.
And let’s see if we can pull out the names for these really quickly.
So this is going to take us a little bit. So I may cut the video at this
point while I come up with a quick way to put these together. We can see
that it’s not exactly easy to pull these out and would take a little bit
of formatting.
It’s not impossible. But for the meantime, I’m just going to move on
without having my actual names. We know this data set is well formatted
because it’s a toy data set as well as you can read the summary on
Sklearn. So we’ll just continue.
But we don’t need to forget that we actually have to add the target
as well. So let’s go ahead and add that to our data frame. So let’s look
at our final data. Oops.
This is why I tab complete quite a bit. So we can see we’ve got our
values. And if we wanted to look up what those values are, we could.
Then we see we’ve got targets of 0’a and 1’s.
So we’ve got our data. It’s binary classified. What we need to do is
let’s start our basic logistic regression. Now, you’ll notice at this
point, I have not scaled the data.
But we can still get fairly good results without scaling the data.
And by not scaling the data initially, we can perhaps compare and
contrast the importance of scaling the data. Fortunately, doing our
logistic regression is actually going to be fairly quick. Now, this is
the first time we’ve encountered our Sklearn.
You can notice that there are multiple versions of logistic
regression. Now we always like to do cross-validation. But you can use
regular logistic regression and a cross-validation from other parts of
Sklearn. The Logistic Regression CV, because it’s such good practice,
has it built in.
At this point, let’s go ahead and take a look at our Logistic
Regression CV. So I’ve imported this class. And we’re going to use this
as our model. Now, a little bit of object-oriented programming, Logistic
Regression CV is a class.
So we’ve just imported this class. And now what we need to do is
we’re going to create a model. So our model is an instance of the
Logistic Regression class. Now, we can accept the default parameters by
just using empty parentheses.
If you’re interested in all the options, you may want to take a look
at the Sklearn documentation. So we may get some mornings as we start
this. But this is our base logistic regression cross-validation. Let’s
go ahead and instantiate it.
So now we’ve got a model, now it’s time for us to fit our model. So
to fit our model, we need both targets and data. But that was in the
data frame. We’ve already got that.
So I’ve got my data frame. But remember, the data frame contains both
the targets and the data. So we need a way to split this up. I can do
that with pandas location.
Or I can actually split these up. And either way, we will get,
essentially, our x data and our y data. So what I can do for the y data,
we know that that’s the target. Now, this may look a little crazy.
But what I’m doing is I’m creating an object where I drop the target
column. And then here are my targets–
the actual target column. You may be afraid that when I drop the
target, I’m going to drop it from the data frame completely. However, in
this case, because I’m not assigning the result to anything, this just
returns an object. It has no name.
So this is essentially our x data, while this is our y data. We’ll
probably get a warning because we didn’t specify how many folds in our
cross-validation. But let’s go ahead and see if this works. Well, we
found not in the axis.
So I probably misspelled this. And let’s see if we can figure
out–
hmm, this is interesting. So this is one of my most common mistakes.
I forget to tell it if I’m dropping a column or a row. OK, so we’ve got
some warnings here.
And this is our results. And while these are warnings, they’re not
too terribly of concern with us. And it’s telling us we’ve reached our
iteration limits. In other words, its iteration.
But it may not have finally found a solution. So we finally got our
done. And we’ve got our results. So we don’t have anything printed.
How do we see our results? Well, they’re actually stored as part of
the model. Now, you’ll notice I actually put–
I did that quite quickly. Let me stop and go back. Because they’re
stored as part of the model, this model is an object and it has
attributes with it. Those attributes can be accessed through various
methods.
Again, we’re into a little bit of object-oriented programming. But
the key thing is we can access those attributes. When I hit the period
here, this means I want one of the attributes or methods of my model.
What I can do is now I’m going to hit Tab.
And this will bring up a list of attributes. So this will tell us all
the different things we want to know. First of all, the C, that’s going
to be our regularization. We may or may not have talked about
regularization at this point.
But an important reminder that regularization is key to any model.
One of the most important things is, of course, our coefficients or our
slopes. So let’s take a look at what our coefficients are. This is going
to tell us all the coefficients or slopes for each columns.
Now, at this point, our columns are unnamed. But we can actually tie
these back because these are in order. This is the coefficient for the
first, or zeroth, column, the second column, and so on and so forth.
Obviously, the larger in absolute value the column, the more important
it is.
But if you remember, we didn’t normalize that data. So the range of
the data can have an impact on these coefficients. So the importance is
somewhat hidden. And we’ll see that when we normalize our data in just a
moment.
But we’ve got our coefficients out. Let’s see if there’s anything
else. Let’s take a look. Is the slope part of these coefficients or
not?
One of the quick ways we can look at that is let’s take a look at our
data frame. And remember, we’ve got an extra column in there. That’s the
target. So we see that there’s 31 features.
Let’s take a look at how many coefficients there are. So you’ll
notice there are 30 coefficients. Now, this shape was 31. But don’t
forget we’ve got the targets in there.
So we’ve got a coefficient for each column, which means the
y-intercept is not actually included in these coefficients. It’s always
good to check to see whether they consider the intercept independent of
the coefficients. Because many times, we group it in. And because data
science is conglomerate and brings in methods from multiple sciences,
you can never be sure.
So it’s always good to check whether or not your assumptions are
correct. I would have initially assumed that the intercept was part of
these coefficients. But in this case, it’s not. Let’s take a look at
some of the other things that are involved here.
So as we saw, we’ve got the different values. We see here is the
intercept. And we’ve got all the other parameters. What we didn’t
do–
my apologies for accidentally clicking on that–
was we didn’t know how many cross-validations we did. So let’s see if
we can find that out now. And it’s actually not telling us. So we can’t
understand at this point what was the default cross-validation.
This is sometimes when we might have to go to the documentation. But
let’s see if we can possibly change that because we know this is a value
in our logistic regression. If we remember our fit method or possibly
our instantiation, we can possibly change the value for CV. So let’s go
back.
And I’m going to create a new model. And, of course, it’s an instance
of the Logistic Regression CV. And let’s see if we can figure out how to
change the number of folds. Now, we’re not seeing anything.
I press Tab to come up. And so it’s not showing us. One of the things
I can do is–
I know that typically one of the things is n folds. Now, I’ve got n
jobs. And at this point, because I don’t use logistic regression and not
a lot, I may have to go back and refresh my memory with the
documentation. So let’s probably go and do that.
And let’s take a look at the Sklearn documentation. It’s a little bit
hard to remember all the things. So don’t be ashamed if you can’t
remember these things off the top of your head. So let’s create a new
tab.
And I’m going to go and open up the Sklearn documentation.
- Demonstration of Logistic Regression, Part II
Here we are at the SK Learn documentation. What we’re going to do is
we’re going to look for that logistic regression CV. Now there are a
number of ways to get there. We’ll just type that in.
And we see logistic regression CV pop up. Now we’ve got our
documentation. And we can see right here how to define the different
important parameters. And we see that we may have caught ourselves a
little bit because we see the default is CV equals none.
And so we have this CV. And we look down here. It’s the
cross-validation generator. And the default being none.
So even though we started off assuming we were doing
cross-validation, it turned out we weren’t doing any because we didn’t
have an integer. So to properly do a cross-validation, we’ll actually
have to assign an integer. This is why it’s helpful to always bring up
the documentation. It’s fairly clear here in SK Learn.
And it’s one of the benefits of this package to go through. So while
we’re here, we can take a look at all the different options just to
remind ourselves that we’re doing things the correct way. So we start
with our cross-validation. And as we browse through this, we see that
this would be our regularization, our Cs, as I may have mentioned
before.
That may not be obvious. I’ve been here before so I know that C
represents the regularization. We’ll see our penalties down here. We’ll
talk about that frequently throughout the course.
We’ll talk about our scoring. So we can have a report on the
different scoring. And right now it’s not telling us any scoring. We saw
the stopping criteria.
Now you may ask yourself, what do you mean stopping criteria? As the
different solvers are used, they stop when the loss begins to decrease
or stops decreasing a certain amount. And so we don’t typically use
this, but that’s with our warnings were about if you may have noticed
and we’re reading along. We have various issues for class weights and
efficiencies.
These are more computation. So our main thing is let’s go back and do
our cross-validation. So I’m going to click back. And let’s do a
five-fold cross-validation.
So we’ve got this new model. And let’s go ahead and fit it. Now I’m
going to go and scroll up. And I’m just going to copy what I used
before, because I know it works.
So we’re seeing still we’re getting the number of iterations. But now
we see we’re still not getting results. Let’s see if we can add in that
scoring metric. So we’re just going to rewrite, even though this is our
new model.
And let’s see what the accuracy is. So once again, we’re just going
to use the same fit command. And we haven’t had anything come out. So is
that stored somewhere within our model?
So let’s take a look. Aha. We’ve got scores in here. So what we’re
seeing on scores is not what we expect.
We saw for scoring we expected actually five values out. So this is
something that’s not our actual scores of our cross-validation. So when
we actually do the score, this is asking us for our x and y values. What
we’re finding is that the CV method does not actually return to us the
scores for our cross-folds.
So what we can do is we can actually move back either to a regular
logistic regression and use the model selection to do cross-validation.
However, at this point, let’s see what our score is just for our basic x
and y. I’m just going to quickly pull these. And you can see we’ve got
97% accuracy.
However, the astute among you will notice this is data that was used
to fit the model. So this is a biased estimate. And what we’re really
interested in is our unbiased estimate. It doesn’t appear that the
logistic regression CV is immediately forthcoming with those unbiased
estimates.
I’m going to take one more quick look before we move on to a
different method. From this perspective, I’m going to move on to the
cross-fold validation in the model selection to see if we can get a
better estimate of our scoring. To do that, I will import the
cross-validation method. And we’re going to switch to regular logistic
regression.
What I’ve done here is I’ve imported a method called cross_val_score.
This will give us the scores across a cross-validation for any model. So
this is not unique to logistic regression like the logistic regression
CV was. So let’s go ahead and use this cross_val_score with our log reg
model, which is a simple logistic model.
So what we’ll need to do is we’ll need to understand how to do the
cross_val_score. For that, let’s take a quick look at SK Learn. We see
here cross_val_score takes an estimator. That’s going to be our log rog
model, our x and y data.
It’s going to take our scoring. So that will be what metric we want
to use. And then it will use a default five-fold cross-validation. Now
you’ll notice here it says none.
However, if we read in the details, none uses the default five-fold
cross-validation. This is probably the origin of why I myself tend to
use five-fold cross-validation so that we know if we call this with the
default parameters, we’re getting a five-fold cross validation. So let’s
go back and do that. Now we need our estimator.
We need our x data and our y data. We remember that we don’t need to
specify a CV because the default is 5. Let’s add our accuracy in. This
line has gotten a little long, but you can see I’ve got my
cross-validation score.
I put in what model I’m going to use. I have my x data. I have my y
or target data. And I have the scoring I want to use.
Let’s go ahead and run it. And you can see here we’ve got a slightly
lower value. This is important. It’s not bad.
We didn’t do anything incorrectly. What these represent are the
unbiased estimates. Before our 97% was a biased estimate. We used the
entire data to fit that particular model.
However, when we ran the data through, that data had been used to
build the model. In other words, the model had seen the data ahead of
time, resulting in biased results. Here, of course, we split our data.
And so when data was sent through for scoring, that data was not used to
build the model.
And of course, the reason we have five scores with five different
values is we split the data into five parts. And each time that part of
the data was held out as a test set and the other parts were used to
train the model and fit it. The test set was then run through without
the model having seen it before and thus produces an unbiased
prediction. So we see that while these predictions are a lower score,
they’re actually a more accurate–
with no relation to our scoring using accuracy, they’re a more
accurate representation of how our model will perform on unknown data.
This is a good example of how logistic regression and cross-validation
can be used together to get that estimate of performance on an unknown
data set. At this point, we’re going to go ahead and conclude our demo
on logistic regression.
- Case Study 2 Introduction Welcome to our case study on diabetes.
This week’s case study is one that has some possibly controversial data.
I want you to carefully follow the discussion and develop your own
thoughts about the material being discussed. Some people may want to
avoid this type of conversation.
However, as a data scientist, we are often confronted with things
that have ethical concerns. So as you watch this video, continue to
monitor how our protagonist asks about the problem. And then also watch
how they raise their concerns. See if you have the same or different
concerns.
Feel free to submit those as part of your participation. Don’t
forget, important details about the problem are included in the video.
Once again, we are simulating a real world problem and a presentation
from a data science professional to a, this time, slightly more seasoned
data scientist. Again, we will pause to have you take a look at the
data.
And you can submit further questions towards the end. Those questions
can then be uploaded as part of your participation grade.
- Case Study 2, Part I
All right. So this week we have a problem coming to us from the
medical community. We’re looking specifically at a diabetes study. And
the problem is hospital readmission.
Now we don’t want people in hospitals. We want them to be well. And
we certainly don’t want them to be readmitted. This comes at a huge cost
to the patient in terms of bills, lost wages, strain on their family and
whatnot.
So our goal is no readmission. So for this study, what we’re trying
to do is we want to try to predict readmission of the patient within 30
days of initial hospitalization. So take a look at that data and we’ll
come back and take any questions you have.
- Case Study 2, Part II
So you’ve had a chance to look at the data. Do you have any questions
about it? I actually do have a couple of questions about this particular
assignment. And the first one is, well, pretty important to me.
It’s regarding the race category. And I don’t understand its
significance. It seems actually kind of like sorting patients with this
criteria could be seen as racially biased or something. I mean, simply
talking about identifying who was getting readmitted to the
hospital.
So I really don’t understand why race matters. OK. Absolutely. And
thank you for bringing that up because I absolutely agree with you.
Normally, this is not something that we would take into
consideration. Sorting by race can bring in a lot of ethical
considerations. In this case, we’re talking about the medical community.
We’re talking about patients.
And we do know that diabetes affects different demographics
differently. So race actually could very well be a factor in this. Now,
that being said, I will leave it to you. As long as we can chart the
trends accurately, I’m not as concerned about how we get there.
OK. Got it. Thank you. So my next question is all about all of these
question marks that are in the data set.
What’s going on with that? Yeah. The study took place over 10 years.
There’s something like 130 hospitals that they were pulling data
from.
There are a lot of people entering data into this study. And so there
are holes. So we’re just going to have to make the best recommendation
we can based on the data that we have. OK.
Understood. I’ll get working right away. Thank you.
You will submit your assignment to this page.
Please refer to your course syllabus for additional details about
this assignment.
When you save your assignment, the file name should be “First
Name_Last Name_Assignment Name.”
Your case study is to build classifiers using logistic regression to
predict ALL 3 CATEGORIES of hospital re-admittance. There is missing
data that must be imputed.
Once again, discuss the top 5 variable importance of your best model
as part of your submissio
What is Discrete Data?
Discrete data refers to a type of quantitative data
that consists of distinct, separate values. These values are countable
and cannot take on every possible value within a range. In other words,
discrete data is made up of whole numbers or specific categories, and
there are no intermediate values between two data points.
Key Characteristics of Discrete Data
- Countable: Discrete data represents items that can
be counted (e.g., the number of students in a class).
- Finite or Infinite: It can have a finite set of
values (e.g., dice rolls: 1, 2, 3, 4, 5, 6) or an infinite set (e.g.,
the number of times a coin is flipped until heads appears).
- No Fractions or Decimals: Discrete data cannot take
fractional or decimal values (e.g., you can’t have 2.5 students).
- Categorical or Numerical: While discrete data is
often numerical, it can also represent categories (e.g., types of cars:
sedan, SUV, truck).
Examples of Discrete Data
- Numerical Examples:
- Number of children in a family (e.g., 0, 1, 2, 3).
- Number of cars in a parking lot.
- Number of goals scored in a soccer match.
- Categorical Examples:
- Shoe sizes (e.g., 7, 8, 9, 10).
- Types of fruits in a basket (e.g., apples, oranges, bananas).
Discrete vs. Continuous Data
Definition |
Countable, distinct values |
Measurable, infinite values within a range |
Examples |
Number of students, dice rolls |
Height, weight, temperature |
Values |
Whole numbers or categories |
Can include fractions and decimals |
Graph Representation |
Bar charts, pie charts |
Histograms, line graphs |
How to Analyze Discrete Data
- Visualization:
- Use bar charts or pie charts to
represent discrete data visually.
- Example: A bar chart showing the number of students in each
grade.
- Statistical Measures:
- Calculate measures like mean,
median, mode, and
range.
- Example: The average number of cars sold by a dealership per
day.
- Probability:
- Discrete data is often used in probability distributions, such as
the binomial distribution or Poisson
distribution.
Applications of Discrete Data
- Business: Tracking the number of products sold or
customer complaints.
- Education: Counting the number of students in
different classes.
- Healthcare: Recording the number of patients
visiting a clinic daily.
- Sports: Counting goals, points, or wins in a
game.
Conclusion
Discrete data is a fundamental concept in statistics and data
analysis. It represents countable, distinct values and is often used in
scenarios where whole numbers or specific categories are involved.
Understanding discrete data is essential for analyzing and interpreting
real-world phenomena effectively.
Summary: Transitioning from Linear Regression to Logistic
Regression
This explanation focuses on the transition from linear
regression to logistic regression,
particularly when dealing with categorical data and
binary classification tasks.
- Problem with Linear Regression for Categorical
Targets:
- Linear regression predicts continuous outputs, which are not
suitable for categorical targets (e.g., predicting 0 or 1 for binary
classification).
- Even if the output is constrained between 0 and 1, linear regression
can still produce values outside this range, making it unreliable for
classification tasks.
- Discrete vs. Continuous Outputs:
- Continuous outputs vary smoothly with input values (e.g., linear
regression).
- Discrete outputs, like binary targets (0 or 1), only take specific
values and are better suited for classification tasks.
- Binary Classification:
- The target variable is binary (e.g., true/false, 0/1).
- One-hot encoding is used to transform categorical data into numeric
data, but linear regression cannot directly handle this for
classification.
- Introducing Logistic Regression:
- Logistic regression applies a variable
transformation to convert the continuous output of linear
regression into a discrete probability.
- This transformation is achieved using the sigmoid
function (denoted as σ), which maps any input to a value
between 0 and 1.
- Sigmoid Function:
- The sigmoid function takes the linear regression equation
(
y = mx + b
) as input and outputs a probability value
between 0 and 1.
- This ensures the output is suitable for binary classification
tasks.
- Compact Notation:
- Logistic regression builds on the familiar linear regression
equation (
y = m1x1 + m2x2 + ... + b
), but applies the
sigmoid function to the result.
- The summation sign is often implied for brevity, as the structure of
the equation remains similar to linear regression.
Key Takeaway:
Logistic regression extends linear regression by introducing a
sigmoid function to model binary categorical
outputs. This transformation ensures the output is constrained
between 0 and 1, making it suitable for classification tasks where the
target is discrete.
Summary: Transitioning from Linear Regression to Logistic
Regression
This explanation focuses on the transition from linear
regression to logistic regression,
particularly when dealing with categorical data and
binary classification tasks.
- Problem with Linear Regression for Categorical
Targets:
- Linear regression predicts continuous outputs, which are not
suitable for categorical targets (e.g., predicting 0 or 1 for binary
classification).
- Even if the output is constrained between 0 and 1, linear regression
can still produce values outside this range, making it unreliable for
classification tasks.
- Discrete vs. Continuous Outputs:
- Continuous outputs vary smoothly with input values (e.g., linear
regression).
- Discrete outputs, like binary targets (0 or 1), only take specific
values and are better suited for classification tasks.
- Binary Classification:
- The target variable is binary (e.g., true/false, 0/1).
- One-hot encoding is used to transform categorical data into numeric
data, but linear regression cannot directly handle this for
classification.
- Introducing Logistic Regression:
- Logistic regression applies a variable
transformation to convert the continuous output of linear
regression into a discrete probability.
- This transformation is achieved using the sigmoid
function (denoted as σ), which maps any input to a value
between 0 and 1.
- Sigmoid Function:
- The sigmoid function takes the linear regression equation
(
y = mx + b
) as input and outputs a probability value
between 0 and 1.
- This ensures the output is suitable for binary classification
tasks.
- Compact Notation:
- Logistic regression builds on the familiar linear regression
equation (
y = m1x1 + m2x2 + ... + b
), but applies the
sigmoid function to the result.
- The summation sign is often implied for brevity, as the structure of
the equation remains similar to linear regression.
Key Takeaway:
Logistic regression extends linear regression by introducing a
sigmoid function to model binary categorical
outputs. This transformation ensures the output is constrained
between 0 and 1, making it suitable for classification tasks where the
target is discrete. This slide introduces logistic
regression by building on the familiar linear
regression equation and applying a transformation to make it
suitable for classification tasks, particularly binary
classification.
1. Linear Regression Equation
The original linear regression equation is: \[
y = m_1x_1 + m_2x_2 + \dots = \sum m_ix_i
\] - Explanation: - \(m_i\): Coefficients (slopes) for each
feature \(x_i\). - \(x_i\): Input features (independent
variables). - \(y\): Output (dependent
variable), which is continuous in linear regression.
2. Problem with Linear Regression for
Classification
- Linear regression produces continuous outputs, which are not ideal
for classification tasks where the target is categorical (e.g., 0 or 1
for binary classification).
- To address this, we need to transform the output into a
probability that lies between 0 and 1.
3. Logistic Regression Equation
The logistic regression equation introduces the sigmoid
function to transform the output: \[
y = \frac{1}{1 + e^{-\sum m_ix_i}}
\] - Explanation: - The sigmoid function maps
any real number (from \(-\infty\) to
\(+\infty\)) to a value between 0 and
1. - This makes the output interpretable as a
probability for binary classification.
4. Compact Notation
The equation can be written more compactly as: \[
y = \sigma(\sum m_ix_i)
\] - \(\sigma\): The sigmoid function,
defined as: \[
\sigma(z) = \frac{1}{1 + e^{-z}}
\] - The summation sign (\(\sum\)) is often dropped for clarity, as it
is implied that the input is the weighted sum of features (\(m_ix_i\)).
Key Takeaways
- Linear regression is extended to logistic
regression by applying the sigmoid function to the output.
- The sigmoid function ensures the output is between 0 and 1, making
it suitable for binary classification tasks.
- Logistic regression predicts the probability of a
binary outcome (e.g., 0 or 1), where the decision boundary is typically
set at 0.5.
Summary and Expansion: Understanding the Sigmoid Function in
Logistic Regression
The sigmoid function is a key mathematical tool in transforming
linear regression into logistic regression. It maps continuous input
values into a range between 0 and 1, making it suitable for binary
classification tasks. Here’s a breakdown and deeper explanation of its
properties and applications:
1. What Does the Sigmoid Function Do?
- Purpose: The sigmoid function transforms the output
of a linear equation into a probability between 0 and 1.
- Equation: \[
y = \frac{1}{1 + e^{-\sum m_ix_i}}
\]
- \(m_i\): Coefficients (slopes) for
each input feature \(x_i\).
- \(\sum m_ix_i\): The linear
combination of features and coefficients.
- \(e\): The exponential
function.
- Behavior:
- For large positive values of \(\sum
m_ix_i\), the output approaches 1.
- For large negative values of \(\sum
m_ix_i\), the output approaches 0.
- For values near 0, the output is approximately 0.5.
This transformation creates an S-shaped curve
(sigmoid curve), which is ideal for modeling probabilities.
2. Relationship to Binary Classification
- Binary Outputs: Logistic regression predicts
probabilities for two classes:
- Class 1: Probability close to 1.
- Class 0: Probability close to 0.
- Thresholding:
- A default threshold of \(y = 0.5\)
is often used:
- \(y \geq 0.5\): Assign to Class
1.
- \(y < 0.5\): Assign to Class
0.
- However, the threshold can be adjusted based on the problem (e.g.,
fraud detection, medical diagnosis) to balance sensitivity and
specificity.
3. Properties of the Sigmoid Function
- Output Behavior:
- When \(\sum m_ix_i\) is very large
and positive:
- \(e^{-\text{large positive number}} \to
0\), so \(y \to 1\).
- When \(\sum m_ix_i\) is very large
and negative:
- \(e^{-\text{large negative number}} \to
\infty\), so \(y \to 0\).
- Smooth Transition:
- The sigmoid function smoothly transitions from 0 to 1, avoiding
abrupt changes like a step function.
- This smoothness allows the function to output probabilities, which
can then be thresholded for classification.
4. Adjusting the Threshold
- Default Threshold: \(y =
0.5\).
- Custom Thresholds:
- Lower thresholds (e.g., \(y =
0.2\)) increase sensitivity, useful for detecting rare events
like fraud.
- Higher thresholds (e.g., \(y =
0.9\)) increase specificity, useful for high-confidence
decisions.
- Importance of Thresholds:
- Choosing the right threshold is critical for balancing false
positives and false negatives, depending on the application’s
requirements.
5. Why the Sigmoid Function?
- Simulates a Step Function:
- At extreme values, the sigmoid function behaves like a step
function, producing outputs close to 0 or 1.
- Mathematical Simplicity:
- Despite involving exponentials, the sigmoid function has properties
that simplify calculations, making it computationally efficient.
- Probabilistic Interpretation:
- The output of the sigmoid function can be interpreted as the
probability of belonging to Class 1.
6. Practical Considerations
- Framing the Problem:
- Logistic regression is framed such that positive values of \(\sum m_ix_i\) correspond to Class 1, and
negative values correspond to Class 0.
- Applications:
- Logistic regression is widely used in applications like fraud
detection, medical diagnosis, and spam classification.
Conclusion
The sigmoid function is the cornerstone of logistic regression,
enabling the transformation of continuous linear outputs into
probabilities suitable for binary classification. Its smooth, S-shaped
curve and ability to simulate step-like behavior make it ideal for
modeling discrete outcomes. By adjusting thresholds and leveraging its
mathematical properties, the sigmoid function provides flexibility and
precision in a variety of real-world applications.
Summary of Logarithmic Loss and Its Derivation
Logarithmic Loss (Log Loss) is a critical loss function used in
classification problems, particularly for binary classification tasks.
It evaluates the performance of a classification model by penalizing
predictions that deviate from the true target probabilities. Here’s a
breakdown of the key ideas:
- Loss Functions in Regression:
- For regression, we typically use Mean Squared Error
(MSE) or Mean Absolute Error (MAE) to minimize
the vertical distance between predictions (\(\hat{y}\)) and true targets (\(y\)).
- These measure error using continuous distances between predicted and
actual values.
- Challenges in Classification:
- Classification involves discrete labels, such as \(y = 0\) or \(y =
1\).
- Predictions are probabilistic, e.g., \(p(y=1 | x)\), derived from models like
logistic regression with a sigmoid activation
function.
- Log Loss Function:
- The sigmoid function outputs probabilities (\(p\)) between 0 and 1: \[
p = \sigma(z) = \frac{1}{1 + e^{-z}}
\]
- The log loss function measures how close these probabilities are to
the true labels using the natural logarithm (\(\ln\)).
- Mathematical Derivation:
- Log loss for \(y=1\): \[
\text{Loss} = -\ln(p)
\]
- Log loss for \(y=0\): \[
\text{Loss} = -\ln(1-p)
\]
- Combined, for general \(y\)
(binary): \[
\text{Log Loss} = - \big( y \ln(p) + (1-y) \ln(1-p) \big)
\]
- Here, \(p\) is the model’s
predicted probability for \(y=1\), and
\(1-p\) for \(y=0\).
- Intuition:
- Penalizing incorrect predictions: Logarithms
amplify penalties for predictions that are far from the true label.
- A prediction of \(p=0.99\) for
\(y=1\) incurs a small loss, but \(p=0.01\) incurs a large loss.
- The log loss function is convex, ensuring a unique
global minimum.
- Visualization:
- The log loss curves for \(y=1\) and
\(y=0\) are mirror images, reflecting
their symmetry.
- The total loss function forms a convex shape, simplifying
optimization.
Mathematical Notation in Code
Python Code Example
Below is a Python implementation of the log loss function for binary
classification.
import numpy as np
def sigmoid(z):
"""Compute the sigmoid of z."""
return 1 / (1 + np.exp(-z))
def log_loss(y_true, y_pred):
"""
Compute the log loss for binary classification.
Parameters:
- y_true: True labels (0 or 1).
- y_pred: Predicted probabilities (between 0 and 1).
Returns:
- Average log loss.
"""
epsilon = 1e-15 # Avoid log(0) by clipping predictions
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
return loss
# Example usage
y_true = np.array([1, 0, 1, 0]) # True labels
y_pred = np.array([0.9, 0.1, 0.8, 0.2]) # Predicted probabilities
print("Log Loss:", log_loss(y_true, y_pred))
Visualization in Python
The following Python code visualizes log loss for \(y=1\) and \(y=0\).
import matplotlib.pyplot as plt
# Probabilities from 0 to 1
p = np.linspace(0.01, 0.99, 100)
# Log loss for y=1 and y=0
loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)
# Plot
plt.figure(figsize=(8, 6))
plt.plot(p, loss_y1, label='Loss (y=1)', color='blue')
plt.plot(p, loss_y0, label='Loss (y=0)', color='orange')
plt.title("Logarithmic Loss for Binary Classification")
plt.xlabel("Predicted Probability (p)")
plt.ylabel("Log Loss")
plt.legend()
plt.grid()
plt.show()
Summary of Key Insights
- Log Loss combines terms for \(y=1\) and \(y=0\) into a single, symmetric
function.
- It penalizes incorrect predictions more as they deviate from the
true class probability.
- The convexity of log loss ensures optimization is feasible with
methods like gradient descent.
This foundation makes log loss a standard for classification tasks
and a crucial metric for training and evaluating models like logistic
regression and neural networks.
Summary and Expansion: Logarithmic Loss (Log Loss) in Logistic
Regression
Logarithmic loss, or log loss, is a key loss function used in
logistic regression to evaluate the performance of a model when
predicting probabilities for binary classification
tasks. It measures how far the predicted probabilities are from
the actual target values (0 or 1).
1. Why Not Use Mean Squared Error (MSE)?
- Linear Regression: In linear regression, we use MSE
or Mean Absolute Error (MAE) to measure the distance between predictions
and targets. These methods work well for continuous outputs.
- Categorical Data: For binary classification, the
target values are either 0 or 1. Using MSE is not ideal because it
doesn’t account for the probabilistic nature of the predictions in
logistic regression.
- Need for Log Loss: Log loss is designed
specifically for classification tasks, penalizing incorrect predictions
more effectively by using probabilities.
2. Logarithmic Loss Function
The log loss function combines two cases: when the target \(y = 1\) and when \(y = 0\). It is defined as:
\[
\text{Log Loss} = - \left[ y \cdot \log(p) + (1 - y) \cdot \log(1 - p)
\right]
\]
- Key Components:
- \(p\): Predicted probability
(output of the sigmoid function).
- \(y\): Actual target value (0 or
1).
- \(\log\): Natural logarithm (base
\(e\)).
- Explanation:
- When \(y = 1\): The first term
\(-y \cdot \log(p)\) is active, and the
second term becomes zero.
- When \(y = 0\): The second term
\(-(1 - y) \cdot \log(1 - p)\) is
active, and the first term becomes zero.
- This ensures that only the relevant term contributes to the loss
based on the actual target.
3. Intuition Behind Log Loss
- Penalty for Incorrect Predictions:
- If the predicted probability \(p\)
is close to the actual target \(y\),
the loss is small.
- If \(p\) is far from \(y\), the loss is large.
- Behavior:
- For \(y = 1\), if \(p\) is close to 1, \(-\log(p)\) is small. If \(p\) is close to 0, \(-\log(p)\) becomes very large.
- For \(y = 0\), if \(p\) is close to 0, \(-\log(1 - p)\) is small. If \(p\) is close to 1, \(-\log(1 - p)\) becomes very large.
This penalization encourages the model to predict probabilities
closer to the true target.
4. Why Use Log Loss?
- Probabilistic Predictions:
- Logistic regression outputs probabilities using the sigmoid
function. Log loss evaluates these probabilities rather than discrete
predictions.
- Coupled Classes:
- Since binary classification involves two mutually exclusive classes
(0 and 1), log loss accounts for both cases simultaneously.
- Convexity:
- Log loss forms a convex curve with a well-defined
minimum, which is ideal for optimization algorithms like gradient
descent.
5. Visualization of Log Loss
- Loss for \(y =
1\):
- As \(p \to 1\), the loss approaches
0 (correct prediction).
- As \(p \to 0\), the loss becomes
very large (incorrect prediction).
- Loss for \(y =
0\):
- As \(p \to 0\), the loss approaches
0 (correct prediction).
- As \(p \to 1\), the loss becomes
very large (incorrect prediction).
- Combined Loss:
- The total log loss curve is convex, ensuring a single global
minimum.
6. Practical Applications
- Thresholding:
- Logistic regression predicts probabilities. A threshold (e.g., 0.5)
is used to classify probabilities into binary outcomes.
- Log loss ensures the model focuses on improving predictions close to
the threshold.
- Model Training:
- Minimizing log loss during training ensures the model outputs
probabilities that align closely with the true labels.
7. Advantages of Log Loss
- Handles Probabilities: Unlike MSE, it evaluates
probabilistic predictions effectively.
- Penalizes Confident Wrong Predictions: Predictions
far from the true target are penalized more heavily.
- Optimizable: The convex nature of log loss ensures
that optimization algorithms can find the global minimum.
Conclusion
Log loss is a critical tool for evaluating and training logistic
regression models. By penalizing incorrect predictions based on
probabilities, it ensures the model outputs probabilities that are both
accurate and reliable. Its mathematical properties and intuitive
behavior make it ideal for binary classification tasks.
Minimizing Log Loss: Summary and Detailed
Expansion
Conceptual Overview
Minimizing logarithmic loss (log loss) is a core
step in training logistic regression models. A smaller log loss
indicates that the model’s predicted probabilities closely align with
the true class labels, resulting in better classification
performance.
Key Components
- Log Loss Function:
- The log loss \(J\) for binary
classification is: \[
J(\mathbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \Big[ y_i \log(p_i) + (1 -
y_i) \log(1 - p_i) \Big]
\] where:
- \(N\) = number of data points,
- \(y_i\) = true label (\(0\) or \(1\)),
- \(p_i = \sigma(z_i) = \frac{1}{1 +
e^{-z_i}}\), the predicted probability for \(y=1\),
- \(z_i = \mathbf{w}^T
\mathbf{x}_i\), the linear combination of weights (\(\mathbf{w}\)) and features (\(\mathbf{x}_i\)).
- Prediction Function (Sigmoid):
- The sigmoid activation squashes the linear combination of inputs
into a probability: \[
\sigma(z) = \frac{1}{1 + e^{-z}}
\]
- Ensures \(0 < p < 1\),
suitable for probability estimation.
- Optimization Goal:
- Minimize \(J(\mathbf{w})\) by
adjusting the weights \(\mathbf{w}\),
which control the decision boundary of the classifier.
Mathematics of Minimizing Log Loss
To minimize \(J(\mathbf{w})\), we
apply gradient descent: 1. Gradient of Log Loss
w.r.t Weights: - The partial derivative of \(J\) with respect to \(\mathbf{w}\) is: \[
\frac{\partial J}{\partial \mathbf{w}} = \frac{1}{N} \sum_{i=1}^{N}
(\sigma(z_i) - y_i) \mathbf{x}_i
\] - Intuition: - \(\sigma(z_i) -
y_i\): The prediction error. - \(\mathbf{x}_i\): Feature vector influences
weight updates.
- Gradient Descent Update Rule:
- Update weights iteratively: \[
\mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\partial J}{\partial
\mathbf{w}}
\]
- \(\eta\): Learning rate controls
step size.
Coding Representation
Here’s the Python implementation:
1. Sigmoid Function
import numpy as np
def sigmoid(z):
"""Compute sigmoid function."""
return 1 / (1 + np.exp(-z))
2. Log Loss Function
def log_loss(y_true, y_pred):
"""
Compute binary log loss.
Parameters:
- y_true: True binary labels (array-like).
- y_pred: Predicted probabilities (array-like).
Returns:
- Average log loss.
"""
epsilon = 1e-15 # To avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
3. Logistic Regression Using Gradient Descent
def logistic_regression(X, y, lr=0.01, epochs=1000):
"""
Train logistic regression using gradient descent.
Parameters:
- X: Feature matrix (N x d).
- y: Labels (N x 1).
- lr: Learning rate.
- epochs: Number of iterations.
Returns:
- weights: Trained weights.
- losses: Log loss per epoch.
"""
N, d = X.shape
weights = np.zeros(d) # Initialize weights
losses = []
for epoch in range(epochs):
# Linear combination
z = np.dot(X, weights)
# Predicted probabilities
preds = sigmoid(z)
# Compute log loss
loss = log_loss(y, preds)
losses.append(loss)
# Gradient calculation
gradient = np.dot(X.T, (preds - y)) / N
# Weight update
weights -= lr * gradient
return weights, losses
4. Visualization of Log Loss Minimization
import matplotlib.pyplot as plt
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 2)
y = (X[:, 0] + X[:, 1] > 1).astype(int)
# Train logistic regression
weights, losses = logistic_regression(X, y, lr=0.1, epochs=200)
# Plot log loss
plt.plot(losses)
plt.title("Log Loss Minimization")
plt.xlabel("Epochs")
plt.ylabel("Log Loss")
plt.grid()
plt.show()
Key Insights
- Convexity of Log Loss:
- Log loss is convex, ensuring a global minimum.
- Gradient descent is guaranteed to converge with a proper learning
rate.
- Weight Updates and Error:
- Each weight update reduces the prediction error by moving in the
direction opposite to the gradient.
- Relationship to Linear Regression:
- Logistic regression extends linear regression by squashing
predictions into probabilities using the sigmoid function.
- The gradient update rule is structurally similar, differing mainly
in the sigmoid nonlinearity.
This mathematical and coding breakdown highlights the intuitive and
practical aspects of minimizing log loss, making logistic regression a
robust tool for binary classification tasks.
Summary and Expansion: Minimizing Log Loss in Logistic
Regression
This explanation focuses on how to minimize the log loss
function in logistic regression, which is essential for
training the model to make accurate predictions. The process involves
leveraging optimization techniques, particularly gradient descent, to
adjust the model’s parameters (slopes) and achieve a well-fit model.
1. Objective of Minimizing Log Loss
- Goal: Minimizing log loss ensures that the model
produces predictions that are as close as possible to the actual target
values.
- Small Loss = Good Model: A smaller log loss
indicates a well-trained model with accurate predictions.
2. Log Loss Function Recap
- The log loss function (\(J\)) is
defined as: \[
J = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \cdot \log(p_i) + (1 - y_i)
\cdot \log(1 - p_i) \right]
\]
- \(y_i\): Actual target (0 or
1).
- \(p_i\): Predicted probability
(output of the sigmoid function).
- \(n\): Number of data points.
- Key Insight:
- \(p\) is not just a number; it is
the output of the sigmoid function: \[
p = \sigma(m \cdot x) = \frac{1}{1 + e^{-m \cdot x}}
\]
- The sigmoid function depends on the input data (\(x\)) and the model parameters (\(m\), the slopes).
3. Minimizing Log Loss
- To minimize log loss, the process involves:
- Fixing the Data:
- The target values (\(y\)) and input
data (\(x\)) are fixed and cannot be
changed.
- Adjusting the Slopes:
- The slopes (\(m\)) are the only
variables that can be updated to minimize \(J\).
4. Gradient Descent for Optimization
- Gradient Descent:
- Gradient descent is used to find the minimum of the log loss
function.
- The partial derivative of \(J\)
with respect to each slope (\(m_i\))
determines the direction and magnitude of the update.
- The update rule is: \[
m_i = m_i - \alpha \cdot \frac{\partial J}{\partial m_i}
\]
- \(\alpha\): Learning rate (controls
the step size of the update).
- \(\frac{\partial J}{\partial
m_i}\): Partial derivative of \(J\) with respect to \(m_i\).
- Key Insight:
- The derivative of the log loss function with the sigmoid function is
mathematically structured such that the update rule is
identical to that of linear regression. This is due to
the choice of the sigmoid and log functions.
5. Why the Sigmoid and Log Functions?
- The sigmoid and log functions were specifically chosen because:
- The log loss function is convex, ensuring a single
global minimum.
- The derivative of the natural log (\(\log_e\)) is simple: \[
\frac{d}{dp} \log(p) = \frac{1}{p}
\]
- This simplicity ensures that the update rules for logistic
regression align with those for linear regression, allowing the same
optimization algorithms to be reused.
6. Practical Implications
- Unified Optimization:
- The fact that logistic regression shares the same update rule as
linear regression simplifies implementation.
- Algorithms like stochastic gradient descent (SGD) or batch gradient
descent can be directly applied.
- Extending Linear Regression:
- By simply changing the loss function (from mean squared error to log
loss), logistic regression extends linear regression to handle
categorical problems.
7. Key Takeaways
- Minimizing Log Loss:
- Achieved by adjusting the slopes (\(m\)) using gradient descent.
- The convexity of the log loss function ensures an easy-to-find
global minimum.
- Reusability:
- The derivative of the log loss function aligns with linear
regression’s update rules, making optimization straightforward.
- Why the Sigmoid Function?:
- The sigmoid function ensures outputs are probabilities (between 0
and 1).
- Its exponential structure simplifies derivatives, which is crucial
for efficient optimization.
Conclusion
Minimizing log loss in logistic regression is a straightforward
extension of linear regression’s optimization process. By using the
sigmoid function and log loss, the model can handle categorical data
effectively while maintaining the same optimization framework. This
mathematical design ensures both simplicity and efficiency in training
logistic regression models.
Multiclass Classification: Overview and
Explanation
Multiclass classification extends binary classification techniques to
problems where there are more than two classes. For instance, instead of
determining if an input belongs to “cat” or “dog,” we might classify
between “cat,” “dog,” and “rabbit.”
Key Approaches to Multiclass Classification
1. One-vs-Rest (OvR) Method
- Concept: Train separate binary classifiers for each
class.
- For \(K\) classes, we train \(K\) models:
- \(\text{Class}_i\) vs. “not-\(\text{Class}_i\)”
- Example:
- Classes: Red, Green, Blue.
- Models: Red-vs-not-Red, Green-vs-not-Green, Blue-vs-not-Blue.
- Prediction: Use the argmax
function to choose the class with the highest probability: \[
\hat{y} = \text{argmax}_k \; \text{sigmoid}(\mathbf{w}_k^T \mathbf{x})
\]
- Pros:
- Simple and easy to implement.
- Works well for a small number of classes.
- Cons:
- Bias towards the negative class when class distributions are
imbalanced.
- Scales poorly with the number of classes.
2. One-vs-One (OvO) Method
- Concept: Train a binary classifier for each pair of
classes.
- For \(K\) classes, train \(\binom{K}{2} = \frac{K(K-1)}{2}\) models.
- Each model distinguishes between two classes, e.g., \(\text{Class}_i\) vs. \(\text{Class}_j\).
- Example:
- Classes: Red, Green, Blue.
- Models: Red-vs-Green, Red-vs-Blue, Green-vs-Blue.
- Prediction: Use majority voting:
- Each classifier “votes” for one class.
- The class with the most votes wins.
- Pros:
- Handles imbalanced classes better than OvR.
- Often performs well for small \(K\).
- Cons:
- Computationally expensive for large \(K\) due to the \(O(K^2)\) number of models.
- Potential for ties or paradoxical results in voting.
Mathematical Representation
Log Loss for Multiclass Problems
For a dataset with \(K\) classes: 1.
Softmax Activation: - Generalizes the sigmoid function
for multiclass: \[
p(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^T
\mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}
\] - Outputs a probability distribution over \(K\) classes.
- Cross-Entropy Loss:
- Measures the difference between predicted and true probability
distributions: \[
J(\mathbf{W}) = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{i,k}
\log(p(y=k|\mathbf{x}_i))
\]
- \(\mathbf{W} = [\mathbf{w}_1,
\mathbf{w}_2, \ldots, \mathbf{w}_K]\): weight matrix for \(K\) classes.
- \(y_{i,k}\): One-hot encoded true
label for \(\mathbf{x}_i\).
Python Implementation
Softmax Function
def softmax(z):
"""
Compute the softmax of vector z.
Parameters:
- z: 2D array of shape (N, K), where N is the number of samples and K is the number of classes.
Returns:
- Softmax probabilities for each class.
"""
exp_z = np.exp(z - np.max(z, axis=1, keepdims=True)) # Stabilize for numerical safety
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
Cross-Entropy Loss
def cross_entropy_loss(y_true, y_pred):
"""
Compute the cross-entropy loss for multiclass classification.
Parameters:
- y_true: One-hot encoded true labels of shape (N, K).
- y_pred: Predicted probabilities of shape (N, K).
Returns:
- Average cross-entropy loss.
"""
epsilon = 1e-15 # Avoid log(0)
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
Multiclass Logistic Regression Training
def train_multiclass_logistic_regression(X, y, lr=0.01, epochs=1000):
"""
Train a multiclass logistic regression model using gradient descent.
Parameters:
- X: Feature matrix (N x d).
- y: One-hot encoded labels (N x K).
- lr: Learning rate.
- epochs: Number of iterations.
Returns:
- W: Trained weight matrix (d x K).
- losses: List of cross-entropy losses per epoch.
"""
N, d = X.shape
K = y.shape[1]
W = np.zeros((d, K)) # Initialize weights
losses = []
for epoch in range(epochs):
# Linear combination
z = np.dot(X, W)
# Softmax probabilities
probs = softmax(z)
# Compute loss
loss = cross_entropy_loss(y, probs)
losses.append(loss)
# Gradient calculation
gradient = np.dot(X.T, (probs - y)) / N
# Weight update
W -= lr * gradient
return W, losses
Visualization of Loss Minimization
import matplotlib.pyplot as plt
# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 3)
y_raw = np.random.randint(0, 3, 100) # Random labels (0, 1, 2)
y_one_hot = np.eye(3)[y_raw] # One-hot encode labels
# Train logistic regression
W, losses = train_multiclass_logistic_regression(X, y_one_hot, lr=0.1, epochs=300)
# Plot loss
plt.plot(losses)
plt.title("Cross-Entropy Loss Minimization")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.grid()
plt.show()
Key Insights
- Softmax for Multiclass:
- Softmax ensures the sum of predicted probabilities equals 1,
suitable for multiclass problems.
- One-vs-Rest:
- Simple to implement, but scales poorly for large \(K\).
- One-vs-One:
- Computationally expensive but useful for small \(K\).
- Cross-Entropy Loss:
- The loss function generalizes binary cross-entropy for multiclass
problems.
- Optimization:
- Gradient descent works well due to the convex nature of the
cross-entropy loss.
This mathematical and coding explanation provides a comprehensive
understanding of multiclass classification using logistic
regression.
---
title: "QTW 7333 Module 3 - Taking it Further"
output: html_notebook
---


 3: Logistic Regression

Categorical Data

Linear to Logistic

The Sigmoid

Log Loss

Minimizing Log Loss

Multiclass
Context Module Sub Header
Demonstration of Logistic Regression

Demonstration of Logistic Regression, Part I

Demonstration of Logistic Regression, Part II
Context Module Sub Header
Case Study 2

Case Study 2 Introduction

Case Study 2, Part I

Case Study 2, Part II
Assignment
Case Study 2 Assignment



1. categorical data

Hi and welcome to this video on categorical data, where we'll we be talking about categorical targets and categorical inputs. So categorical data is when we have data that's based on classes. Good examples of this are things like colors--

is something red, green, or blue?--

or logic--

is something true or false? Other examples are things like objects. Is it a car? Is it a house?

Is it a truck? So this data, when we see it initially--

red, green, and blue--

it's non-numeric. But all our algorithms actually take numeric inputs. So how are we going to take this both inputs and possibly targets as we could be predicting whether something is going to be a red object, a green object, or a blue object? It can both go on both sides of the equation, both the inputs and the targets.

So how do we make this data into numeric data? The example or the standard method is what we're going to call one-hot encoding. So in essence, what we're going to do is create new indicator variables, and those indicator variables, the object type--

or, in this case, color--

will be the column name. So we're going to have this feature. Basically, is it red, is it green, is it blue, or is it pink? This is one-hot encoding and becomes essentially a vector, where the vector represents all the different values.

So vectors can be sometimes a little bit intimidating. But what we need to think of is we are just simply creating new variables that say, is it red, yes or no? So as we see in the example here, the first two lines are red. So we see a 1 in the red column.

They're not green, not blue, and not pink, so all of those features are 0. We can see, as we go down, each time the color appears, there's a 1 in the appropriate column. So we see this as a vector. So the color red becomes a vector 1, 0, 0, 0, and then blue becomes the vector 0, 0, 1, 0.

And this allows us to transform these categories or features into numeric data. And most importantly, not only does it allow us to change features into data, it allows us to use them as targets as well. So we may not be used to seeing targets as vectors, but towards the end of our video series, we'll see how this plays out. And we can actually predict multiple categories based on this same concept of is it red, is a green, is it blue, is it pink?

Now, an important thing to note is only one column should have the 1 in it. All the other columns should be 0. If you do happen to find features that are both--

that have multiple columns that are 1, make sure that that's what you really intend. In this case, red, green, blue, and pink are all exclusive. But perhaps, if you wanted to break it into primary colors and say, oh, I have red and blue and that will indicate purple, well, then, you would have a vector 1, 0, 1. That's extremely uncommon.

Usually, for each unique value of the feature, we add a column. And only one of the columns will have a 1 indicated, and the rest of the columns will be 0. But that is not a hard and fast rule. It's more of a guideline.

So categorical data is closely related to discrete data, but they are not the same. And a good example of this is the number of rooms in a house. Is it categorical? The answer is actually no.

Even though we don't have like 3 and 1/2 rooms in a house or 3.21 rooms, what we have here is discrete data. So discrete data, it's important to still treat it as numeric because, in this case, the value of 3 actually indicates something that is greater than the value of 2. There are three rooms versus two rooms.

Now, if you have different classes--

now I'd say I have class 1 or class 2 or, in this example, category 1 or category 2--

is it really that category 2 represents twice the value of category 1? So that's how you can tell the difference between discrete data and categorical data. Categorical data needs to be one-hot encoded. Discrete data, while you can one-hot encode it, can be left in its original integer form.

So remember that when you're looking at your possibly categorical data to ensure that it's not accidentally discrete data. So a couple of examples of discrete data. We talked about the number of rooms in a house or the number of cars a family owns. You don't own 2 and 1/2 cars or 4.1

cars. How many children do you have? The average family has 2.4 children. That's not an actual number of children.

The number of children is discrete. Number of cylinders in a car engine. Cars don't come with a partial cylinder. They have two, four, sometimes five for the really odd ones, but the cylinders in a car engine is discrete.

So remember, do not one-hot encode discrete data. Allow those values to inform your model. It's possible you can do a data transform, but the best way to handle this discrete data is to input it directly. So once you've actually one-hot encoded your categorical features, you can use this in fairly any model.

So if we're trying to figure out the price of our car, and we're using a linear regression model, which we learned about in one of the earlier weeks, and one of the features is the car color. So as we look at the color of the car, we may have black cars and silver cars and gray cars, what we're going to do is we're going to transform that single column of car colors to multiple columns, where each column is a color, and there will be a 1 to indicate, for a row, if that particular car is that color. So I own a black car. So underneath the Black column, there would be a 1.

And let's say our other car colors are red and white. Under the Red and White columns, there would be 0's. So now I can put this into a linear model. I can put it into any model.

Our categories have been transformed into numbers that our model can actually understand.


2. linear to logistic
Welcome to our video that's going to take us from linear regression to logistic regression. So we've encountered categorical data now, and the question becomes, what happens if the target or our prediction is actually one-hot encoded? So the idea being, let's try and predict if something is true or something is false. And we saw, in the past, that we one-hot encode these, and that can transform our categorical data into numeric data.

But, now, what happens if our target is now this category? We've one-hot encoded the target, but, remember, linear regression doesn't predict a zero or a one. Linear regression produces a continuous output. So even if we manage to get that output between zero and one, we would have a value that could be anywhere from zero to one, and, quite possibly, could go over or under.

So the question becomes, how can we use what we've learned in linear regression to tackle these problems where the output is--

or the prediction--

is categorical? So when we look at what happens with a continuous output versus a categorical output, what we need to do is, we will see on the left hand side that the output varies with x. So there's a continuous output. The discrete output means that y can only take on specific values.

In this case, it would be roughly negative one or one. And it doesn't have to jump exactly at zero. In this case, it can take on different discrete values for different ranges of x. It doesn't have to be x greater than 0, x less than zero.

It could be something like x greater than negative one or x less than than negative one. But we see that y only takes on discrete values. So we talked about discrete numbers, but this is now a discrete target. And, in particular, we're going to look at binary targets where we've got two values.

One of which will be a value zero, and one of which will be a value one, which will correspond to our one-hot encoded targets. So to do that, we're going to come back to linear regression. And, hopefully, these mathematical equations look familiar to you, but it's our old y equals mx plus b. So, of course, each column, x--

in this case x1, x2, et cetera--

has an associated slope, m1, m2. And a reminder that the intercept is hiding as m zero. So if you're wondering, why is this the sum of mi xi, x0 is actually one, and m0, of course, will then be the intercept. And this is the compact notation I mentioned a few videos ago.

So what we're going to do is, we're going to take a variable transform. And this variable transform is going to take us to logistic regression. Now, it seems like this is a strange type of variable transform. But, as we're going to see, this actually forces our output to now essentially be discrete.

We're going to model a discrete output, and that's what this function is giving us. And this function will lead us to logistic regression. Now, one of the things you'll notice is I add this sigma at the bottom, sigma of mi xi. And that simply means that the sigma function, or what we're going to call the sigmoid function, takes as its input, m times x--

the summary of that. And, typically, we'll drop the summation. It's implied because we know for linear regression that we see this original top equation--

the m1 x1 plus m2 x2, et cetera, et cetera. Because we're so familiar with that, for shortness and brevity we typically drop the summation sign. But it's already there. And so I've dropped it here for clarity, but you can see that y is now a function of m times x rather than m times x directly.

3. the sigmoid

Hi, and welcome to the video about the sigmoid function, which is one of the key functions in transforming from a linear regression to a logistic regression. So in the last video, we talked a little bit about this variable transform. Now, we can start to see what this variable transform actually does to our data. So this sigmoid function takes the input x, and it essentially squeezes the output so that it's between 0 and 1.

And it gives us the S-shaped curve that we see on the left. However, as that slope--

and when I say slope, this would be the slopes m--

get larger and larger, we see that this sigmoid function becomes much more like a step function, modeling the behavior that we saw for discrete output. And this shows us that we can actually get an output that is essentially forced to be either 0 or 1. Now, we'll still get outputs that are between 0 and 1. So some people may say, well, although we've squeezed the tails to 0 and 1, what about the point in the middle?

And what we'll look at, at these points in the middle, is these are going to represent probabilities that we belong to class 1 or class 0. So what we're looking at now is we will find a threshold or a cutoff in our sigmoid function. And we'll say anything above this value is class 1. And anything below it is class 0.

The default, of course, would be at y equals 0.5, the idea being essentially rounding. So we could take--

if its 0.51 and above, it belongs to class 1. And if it belongs to 0.50 and below, it belongs to class 0 like our normal rounding would be.

However, this is not required. You can make your thresholds wherever you want. And this is important as we look forward. And we maybe want to be cautious about certain assignments.

So what if an assignment for 1 is a fraud detection? Well, we might want to have either a very low threshold. Like we want to detect any possible chance of fraud. So perhaps when our y output value is say 0.2,

so that would mean that there's about a 20% chance that there's fraud going on. So maybe we should check that. On the other hand, if you put the threshold really high, you would say, ah, I'm extremely certain that fraud is going on at something like 0.99. But the point is you have the choice to adjust that threshold.

So the default output if you don't instruct your algorithm will make the cutoff at 0.5. This is one of the most common errors I see students make is they don't question where that threshold should be. Now, there are a number of ways that we can determine that threshold. It's not entirely a judgment call.

But be aware that you can make the determination where should your assignment be for class 1 and class 0. So let's take a look at the properties of the sigmoid function. We can see that we have the output y is 1 over 1 plus an exponential. And that exponential, again, the implied sum, we have here mx.

But it's actually a summary of all m's and all the x's. So, again, m0 times x0, x0 being 1, m0 being the intercept, x1 m1, our first column and first slope. But the idea is if that sum total is extremely large and it's positive, we'll have 1 over 1 plus e to the negative large number. e to a negative large number is, in essence, a very, very small number because it's 1 over e to that power.

So that number becomes 1 plus an extremely small number. That becomes 1 over 1, which means our output is 1. If our m times x is negative--

now, remember, we have a negative sign in front of m times x. That means m times x itself is negative. Then we have an e to the positive number because the two negative signs will cancel out. Now, I've got an exponential to a very large power.

Well, anytime I take something to a large power, that number gets very, very large. So I have 1 over 1 plus a large number. 1 plus a large number is essentially that large number. And 1 over a large number is 0 roughly.

Now, again, I like to talk in physics terms because of my physics background. But the idea is that we are extremely close to 0 with 1 over a large number. And that's the behavior we actually want. We want if the output is large to be, yes, this is class 1.

And that's how we're going to model it. You may ask yourself, why did we say if mx is large, we want it to be one, and if an x is negative, we want it to be 0? Well, that's the idea of us taking advantage of this sigmoid problem. So we will frame the problem such that we want the output to be large for class 1, and large and positive, and negative for class 0.

This simulates the output of the step function. And as an added bonus, although many people are intimidated by the mathematics of exponentials, exponentials actually have some very friendly properties that we're going to take advantage of. And they'll help us with doing some advanced mathematics in the next couple of steps. So we'll use this sigmoid function to give us an advantage.

And, hopefully, we'll see that in the next couple of videos. And it won't be so obscure why did we choose this particular function. At its face value, we're looking at, of course, outputs that are 0 and 1. And this function starts to simulate that.

And then we're going to take advantage of that exponential property as we go forward.


4. log loss
Hi, and welcome to our video on logarithmic loss. So, in linear regression, what we find is that we typically use the mean squared error or the mean absolute error. And the idea is, we are measuring the distance between our prediction and the target. And because distance, of course, is Pythagorean theorem based, we typically use the square.

However, because we're also one dimensional with prediction and target, it's also possible to use the mean absolute error. And all we're doing is just measuring how far away from the actual value is our guess. And so our loss is actually the sum of the distances from the prediction in the value of x to the target in y. And that's why we use the vertical distance rather than the shortest distance.

The idea is that the input and our data is used and fixed for us. And so we adjust the slope of the line to provide a prediction. And the distance from the position at input x to the target y is then our vertical distance. So our object is to get our straight line as close to the target points as possible, minimizing all of those distance.

That's what we're doing when we're using both mean squared error and mean absolute error. However, when we get to logarithmic loss and categorical data, we've got a problem. Here, our output or our target, our guess, really has only two values. Those values tend to be zero and one.

So what we need is, we need a loss function that represents our distance to this target and our error that when we guess, is it target zero or is it target one, how far away from that guess are we? So we're actually going to have two cases because we have the case where our target is zero and the case where our target is one. So if the target is one, the error is just going to be minus y log with p. And p is just our output of our sigmoid function.

I've mentioned, if we've been dealing with sigmoid functions, we know that the output of the sigmoid function represents a probability that that input belongs to a certain class. So p is really just the output or the sigmoid of m times x. It's that one over the exponential. So what we're doing is, we're looking at the distance between our target and our guess or our output.

And we put y in front of it because we're going to combine these two cases. Now, in this case, the target y is one, so our error is really just minus log p. Now, here, I'm using the traditional notation used in data science. In this case, the log is not log base 10, it's actually the natural log.

That will become important in a few moments. Now, when the target is zero, we see we have a one minus y. In other words, one minus zero times the log of one minus the probability. This is basically an exclusion.

So, , just like before, we said, we had a probability to belong to class one. Well, if we've got a class one and a class zero, and we look at this as, what's the probability I belong to class one? But they're mutually exclusive. So that means that the probability I belong to class zero is one minus the probability of belonging to class one.

It's simply finding the probability that I belong to the other class. But these two classes are coupled, and so we see the one minus p. So, again, p is our prediction or output of our sigmoid. So if we put these two terms together, we get a two-term loss function.

But you see that one of those terms is always zero. If y is one, the right-hand term becomes one minus one or zero, so that term contributes nothing to the loss. On the other hand, if y is zero, the left-hand term becomes zero. And we have, once again, minus log of one minus p.

And that's the idea of, what's my probability of belonging to class zero? So both terms are symmetric. What we're seeing is we are just taking advantage of our classes being coupled to select which term we're going to use. So, here, what we're doing is, we are penalizing predictions that are farther away.

So if I'm trying to predict zero and my output is really, really large, log of one minus the probability is going to be a high contribution to the loss. On the other hand, if my prediction--

I want it to be close to one--

my target is one, then the output should be close to one. And that means my probability should be close to one. As the farther and farther I get away from one--

and, remember, the only way to get away from one is to move towards zero. By using that minus log p term, I'm actually increasing the value of my error. So the more I predict zero--

or the closer I am to predicting zero, the more I contribute to the loss. In other words, my error is larger. So, sometimes, it helps us to take a look at this. And let's take a look at that visually now.

So, here, we can see what log loss looks like, and we have the individual terms. We can see the blue line, which is the probability for y equals one--

or I should say the loss for y equals one--

you'll see, if we're predicting close to one, our loss is very, very small. But if we're predicting close to zero and our target is one, we contribute a great amount to the loss. And the idea is, when we're trying to minimize this loss equation, what we're going to try and do is, force our predictions to be correct. In other words, the model should start outputting predictions that are closer to one because they will contribute less to the loss.

Likewise, when we've got a target that's zero, the farther away from zero you are, the more you contribute to the loss. Now, remember, our sigmoid is going to squash our outputs to be between zero and one. So the farther away you are from zero, means the closer you are to one. If we add these two terms together, we see that it forms a concave curve--

or I should say, a convex curve--

with a well-defined minimum. And this is what we always wanted for any of our loss minimizations. We want a well defined minimum. Well, we see that this logarithmic loss has that well defined minimum, and it's a function of these two terms.

So by using these two terms with the values of y equals one and y equals zero, we formed this loss function with a well defined minimum. That gives us an advantage because a well defined a minimum for us is easy to find and minimize.



5. minimizing log loss

Hi, and welcome to the video on minimizing log loss. So, once we have a log loss equation, our question becomes, how do we minimize that? And of course, minimizing our log loss, means that our model is producing good predictions. So a small loss, or a minimal loss, gives us what we want--

a good, well-fit model. We've discussed the loss function. We have a prediction function. So if we take the sigmoid prediction function and our logarithmic loss, let's try and combine them in a similar fashion to linear regression.

And keep in mind that although we see the value p, p is really a function of x. And although I shouldn't say function of x, p contains the data, x, and is actually a function of the slopes times x. So we've got a little bit of math here. But it's important for us to understand how this works, and how it's similar to linear regression.

So traditionally, we denote the loss with the value J. And here, what I'm doing is I'm summing up all the loss terms for each prediction. So again, if the prediction is 1, we're going to use the left hand turn. If the prediction is 0, we're going to use the right hand term.

Or I shouldn't say the prediction, if the target. And as we sum all of those up, we will get the final loss. And just a quick reminder that I have p in the loss equation. But remember that p is actually the sigmoid that contains our data and our slopes.

And remember, that's the same as linear regression, that sum of mi xi. So, we've got things that are fixed. Our target, y, is fixed. That's our target.

It's our data. Our x values are fixed. That's our data. So to make a minimum value of J, the only thing that can change is the slope.

And because we have this function that has a well-defined minimum, if we take the slope of the function J, what we can do is we can find the minimum. Because what we'll do by looking at the slope is we'll find, which way does the direction of the minimum lie? Does it lie to our left? Does it lie to our right in each dimension?

And so by taking the partial derivative of J with respect to the slope, we will get an update rule for the value of the slope. Now conveniently, the update rule for logistic regression is the exact, down to the variable function as the linear regression. In other words, I have this sigmoid function. And I have an update rule.

And when I take the actual derivative of the update rule, it turns out that the value is the same. Now, although it's a nicety to see, I've included a link here so that you can go through the details of the derivation yourself. But the key takeaway from all this mathematics is our update is the same. And if our update is the same, that means that we can use all the algorithms from linear regression to update our values of the slope for logistic regression.

So now we've extended our ability to model problems from linear problems to categorical problems, simply by using an appropriate loss function. So now, hopefully it starts to become a little bit clearer why we use this particular exponential function. It was actually chosen so that those update rules come out the same. And again, just a quick reminder, although I've got log, L-O-G, these are actually natural logs.

And that can be important if you're taking those derivatives, because the natural log derivative is actually 1 over the value. So the derivative of natural log of p is 1 over p. That plays an important role in this update rule coming out the same.



6. multiclass

Hi, and welcome to the video on multiclass classification. So the question may come up to your mind. We've been studying how to do binary classifications. And our sigmoid function gives us a binary output--

class 1 or class 0. So you may come across the log loss as binary cross entropy loss. But as we extend to more and more complex problems, not every problem is a simple example of two classes. We could have three classes, 10 classes, 1,000 classes, or even a million classes.

So how can we get from our binary classification, the output of the sigmoid being 0 or 1, to multiple classes? And a simple example of this is the iris data set that's a classic toy data set. There are three species of flower within the iris. And the question would become, how would I do a logistic regression on that?

Because a sigmoid output will only be 0 to 1, but I've somehow got three classes. And even if I did a one-hot including, my target is no longer 1 or 0, there's actually three classes that have ones and zeros. So how would I adapt my logistic regression to deal with these multiple classes? So the first method that we can use is something that's called one-versus-all or one-versus-the-rest.

So what we're going to do is for each class, we will train a discriminator or a model. So if I have three classes that correspond to the colors red, blue, and green, my classifications would be red/not red, green/not green, blue/not blue. So I've got a classifier for each class. And what we're going to do is we'll find which class gets the highest output value.

What happens is you'll look at the sigmoid, and although we often talk about the sigmoid outputting a 0 and 1, in reality, we get some value between 0 and 1. And so to pick the winning class, what we will do is find out which output is the largest. Now we frame these problems as when we do the red/not red, red would be class 1, and not red would be class 0. That way the not value is always class 0, and the value is class 1.

So we will come out with three values. And what we will do is we will look at those three values, and they may all be very large and very positive. The key is, which one is the largest? So if we look and say red has a score of 0.97

and green has a score of 0.22 and blue has a score of 0.50, well, because red has the highest score, that's the winning model. So we're trying to apply our models off against each other.

It's not really a competition, but we're looking at each of the output scores and finding which one gives us the max score. And oftentimes, we'll see a function called argmax, which means it will look at the outputs of all three values and just takes the maximum value and look for which one is maximum, and it returns that class. That's the one-versus-rest or one-versus-all method. The downside, of course, is you have to keep a model for each of your classes.

So this isn't so bad if we've got three classes. But if we had a lot of classes--

100, or 1,000, or even that 100,000, million classes--

that could be quite time consuming. So that is one of the drawbacks of logistic regression. It's not inherently multiclass, but we can adapt to it. But as the number of classes increases, it becomes more and more difficult for us to find a solution that's easy for us to implement.

It's hard to manage millions of models. It's not so bad to manage three models. So here is an example of some decent outputs of the red, blue, and green. So we look at the red score, 0.87,

the blue score, 0.22, the green score, 0.45. So we would look at those and form a vector of the positive outputs. And when I say the positive outputs, the class 1 outputs.

So here I have 0.87, 0.22, and 0.45. So with 0.87 being the largest value corresponding to the red value, once again, we're going to classify that as red.

So one of the issues is if your data is not evenly distributed, what our classifiers are going to see is a large number of negative examples. And because classifiers learn based on probability, they may start to favor the not example. So let's say our data is evenly distributed, with 1/3 red, 1/3 blue, and 1/3 green. Well when I go to the red/not red classifier, 1/3 of the data is red, but 2/3 of the data is not red.

And this can have the effect of biasing our models towards the 0 class probability. So keep that in mind, especially as the number of classes increases. This would get worse if I had 10 classes, 10 colors. Then 10% of the data would be red and 90% of the data would be not red.

So as a basic guess to get 90% accuracy, you could actually predict 100% of the data being not red. That's an example of biasing the data. Now in reality, the classifier will hopefully pick up which input variables contribute to the class red. But be aware that as you get more and more classes, your classifiers will start to get biased for negative samples.

And there are a number of things that we can do to take care of that bias, but you have to be aware and fundamentally address them as you get to larger and larger multiclass. Now, there's another way to handle this multiclass, and it's more of a head-to-head battle of classifiers. And again, we use these which one's the max, which one's the best--

a head-to-head battle. But it's setting up a comparison. So we're going to do a comparison for all these classifiers. The problem is this can get out of hand quite quickly as we can see the number of classifiers starts to expand quite quickly.

So here, we're going to be even worse as we go to multiple classes. And what we're going to do is compare things like, is it red? Is it green? Is it green?

Is it blue? As we go through that sequence, obviously we can start to see there's factorials involved. But that's why we see the number of classifiers expanding quite a bit more than we had for the one-versus-all. Whereas if we had 10 classes, we would have 10 classifiers.

But this also allows direct comparisons between the different classes. And the idea here is the class that wins the most votes is the assigned class. So for the example of red and green and blue, we would have a red/green and a red/blue. So red has two opportunities to win.

Well, just like that, if you've got the red/green, you've also got green and blue. So green has two chances to win, and so does blue. That's why you see as the number of classes expands, the more head-to-heads we tend to get. And this can lead to some crazy solutions where some of the classes aren't super popular, but it's very difficult for our classifiers to distinguish between different parts.

So one-versus-one, although it is used in smaller classifiers, it's typically not used as we go to larger and larger numbers of classes. But the one-versus-one is out there, and in many of our algorithms is implemented behind the scenes, so you don't have to write this from scratch. You can say, I want to do a one-to-one classifier. Here are my inputs, here are my targets, and the algorithm will set up all these classifiers for you.

So here I've got a four-value example of Texas, Iowa, California, and Florida. Now it turns out I need six head-to-head classifiers. So because SMU is in Texas, I purposely set this up so that Texas would win. And we don't actually know what these values are, but this is just an example.

And so I've got the Texas-Florida classifier. And you can see that Texas score is the largest value there. So Texas is, quote unquote, "the winner" of that classifier. And we have the Texas versus Iowa.

And you can see once again, Texas comes out on top. Well then we've got the Texas versus California. And we see once again, Texas is triumphant. All of our students in Texas are celebrating right at this point.

Now we go to the California versus Iowa. Before we were looking at Texas, all the Texas head-to-heads. Now let's look at the California head-to-heads. So California has already gone head-to-head with Texas.

Now it needs to go head-to-head with Iowa and Florida. And you can see it comes out on the losing end versus Iowa. So Iowa wins that head-to-head battle. And California comes out on the bottom against Florida.

So Florida has won one. So our score now is Texas 3, Iowa 1, Florida 1. So we've got left is the Florida-Iowa battle. And of course, the classifier for this result says Iowa comes out on top.

So our final tally is Texas got three votes, Iowa got two votes, Florida got one vote, and California has zero. And we call these votes, but we're just tallying the winner of each classifier. So we can sum these up and we'll get these results. But what can happen is a state, or in this case, a classifier can actually lose the head-to-head and win the overall classifier--

which can seem strange. So remember, because this is a summary of outputs at each head-to-head, you can get different results. So one of the other problems is you can end up with ties, where everybody ends up in a tie. It's rare, but it can happen.

So here we've got three votes--

or two votes for three states and six classifiers. So we have a tie. There's no way to break that tie statistically in how this is set up. So we have two methods to attempt multiclass.

Neither one is perfect. Remember that the one-versus-one can have these ties. And the one-versus-rest or the one-versus-all can suffer from unbalanced classifiers. Again, we've got that idea of class/not the class.

And as the number of classes increases, we have this bias towards the negative class, just because of the number of samples that the classes are going to see. So both of these implement what I like to call "under the hood." We're not going to set these up individually. What we'll do is we'll go to sklearn and we'll say I would like a one-v-one model situation.

Here's my data--

run. Or I have a one-v-r. I have a one-versus-all method multiclass. Here's my data--

set it up. So it takes place behind the scenes that it's well-established how these are run. And so you get your final outputs. So once again, it allows us to focus on what's going on with the problem rather than implementing the software solution.

The software solution is there. It's for us to use the appropriate software correctly, but it's already been optimized so we can get the best performance and less worrying about the specifics of the coding and more about the problem itself.


7.  demonstrating of logistic regressio, part 1

Welcome to our demonstration of logistic regression. So to get started, we'll do our traditional imports. And now what we want to do is we want to find a binary classification set. To practice, Sklearn has a number of data sets.

You can always scroll through them. And I've chosen one to start with that gets us started with our binary classification. So we're going to be taking a look at the breast cancer data set. Let's go ahead and do our first tab.

As always, let's take a look at our data. So you can see this is not actually a data frame. What we've got here is actually a dictionary, where we've got our data and we've got our targets. And then we've got all the information we need.

We just have to put it into a data frame. That's not too hard. And this is the traditional way that data sets are stored in Sklearn. So if we're not familiar with dictionaries, we're about to get a short lesson.

So you can see this is our actual array of data. So let's start to create a data frame. And I'll just call it cancer data frame. And for brevity, this is the data that's going to go in.

So we've got our indexes. Now you'll notice we don't have any column names. But the column names were actually there in the raw data. Let's take a quick look.

So we'll have to remember to add our targets here. But here's the description of all the different names. So let's see if we can pull up those descriptions. So these are all the descriptions.

And let's see if we can pull out the names for these really quickly. So this is going to take us a little bit. So I may cut the video at this point while I come up with a quick way to put these together. We can see that it's not exactly easy to pull these out and would take a little bit of formatting.

It's not impossible. But for the meantime, I'm just going to move on without having my actual names. We know this data set is well formatted because it's a toy data set as well as you can read the summary on Sklearn. So we'll just continue.

But we don't need to forget that we actually have to add the target as well. So let's go ahead and add that to our data frame. So let's look at our final data. Oops.

This is why I tab complete quite a bit. So we can see we've got our values. And if we wanted to look up what those values are, we could. Then we see we've got targets of 0'a and 1's.

So we've got our data. It's binary classified. What we need to do is let's start our basic logistic regression. Now, you'll notice at this point, I have not scaled the data.

But we can still get fairly good results without scaling the data. And by not scaling the data initially, we can perhaps compare and contrast the importance of scaling the data. Fortunately, doing our logistic regression is actually going to be fairly quick. Now, this is the first time we've encountered our Sklearn.

You can notice that there are multiple versions of logistic regression. Now we always like to do cross-validation. But you can use regular logistic regression and a cross-validation from other parts of Sklearn. The Logistic Regression CV, because it's such good practice, has it built in.

At this point, let's go ahead and take a look at our Logistic Regression CV. So I've imported this class. And we're going to use this as our model. Now, a little bit of object-oriented programming, Logistic Regression CV is a class.

So we've just imported this class. And now what we need to do is we're going to create a model. So our model is an instance of the Logistic Regression class. Now, we can accept the default parameters by just using empty parentheses.

If you're interested in all the options, you may want to take a look at the Sklearn documentation. So we may get some mornings as we start this. But this is our base logistic regression cross-validation. Let's go ahead and instantiate it.

So now we've got a model, now it's time for us to fit our model. So to fit our model, we need both targets and data. But that was in the data frame. We've already got that.

So I've got my data frame. But remember, the data frame contains both the targets and the data. So we need a way to split this up. I can do that with pandas location.

Or I can actually split these up. And either way, we will get, essentially, our x data and our y data. So what I can do for the y data, we know that that's the target. Now, this may look a little crazy.

But what I'm doing is I'm creating an object where I drop the target column. And then here are my targets--

the actual target column. You may be afraid that when I drop the target, I'm going to drop it from the data frame completely. However, in this case, because I'm not assigning the result to anything, this just returns an object. It has no name.

So this is essentially our x data, while this is our y data. We'll probably get a warning because we didn't specify how many folds in our cross-validation. But let's go ahead and see if this works. Well, we found not in the axis.

So I probably misspelled this. And let's see if we can figure out--

hmm, this is interesting. So this is one of my most common mistakes. I forget to tell it if I'm dropping a column or a row. OK, so we've got some warnings here.

And this is our results. And while these are warnings, they're not too terribly of concern with us. And it's telling us we've reached our iteration limits. In other words, its iteration.

But it may not have finally found a solution. So we finally got our done. And we've got our results. So we don't have anything printed.

How do we see our results? Well, they're actually stored as part of the model. Now, you'll notice I actually put--

I did that quite quickly. Let me stop and go back. Because they're stored as part of the model, this model is an object and it has attributes with it. Those attributes can be accessed through various methods.

Again, we're into a little bit of object-oriented programming. But the key thing is we can access those attributes. When I hit the period here, this means I want one of the attributes or methods of my model. What I can do is now I'm going to hit Tab.

And this will bring up a list of attributes. So this will tell us all the different things we want to know. First of all, the C, that's going to be our regularization. We may or may not have talked about regularization at this point.

But an important reminder that regularization is key to any model. One of the most important things is, of course, our coefficients or our slopes. So let's take a look at what our coefficients are. This is going to tell us all the coefficients or slopes for each columns.

Now, at this point, our columns are unnamed. But we can actually tie these back because these are in order. This is the coefficient for the first, or zeroth, column, the second column, and so on and so forth. Obviously, the larger in absolute value the column, the more important it is.

But if you remember, we didn't normalize that data. So the range of the data can have an impact on these coefficients. So the importance is somewhat hidden. And we'll see that when we normalize our data in just a moment.

But we've got our coefficients out. Let's see if there's anything else. Let's take a look. Is the slope part of these coefficients or not?

One of the quick ways we can look at that is let's take a look at our data frame. And remember, we've got an extra column in there. That's the target. So we see that there's 31 features.

Let's take a look at how many coefficients there are. So you'll notice there are 30 coefficients. Now, this shape was 31. But don't forget we've got the targets in there.

So we've got a coefficient for each column, which means the y-intercept is not actually included in these coefficients. It's always good to check to see whether they consider the intercept independent of the coefficients. Because many times, we group it in. And because data science is conglomerate and brings in methods from multiple sciences, you can never be sure.

So it's always good to check whether or not your assumptions are correct. I would have initially assumed that the intercept was part of these coefficients. But in this case, it's not. Let's take a look at some of the other things that are involved here.

So as we saw, we've got the different values. We see here is the intercept. And we've got all the other parameters. What we didn't do--

my apologies for accidentally clicking on that--

was we didn't know how many cross-validations we did. So let's see if we can find that out now. And it's actually not telling us. So we can't understand at this point what was the default cross-validation.

This is sometimes when we might have to go to the documentation. But let's see if we can possibly change that because we know this is a value in our logistic regression. If we remember our fit method or possibly our instantiation, we can possibly change the value for CV. So let's go back.

And I'm going to create a new model. And, of course, it's an instance of the Logistic Regression CV. And let's see if we can figure out how to change the number of folds. Now, we're not seeing anything.

I press Tab to come up. And so it's not showing us. One of the things I can do is--

I know that typically one of the things is n folds. Now, I've got n jobs. And at this point, because I don't use logistic regression and not a lot, I may have to go back and refresh my memory with the documentation. So let's probably go and do that.

And let's take a look at the Sklearn documentation. It's a little bit hard to remember all the things. So don't be ashamed if you can't remember these things off the top of your head. So let's create a new tab.

And I'm going to go and open up the Sklearn documentation.


8.  Demonstration of Logistic Regression, Part II

Here we are at the SK Learn documentation. What we're going to do is we're going to look for that logistic regression CV. Now there are a number of ways to get there. We'll just type that in.

And we see logistic regression CV pop up. Now we've got our documentation. And we can see right here how to define the different important parameters. And we see that we may have caught ourselves a little bit because we see the default is CV equals none.

And so we have this CV. And we look down here. It's the cross-validation generator. And the default being none.

So even though we started off assuming we were doing cross-validation, it turned out we weren't doing any because we didn't have an integer. So to properly do a cross-validation, we'll actually have to assign an integer. This is why it's helpful to always bring up the documentation. It's fairly clear here in SK Learn.

And it's one of the benefits of this package to go through. So while we're here, we can take a look at all the different options just to remind ourselves that we're doing things the correct way. So we start with our cross-validation. And as we browse through this, we see that this would be our regularization, our Cs, as I may have mentioned before.

That may not be obvious. I've been here before so I know that C represents the regularization. We'll see our penalties down here. We'll talk about that frequently throughout the course.

We'll talk about our scoring. So we can have a report on the different scoring. And right now it's not telling us any scoring. We saw the stopping criteria.

Now you may ask yourself, what do you mean stopping criteria? As the different solvers are used, they stop when the loss begins to decrease or stops decreasing a certain amount. And so we don't typically use this, but that's with our warnings were about if you may have noticed and we're reading along. We have various issues for class weights and efficiencies.

These are more computation. So our main thing is let's go back and do our cross-validation. So I'm going to click back. And let's do a five-fold cross-validation.

So we've got this new model. And let's go ahead and fit it. Now I'm going to go and scroll up. And I'm just going to copy what I used before, because I know it works.

So we're seeing still we're getting the number of iterations. But now we see we're still not getting results. Let's see if we can add in that scoring metric. So we're just going to rewrite, even though this is our new model.

And let's see what the accuracy is. So once again, we're just going to use the same fit command. And we haven't had anything come out. So is that stored somewhere within our model?

So let's take a look. Aha. We've got scores in here. So what we're seeing on scores is not what we expect.

We saw for scoring we expected actually five values out. So this is something that's not our actual scores of our cross-validation. So when we actually do the score, this is asking us for our x and y values. What we're finding is that the CV method does not actually return to us the scores for our cross-folds.

So what we can do is we can actually move back either to a regular logistic regression and use the model selection to do cross-validation. However, at this point, let's see what our score is just for our basic x and y. I'm just going to quickly pull these. And you can see we've got 97% accuracy.

However, the astute among you will notice this is data that was used to fit the model. So this is a biased estimate. And what we're really interested in is our unbiased estimate. It doesn't appear that the logistic regression CV is immediately forthcoming with those unbiased estimates.

I'm going to take one more quick look before we move on to a different method. From this perspective, I'm going to move on to the cross-fold validation in the model selection to see if we can get a better estimate of our scoring. To do that, I will import the cross-validation method. And we're going to switch to regular logistic regression.

What I've done here is I've imported a method called cross_val_score. This will give us the scores across a cross-validation for any model. So this is not unique to logistic regression like the logistic regression CV was. So let's go ahead and use this cross_val_score with our log reg model, which is a simple logistic model.

So what we'll need to do is we'll need to understand how to do the cross_val_score. For that, let's take a quick look at SK Learn. We see here cross_val_score takes an estimator. That's going to be our log rog model, our x and y data.

It's going to take our scoring. So that will be what metric we want to use. And then it will use a default five-fold cross-validation. Now you'll notice here it says none.

However, if we read in the details, none uses the default five-fold cross-validation. This is probably the origin of why I myself tend to use five-fold cross-validation so that we know if we call this with the default parameters, we're getting a five-fold cross validation. So let's go back and do that. Now we need our estimator.

We need our x data and our y data. We remember that we don't need to specify a CV because the default is 5. Let's add our accuracy in. This line has gotten a little long, but you can see I've got my cross-validation score.

I put in what model I'm going to use. I have my x data. I have my y or target data. And I have the scoring I want to use.

Let's go ahead and run it. And you can see here we've got a slightly lower value. This is important. It's not bad.

We didn't do anything incorrectly. What these represent are the unbiased estimates. Before our 97% was a biased estimate. We used the entire data to fit that particular model.

However, when we ran the data through, that data had been used to build the model. In other words, the model had seen the data ahead of time, resulting in biased results. Here, of course, we split our data. And so when data was sent through for scoring, that data was not used to build the model.

And of course, the reason we have five scores with five different values is we split the data into five parts. And each time that part of the data was held out as a test set and the other parts were used to train the model and fit it. The test set was then run through without the model having seen it before and thus produces an unbiased prediction. So we see that while these predictions are a lower score, they're actually a more accurate--

with no relation to our scoring using accuracy, they're a more accurate representation of how our model will perform on unknown data. This is a good example of how logistic regression and cross-validation can be used together to get that estimate of performance on an unknown data set. At this point, we're going to go ahead and conclude our demo on logistic regression.


9. Case Study 2 Introduction
Welcome to our case study on diabetes. This week's case study is one that has some possibly controversial data. I want you to carefully follow the discussion and develop your own thoughts about the material being discussed. Some people may want to avoid this type of conversation.

However, as a data scientist, we are often confronted with things that have ethical concerns. So as you watch this video, continue to monitor how our protagonist asks about the problem. And then also watch how they raise their concerns. See if you have the same or different concerns.

Feel free to submit those as part of your participation. Don't forget, important details about the problem are included in the video. Once again, we are simulating a real world problem and a presentation from a data science professional to a, this time, slightly more seasoned data scientist. Again, we will pause to have you take a look at the data.

And you can submit further questions towards the end. Those questions can then be uploaded as part of your participation grade.

10. Case Study 2, Part I

All right. So this week we have a problem coming to us from the medical community. We're looking specifically at a diabetes study. And the problem is hospital readmission.

Now we don't want people in hospitals. We want them to be well. And we certainly don't want them to be readmitted. This comes at a huge cost to the patient in terms of bills, lost wages, strain on their family and whatnot.

So our goal is no readmission. So for this study, what we're trying to do is we want to try to predict readmission of the patient within 30 days of initial hospitalization. So take a look at that data and we'll come back and take any questions you have.

11. Case Study 2, Part II
 
 So you've had a chance to look at the data. Do you have any questions about it? I actually do have a couple of questions about this particular assignment. And the first one is, well, pretty important to me.

It's regarding the race category. And I don't understand its significance. It seems actually kind of like sorting patients with this criteria could be seen as racially biased or something. I mean, simply talking about identifying who was getting readmitted to the hospital.

So I really don't understand why race matters. OK. Absolutely. And thank you for bringing that up because I absolutely agree with you.

Normally, this is not something that we would take into consideration. Sorting by race can bring in a lot of ethical considerations. In this case, we're talking about the medical community. We're talking about patients.

And we do know that diabetes affects different demographics differently. So race actually could very well be a factor in this. Now, that being said, I will leave it to you. As long as we can chart the trends accurately, I'm not as concerned about how we get there.

OK. Got it. Thank you. So my next question is all about all of these question marks that are in the data set.

What's going on with that? Yeah. The study took place over 10 years. There's something like 130 hospitals that they were pulling data from.

There are a lot of people entering data into this study. And so there are holes. So we're just going to have to make the best recommendation we can based on the data that we have. OK.

Understood. I'll get working right away. Thank you.

You will submit your assignment to this page.

Please refer to your course syllabus for additional details about this assignment.

When you save your assignment, the file name should be "First Name_Last Name_Assignment Name."

Your case study is to build classifiers using logistic regression to predict ALL 3 CATEGORIES of hospital re-admittance. There is missing data that must be imputed.

Once again, discuss the top 5 variable importance of your best model as part of your submissio

### **What is Discrete Data?**

**Discrete data** refers to a type of quantitative data that consists of distinct, separate values. These values are countable and cannot take on every possible value within a range. In other words, discrete data is made up of whole numbers or specific categories, and there are no intermediate values between two data points.

---

### **Key Characteristics of Discrete Data**
1. **Countable**: Discrete data represents items that can be counted (e.g., the number of students in a class).
2. **Finite or Infinite**: It can have a finite set of values (e.g., dice rolls: 1, 2, 3, 4, 5, 6) or an infinite set (e.g., the number of times a coin is flipped until heads appears).
3. **No Fractions or Decimals**: Discrete data cannot take fractional or decimal values (e.g., you can't have 2.5 students).
4. **Categorical or Numerical**: While discrete data is often numerical, it can also represent categories (e.g., types of cars: sedan, SUV, truck).

---

### **Examples of Discrete Data**
1. **Numerical Examples**:
   - Number of children in a family (e.g., 0, 1, 2, 3).
   - Number of cars in a parking lot.
   - Number of goals scored in a soccer match.

2. **Categorical Examples**:
   - Shoe sizes (e.g., 7, 8, 9, 10).
   - Types of fruits in a basket (e.g., apples, oranges, bananas).

---

### **Discrete vs. Continuous Data**
| **Feature**            | **Discrete Data**                          | **Continuous Data**                       |
|-------------------------|--------------------------------------------|-------------------------------------------|
| **Definition**          | Countable, distinct values                | Measurable, infinite values within a range|
| **Examples**            | Number of students, dice rolls            | Height, weight, temperature               |
| **Values**              | Whole numbers or categories               | Can include fractions and decimals        |
| **Graph Representation**| Bar charts, pie charts                    | Histograms, line graphs                   |

---

### **How to Analyze Discrete Data**
1. **Visualization**:
   - Use **bar charts** or **pie charts** to represent discrete data visually.
   - Example: A bar chart showing the number of students in each grade.

2. **Statistical Measures**:
   - Calculate measures like **mean**, **median**, **mode**, and **range**.
   - Example: The average number of cars sold by a dealership per day.

3. **Probability**:
   - Discrete data is often used in probability distributions, such as the **binomial distribution** or **Poisson distribution**.

---

### **Applications of Discrete Data**
- **Business**: Tracking the number of products sold or customer complaints.
- **Education**: Counting the number of students in different classes.
- **Healthcare**: Recording the number of patients visiting a clinic daily.
- **Sports**: Counting goals, points, or wins in a game.

---

### **Conclusion**
Discrete data is a fundamental concept in statistics and data analysis. It represents countable, distinct values and is often used in scenarios where whole numbers or specific categories are involved. Understanding discrete data is essential for analyzing and interpreting real-world phenomena effectively.

### Summary: Transitioning from Linear Regression to Logistic Regression

This explanation focuses on the transition from **linear regression** to **logistic regression**, particularly when dealing with **categorical data** and binary classification tasks.

1. **Problem with Linear Regression for Categorical Targets**:
   - Linear regression predicts continuous outputs, which are not suitable for categorical targets (e.g., predicting 0 or 1 for binary classification).
   - Even if the output is constrained between 0 and 1, linear regression can still produce values outside this range, making it unreliable for classification tasks.

2. **Discrete vs. Continuous Outputs**:
   - Continuous outputs vary smoothly with input values (e.g., linear regression).
   - Discrete outputs, like binary targets (0 or 1), only take specific values and are better suited for classification tasks.

3. **Binary Classification**:
   - The target variable is binary (e.g., true/false, 0/1).
   - One-hot encoding is used to transform categorical data into numeric data, but linear regression cannot directly handle this for classification.

4. **Introducing Logistic Regression**:
   - Logistic regression applies a **variable transformation** to convert the continuous output of linear regression into a discrete probability.
   - This transformation is achieved using the **sigmoid function** (denoted as σ), which maps any input to a value between 0 and 1.

5. **Sigmoid Function**:
   - The sigmoid function takes the linear regression equation (`y = mx + b`) as input and outputs a probability value between 0 and 1.
   - This ensures the output is suitable for binary classification tasks.

6. **Compact Notation**:
   - Logistic regression builds on the familiar linear regression equation (`y = m1x1 + m2x2 + ... + b`), but applies the sigmoid function to the result.
   - The summation sign is often implied for brevity, as the structure of the equation remains similar to linear regression.

### Key Takeaway:
Logistic regression extends linear regression by introducing a **sigmoid function** to model **binary categorical outputs**. This transformation ensures the output is constrained between 0 and 1, making it suitable for classification tasks where the target is discrete.

### Summary: Transitioning from Linear Regression to Logistic Regression

This explanation focuses on the transition from **linear regression** to **logistic regression**, particularly when dealing with **categorical data** and binary classification tasks.

1. **Problem with Linear Regression for Categorical Targets**:
   - Linear regression predicts continuous outputs, which are not suitable for categorical targets (e.g., predicting 0 or 1 for binary classification).
   - Even if the output is constrained between 0 and 1, linear regression can still produce values outside this range, making it unreliable for classification tasks.

2. **Discrete vs. Continuous Outputs**:
   - Continuous outputs vary smoothly with input values (e.g., linear regression).
   - Discrete outputs, like binary targets (0 or 1), only take specific values and are better suited for classification tasks.

3. **Binary Classification**:
   - The target variable is binary (e.g., true/false, 0/1).
   - One-hot encoding is used to transform categorical data into numeric data, but linear regression cannot directly handle this for classification.

4. **Introducing Logistic Regression**:
   - Logistic regression applies a **variable transformation** to convert the continuous output of linear regression into a discrete probability.
   - This transformation is achieved using the **sigmoid function** (denoted as σ), which maps any input to a value between 0 and 1.

5. **Sigmoid Function**:
   - The sigmoid function takes the linear regression equation (`y = mx + b`) as input and outputs a probability value between 0 and 1.
   - This ensures the output is suitable for binary classification tasks.

6. **Compact Notation**:
   - Logistic regression builds on the familiar linear regression equation (`y = m1x1 + m2x2 + ... + b`), but applies the sigmoid function to the result.
   - The summation sign is often implied for brevity, as the structure of the equation remains similar to linear regression.

### Key Takeaway:
Logistic regression extends linear regression by introducing a **sigmoid function** to model **binary categorical outputs**. This transformation ensures the output is constrained between 0 and 1, making it suitable for classification tasks where the target is discrete.
This slide introduces **logistic regression** by building on the familiar **linear regression equation** and applying a transformation to make it suitable for classification tasks, particularly binary classification.

---

### **1. Linear Regression Equation**
The original linear regression equation is:
\[
y = m_1x_1 + m_2x_2 + \dots = \sum m_ix_i
\]
- **Explanation**:
  - \(m_i\): Coefficients (slopes) for each feature \(x_i\).
  - \(x_i\): Input features (independent variables).
  - \(y\): Output (dependent variable), which is continuous in linear regression.

---

### **2. Problem with Linear Regression for Classification**
- Linear regression produces continuous outputs, which are not ideal for classification tasks where the target is categorical (e.g., 0 or 1 for binary classification).
- To address this, we need to transform the output into a **probability** that lies between 0 and 1.

---

### **3. Logistic Regression Equation**
The logistic regression equation introduces the **sigmoid function** to transform the output:
\[
y = \frac{1}{1 + e^{-\sum m_ix_i}}
\]
- **Explanation**:
  - The sigmoid function maps any real number (from \(-\infty\) to \(+\infty\)) to a value between 0 and 1.
  - This makes the output interpretable as a **probability** for binary classification.

---

### **4. Compact Notation**
The equation can be written more compactly as:
\[
y = \sigma(\sum m_ix_i)
\]
- **\(\sigma\)**: The sigmoid function, defined as:
  \[
  \sigma(z) = \frac{1}{1 + e^{-z}}
  \]
- The summation sign (\(\sum\)) is often dropped for clarity, as it is implied that the input is the weighted sum of features (\(m_ix_i\)).

---

### **Key Takeaways**
- **Linear regression** is extended to **logistic regression** by applying the sigmoid function to the output.
- The sigmoid function ensures the output is between 0 and 1, making it suitable for **binary classification** tasks.
- Logistic regression predicts the **probability** of a binary outcome (e.g., 0 or 1), where the decision boundary is typically set at 0.5.


### **Summary and Expansion: Understanding the Sigmoid Function in Logistic Regression**

The sigmoid function is a key mathematical tool in transforming linear regression into logistic regression. It maps continuous input values into a range between 0 and 1, making it suitable for binary classification tasks. Here's a breakdown and deeper explanation of its properties and applications:

---

### **1. What Does the Sigmoid Function Do?**
- **Purpose**: The sigmoid function transforms the output of a linear equation into a probability between 0 and 1.
- **Equation**: 
  \[
  y = \frac{1}{1 + e^{-\sum m_ix_i}}
  \]
  - \(m_i\): Coefficients (slopes) for each input feature \(x_i\).
  - \(\sum m_ix_i\): The linear combination of features and coefficients.
  - \(e\): The exponential function.

- **Behavior**:
  - For large positive values of \(\sum m_ix_i\), the output approaches 1.
  - For large negative values of \(\sum m_ix_i\), the output approaches 0.
  - For values near 0, the output is approximately 0.5.

This transformation creates an **S-shaped curve** (sigmoid curve), which is ideal for modeling probabilities.

---

### **2. Relationship to Binary Classification**
- **Binary Outputs**: Logistic regression predicts probabilities for two classes:
  - Class 1: Probability close to 1.
  - Class 0: Probability close to 0.
- **Thresholding**:
  - A default threshold of \(y = 0.5\) is often used:
    - \(y \geq 0.5\): Assign to Class 1.
    - \(y < 0.5\): Assign to Class 0.
  - However, the threshold can be adjusted based on the problem (e.g., fraud detection, medical diagnosis) to balance sensitivity and specificity.

---

### **3. Properties of the Sigmoid Function**
- **Output Behavior**:
  - When \(\sum m_ix_i\) is very large and positive:
    - \(e^{-\text{large positive number}} \to 0\), so \(y \to 1\).
  - When \(\sum m_ix_i\) is very large and negative:
    - \(e^{-\text{large negative number}} \to \infty\), so \(y \to 0\).
- **Smooth Transition**:
  - The sigmoid function smoothly transitions from 0 to 1, avoiding abrupt changes like a step function.
  - This smoothness allows the function to output probabilities, which can then be thresholded for classification.

---

### **4. Adjusting the Threshold**
- **Default Threshold**: \(y = 0.5\).
- **Custom Thresholds**:
  - Lower thresholds (e.g., \(y = 0.2\)) increase sensitivity, useful for detecting rare events like fraud.
  - Higher thresholds (e.g., \(y = 0.9\)) increase specificity, useful for high-confidence decisions.
- **Importance of Thresholds**:
  - Choosing the right threshold is critical for balancing false positives and false negatives, depending on the application's requirements.

---

### **5. Why the Sigmoid Function?**
- **Simulates a Step Function**:
  - At extreme values, the sigmoid function behaves like a step function, producing outputs close to 0 or 1.
- **Mathematical Simplicity**:
  - Despite involving exponentials, the sigmoid function has properties that simplify calculations, making it computationally efficient.
- **Probabilistic Interpretation**:
  - The output of the sigmoid function can be interpreted as the probability of belonging to Class 1.

---

### **6. Practical Considerations**
- **Framing the Problem**:
  - Logistic regression is framed such that positive values of \(\sum m_ix_i\) correspond to Class 1, and negative values correspond to Class 0.
- **Applications**:
  - Logistic regression is widely used in applications like fraud detection, medical diagnosis, and spam classification.

---

### **Conclusion**
The sigmoid function is the cornerstone of logistic regression, enabling the transformation of continuous linear outputs into probabilities suitable for binary classification. Its smooth, S-shaped curve and ability to simulate step-like behavior make it ideal for modeling discrete outcomes. By adjusting thresholds and leveraging its mathematical properties, the sigmoid function provides flexibility and precision in a variety of real-world applications.

### **Summary of Logarithmic Loss and Its Derivation**

Logarithmic Loss (Log Loss) is a critical loss function used in classification problems, particularly for binary classification tasks. It evaluates the performance of a classification model by penalizing predictions that deviate from the true target probabilities. Here’s a breakdown of the key ideas:

1. **Loss Functions in Regression**:
   - For regression, we typically use **Mean Squared Error (MSE)** or **Mean Absolute Error (MAE)** to minimize the vertical distance between predictions (\( \hat{y} \)) and true targets (\( y \)).
   - These measure error using continuous distances between predicted and actual values.

2. **Challenges in Classification**:
   - Classification involves discrete labels, such as \( y = 0 \) or \( y = 1 \).
   - Predictions are probabilistic, e.g., \( p(y=1 | x) \), derived from models like logistic regression with a **sigmoid activation function**.

3. **Log Loss Function**:
   - The sigmoid function outputs probabilities (\( p \)) between 0 and 1:
     \[
     p = \sigma(z) = \frac{1}{1 + e^{-z}}
     \]
   - The log loss function measures how close these probabilities are to the true labels using the natural logarithm (\( \ln \)).

4. **Mathematical Derivation**:
   - Log loss for \( y=1 \):
     \[
     \text{Loss} = -\ln(p)
     \]
   - Log loss for \( y=0 \):
     \[
     \text{Loss} = -\ln(1-p)
     \]
   - Combined, for general \( y \) (binary):
     \[
     \text{Log Loss} = - \big( y \ln(p) + (1-y) \ln(1-p) \big)
     \]
   - Here, \( p \) is the model's predicted probability for \( y=1 \), and \( 1-p \) for \( y=0 \).

5. **Intuition**:
   - **Penalizing incorrect predictions**: Logarithms amplify penalties for predictions that are far from the true label.
   - A prediction of \( p=0.99 \) for \( y=1 \) incurs a small loss, but \( p=0.01 \) incurs a large loss.
   - The log loss function is **convex**, ensuring a unique global minimum.

6. **Visualization**:
   - The log loss curves for \( y=1 \) and \( y=0 \) are mirror images, reflecting their symmetry.
   - The total loss function forms a convex shape, simplifying optimization.

---

### **Mathematical Notation in Code**

#### **Python Code Example**
Below is a Python implementation of the log loss function for binary classification.

```python
import numpy as np

def sigmoid(z):
    """Compute the sigmoid of z."""
    return 1 / (1 + np.exp(-z))

def log_loss(y_true, y_pred):
    """
    Compute the log loss for binary classification.
    
    Parameters:
    - y_true: True labels (0 or 1).
    - y_pred: Predicted probabilities (between 0 and 1).
    
    Returns:
    - Average log loss.
    """
    epsilon = 1e-15  # Avoid log(0) by clipping predictions
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

# Example usage
y_true = np.array([1, 0, 1, 0])  # True labels
y_pred = np.array([0.9, 0.1, 0.8, 0.2])  # Predicted probabilities

print("Log Loss:", log_loss(y_true, y_pred))
```

---

### **Visualization in Python**

The following Python code visualizes log loss for \( y=1 \) and \( y=0 \).

```python
import matplotlib.pyplot as plt

# Probabilities from 0 to 1
p = np.linspace(0.01, 0.99, 100)

# Log loss for y=1 and y=0
loss_y1 = -np.log(p)
loss_y0 = -np.log(1 - p)

# Plot
plt.figure(figsize=(8, 6))
plt.plot(p, loss_y1, label='Loss (y=1)', color='blue')
plt.plot(p, loss_y0, label='Loss (y=0)', color='orange')
plt.title("Logarithmic Loss for Binary Classification")
plt.xlabel("Predicted Probability (p)")
plt.ylabel("Log Loss")
plt.legend()
plt.grid()
plt.show()
```

---

### **Summary of Key Insights**
- **Log Loss** combines terms for \( y=1 \) and \( y=0 \) into a single, symmetric function.
- It penalizes incorrect predictions more as they deviate from the true class probability.
- The convexity of log loss ensures optimization is feasible with methods like gradient descent.

This foundation makes log loss a standard for classification tasks and a crucial metric for training and evaluating models like logistic regression and neural networks.

### Summary and Expansion: Logarithmic Loss (Log Loss) in Logistic Regression

Logarithmic loss, or log loss, is a key loss function used in logistic regression to evaluate the performance of a model when predicting probabilities for **binary classification tasks**. It measures how far the predicted probabilities are from the actual target values (0 or 1).

---

### **1. Why Not Use Mean Squared Error (MSE)?**
- **Linear Regression**: In linear regression, we use MSE or Mean Absolute Error (MAE) to measure the distance between predictions and targets. These methods work well for continuous outputs.
- **Categorical Data**: For binary classification, the target values are either 0 or 1. Using MSE is not ideal because it doesn't account for the probabilistic nature of the predictions in logistic regression.
- **Need for Log Loss**: Log loss is designed specifically for classification tasks, penalizing incorrect predictions more effectively by using probabilities.

---

### **2. Logarithmic Loss Function**
The log loss function combines two cases: when the target \(y = 1\) and when \(y = 0\). It is defined as:

\[
\text{Log Loss} = - \left[ y \cdot \log(p) + (1 - y) \cdot \log(1 - p) \right]
\]

- **Key Components**:
  - \(p\): Predicted probability (output of the sigmoid function).
  - \(y\): Actual target value (0 or 1).
  - \(\log\): Natural logarithm (base \(e\)).

- **Explanation**:
  - When \(y = 1\): The first term \(-y \cdot \log(p)\) is active, and the second term becomes zero.
  - When \(y = 0\): The second term \(-(1 - y) \cdot \log(1 - p)\) is active, and the first term becomes zero.
  - This ensures that only the relevant term contributes to the loss based on the actual target.

---

### **3. Intuition Behind Log Loss**
- **Penalty for Incorrect Predictions**:
  - If the predicted probability \(p\) is close to the actual target \(y\), the loss is small.
  - If \(p\) is far from \(y\), the loss is large.
- **Behavior**:
  - For \(y = 1\), if \(p\) is close to 1, \(-\log(p)\) is small. If \(p\) is close to 0, \(-\log(p)\) becomes very large.
  - For \(y = 0\), if \(p\) is close to 0, \(-\log(1 - p)\) is small. If \(p\) is close to 1, \(-\log(1 - p)\) becomes very large.

This penalization encourages the model to predict probabilities closer to the true target.

---

### **4. Why Use Log Loss?**
- **Probabilistic Predictions**:
  - Logistic regression outputs probabilities using the sigmoid function. Log loss evaluates these probabilities rather than discrete predictions.
- **Coupled Classes**:
  - Since binary classification involves two mutually exclusive classes (0 and 1), log loss accounts for both cases simultaneously.
- **Convexity**:
  - Log loss forms a **convex curve** with a well-defined minimum, which is ideal for optimization algorithms like gradient descent.

---

### **5. Visualization of Log Loss**
- **Loss for \(y = 1\)**:
  - As \(p \to 1\), the loss approaches 0 (correct prediction).
  - As \(p \to 0\), the loss becomes very large (incorrect prediction).
- **Loss for \(y = 0\)**:
  - As \(p \to 0\), the loss approaches 0 (correct prediction).
  - As \(p \to 1\), the loss becomes very large (incorrect prediction).
- **Combined Loss**:
  - The total log loss curve is convex, ensuring a single global minimum.

---

### **6. Practical Applications**
- **Thresholding**:
  - Logistic regression predicts probabilities. A threshold (e.g., 0.5) is used to classify probabilities into binary outcomes.
  - Log loss ensures the model focuses on improving predictions close to the threshold.
- **Model Training**:
  - Minimizing log loss during training ensures the model outputs probabilities that align closely with the true labels.

---

### **7. Advantages of Log Loss**
- **Handles Probabilities**: Unlike MSE, it evaluates probabilistic predictions effectively.
- **Penalizes Confident Wrong Predictions**: Predictions far from the true target are penalized more heavily.
- **Optimizable**: The convex nature of log loss ensures that optimization algorithms can find the global minimum.

---

### **Conclusion**
Log loss is a critical tool for evaluating and training logistic regression models. By penalizing incorrect predictions based on probabilities, it ensures the model outputs probabilities that are both accurate and reliable. Its mathematical properties and intuitive behavior make it ideal for binary classification tasks.

### **Minimizing Log Loss: Summary and Detailed Expansion**

---

### **Conceptual Overview**

Minimizing **logarithmic loss (log loss)** is a core step in training logistic regression models. A smaller log loss indicates that the model's predicted probabilities closely align with the true class labels, resulting in better classification performance.

#### **Key Components**
1. **Log Loss Function:**
   - The log loss \( J \) for binary classification is:
     \[
     J(\mathbf{w}) = -\frac{1}{N} \sum_{i=1}^{N} \Big[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \Big]
     \]
     where:
     - \( N \) = number of data points,
     - \( y_i \) = true label (\( 0 \) or \( 1 \)),
     - \( p_i = \sigma(z_i) = \frac{1}{1 + e^{-z_i}} \), the predicted probability for \( y=1 \),
     - \( z_i = \mathbf{w}^T \mathbf{x}_i \), the linear combination of weights (\( \mathbf{w} \)) and features (\( \mathbf{x}_i \)).

2. **Prediction Function (Sigmoid):**
   - The sigmoid activation squashes the linear combination of inputs into a probability:
     \[
     \sigma(z) = \frac{1}{1 + e^{-z}}
     \]
     - Ensures \( 0 < p < 1 \), suitable for probability estimation.

3. **Optimization Goal:**
   - Minimize \( J(\mathbf{w}) \) by adjusting the weights \( \mathbf{w} \), which control the decision boundary of the classifier.

---

### **Mathematics of Minimizing Log Loss**

To minimize \( J(\mathbf{w}) \), we apply **gradient descent**:
1. **Gradient of Log Loss w.r.t Weights:**
   - The partial derivative of \( J \) with respect to \( \mathbf{w} \) is:
     \[
     \frac{\partial J}{\partial \mathbf{w}} = \frac{1}{N} \sum_{i=1}^{N} (\sigma(z_i) - y_i) \mathbf{x}_i
     \]
     - Intuition:
       - \( \sigma(z_i) - y_i \): The prediction error.
       - \( \mathbf{x}_i \): Feature vector influences weight updates.

2. **Gradient Descent Update Rule:**
   - Update weights iteratively:
     \[
     \mathbf{w} \leftarrow \mathbf{w} - \eta \frac{\partial J}{\partial \mathbf{w}}
     \]
     - \( \eta \): Learning rate controls step size.

---

### **Coding Representation**

Here’s the Python implementation:

#### **1. Sigmoid Function**
```python
import numpy as np

def sigmoid(z):
    """Compute sigmoid function."""
    return 1 / (1 + np.exp(-z))
```

---

#### **2. Log Loss Function**
```python
def log_loss(y_true, y_pred):
    """
    Compute binary log loss.
    
    Parameters:
    - y_true: True binary labels (array-like).
    - y_pred: Predicted probabilities (array-like).
    
    Returns:
    - Average log loss.
    """
    epsilon = 1e-15  # To avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
```

---

#### **3. Logistic Regression Using Gradient Descent**
```python
def logistic_regression(X, y, lr=0.01, epochs=1000):
    """
    Train logistic regression using gradient descent.
    
    Parameters:
    - X: Feature matrix (N x d).
    - y: Labels (N x 1).
    - lr: Learning rate.
    - epochs: Number of iterations.
    
    Returns:
    - weights: Trained weights.
    - losses: Log loss per epoch.
    """
    N, d = X.shape
    weights = np.zeros(d)  # Initialize weights
    losses = []

    for epoch in range(epochs):
        # Linear combination
        z = np.dot(X, weights)
        # Predicted probabilities
        preds = sigmoid(z)
        # Compute log loss
        loss = log_loss(y, preds)
        losses.append(loss)
        # Gradient calculation
        gradient = np.dot(X.T, (preds - y)) / N
        # Weight update
        weights -= lr * gradient

    return weights, losses
```

---

#### **4. Visualization of Log Loss Minimization**
```python
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 2)
y = (X[:, 0] + X[:, 1] > 1).astype(int)

# Train logistic regression
weights, losses = logistic_regression(X, y, lr=0.1, epochs=200)

# Plot log loss
plt.plot(losses)
plt.title("Log Loss Minimization")
plt.xlabel("Epochs")
plt.ylabel("Log Loss")
plt.grid()
plt.show()
```

---

### **Key Insights**
1. **Convexity of Log Loss**:
   - Log loss is convex, ensuring a global minimum.
   - Gradient descent is guaranteed to converge with a proper learning rate.

2. **Weight Updates and Error**:
   - Each weight update reduces the prediction error by moving in the direction opposite to the gradient.

3. **Relationship to Linear Regression**:
   - Logistic regression extends linear regression by squashing predictions into probabilities using the sigmoid function.
   - The gradient update rule is structurally similar, differing mainly in the sigmoid nonlinearity.

---

This mathematical and coding breakdown highlights the intuitive and practical aspects of minimizing log loss, making logistic regression a robust tool for binary classification tasks.

### Summary and Expansion: Minimizing Log Loss in Logistic Regression

This explanation focuses on how to minimize the **log loss function** in logistic regression, which is essential for training the model to make accurate predictions. The process involves leveraging optimization techniques, particularly gradient descent, to adjust the model's parameters (slopes) and achieve a well-fit model.

---

### **1. Objective of Minimizing Log Loss**
- **Goal**: Minimizing log loss ensures that the model produces predictions that are as close as possible to the actual target values.
- **Small Loss = Good Model**: A smaller log loss indicates a well-trained model with accurate predictions.

---

### **2. Log Loss Function Recap**
- The log loss function (\(J\)) is defined as:
  \[
  J = - \frac{1}{n} \sum_{i=1}^n \left[ y_i \cdot \log(p_i) + (1 - y_i) \cdot \log(1 - p_i) \right]
  \]
  - \(y_i\): Actual target (0 or 1).
  - \(p_i\): Predicted probability (output of the sigmoid function).
  - \(n\): Number of data points.

- **Key Insight**:
  - \(p\) is not just a number; it is the **output of the sigmoid function**:
    \[
    p = \sigma(m \cdot x) = \frac{1}{1 + e^{-m \cdot x}}
    \]
  - The sigmoid function depends on the input data (\(x\)) and the model parameters (\(m\), the slopes).

---

### **3. Minimizing Log Loss**
- To minimize log loss, the process involves:
  1. **Fixing the Data**:
     - The target values (\(y\)) and input data (\(x\)) are fixed and cannot be changed.
  2. **Adjusting the Slopes**:
     - The slopes (\(m\)) are the only variables that can be updated to minimize \(J\).

---

### **4. Gradient Descent for Optimization**
- **Gradient Descent**:
  - Gradient descent is used to find the minimum of the log loss function.
  - The partial derivative of \(J\) with respect to each slope (\(m_i\)) determines the direction and magnitude of the update.
  - The update rule is:
    \[
    m_i = m_i - \alpha \cdot \frac{\partial J}{\partial m_i}
    \]
    - \(\alpha\): Learning rate (controls the step size of the update).
    - \(\frac{\partial J}{\partial m_i}\): Partial derivative of \(J\) with respect to \(m_i\).

- **Key Insight**:
  - The derivative of the log loss function with the sigmoid function is mathematically structured such that the update rule is **identical** to that of linear regression. This is due to the choice of the sigmoid and log functions.

---

### **5. Why the Sigmoid and Log Functions?**
- The sigmoid and log functions were specifically chosen because:
  - The log loss function is **convex**, ensuring a single global minimum.
  - The derivative of the natural log (\(\log_e\)) is simple:
    \[
    \frac{d}{dp} \log(p) = \frac{1}{p}
    \]
  - This simplicity ensures that the update rules for logistic regression align with those for linear regression, allowing the same optimization algorithms to be reused.

---

### **6. Practical Implications**
- **Unified Optimization**:
  - The fact that logistic regression shares the same update rule as linear regression simplifies implementation.
  - Algorithms like stochastic gradient descent (SGD) or batch gradient descent can be directly applied.
- **Extending Linear Regression**:
  - By simply changing the loss function (from mean squared error to log loss), logistic regression extends linear regression to handle categorical problems.

---

### **7. Key Takeaways**
- **Minimizing Log Loss**:
  - Achieved by adjusting the slopes (\(m\)) using gradient descent.
  - The convexity of the log loss function ensures an easy-to-find global minimum.
- **Reusability**:
  - The derivative of the log loss function aligns with linear regression's update rules, making optimization straightforward.
- **Why the Sigmoid Function?**:
  - The sigmoid function ensures outputs are probabilities (between 0 and 1).
  - Its exponential structure simplifies derivatives, which is crucial for efficient optimization.

---

### **Conclusion**
Minimizing log loss in logistic regression is a straightforward extension of linear regression's optimization process. By using the sigmoid function and log loss, the model can handle categorical data effectively while maintaining the same optimization framework. This mathematical design ensures both simplicity and efficiency in training logistic regression models.

### **Multiclass Classification: Overview and Explanation**

Multiclass classification extends binary classification techniques to problems where there are more than two classes. For instance, instead of determining if an input belongs to "cat" or "dog," we might classify between "cat," "dog," and "rabbit."

---

### **Key Approaches to Multiclass Classification**

#### **1. One-vs-Rest (OvR) Method**
- **Concept**: Train separate binary classifiers for each class.
  - For \( K \) classes, we train \( K \) models:
    - \( \text{Class}_i \) vs. "not-\( \text{Class}_i \)"
  - Example:
    - Classes: Red, Green, Blue.
    - Models: Red-vs-not-Red, Green-vs-not-Green, Blue-vs-not-Blue.
- **Prediction**: Use the **argmax** function to choose the class with the highest probability:
  \[
  \hat{y} = \text{argmax}_k \; \text{sigmoid}(\mathbf{w}_k^T \mathbf{x})
  \]
- **Pros**:
  - Simple and easy to implement.
  - Works well for a small number of classes.
- **Cons**:
  - Bias towards the negative class when class distributions are imbalanced.
  - Scales poorly with the number of classes.

---

#### **2. One-vs-One (OvO) Method**
- **Concept**: Train a binary classifier for each pair of classes.
  - For \( K \) classes, train \( \binom{K}{2} = \frac{K(K-1)}{2} \) models.
    - Each model distinguishes between two classes, e.g., \( \text{Class}_i \) vs. \( \text{Class}_j \).
  - Example:
    - Classes: Red, Green, Blue.
    - Models: Red-vs-Green, Red-vs-Blue, Green-vs-Blue.
- **Prediction**: Use majority voting:
  - Each classifier "votes" for one class.
  - The class with the most votes wins.
- **Pros**:
  - Handles imbalanced classes better than OvR.
  - Often performs well for small \( K \).
- **Cons**:
  - Computationally expensive for large \( K \) due to the \( O(K^2) \) number of models.
  - Potential for ties or paradoxical results in voting.

---

### **Mathematical Representation**

#### **Log Loss for Multiclass Problems**
For a dataset with \( K \) classes:
1. **Softmax Activation**:
   - Generalizes the sigmoid function for multiclass:
     \[
     p(y=k|\mathbf{x}) = \frac{e^{\mathbf{w}_k^T \mathbf{x}}}{\sum_{j=1}^K e^{\mathbf{w}_j^T \mathbf{x}}}
     \]
   - Outputs a probability distribution over \( K \) classes.

2. **Cross-Entropy Loss**:
   - Measures the difference between predicted and true probability distributions:
     \[
     J(\mathbf{W}) = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K y_{i,k} \log(p(y=k|\mathbf{x}_i))
     \]
   - \( \mathbf{W} = [\mathbf{w}_1, \mathbf{w}_2, \ldots, \mathbf{w}_K] \): weight matrix for \( K \) classes.
   - \( y_{i,k} \): One-hot encoded true label for \( \mathbf{x}_i \).

---

### **Python Implementation**

#### **Softmax Function**
```python
def softmax(z):
    """
    Compute the softmax of vector z.
    
    Parameters:
    - z: 2D array of shape (N, K), where N is the number of samples and K is the number of classes.
    
    Returns:
    - Softmax probabilities for each class.
    """
    exp_z = np.exp(z - np.max(z, axis=1, keepdims=True))  # Stabilize for numerical safety
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)
```

---

#### **Cross-Entropy Loss**
```python
def cross_entropy_loss(y_true, y_pred):
    """
    Compute the cross-entropy loss for multiclass classification.
    
    Parameters:
    - y_true: One-hot encoded true labels of shape (N, K).
    - y_pred: Predicted probabilities of shape (N, K).
    
    Returns:
    - Average cross-entropy loss.
    """
    epsilon = 1e-15  # Avoid log(0)
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
```

---

#### **Multiclass Logistic Regression Training**
```python
def train_multiclass_logistic_regression(X, y, lr=0.01, epochs=1000):
    """
    Train a multiclass logistic regression model using gradient descent.
    
    Parameters:
    - X: Feature matrix (N x d).
    - y: One-hot encoded labels (N x K).
    - lr: Learning rate.
    - epochs: Number of iterations.
    
    Returns:
    - W: Trained weight matrix (d x K).
    - losses: List of cross-entropy losses per epoch.
    """
    N, d = X.shape
    K = y.shape[1]
    W = np.zeros((d, K))  # Initialize weights
    losses = []

    for epoch in range(epochs):
        # Linear combination
        z = np.dot(X, W)
        # Softmax probabilities
        probs = softmax(z)
        # Compute loss
        loss = cross_entropy_loss(y, probs)
        losses.append(loss)
        # Gradient calculation
        gradient = np.dot(X.T, (probs - y)) / N
        # Weight update
        W -= lr * gradient

    return W, losses
```

---

#### **Visualization of Loss Minimization**
```python
import matplotlib.pyplot as plt

# Generate synthetic data
np.random.seed(42)
X = np.random.rand(100, 3)
y_raw = np.random.randint(0, 3, 100)  # Random labels (0, 1, 2)
y_one_hot = np.eye(3)[y_raw]  # One-hot encode labels

# Train logistic regression
W, losses = train_multiclass_logistic_regression(X, y_one_hot, lr=0.1, epochs=300)

# Plot loss
plt.plot(losses)
plt.title("Cross-Entropy Loss Minimization")
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.grid()
plt.show()
```

---

### **Key Insights**
1. **Softmax for Multiclass**:
   - Softmax ensures the sum of predicted probabilities equals 1, suitable for multiclass problems.
2. **One-vs-Rest**:
   - Simple to implement, but scales poorly for large \( K \).
3. **One-vs-One**:
   - Computationally expensive but useful for small \( K \).
4. **Cross-Entropy Loss**:
   - The loss function generalizes binary cross-entropy for multiclass problems.
5. **Optimization**:
   - Gradient descent works well due to the convex nature of the cross-entropy loss.

This mathematical and coding explanation provides a comprehensive understanding of multiclass classification using logistic regression.