setwd("~/Desktop/R Materials/mih140/Lecture 19 - Logistic Regression I")
kickers = read.table("Kickers.txt", sep = "\t", header = T, quote = "", allowEscapes = T)
Motivation: Want to learn a relationship between features and binary (0 or 1) response variable. Problem: Linear regression gives wrong answers for this problem!
Linear Regression: \(Y = b_0 + b_1 X_1 + ... + b_k X_k\) <- Inappropriate for predicting a binary response.
Interpret the output of the linear regression as a Probability that the observation is in class 1 or class 0. Now we can learn a continuous relationship, of course linear regression returns numbers that are between (-, ), not [0,1] like we would expect for a probability. So that gives us something like: \(Pr(Y = 1) = b_0 + b_1 X_1 + \ldots + b_k X_k\) <- Still a problem, probabilities are bounded between [0,1], where as the linear combination can spit out numbers between \((-\infty, \infty)\).
Use a link function to transform the output of linear regression into a number between [0,1], so that it can be interpreted as a coefficient. In particular, we want some function \(g(\cdot)\) that is increasing and for which: \(g(t): (-\infty, \infty) -> [0,1]\)
We will then use \(Pr(Y = 1) = g(b_0 + b_1 X_1 + ... + b_k X_k)\) as our model!
We will use the logit function: \(g(t) = e^t/(1+e^t)\), note \(g:(-\infty, \infty) -> [0,1]\).
\(\pi = Pr(Y = 1) = g(b_0 + b_1 X_1 + ... + b_k X_k)\) = \(e^{b_0 + b_1 X_1 + ... + b_k X_k}/(1+ e^{b_0 + b_1 X_1 + ... + b_k X_k})\)
Remember the above equation. Just like in linear regression all interpretation about logistic regression comes from understanding the above function.
# To do logistic regression we use the glm() function. The param family = "binomial" tells R to use the logit function as the link function.
glm1 = glm(data = kickers, Win ~ FG.Percentage, family = "binomial")
# Analyze the logistic regressions using Summary just like linear regression!
summary(glm1)
##
## Call:
## glm(formula = Win ~ FG.Percentage, family = "binomial", data = kickers)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.350 -1.350 1.014 1.014 1.697
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.169454 0.356486 -3.281 0.00104 **
## FG.Percentage 0.015656 0.004001 3.913 9.1e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 521.94 on 377 degrees of freedom
## Residual deviance: 505.08 on 376 degrees of freedom
## AIC: 509.08
##
## Number of Fisher Scoring iterations: 4
#Coefficients:
# Estimate Std. Error z value Pr(>|z|)
#(Intercept) -1.169454 0.356486 -3.281 0.00104 **
#FG.Percentage 0.015656 0.004001 3.913 9.1e-05 ***
# What is the logistic model we've learned?
# Pr(Win = 1) = e^(-1.169454 + FG.Percentage*0.015656)/(1 + e^(-1.169454 + FG.Percentage*0.015656))
# What's the interpretation of the intercept?
# Pr(Y = 1 | X =0) = e^-1.169454/(1+e^-1.169454)
exp(-1.169454)/(1 + exp(-1.169454)) # Chance to win given 0 FG.Percentage
## [1] 0.2369537
# To simulate prediction we will split up the data into a training set, from which we will learn the model, and a testing set for which we will apply the model.
kickers_shuff = kickers[sample(nrow(kickers)), ] # shuffle rows of kickers
training_set = kickers_shuff[1:20, ]
test_set = kickers_shuff[21:30, ]
glm2 = glm(data = training_set, Win ~ FGs.Attempted + FGs.Made, family = "binomial") # Can do logistic regression with multiple features just like in linear regression!
# Confidence Intervals
confint(glm2) # confidence intervals around the coeff. of your model are just like in linear regression
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -3.135818 1.066416
## FGs.Attempted -1.348572 2.922014
## FGs.Made -1.864736 2.111570
# Logistic Regression has a standard interpretation for prediction, to use it as such you must make sure the observations you're predicting on are distinct from the observations used to train the model. Lets do this next.
# Fitting your model on the new observations!
test_set$Win = predict.glm(glm2, test_set, type = "response")
test_set$Win # These are the predicted probabilities under the model!
## [1] 0.4633018 0.4633018 0.6622677 0.5977009 0.7188666 0.8846113 0.7188666
## [8] 0.7188666 0.4633018 0.6622677
# test_set[1,] - FG.Attempted = 2, FG.Made = 1
# Pr(Win = 1 | FG.Attempted = 2, FG.Made = 1) = exp(0.02425 + 0.15034*2 - 0.08834)/(1 + exp(0.02425 + 0.15034*2 - 0.08834)) = 0.5588731
test_set$Win[1] # Note it's the same (up to rounding)
## [1] 0.4633018
# To plot the logistic curve, we will need to be a little clevered than we were for linear regression.
# We will generate a set of x coordinates by sweeping through the range.
# Then we will get the y values by fitting the model on our generated x coordinates.
new_dat = data.frame(FG.Percentage = seq(min(kickers$FG.Percentage)-100, max(kickers$FG.Percentage)+100, 1)) # makes x coordinates
# Using the x values in new_dat, we can now fit our first model on those points
new_dat$Win = predict.glm(glm1, new_dat, type = "response") # makes y coordinates
# Now to plot hte curve we call:
plot(Win ~ FG.Percentage, xlim = c(-100, 200), data = kickers) # Plots the points from the dataset
lines(Win ~ FG.Percentage, data = new_dat, lwd = 2) # Sweeps out a line through our generated set of points