What is a prediction? Why do we need to predict?
For example,
Think about Amazon, they are always making predictions to optimize their workflow!
Can we evaluate forecasts?
NOTICE!
In this Lesson we’ll learn how to evaluate a predictive model’s performance
Brush up on the kind of variables you know:
- Quantitative (continuous, discrete)
- Qualitative (ordinal, nominal)
Let’s find some examples of these variables
library(ISLR)
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00
## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
## Median :379.5 Median : 96 Median : 8.00 Median : 48.00
## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8
## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0
## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7
## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75
## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50
## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12
## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
##
## CWalks League Division PutOuts Assists
## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
## Median : 170.50 Median : 212.0 Median : 39.5
## Mean : 260.24 Mean : 288.9 Mean :106.9
## 3rd Qu.: 339.25 3rd Qu.: 325.0 3rd Qu.:166.0
## Max. :1566.00 Max. :1378.0 Max. :492.0
##
## Errors Salary NewLeague
## Min. : 0.00 Min. : 67.5 A:176
## 1st Qu.: 3.00 1st Qu.: 190.0 N:146
## Median : 6.00 Median : 425.0
## Mean : 8.04 Mean : 535.9
## 3rd Qu.:11.00 3rd Qu.: 750.0
## Max. :32.00 Max. :2460.0
## NA's :59
- Maybe you need to know the meaning of the variables
Supervised learning
The key idea
supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy
Do you remember how to do it? There are a lot of techniques. Let’s start with linear regression models.
\[ y_{i}=\beta_{0}+\beta_{1}x_{i}+u_{i} \]
Where:
\(y_{i}\) is the output variable. The variable we want to learn how to predict, \(x_{i}\) is the input variable. A variable that will be helpful to predict \(y\), \(i=1,...,N\) for each individual we have, and finally, \(u_{i}\) is a random error term.
Exercise
How do you interpret the meaning of \(\beta_0\) and \(\beta_1\)?
Let’s use the Baseball data set (Hitters). We want to predict the Salary of a player conditional to their Hits
attach(Hitters)
plot(Hits, Salary, main="Scatterplot Example",
xlab="Hits", ylab="Salary", pch=19)
\[ Salary_{i}=\beta_{0}+\beta_{1}Hits_{i}+u_{i} \]
mod1<-lm(Salary~Hits,data=Hitters)
summary(mod1)
##
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -893.99 -245.63 -59.08 181.12 2059.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.0488 64.9822 0.970 0.333
## Hits 4.3854 0.5561 7.886 8.53e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 406.2 on 261 degrees of freedom
## (59 observations deleted due to missingness)
## Multiple R-squared: 0.1924, Adjusted R-squared: 0.1893
## F-statistic: 62.19 on 1 and 261 DF, p-value: 8.531e-14
Exercise: explain, carefully, the meaning of all the items you can recognize in the output.
recall: this is “supervised learning” because we have one output and, at least, one input
Let us consider we want to predict “the League” of a player conditional to its hits. The variable “league” is discrete:
A factor with levels A and N indicating player’s league at the end of 1986
Example
Try to plot the league versus the hits. What can you observe?
League_Factor<- as.factor(ifelse(Hitters$League == "A", 1,0)) #convert to a numerical variable
Hitters<-data.frame(Hitters,League_Factor) #add to our data base
attach(Hitters)
## The following object is masked _by_ .GlobalEnv:
##
## League_Factor
## The following objects are masked from Hitters (pos = 3):
##
## Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
## Division, Errors, Hits, HmRun, League, NewLeague, PutOuts, RBI,
## Runs, Salary, Walks, Years
plot(Hitters$Hits, Hitters$League_Factor, main="Scatterplot Example",
xlab="Hits", ylab="League", pch=19)
boxplot(Hitters$Hits~Hitters$League_Factor)
In the case we have a binary output variable, we need to introduce a new function called “logistic”:
\[ y_{i}=\frac{1}{1+e^{-x}} \]
Exercise: Generate a sequence of values for \(x\) (let’s say \(x\in[-10,10]\)) and write the following logistic function.
\[ y_{i}=\frac{1}{1+e^{-0.5x}} \]
Plot it and try to explain what you get.
x<-seq(-10,10,0.5)
y<-1/(1+exp(-0.5*x))
plot(x,y)
In our case, the model we want to estimate is writen in this way:
\[ League_{i}=\frac{1}{1+e^{-(\beta_{0}+\beta_{1}hits)}} \]
attach(Hitters)
## The following object is masked _by_ .GlobalEnv:
##
## League_Factor
## The following objects are masked from Hitters (pos = 3):
##
## Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
## Division, Errors, Hits, HmRun, League, League_Factor, NewLeague,
## PutOuts, RBI, Runs, Salary, Walks, Years
## The following objects are masked from Hitters (pos = 4):
##
## Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
## Division, Errors, Hits, HmRun, League, NewLeague, PutOuts, RBI,
## Runs, Salary, Walks, Years
options(warn=-1)
model <- glm( League ~Hits, data = Hitters, family = binomial)
summary(model)
##
## Call:
## glm(formula = League ~ Hits, family = binomial, data = Hitters)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2903 -1.1035 -0.9818 1.2149 1.4686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.270376 0.269686 1.003 0.316
## Hits -0.004422 0.002449 -1.805 0.071 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 443.95 on 321 degrees of freedom
## Residual deviance: 440.64 on 320 degrees of freedom
## AIC: 444.64
##
## Number of Fisher Scoring iterations: 4
Now, think about the interpretation of the parameters: is this interpretation standard?
This model will be useful to forecast PROBABILITY. The output of the model, as you have drawn before is a value between 0 and 1. So it transforms the number of hits of a player to the probability of being in the league labelled with value 1 (which is league A).