What is a prediction? Why do we need to predict?
For example,
Think about Amazon, they are always making predictions to optimize their workflow!
Can we evaluate forecasts?
NOTICE!
In this Lesson we’ll learn how to evaluate a predictive model’s performance
Brush up on the kind of variables you know:
- Quantitative (continuous, discrete)
- Qualitative (ordinal, nominal)
Let’s find some examples of these variables
library(ISLR)
summary(Hitters)
## AtBat Hits HmRun Runs
## Min. : 16.0 Min. : 1 Min. : 0.00 Min. : 0.00
## 1st Qu.:255.2 1st Qu.: 64 1st Qu.: 4.00 1st Qu.: 30.25
## Median :379.5 Median : 96 Median : 8.00 Median : 48.00
## Mean :380.9 Mean :101 Mean :10.77 Mean : 50.91
## 3rd Qu.:512.0 3rd Qu.:137 3rd Qu.:16.00 3rd Qu.: 69.00
## Max. :687.0 Max. :238 Max. :40.00 Max. :130.00
##
## RBI Walks Years CAtBat
## Min. : 0.00 Min. : 0.00 Min. : 1.000 Min. : 19.0
## 1st Qu.: 28.00 1st Qu.: 22.00 1st Qu.: 4.000 1st Qu.: 816.8
## Median : 44.00 Median : 35.00 Median : 6.000 Median : 1928.0
## Mean : 48.03 Mean : 38.74 Mean : 7.444 Mean : 2648.7
## 3rd Qu.: 64.75 3rd Qu.: 53.00 3rd Qu.:11.000 3rd Qu.: 3924.2
## Max. :121.00 Max. :105.00 Max. :24.000 Max. :14053.0
##
## CHits CHmRun CRuns CRBI
## Min. : 4.0 Min. : 0.00 Min. : 1.0 Min. : 0.00
## 1st Qu.: 209.0 1st Qu.: 14.00 1st Qu.: 100.2 1st Qu.: 88.75
## Median : 508.0 Median : 37.50 Median : 247.0 Median : 220.50
## Mean : 717.6 Mean : 69.49 Mean : 358.8 Mean : 330.12
## 3rd Qu.:1059.2 3rd Qu.: 90.00 3rd Qu.: 526.2 3rd Qu.: 426.25
## Max. :4256.0 Max. :548.00 Max. :2165.0 Max. :1659.00
##
## CWalks League Division PutOuts Assists
## Min. : 0.00 A:175 E:157 Min. : 0.0 Min. : 0.0
## 1st Qu.: 67.25 N:147 W:165 1st Qu.: 109.2 1st Qu.: 7.0
## Median : 170.50 Median : 212.0 Median : 39.5
## Mean : 260.24 Mean : 288.9 Mean :106.9
## 3rd Qu.: 339.25 3rd Qu.: 325.0 3rd Qu.:166.0
## Max. :1566.00 Max. :1378.0 Max. :492.0
##
## Errors Salary NewLeague
## Min. : 0.00 Min. : 67.5 A:176
## 1st Qu.: 3.00 1st Qu.: 190.0 N:146
## Median : 6.00 Median : 425.0
## Mean : 8.04 Mean : 535.9
## 3rd Qu.:11.00 3rd Qu.: 750.0
## Max. :32.00 Max. :2460.0
## NA's :59
- Maybe you need to know the meaning of the variables
Supervised learning
The key idea
supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy
Do you remember how to do it? There are a lot of techniques. Let’s start with linear regression models.
\[ y_{i}=\beta_{0}+\beta_{1}x_{i}+u_{i} \]
Where:
\(y_{i}\) is the output variable. The variable we want to learn how to predict, \(x_{i}\) is the input variable. A variable that will be helpful to predict \(y\), \(i=1,...,N\) for each individual we have, and finally, \(u_{i}\) is a random error term.
Exercise
How do you interpret the meaning of \(\beta_0\) and \(\beta_1\)?
Let’s use the Baseball data set (Hitters). We want to predict the Salary of a player conditional to their Hits
attach(Hitters)
plot(Hits, Salary, main="Scatterplot Example",
xlab="Hits", ylab="Salary", pch=19)
\[ Salary_{i}=\beta_{0}+\beta_{1}Hits_{i}+u_{i} \]
mod1<-lm(Salary~Hits,data=Hitters)
summary(mod1)
##
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -893.99 -245.63 -59.08 181.12 2059.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.0488 64.9822 0.970 0.333
## Hits 4.3854 0.5561 7.886 8.53e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 406.2 on 261 degrees of freedom
## (59 observations deleted due to missingness)
## Multiple R-squared: 0.1924, Adjusted R-squared: 0.1893
## F-statistic: 62.19 on 1 and 261 DF, p-value: 8.531e-14
Exercise: explain, carefully, the meaning of all the items you can recognize in the output.
Two important cheat sheets
- How to interpret a regression output
- How to deal with p-value
recall: this is “supervised learning” because we have one output and, at least, one input