Week #1

First day 31/01/2023

What is a prediction? Why do we need to predict?

For example,

Think about Amazon, they are always making predictions to optimize their workflow!

Can we evaluate forecasts?

NOTICE!

In this Lesson we’ll learn how to evaluate a predictive model’s performance

To make forecasts we need VARIABLES

Brush up on the kind of variables you know:

  • Quantitative (continuous, discrete)
  • Qualitative (ordinal, nominal)

Let’s find some examples of these variables

library(ISLR)
summary(Hitters)
##      AtBat            Hits         HmRun            Runs       
##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
##                                                                
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##                                                                      
##      CHits            CHmRun           CRuns             CRBI        
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
##                                                                      
##      CWalks        League  Division    PutOuts          Assists     
##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
##  Median : 170.50                    Median : 212.0   Median : 39.5  
##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
##                                                                     
##      Errors          Salary       NewLeague
##  Min.   : 0.00   Min.   :  67.5   A:176    
##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
##  Median : 6.00   Median : 425.0            
##  Mean   : 8.04   Mean   : 535.9            
##  3rd Qu.:11.00   3rd Qu.: 750.0            
##  Max.   :32.00   Max.   :2460.0            
##                  NA's   :59
  • Maybe you need to know the meaning of the variables

Lesson of the day: we must to understand the type of variables we have. Otherwise, our next step will be full of mistakes!

Second day 02/02/2023

Supervised learning

The key idea

supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy

When the output is a continuous variable

Do you remember how to do it? There are a lot of techniques. Let’s start with linear regression models.

\[ y_{i}=\beta_{0}+\beta_{1}x_{i}+u_{i} \]

Where:

\(y_{i}\) is the output variable. The variable we want to learn how to predict, \(x_{i}\) is the input variable. A variable that will be helpful to predict \(y\), \(i=1,...,N\) for each individual we have, and finally, \(u_{i}\) is a random error term.


Exercise

How do you interpret the meaning of \(\beta_0\) and \(\beta_1\)?


Let’s use the Baseball data set (Hitters). We want to predict the Salary of a player conditional to their Hits

attach(Hitters)
plot(Hits, Salary, main="Scatterplot Example",
   xlab="Hits", ylab="Salary", pch=19)

\[ Salary_{i}=\beta_{0}+\beta_{1}Hits_{i}+u_{i} \]

mod1<-lm(Salary~Hits,data=Hitters)
summary(mod1)
## 
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -893.99 -245.63  -59.08  181.12 2059.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.0488    64.9822   0.970    0.333    
## Hits          4.3854     0.5561   7.886 8.53e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406.2 on 261 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1893 
## F-statistic: 62.19 on 1 and 261 DF,  p-value: 8.531e-14

Exercise: explain, carefully, the meaning of all the items you can recognize in the output.


recall: this is “supervised learning” because we have one output and, at least, one input

When the output is a binary variable

Let us consider we want to predict “the League” of a player conditional to its hits. The variable “league” is discrete:

A factor with levels A and N indicating player’s league at the end of 1986


Example

Try to plot the league versus the hits. What can you observe?

League_Factor<- as.factor(ifelse(Hitters$League == "A", 1,0)) #convert to a numerical variable
Hitters<-data.frame(Hitters,League_Factor)  #add to our data base                  


attach(Hitters)
## The following object is masked _by_ .GlobalEnv:
## 
##     League_Factor
## The following objects are masked from Hitters (pos = 3):
## 
##     Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
##     Division, Errors, Hits, HmRun, League, NewLeague, PutOuts, RBI,
##     Runs, Salary, Walks, Years
plot(Hitters$Hits, Hitters$League_Factor, main="Scatterplot Example",
   xlab="Hits", ylab="League", pch=19)

boxplot(Hitters$Hits~Hitters$League_Factor)


In the case we have a binary output variable, we need to introduce a new function called “logistic”:

\[ y_{i}=\frac{1}{1+e^{-x}} \]


Exercise: Generate a sequence of values for \(x\) (let’s say \(x\in[-10,10]\)) and write the following logistic function.

\[ y_{i}=\frac{1}{1+e^{-0.5x}} \]

Plot it and try to explain what you get.

x<-seq(-10,10,0.5)
y<-1/(1+exp(-0.5*x))
plot(x,y)


In our case, the model we want to estimate is writen in this way:

\[ League_{i}=\frac{1}{1+e^{-(\beta_{0}+\beta_{1}hits)}} \]

attach(Hitters)
## The following object is masked _by_ .GlobalEnv:
## 
##     League_Factor
## The following objects are masked from Hitters (pos = 3):
## 
##     Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
##     Division, Errors, Hits, HmRun, League, League_Factor, NewLeague,
##     PutOuts, RBI, Runs, Salary, Walks, Years
## The following objects are masked from Hitters (pos = 4):
## 
##     Assists, AtBat, CAtBat, CHits, CHmRun, CRBI, CRuns, CWalks,
##     Division, Errors, Hits, HmRun, League, NewLeague, PutOuts, RBI,
##     Runs, Salary, Walks, Years
options(warn=-1)
model <- glm( League ~Hits, data = Hitters, family = binomial)
summary(model)
## 
## Call:
## glm(formula = League ~ Hits, family = binomial, data = Hitters)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.2903  -1.1035  -0.9818   1.2149   1.4686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.270376   0.269686   1.003    0.316  
## Hits        -0.004422   0.002449  -1.805    0.071 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 443.95  on 321  degrees of freedom
## Residual deviance: 440.64  on 320  degrees of freedom
## AIC: 444.64
## 
## Number of Fisher Scoring iterations: 4

Now, think about the interpretation of the parameters: is this interpretation standard?

This model will be useful to forecast PROBABILITY. The output of the model, as you have drawn before is a value between 0 and 1. So it transforms the number of hits of a player to the probability of being in the league labelled with value 1 (which is league A).

Lesson of the day: we have to choose a model that matches with the kind of variables we have!