Week #1

First day 31/01/2023

What is a prediction? Why do we need to predict?

For example,

Think about Amazon, they are always making predictions to optimize their workflow!

Can we evaluate forecasts?

NOTICE!

In this Lesson we’ll learn how to evaluate a predictive model’s performance

To make forecasts we need VARIABLES

Brush up on the kind of variables you know:

  • Quantitative (continuous, discrete)
  • Qualitative (ordinal, nominal)

Let’s find some examples of these variables

library(ISLR)
summary(Hitters)
##      AtBat            Hits         HmRun            Runs       
##  Min.   : 16.0   Min.   :  1   Min.   : 0.00   Min.   :  0.00  
##  1st Qu.:255.2   1st Qu.: 64   1st Qu.: 4.00   1st Qu.: 30.25  
##  Median :379.5   Median : 96   Median : 8.00   Median : 48.00  
##  Mean   :380.9   Mean   :101   Mean   :10.77   Mean   : 50.91  
##  3rd Qu.:512.0   3rd Qu.:137   3rd Qu.:16.00   3rd Qu.: 69.00  
##  Max.   :687.0   Max.   :238   Max.   :40.00   Max.   :130.00  
##                                                                
##       RBI             Walks            Years            CAtBat       
##  Min.   :  0.00   Min.   :  0.00   Min.   : 1.000   Min.   :   19.0  
##  1st Qu.: 28.00   1st Qu.: 22.00   1st Qu.: 4.000   1st Qu.:  816.8  
##  Median : 44.00   Median : 35.00   Median : 6.000   Median : 1928.0  
##  Mean   : 48.03   Mean   : 38.74   Mean   : 7.444   Mean   : 2648.7  
##  3rd Qu.: 64.75   3rd Qu.: 53.00   3rd Qu.:11.000   3rd Qu.: 3924.2  
##  Max.   :121.00   Max.   :105.00   Max.   :24.000   Max.   :14053.0  
##                                                                      
##      CHits            CHmRun           CRuns             CRBI        
##  Min.   :   4.0   Min.   :  0.00   Min.   :   1.0   Min.   :   0.00  
##  1st Qu.: 209.0   1st Qu.: 14.00   1st Qu.: 100.2   1st Qu.:  88.75  
##  Median : 508.0   Median : 37.50   Median : 247.0   Median : 220.50  
##  Mean   : 717.6   Mean   : 69.49   Mean   : 358.8   Mean   : 330.12  
##  3rd Qu.:1059.2   3rd Qu.: 90.00   3rd Qu.: 526.2   3rd Qu.: 426.25  
##  Max.   :4256.0   Max.   :548.00   Max.   :2165.0   Max.   :1659.00  
##                                                                      
##      CWalks        League  Division    PutOuts          Assists     
##  Min.   :   0.00   A:175   E:157    Min.   :   0.0   Min.   :  0.0  
##  1st Qu.:  67.25   N:147   W:165    1st Qu.: 109.2   1st Qu.:  7.0  
##  Median : 170.50                    Median : 212.0   Median : 39.5  
##  Mean   : 260.24                    Mean   : 288.9   Mean   :106.9  
##  3rd Qu.: 339.25                    3rd Qu.: 325.0   3rd Qu.:166.0  
##  Max.   :1566.00                    Max.   :1378.0   Max.   :492.0  
##                                                                     
##      Errors          Salary       NewLeague
##  Min.   : 0.00   Min.   :  67.5   A:176    
##  1st Qu.: 3.00   1st Qu.: 190.0   N:146    
##  Median : 6.00   Median : 425.0            
##  Mean   : 8.04   Mean   : 535.9            
##  3rd Qu.:11.00   3rd Qu.: 750.0            
##  Max.   :32.00   Max.   :2460.0            
##                  NA's   :59
  • Maybe you need to know the meaning of the variables

Lesson of the day: we must to understand the type of variables we have. Otherwise, our next step will be full of mistakes!

Second day 02/02/2023

Supervised learning

The key idea

supervised statistical learning involves building a statistical model for predicting, or estimating, an output based on one or more inputs. Problems of this nature occur in fields as diverse as business, medicine, astrophysics, and public policy

When the output is a continuous variable

Do you remember how to do it? There are a lot of techniques. Let’s start with linear regression models.

\[ y_{i}=\beta_{0}+\beta_{1}x_{i}+u_{i} \]

Where:

\(y_{i}\) is the output variable. The variable we want to learn how to predict, \(x_{i}\) is the input variable. A variable that will be helpful to predict \(y\), \(i=1,...,N\) for each individual we have, and finally, \(u_{i}\) is a random error term.


Exercise

How do you interpret the meaning of \(\beta_0\) and \(\beta_1\)?


Let’s use the Baseball data set (Hitters). We want to predict the Salary of a player conditional to their Hits

attach(Hitters)
plot(Hits, Salary, main="Scatterplot Example",
   xlab="Hits", ylab="Salary", pch=19)

\[ Salary_{i}=\beta_{0}+\beta_{1}Hits_{i}+u_{i} \]

mod1<-lm(Salary~Hits,data=Hitters)
summary(mod1)
## 
## Call:
## lm(formula = Salary ~ Hits, data = Hitters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -893.99 -245.63  -59.08  181.12 2059.90 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.0488    64.9822   0.970    0.333    
## Hits          4.3854     0.5561   7.886 8.53e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 406.2 on 261 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.1924, Adjusted R-squared:  0.1893 
## F-statistic: 62.19 on 1 and 261 DF,  p-value: 8.531e-14

Exercise: explain, carefully, the meaning of all the items you can recognize in the output.


Two important cheat sheets

  • How to interpret a regression output

  • How to deal with p-value

recall: this is “supervised learning” because we have one output and, at least, one input