Step Functions

#1.- INTRODUCCIÓN

La regresión polinómica explicada anteriormente persigue generar una única función global que describa el comportamiento de la variable dependiente Y en todo el rango del predictor X. La estrategia del método step functions consiste en dividir el rango del predictor X en varios subintervalos y ajustar una constante distinta para cada uno.

Supóngase que se crean K puntos de corte c1, c2,…, ck en el rango del predictor X generando K+1 intervalos. Para cada uno de estos intervalos se crea una variable dummy C0(X), C1(X),…, CK(X) . El valor de estas variables será 1 si X está dentro del intervalo asociado con la variable y 0 de lo contrario. Dado que cualquier valor de X va a estar comprendido en uno de los K+1 intervalos y solo en uno, únicamente una de las variables dummy tendrá el valor de 1 y las demás serán cero.

##2.- DATOS

Empleando el mismo set de datos Wage que en el ejercicio anterior, se pretende crear un modelo formado por 4 step functions* que permita predecir el salario de un trabajador en función de la edad que tiene.*

En R, los modelos step functions se obtienen mediante una combinación de las funciones lm() y cut(). Dado un vector numérico como argumento, la función cut() establece n puntos de corte y devuelve una variable cualitativa que indica el subintervalo al que pertenece cada observación.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.2.3

data(Wage)

str(Wage)

## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...

head(Wage)

##        year age           maritl     race       education             region
## 231655 2006  18 1. Never Married 1. White    1. < HS Grad 2. Middle Atlantic
## 86582  2004  24 1. Never Married 1. White 4. College Grad 2. Middle Atlantic
## 161300 2003  45       2. Married 1. White 3. Some College 2. Middle Atlantic
## 155159 2003  43       2. Married 3. Asian 4. College Grad 2. Middle Atlantic
## 11443  2005  50      4. Divorced 1. White      2. HS Grad 2. Middle Atlantic
## 376662 2008  54       2. Married 1. White 4. College Grad 2. Middle Atlantic
##              jobclass         health health_ins  logwage      wage
## 231655  1. Industrial      1. <=Good      2. No 4.318063  75.04315
## 86582  2. Information 2. >=Very Good      2. No 4.255273  70.47602
## 161300  1. Industrial      1. <=Good     1. Yes 4.875061 130.98218
## 155159 2. Information 2. >=Very Good     1. Yes 5.041393 154.68529
## 11443  2. Information      1. <=Good     1. Yes 4.318063  75.04315
## 376662 2. Information 2. >=Very Good     1. Yes 4.845098 127.11574

summary(Wage)

##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
##

table(cut(Wage$wage, 4))

## 
## (19.8,94.6]  (94.6,169]   (169,244]   (244,319] 
##        1120        1677         124          79

#2.- MODELO

modelo_step_fun <- lm(wage ~cut(age, 4), data = Wage)
summary(modelo_step_fun)

## 
## Call:
## lm(formula = wage ~ cut(age, 4), data = Wage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -98.126 -24.803  -6.177  16.493 200.519 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              94.158      1.476  63.790   <2e-16 ***
## cut(age, 4)(33.5,49]     24.053      1.829  13.148   <2e-16 ***
## cut(age, 4)(49,64.5]     23.665      2.068  11.443   <2e-16 ***
## cut(age, 4)(64.5,80.1]    7.641      4.987   1.532    0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.42 on 2996 degrees of freedom
## Multiple R-squared:  0.0625, Adjusted R-squared:  0.06156 
## F-statistic: 66.58 on 3 and 2996 DF,  p-value: < 2.2e-16

modelo_lineal <- lm(wage ~age, data = Wage)
summary(modelo_lineal)

## 
## Call:
## lm(formula = wage ~ age, data = Wage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -100.265  -25.115   -6.063   16.601  205.748 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 81.70474    2.84624   28.71   <2e-16 ***
## age          0.70728    0.06475   10.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40.93 on 2998 degrees of freedom
## Multiple R-squared:  0.03827,    Adjusted R-squared:  0.03795 
## F-statistic: 119.3 on 1 and 2998 DF,  p-value: < 2.2e-16

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(modelo_lineal, modelo_step_fun, type = "text")

## 
## =========================================================================
##                                       Dependent variable:                
##                        --------------------------------------------------
##                                               wage                       
##                                   (1)                      (2)           
## -------------------------------------------------------------------------
## age                            0.707***                                  
##                                 (0.065)                                  
##                                                                          
## cut(age, 4)(33.5,49]                                    24.053***        
##                                                          (1.829)         
##                                                                          
## cut(age, 4)(49,64.5]                                    23.665***        
##                                                          (2.068)         
##                                                                          
## cut(age, 4)(64.5,80.1]                                    7.641          
##                                                          (4.987)         
##                                                                          
## Constant                       81.705***                94.158***        
##                                 (2.846)                  (1.476)         
##                                                                          
## -------------------------------------------------------------------------
## Observations                     3,000                    3,000          
## R2                               0.038                    0.062          
## Adjusted R2                      0.038                    0.062          
## Residual Std. Error       40.929 (df = 2998)        40.424 (df = 2996)   
## F Statistic            119.312*** (df = 1; 2998) 66.575*** (df = 3; 2996)
## =========================================================================
## Note:                                         *p<0.1; **p<0.05; ***p<0.01

La función cut() ha establecido los puntos de corte en los valores age = 33.5, 49, 64.5. En el summary del modelo obtenido no aparece el grupo age < 33.5, lo que significa que este es el grupo de referencia. El intercept (94.158) se interpreta como el salario medio de los trabajadores con age < 33.5 y el coeficiente de regresión estimado de cada grupo como el incremento promedio de salario respecto al grupo de referencia.

Representación gráfica

# INTERPOLACIÓN DE PUNTOS DENTRO DEL RANGO DEL PREDICTOR
# -----------------------------------------------------------

limites <- range(Wage$age)
nuevos_puntos <- seq(from = limites[1], to = limites[2], by = 1)
nuevos_puntos <- data.frame(age = nuevos_puntos)

# PREDICCIÓN DE LA VARIABLE RESPUESTA Y DEL ERROR ESTÁNDAR
# ----------------------------------------

predicciones <- predict(modelo_step_fun, newdata = nuevos_puntos, se.fit = TRUE,
                        level = 0.95)

# CÁLCULO DEL INTERVALO DE CONFIANZA SUPERIOR E INFERIOR
# ---------------------------------------------------------

intervalo_conf <- data.frame(
                  inferior = predicciones$fit - 1.96*predicciones$se.fit,
                  superior = predicciones$fit + 1.96*predicciones$se.fit)
attach(Wage)

# GRÁFICO
# ---------------------------------------------------------------------

plot(x = age, y = wage, pch = 20, col = "darkgrey")
title("Modelo Piecewise Constant con 4 step functions")
lines(x = nuevos_puntos$age, predicciones$fit, col = "red", lwd = 2)
lines(x = nuevos_puntos$age, intervalo_conf$inferior, col = "blue",
      lwd = 2, lty = 2)
lines(x = nuevos_puntos$age, intervalo_conf$superior, col = "blue", 
      lwd = 2, lty = 2)

Step Functions

2023-10-13