CLASE 12. Regresión de Poisson con Patsy en Python

Autor/a

Gerson Rivera

Fecha de publicación

30 julio 2024

El paquete statmodels y patsy de Python tiene un soporte excelente para realizar la regresión de Poisson.

Usemos el conjunto de datos de recuentos de ciclistas del puente de Brooklyn. Puede recoger el conjunto de datos desde nyc_bb_bicyclist_counts.csv

Nuestro objetivo es construir un modelo de regresión de Poisson para los conteos de ciclistas observados y usaremos el modelo entrenado para predecir los recuentos diarios de ciclistas en el puente de Brooklyn que el modelo no ha visto durante el entrenamiento.

import pandas as pd
from patsy import dmatrices
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
df = pd.read_csv('nyc_bb_bicyclist_counts.csv',header=0, infer_datetime_format=True, parse_dates=[0], index_col=[0])
df.head()
HIGH_T LOW_T PRECIP BB_COUNT
Date
2017-04-01 46.0 37.0 0.00 606
2017-04-02 62.1 41.0 0.00 2021
2017-04-03 63.0 50.0 0.03 2470
2017-04-04 51.1 46.0 1.18 723
2017-04-05 63.0 46.0 0.00 2807
ds = df.index.to_series()
df['MONTH'] = ds.dt.month
df['DAY_OF_WEEK'] = ds.dt.dayofweek
df['DAY'] = ds.dt.day

df.head()
HIGH_T LOW_T PRECIP BB_COUNT MONTH DAY_OF_WEEK DAY
Date
2017-04-01 46.0 37.0 0.00 606 4 5 1
2017-04-02 62.1 41.0 0.00 2021 4 6 2
2017-04-03 63.0 50.0 0.03 2470 4 0 3
2017-04-04 51.1 46.0 1.18 723 4 1 4
2017-04-05 63.0 46.0 0.00 2807 4 2 5

No usaremos la variable Date como regresor ya que contiene un valor de fecha absoluto, pero no necesitamos hacer nada especial para eliminar Date, ya que ya se consume como índice del DataFrame de pandas. Por lo tanto, no estará disponible para nosotros en la matriz X. Crearemos los conjuntos de datos de entrenamiento y prueba.

mask = np.random.rand(len(df)) < 0.8
df_train = df[mask]
df_test = df[~mask]
print('Training data set length='+str(len(df_train)))
print('Testing data set length='+str(len(df_test)))
Training data set length=177
Testing data set length=37
df_train.head()
HIGH_T LOW_T PRECIP BB_COUNT MONTH DAY_OF_WEEK DAY
Date
2017-04-01 46.0 37.0 0.00 606 4 5 1
2017-04-02 62.1 41.0 0.00 2021 4 6 2
2017-04-03 63.0 50.0 0.03 2470 4 0 3
2017-04-04 51.1 46.0 1.18 723 4 1 4
2017-04-05 63.0 46.0 0.00 2807 4 2 5

Configure la expresión de regresión en notación patsy. Le estamos diciendo a patsy que BB_COUNT es nuestra variable dependiente y depende de las variables de regresión:
DAY, DAY_OF_WEEK, MONTH, HIGH_T, LOW_T y PRECIP

expr = """BB_COUNT ~ DAY  + DAY_OF_WEEK + MONTH + HIGH_T + LOW_T + PRECIP"""

Configure las matrices X e Y para los conjuntos de datos de entrenamiento y prueba. patsy lo hace realmente simple.

y_train, X_train = dmatrices(expr, df_train, return_type='dataframe')
y_test, X_test = dmatrices(expr, df_test, return_type='dataframe')
X_train.head()
Intercept DAY DAY_OF_WEEK MONTH HIGH_T LOW_T PRECIP
Date
2017-04-01 1.0 1.0 5.0 4.0 46.0 37.0 0.00
2017-04-02 1.0 2.0 6.0 4.0 62.1 41.0 0.00
2017-04-03 1.0 3.0 0.0 4.0 63.0 50.0 0.03
2017-04-04 1.0 4.0 1.0 4.0 51.1 46.0 1.18
2017-04-05 1.0 5.0 2.0 4.0 63.0 46.0 0.00
X_test.head(10)
Intercept DAY DAY_OF_WEEK MONTH HIGH_T LOW_T PRECIP
Date
2017-04-06 1.0 6.0 3.0 4.0 48.9 41.0 0.73
2017-04-10 1.0 10.0 0.0 4.0 73.9 55.0 0.00
2017-04-18 1.0 18.0 1.0 4.0 66.0 50.0 0.00
2017-05-01 1.0 1.0 0.0 5.0 72.0 50.0 0.00
2017-05-03 1.0 3.0 2.0 5.0 64.9 57.9 0.00
2017-05-04 1.0 4.0 3.0 5.0 63.0 50.0 0.00
2017-05-19 1.0 19.0 4.0 5.0 90.0 75.9 0.00
2017-05-23 1.0 23.0 1.0 5.0 68.0 57.9 0.00
2017-05-24 1.0 24.0 2.0 5.0 66.9 57.0 0.04
2017-06-02 1.0 2.0 4.0 6.0 73.9 60.1 0.01
y_test.head()
BB_COUNT
Date
2017-04-06 461.0
2017-04-10 3324.0
2017-04-18 3415.0
2017-05-01 3084.0
2017-05-03 3342.0
y_train.head(5)
BB_COUNT
Date
2017-04-01 606.0
2017-04-02 2021.0
2017-04-03 2470.0
2017-04-04 723.0
2017-04-05 2807.0
y_test.head()
BB_COUNT
Date
2017-04-06 461.0
2017-04-10 3324.0
2017-04-18 3415.0
2017-05-01 3084.0
2017-05-03 3342.0

Con la clase GLM de statsmodels, entrene el modelo de regresión de Poisson en el conjunto de datos de entrenamiento.

poisson_training_results = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
print(poisson_training_results.summary())
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:               BB_COUNT   No. Observations:                  177
Model:                            GLM   Df Residuals:                      170
Model Family:                 Poisson   Df Model:                            6
Link Function:                    Log   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -12735.
Date:                Tue, 30 Jul 2024   Deviance:                       23764.
Time:                        16:43:59   Pearson chi2:                 2.40e+04
No. Iterations:                     6   Pseudo R-squ. (CS):              1.000
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       6.9735      0.012    570.147      0.000       6.950       6.997
DAY             0.0011      0.000      6.616      0.000       0.001       0.001
DAY_OF_WEEK    -0.0252      0.001    -33.867      0.000      -0.027      -0.024
MONTH           0.0169      0.001     22.053      0.000       0.015       0.018
HIGH_T          0.0248      0.000     75.859      0.000       0.024       0.025
LOW_T          -0.0150      0.000    -41.555      0.000      -0.016      -0.014
PRECIP         -0.7962      0.008   -103.607      0.000      -0.811      -0.781
===============================================================================
poisson_training_results.params
Intercept      6.973522
DAY            0.001112
DAY_OF_WEEK   -0.025199
MONTH          0.016871
HIGH_T         0.024785
LOW_T         -0.015029
PRECIP        -0.796171
dtype: float64
poisson_training_results.deviance
np.float64(23764.309105306325)
poisson_training_results.df_resid
np.int64(170)
c_sdisp= poisson_training_results.pearson_chi2 / poisson_training_results.df_resid
print(c_sdisp)
141.40859500647588
X_train.head()
Intercept DAY DAY_OF_WEEK MONTH HIGH_T LOW_T PRECIP
Date
2017-04-01 1.0 1.0 5.0 4.0 46.0 37.0 0.00
2017-04-02 1.0 2.0 6.0 4.0 62.1 41.0 0.00
2017-04-03 1.0 3.0 0.0 4.0 63.0 50.0 0.03
2017-04-04 1.0 4.0 1.0 4.0 51.1 46.0 1.18
2017-04-05 1.0 5.0 2.0 4.0 63.0 46.0 0.00
y_train.head()
BB_COUNT
Date
2017-04-01 606.0
2017-04-02 2021.0
2017-04-03 2470.0
2017-04-04 723.0
2017-04-05 2807.0
newdata={
    'Intercept':[1.0],
    'DAY': [1.0],
    'DAY_OF_WEEK':[5.0],
    'MONTH': [4.0],
    'HIGH_T': [46.0],
    'LOW_T': [37.0],
    'PRECIP': [0.00]
}
df_new = pd.DataFrame(newdata,index=['J'])
X_train.head(1)
Intercept DAY DAY_OF_WEEK MONTH HIGH_T LOW_T PRECIP
Date
2017-04-01 1.0 1.0 5.0 4.0 46.0 37.0 0.0
y_train.head(1)
BB_COUNT
Date
2017-04-01 606.0
df_new
Intercept DAY DAY_OF_WEEK MONTH HIGH_T LOW_T PRECIP
J 1.0 1.0 5.0 4.0 46.0 37.0 0.0
poisson_training_results.predict(df_new)
J    1808.377659
dtype: float64
y_test.head()
BB_COUNT
Date
2017-04-06 461.0
2017-04-10 3324.0
2017-04-18 3415.0
2017-05-01 3084.0
2017-05-03 3342.0
poisson_training_results.predict(X_test)
Date
2017-04-06    1082.149415
2017-04-10    3156.403397
2017-04-18    2752.398699
2017-05-01    3268.537216
2017-05-03    2319.769875
2017-05-04    2432.731680
2017-05-19    3191.415807
2017-05-23    2626.758459
2017-05-24    2449.977472
2017-06-02    2688.196180
2017-06-04    2076.679207
2017-06-10    3040.852693
2017-06-12    3672.621592
2017-06-17     784.382339
2017-06-21    2975.130714
2017-06-26    3126.010484
2017-07-05    3237.373403
2017-07-12    3153.170841
2017-07-18    3309.555669
2017-07-22    2034.497825
2017-07-24    1472.932405
2017-07-29    2584.118708
2017-08-02    3234.780578
2017-08-05    2009.551892
2017-08-10    3166.042869
2017-08-13    2795.072958
2017-09-02    1679.425064
2017-09-08    2648.484298
2017-09-13    3290.148412
2017-09-18    2741.421111
2017-09-26    3371.116866
2017-09-30    2389.631388
2017-10-02    3454.752729
2017-10-03    2977.103274
2017-10-09    2308.562467
2017-10-15    2422.150179
2017-10-16    2591.971807
dtype: float64
poisson_predictions = poisson_training_results.get_prediction(X_test)
#.summary_frame() returns a pandas DataFrame
predictions_summary_frame = poisson_predictions.summary_frame()
print(predictions_summary_frame)
                   mean    mean_se  mean_ci_lower  mean_ci_upper
Date                                                            
2017-04-06  1082.149415   7.112518    1068.298541    1096.179870
2017-04-10  3156.403397  13.586171    3129.886999    3183.144442
2017-04-18  2752.398699  10.244614    2732.392686    2772.551192
2017-05-01  3268.537216  17.092217    3235.208177    3302.209610
2017-05-03  2319.769875   9.241832    2301.726752    2337.954437
2017-05-04  2432.731680   8.403523    2416.316708    2449.258166
2017-05-19  3191.415807  11.273137    3169.397173    3213.587411
2017-05-23  2626.758459   8.170092    2610.794082    2642.820454
2017-05-24  2449.977472   7.059047    2436.180986    2463.852090
2017-06-02  2688.196180   8.246532    2672.081767    2704.407773
2017-06-04  2076.679207   8.358975    2060.360372    2093.127294
2017-06-10  3040.852693   8.734372    3023.781736    3058.020025
2017-06-12  3672.621592  14.708192    3643.906907    3701.562553
2017-06-17   784.382339   8.061819     768.739550     800.343437
2017-06-21  2975.130714   8.420793    2958.671957    2991.681030
2017-06-26  3126.010484  10.418818    3105.656529    3146.497837
2017-07-05  3237.373403   9.760682    3218.299230    3256.560624
2017-07-12  3153.170841  10.190991    3133.259995    3173.208213
2017-07-18  3309.555669  10.223006    3289.579476    3329.653169
2017-07-22  2034.497825  10.795382    2013.448908    2055.766791
2017-07-24  1472.932405   8.590874    1456.190477    1489.866817
2017-07-29  2584.118708   9.145875    2566.255153    2602.106611
2017-08-02  3234.780578  11.045184    3213.204691    3256.501342
2017-08-05  2009.551892   8.219963    1993.505469    2025.727478
2017-08-10  3166.042869   6.979718    3152.392385    3179.752462
2017-08-13  2795.072958   9.210491    2777.078899    2813.183610
2017-09-02  1679.425064   8.913744    1662.045004    1696.986868
2017-09-08  2648.484298   7.635536    2633.561123    2663.492036
2017-09-13  3290.148412   8.576121    3273.382387    3307.000311
2017-09-18  2741.421111  11.356396    2719.253099    2763.769841
2017-09-26  3371.116866  10.167992    3351.246758    3391.104786
2017-09-30  2389.631388   9.840150    2370.422668    2408.995765
2017-10-02  3454.752729  18.079450    3419.498766    3490.370149
2017-10-03  2977.103274  11.863911    2953.941008    3000.447158
2017-10-09  2308.562467  10.691463    2287.702401    2329.612742
2017-10-15  2422.150179  10.404623    2401.843098    2442.628952
2017-10-16  2591.971807  11.515636    2569.499559    2614.640592
predicted_counts=predictions_summary_frame['mean']
actual_counts = y_test['BB_COUNT']
fig = plt.figure(figsize=(12,6))
fig.suptitle('Predicted versus actual bicyclist counts on the Brooklyn bridge')
predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted counts')
actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual counts')
plt.legend(handles=[predicted, actual])
plt.show()

plt.clf()
fig = plt.figure()
fig.suptitle('Scatter plot of Actual versus Predicted counts')
plt.scatter(x=predicted_counts, y=actual_counts, marker='.')
plt.xlabel('Predicted counts')
plt.ylabel('Actual counts')
plt.show()
<Figure size 672x480 with 0 Axes>

Los valores informados de desviación y chi-cuadrado de Pearson son muy grandes. Un buen ajuste es virtualmente imposible dados estos valores. Para hacer una determinación cuantitativa de la bondad del ajuste en algún nivel de confianza, digamos 95\% (p = 0.05), buscamos el valor en la tabla \chi^2 para p = 0.05 y grados de libertad de residuos = 163 (Residuos DF = No. Observaciones menos modelo DF). Comparamos este valor de chi-cuadrado con la estadística observada, en este caso, la desviación o el valor de chi-cuadrado de Pearson informado en GLMResults. Encontramos que en p = 0.05 y Residuales DF = 163, el valor de chi-cuadrado de la tabla estándar es 193.791, que es mucho más pequeño que el estadístico informado de 23030 y 23300. Por lo tanto, según esta prueba, el modelo de regresión de Poisson , a pesar de demostrar un ajuste visual “aceptable” para el conjunto de datos de prueba, se ha ajustado bastante mal a los datos de entrenamiento.

#Build Famoye's Restricted Generalized Poison regression model, know as GP-2
gen_poisson_gp2 = sm.GeneralizedPoisson(y_train, X_train, p=2)
#Fit the model
gen_poisson_gp2_results = gen_poisson_gp2.fit()
Optimization terminated successfully.
         Current function value: 8.205209
         Iterations: 24
         Function evaluations: 34
         Gradient evaluations: 34
#print the results
print(gen_poisson_gp2_results.summary())
                    GeneralizedPoisson Regression Results                     
==============================================================================
Dep. Variable:               BB_COUNT   No. Observations:                  177
Model:             GeneralizedPoisson   Df Residuals:                      170
Method:                           MLE   Df Model:                            6
Date:                Tue, 30 Jul 2024   Pseudo R-squ.:                 0.04604
Time:                        16:43:59   Log-Likelihood:                -1452.3
converged:                       True   LL-Null:                       -1522.4
Covariance Type:            nonrobust   LLR p-value:                 9.187e-28
===============================================================================
                  coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept       6.5322      0.195     33.417      0.000       6.149       6.915
DAY            -0.0025      0.003     -0.839      0.401      -0.008       0.003
DAY_OF_WEEK    -0.0204      0.014     -1.505      0.132      -0.047       0.006
MONTH           0.0084      0.013      0.640      0.522      -0.017       0.034
HIGH_T          0.0406      0.006      6.743      0.000       0.029       0.052
LOW_T          -0.0254      0.007     -3.706      0.000      -0.039      -0.012
PRECIP         -0.6048      0.037    -16.424      0.000      -0.677      -0.533
alpha           0.0065      0.000     16.366      0.000       0.006       0.007
===============================================================================