CLASE 11. Regresión de Poisson con statsmodels en Python

Autor/a

Gerson Rivera

Fecha de publicación

30 julio 2024

#!pip install pyreadstat
import pandas as pd
import numpy as np
import statsmodels.formula as sm
import pyreadstat
import statsmodels.api as sm
from statsmodels.formula.api import glm
data=pd.read_spss('poisson data2.sav')
data.head()
salary manager genderid worksatisf stress numb.absent
0 12.0 yes identified as male 21.0 3.0 0.0
1 7.0 no identified as female 14.0 4.0 0.0
2 10.0 yes identified as female 6.0 6.0 0.0
3 13.0 yes identified as male 19.0 5.0 1.0
4 8.0 no identified as male 10.0 7.0 1.0
data.dtypes
salary          float64
manager        category
genderid       category
worksatisf      float64
stress          float64
numb.absent     float64
dtype: object
data.columns
Index(['salary', 'manager', 'genderid', 'worksatisf', 'stress', 'numb.absent'], dtype='object')
data.head()
salary manager genderid worksatisf stress numb.absent
0 12.0 yes identified as male 21.0 3.0 0.0
1 7.0 no identified as female 14.0 4.0 0.0
2 10.0 yes identified as female 6.0 6.0 0.0
3 13.0 yes identified as male 19.0 5.0 1.0
4 8.0 no identified as male 10.0 7.0 1.0
dat2 = data.rename(columns={'numb.absent':'numabs'})
dat2.columns
Index(['salary', 'manager', 'genderid', 'worksatisf', 'stress', 'numabs'], dtype='object')
dat2.head()
salary manager genderid worksatisf stress numabs
0 12.0 yes identified as male 21.0 3.0 0.0
1 7.0 no identified as female 14.0 4.0 0.0
2 10.0 yes identified as female 6.0 6.0 0.0
3 13.0 yes identified as male 19.0 5.0 1.0
4 8.0 no identified as male 10.0 7.0 1.0
dat2.shape
(50, 6)
model = glm('numabs~salary+ manager+genderid+worksatisf+stress',
            data = dat2,family = sm.families.Poisson()).fit()
model.summary()
Generalized Linear Model Regression Results
Dep. Variable: numabs No. Observations: 50
Model: GLM Df Residuals: 44
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -80.568
Date: Tue, 30 Jul 2024 Deviance: 16.800
Time: 16:37:14 Pearson chi2: 13.1
No. Iterations: 5 Pseudo R-squ. (CS): 0.6609
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 1.6127 0.604 2.671 0.008 0.429 2.796
manager[T.yes] -0.1469 0.174 -0.846 0.397 -0.487 0.193
genderid[T.identified as male] -0.1364 0.159 -0.857 0.391 -0.448 0.175
salary -0.0749 0.036 -2.110 0.035 -0.144 -0.005
worksatisf -0.0615 0.032 -1.908 0.056 -0.125 0.002
stress 0.0606 0.025 2.420 0.016 0.012 0.110
model.params #Parámetros con el modelo Transformado (Log odds)
Intercept                         1.612667
manager[T.yes]                   -0.146893
genderid[T.identified as male]   -0.136422
salary                           -0.074906
worksatisf                       -0.061479
stress                            0.060609
dtype: float64

numabs=Intercept*(1.612667)+manager[T.yes]*( -0.146893)+genderid[T.identified as male]*(-0.136422)+salary*(-0.074906)+worksatisf*(-0.061479)+stress*(0.060609)

model.pearson_chi2
np.float64(13.05685865233813)
np.exp(model.params) # Parámetros con el modelo original (Exponencial)
Intercept                         5.016171
manager[T.yes]                    0.863387
genderid[T.identified as male]    0.872475
salary                            0.927831
worksatisf                        0.940373
stress                            1.062483
dtype: float64
NOTA.
  • Si los valores son mayores a 1, entonces el parámetro influye en el modelo de forma positiva

  • Si los valores son menores a 1, entonces el parámetro influye en el modelo de forma negativa
print(model.conf_int())
                                       0         1
Intercept                       0.429321  2.796012
manager[T.yes]                 -0.487037  0.193251
genderid[T.identified as male] -0.448255  0.175411
salary                         -0.144491 -0.005321
worksatisf                     -0.124638  0.001679
stress                          0.011517  0.109700
print(np.exp(model.conf_int()))
                                       0          1
Intercept                       1.536215  16.379200
manager[T.yes]                  0.614445   1.213188
genderid[T.identified as male]  0.638742   1.191736
salary                          0.865463   0.994693
worksatisf                      0.882817   1.001681
stress                          1.011583   1.115944
NOTA.
  • Si estamos con los parámetros en el modelo transfromado, entonces CERO no debe pertenecer al intervalo de confianza para que sea significativo.

  • Si estamos con los parámetros en el modelo original, entonces UNO no debe pertenecer al intervalo de confianza para que sea significativo.
dat2['numabs'].mean()
np.float64(3.54)
dat2['numabs'].var()
np.float64(4.661632653061225)
dat2['numabs'].var()/dat2['numabs'].mean()
np.float64(1.316845382220685)
ratio = model.pearson_chi2 / model.df_resid
print(ratio)
0.2967467875531393
NOTA.
  • Si el ratio>1.5 entonces ejecutar el modelo con la Regresión Binomial Negativa (o aplicar una aproximación Cuasi Poisson)

  • Si el ratio<1.5 entonces ejecutar el modelo con la Regresión Poisson
model.fittedvalues
0     0.507249
1     1.600078
2     2.037030
3     0.600806
4     1.986685
5     1.200863
6     1.375063
7     0.850054
8     1.951869
9     1.322531
10    1.813072
11    2.229306
12    1.999764
13    2.186267
14    1.899083
15    1.932434
16    2.120757
17    2.068343
18    2.400528
19    2.923211
20    3.223814
21    2.285715
22    1.839248
23    2.404711
24    2.788593
25    2.845757
26    3.918782
27    3.947136
28    4.832340
29    6.174411
30    4.450721
31    3.130210
32    4.567712
33    4.230692
34    4.941874
35    5.430919
36    4.734390
37    6.213674
38    4.450558
39    5.488084
40    4.498953
41    4.985205
42    6.554499
43    6.388118
44    4.867482
45    5.652196
46    4.870044
47    5.497849
48    9.910157
49    6.871164
dtype: float64
dat2.head()
salary manager genderid worksatisf stress numabs
0 12.0 yes identified as male 21.0 3.0 0.0
1 7.0 no identified as female 14.0 4.0 0.0
2 10.0 yes identified as female 6.0 6.0 0.0
3 13.0 yes identified as male 19.0 5.0 1.0
4 8.0 no identified as male 10.0 7.0 1.0
nueva_data={'salary':12,
            'manager':'yes',
            'genderid':'identified as male',
            'worksatisf':21,
            'stress':3

}
nueva_data=pd.DataFrame(nueva_data,index=['Gerson'])
nueva_data
salary manager genderid worksatisf stress
Gerson 12 yes identified as male 21 3
model.predict(nueva_data)
Gerson    0.507249
dtype: float64

Prueba del Modelo

nueva_data2={'salary':9,
            'manager':'yes',
            'genderid':'identified as male',
            'worksatisf':15,
            'stress':21
}
nueva_data2=pd.DataFrame(nueva_data2,index=['A'])
nueva_data2
salary manager genderid worksatisf stress
A 9 yes identified as male 15 21
model.predict(nueva_data2)
A    2.734078
dtype: float64