import pandas as pd
import numpy as np
import statsmodels.formula as sm
import pyreadstat
import statsmodels.api as sm
from statsmodels.formula.api import glm
data= pd.read_spss('poisson data2.sav' )
data.head()
0
12.0
yes
identified as male
21.0
3.0
0.0
1
7.0
no
identified as female
14.0
4.0
0.0
2
10.0
yes
identified as female
6.0
6.0
0.0
3
13.0
yes
identified as male
19.0
5.0
1.0
4
8.0
no
identified as male
10.0
7.0
1.0
salary float64
manager category
genderid category
worksatisf float64
stress float64
numb.absent float64
dtype: object
Index(['salary', 'manager', 'genderid', 'worksatisf', 'stress', 'numb.absent'], dtype='object')
0
12.0
yes
identified as male
21.0
3.0
0.0
1
7.0
no
identified as female
14.0
4.0
0.0
2
10.0
yes
identified as female
6.0
6.0
0.0
3
13.0
yes
identified as male
19.0
5.0
1.0
4
8.0
no
identified as male
10.0
7.0
1.0
dat2 = data.rename(columns= {'numb.absent' :'numabs' })
dat2.columns
Index(['salary', 'manager', 'genderid', 'worksatisf', 'stress', 'numabs'], dtype='object')
0
12.0
yes
identified as male
21.0
3.0
0.0
1
7.0
no
identified as female
14.0
4.0
0.0
2
10.0
yes
identified as female
6.0
6.0
0.0
3
13.0
yes
identified as male
19.0
5.0
1.0
4
8.0
no
identified as male
10.0
7.0
1.0
model = glm('numabs~salary+ manager+genderid+worksatisf+stress' ,
data = dat2,family = sm.families.Poisson()).fit()
model.summary()
Generalized Linear Model Regression Results
Dep. Variable:
numabs
No. Observations:
50
Model:
GLM
Df Residuals:
44
Model Family:
Poisson
Df Model:
5
Link Function:
Log
Scale:
1.0000
Method:
IRLS
Log-Likelihood:
-80.568
Date:
Tue, 30 Jul 2024
Deviance:
16.800
Time:
16:37:14
Pearson chi2:
13.1
No. Iterations:
5
Pseudo R-squ. (CS):
0.6609
Covariance Type:
nonrobust
coef
std err
z
P>|z|
[0.025
0.975]
Intercept
1.6127
0.604
2.671
0.008
0.429
2.796
manager[T.yes]
-0.1469
0.174
-0.846
0.397
-0.487
0.193
genderid[T.identified as male]
-0.1364
0.159
-0.857
0.391
-0.448
0.175
salary
-0.0749
0.036
-2.110
0.035
-0.144
-0.005
worksatisf
-0.0615
0.032
-1.908
0.056
-0.125
0.002
stress
0.0606
0.025
2.420
0.016
0.012
0.110
model.params #Parámetros con el modelo Transformado (Log odds)
Intercept 1.612667
manager[T.yes] -0.146893
genderid[T.identified as male] -0.136422
salary -0.074906
worksatisf -0.061479
stress 0.060609
dtype: float64
numabs=Intercept*(1.612667)+manager[T.yes]*( -0.146893)+genderid[T.identified as male]*(-0.136422)+salary*(-0.074906)+worksatisf*(-0.061479)+stress*(0.060609)
np.float64(13.05685865233813)
np.exp(model.params) # Parámetros con el modelo original (Exponencial)
Intercept 5.016171
manager[T.yes] 0.863387
genderid[T.identified as male] 0.872475
salary 0.927831
worksatisf 0.940373
stress 1.062483
dtype: float64
Si los valores son mayores a 1, entonces el parámetro influye en el modelo de forma positiva
Si los valores son menores a 1, entonces el parámetro influye en el modelo de forma negativa
0 1
Intercept 0.429321 2.796012
manager[T.yes] -0.487037 0.193251
genderid[T.identified as male] -0.448255 0.175411
salary -0.144491 -0.005321
worksatisf -0.124638 0.001679
stress 0.011517 0.109700
print (np.exp(model.conf_int()))
0 1
Intercept 1.536215 16.379200
manager[T.yes] 0.614445 1.213188
genderid[T.identified as male] 0.638742 1.191736
salary 0.865463 0.994693
worksatisf 0.882817 1.001681
stress 1.011583 1.115944
Si estamos con los parámetros en el modelo transfromado, entonces CERO no debe pertenecer al intervalo de confianza para que sea significativo.
Si estamos con los parámetros en el modelo original, entonces UNO no debe pertenecer al intervalo de confianza para que sea significativo.
np.float64(4.661632653061225)
SOBREDISPERSIÓN. Criterio es utilizado cuando la varianza excede de forma significativa a la media de la variable respuesta.
El cociente de \dfrac{var}{mean} si es mayor a 1.5 entonces existe sobredispersión.
dat2['numabs' ].var()/ dat2['numabs' ].mean()
np.float64(1.316845382220685)
ratio = model.pearson_chi2 / model.df_resid
print (ratio)
Si el ratio>1.5 entonces ejecutar el modelo con la Regresión Binomial Negativa (o aplicar una aproximación Cuasi Poisson)
Si el ratio<1.5 entonces ejecutar el modelo con la Regresión Poisson
0 0.507249
1 1.600078
2 2.037030
3 0.600806
4 1.986685
5 1.200863
6 1.375063
7 0.850054
8 1.951869
9 1.322531
10 1.813072
11 2.229306
12 1.999764
13 2.186267
14 1.899083
15 1.932434
16 2.120757
17 2.068343
18 2.400528
19 2.923211
20 3.223814
21 2.285715
22 1.839248
23 2.404711
24 2.788593
25 2.845757
26 3.918782
27 3.947136
28 4.832340
29 6.174411
30 4.450721
31 3.130210
32 4.567712
33 4.230692
34 4.941874
35 5.430919
36 4.734390
37 6.213674
38 4.450558
39 5.488084
40 4.498953
41 4.985205
42 6.554499
43 6.388118
44 4.867482
45 5.652196
46 4.870044
47 5.497849
48 9.910157
49 6.871164
dtype: float64
0
12.0
yes
identified as male
21.0
3.0
0.0
1
7.0
no
identified as female
14.0
4.0
0.0
2
10.0
yes
identified as female
6.0
6.0
0.0
3
13.0
yes
identified as male
19.0
5.0
1.0
4
8.0
no
identified as male
10.0
7.0
1.0
nueva_data= {'salary' :12 ,
'manager' :'yes' ,
'genderid' :'identified as male' ,
'worksatisf' :21 ,
'stress' :3
}
nueva_data= pd.DataFrame(nueva_data,index= ['Gerson' ])
nueva_data
Gerson
12
yes
identified as male
21
3
model.predict(nueva_data)
Gerson 0.507249
dtype: float64
Prueba del Modelo
nueva_data2= {'salary' :9 ,
'manager' :'yes' ,
'genderid' :'identified as male' ,
'worksatisf' :15 ,
'stress' :21
}
nueva_data2= pd.DataFrame(nueva_data2,index= ['A' ])
nueva_data2
A
9
yes
identified as male
15
21
model.predict(nueva_data2)
A 2.734078
dtype: float64