CLASE 14. Regresión Poisson, Binomial Negativa y Gamma en Python

Autor/a

Gerson Rivera

Fecha de publicación

7 agosto 2024

Algo de Regression Gamma

import statsmodels.api as sm

data=sm.datasets.scotland.load_pandas()

data.endog
0     60.3
1     52.3
2     53.4
3     57.0
4     68.7
5     48.8
6     65.5
7     70.5
8     59.1
9     62.7
10    51.6
11    62.0
12    68.4
13    69.2
14    64.7
15    75.0
16    62.1
17    67.2
18    67.7
19    52.7
20    65.7
21    72.2
22    47.4
23    51.3
24    63.6
25    50.7
26    51.6
27    56.2
28    67.6
29    58.9
30    74.7
31    67.3
Name: YES, dtype: float64
data.exog
COUTAX UNEMPF MOR ACT GDP AGE COUTAX_FEMALEUNEMP
0 712.0 21.0 105.0 82.4 13566.0 12.3 14952.0
1 643.0 26.5 97.0 80.2 13566.0 15.3 17039.5
2 679.0 28.3 113.0 86.3 9611.0 13.9 19215.7
3 801.0 27.1 109.0 80.4 9483.0 13.6 21707.1
4 753.0 22.0 115.0 64.7 9265.0 14.6 16566.0
5 714.0 24.3 107.0 79.0 9555.0 13.8 17350.2
6 920.0 21.2 118.0 72.2 9611.0 13.3 19504.0
7 779.0 20.5 114.0 75.2 9483.0 14.5 15969.5
8 771.0 23.2 102.0 81.1 9483.0 14.2 17887.2
9 724.0 20.5 112.0 80.3 12656.0 13.7 14842.0
10 682.0 23.8 96.0 83.0 9483.0 14.6 16231.6
11 837.0 22.1 111.0 74.5 12656.0 11.6 18497.7
12 599.0 19.9 117.0 83.8 8298.0 15.1 11920.1
13 680.0 21.5 121.0 77.6 9265.0 13.7 14620.0
14 747.0 22.5 109.0 77.9 8314.0 14.4 16807.5
15 982.0 19.4 137.0 65.3 9483.0 13.3 19050.8
16 719.0 25.9 109.0 80.9 8298.0 14.9 18622.1
17 831.0 18.5 138.0 80.2 9483.0 14.6 15373.5
18 858.0 19.4 119.0 84.8 12656.0 14.3 16645.2
19 652.0 27.2 108.0 86.4 13566.0 14.6 17734.4
20 718.0 23.7 115.0 73.5 9483.0 15.0 17016.6
21 787.0 20.8 126.0 74.7 9483.0 14.9 16369.6
22 515.0 26.8 106.0 87.8 8298.0 15.3 13802.0
23 732.0 23.0 103.0 86.6 9611.0 13.8 16836.0
24 783.0 20.5 125.0 78.5 9483.0 14.1 16051.5
25 612.0 23.7 100.0 80.6 9033.0 13.3 14504.4
26 486.0 23.2 117.0 84.8 8298.0 15.9 11275.2
27 765.0 23.6 105.0 79.2 9483.0 13.7 18054.0
28 793.0 21.7 125.0 78.4 9483.0 14.5 17208.1
29 776.0 23.0 110.0 77.2 9265.0 13.6 17848.0
30 978.0 19.3 130.0 71.5 9483.0 15.3 18875.4
31 792.0 21.2 126.0 82.2 12656.0 15.1 16790.4
data.endog_name
'YES'
data.exog_name
['COUTAX', 'UNEMPF', 'MOR', 'ACT', 'GDP', 'AGE', 'COUTAX_FEMALEUNEMP']

Load modules and data

 import statsmodels.api as sm

 data = sm.datasets.scotland.load_pandas()

 data.exog = sm.add_constant(data.exog)

data

data.data.to_csv('data.csv')

data.data
YES COUTAX UNEMPF MOR ACT GDP AGE COUTAX_FEMALEUNEMP
0 60.3 712.0 21.0 105.0 82.4 13566.0 12.3 14952.0
1 52.3 643.0 26.5 97.0 80.2 13566.0 15.3 17039.5
2 53.4 679.0 28.3 113.0 86.3 9611.0 13.9 19215.7
3 57.0 801.0 27.1 109.0 80.4 9483.0 13.6 21707.1
4 68.7 753.0 22.0 115.0 64.7 9265.0 14.6 16566.0
5 48.8 714.0 24.3 107.0 79.0 9555.0 13.8 17350.2
6 65.5 920.0 21.2 118.0 72.2 9611.0 13.3 19504.0
7 70.5 779.0 20.5 114.0 75.2 9483.0 14.5 15969.5
8 59.1 771.0 23.2 102.0 81.1 9483.0 14.2 17887.2
9 62.7 724.0 20.5 112.0 80.3 12656.0 13.7 14842.0
10 51.6 682.0 23.8 96.0 83.0 9483.0 14.6 16231.6
11 62.0 837.0 22.1 111.0 74.5 12656.0 11.6 18497.7
12 68.4 599.0 19.9 117.0 83.8 8298.0 15.1 11920.1
13 69.2 680.0 21.5 121.0 77.6 9265.0 13.7 14620.0
14 64.7 747.0 22.5 109.0 77.9 8314.0 14.4 16807.5
15 75.0 982.0 19.4 137.0 65.3 9483.0 13.3 19050.8
16 62.1 719.0 25.9 109.0 80.9 8298.0 14.9 18622.1
17 67.2 831.0 18.5 138.0 80.2 9483.0 14.6 15373.5
18 67.7 858.0 19.4 119.0 84.8 12656.0 14.3 16645.2
19 52.7 652.0 27.2 108.0 86.4 13566.0 14.6 17734.4
20 65.7 718.0 23.7 115.0 73.5 9483.0 15.0 17016.6
21 72.2 787.0 20.8 126.0 74.7 9483.0 14.9 16369.6
22 47.4 515.0 26.8 106.0 87.8 8298.0 15.3 13802.0
23 51.3 732.0 23.0 103.0 86.6 9611.0 13.8 16836.0
24 63.6 783.0 20.5 125.0 78.5 9483.0 14.1 16051.5
25 50.7 612.0 23.7 100.0 80.6 9033.0 13.3 14504.4
26 51.6 486.0 23.2 117.0 84.8 8298.0 15.9 11275.2
27 56.2 765.0 23.6 105.0 79.2 9483.0 13.7 18054.0
28 67.6 793.0 21.7 125.0 78.4 9483.0 14.5 17208.1
29 58.9 776.0 23.0 110.0 77.2 9265.0 13.6 17848.0
30 74.7 978.0 19.3 130.0 71.5 9483.0 15.3 18875.4
31 67.3 792.0 21.2 126.0 82.2 12656.0 15.1 16790.4
 data.endog
0     60.3
1     52.3
2     53.4
3     57.0
4     68.7
5     48.8
6     65.5
7     70.5
8     59.1
9     62.7
10    51.6
11    62.0
12    68.4
13    69.2
14    64.7
15    75.0
16    62.1
17    67.2
18    67.7
19    52.7
20    65.7
21    72.2
22    47.4
23    51.3
24    63.6
25    50.7
26    51.6
27    56.2
28    67.6
29    58.9
30    74.7
31    67.3
Name: YES, dtype: float64
data.endog_name
'YES'
data.exog_name
['COUTAX', 'UNEMPF', 'MOR', 'ACT', 'GDP', 'AGE', 'COUTAX_FEMALEUNEMP']
dataend=data.data.iloc[:,0]
dataend
0     60.3
1     52.3
2     53.4
3     57.0
4     68.7
5     48.8
6     65.5
7     70.5
8     59.1
9     62.7
10    51.6
11    62.0
12    68.4
13    69.2
14    64.7
15    75.0
16    62.1
17    67.2
18    67.7
19    52.7
20    65.7
21    72.2
22    47.4
23    51.3
24    63.6
25    50.7
26    51.6
27    56.2
28    67.6
29    58.9
30    74.7
31    67.3
Name: YES, dtype: float64
data1=data.exog.loc[:,['COUTAX','UNEMPF','MOR','ACT','GDP']]
data1.columns=['COUTAX','UNEMPF','MOR','ACT','GDP']

data1.columns
Index(['COUTAX', 'UNEMPF', 'MOR', 'ACT', 'GDP'], dtype='object')
dataend
0     60.3
1     52.3
2     53.4
3     57.0
4     68.7
5     48.8
6     65.5
7     70.5
8     59.1
9     62.7
10    51.6
11    62.0
12    68.4
13    69.2
14    64.7
15    75.0
16    62.1
17    67.2
18    67.7
19    52.7
20    65.7
21    72.2
22    47.4
23    51.3
24    63.6
25    50.7
26    51.6
27    56.2
28    67.6
29    58.9
30    74.7
31    67.3
Name: YES, dtype: float64

Regresion Gamma en Python

import pandas as pd

dat=pd.read_csv('data Gamma.csv')

dat.head()
Unnamed: 0 YES COUTAX UNEMPF MOR ACT GDP AGE COUTAX_FEMALEUNEMP
0 0 60.3 712.0 21.0 105.0 82.4 13566.0 12.3 14952.0
1 1 52.3 643.0 26.5 97.0 80.2 13566.0 15.3 17039.5
2 2 53.4 679.0 28.3 113.0 86.3 9611.0 13.9 19215.7
3 3 57.0 801.0 27.1 109.0 80.4 9483.0 13.6 21707.1
4 4 68.7 753.0 22.0 115.0 64.7 9265.0 14.6 16566.0
dat.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 9 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Unnamed: 0          32 non-null     int64  
 1   YES                 32 non-null     float64
 2   COUTAX              32 non-null     float64
 3   UNEMPF              32 non-null     float64
 4   MOR                 32 non-null     float64
 5   ACT                 32 non-null     float64
 6   GDP                 32 non-null     float64
 7   AGE                 32 non-null     float64
 8   COUTAX_FEMALEUNEMP  32 non-null     float64
dtypes: float64(8), int64(1)
memory usage: 2.4 KB
dat.dtypes
Unnamed: 0              int64
YES                   float64
COUTAX                float64
UNEMPF                float64
MOR                   float64
ACT                   float64
GDP                   float64
AGE                   float64
COUTAX_FEMALEUNEMP    float64
dtype: object
dat_endog=dat['YES']
dat_endog
0     60.3
1     52.3
2     53.4
3     57.0
4     68.7
5     48.8
6     65.5
7     70.5
8     59.1
9     62.7
10    51.6
11    62.0
12    68.4
13    69.2
14    64.7
15    75.0
16    62.1
17    67.2
18    67.7
19    52.7
20    65.7
21    72.2
22    47.4
23    51.3
24    63.6
25    50.7
26    51.6
27    56.2
28    67.6
29    58.9
30    74.7
31    67.3
Name: YES, dtype: float64
dat_exog=dat.drop(columns='YES')

dat_exog
Unnamed: 0 COUTAX UNEMPF MOR ACT GDP AGE COUTAX_FEMALEUNEMP
0 0 712.0 21.0 105.0 82.4 13566.0 12.3 14952.0
1 1 643.0 26.5 97.0 80.2 13566.0 15.3 17039.5
2 2 679.0 28.3 113.0 86.3 9611.0 13.9 19215.7
3 3 801.0 27.1 109.0 80.4 9483.0 13.6 21707.1
4 4 753.0 22.0 115.0 64.7 9265.0 14.6 16566.0
5 5 714.0 24.3 107.0 79.0 9555.0 13.8 17350.2
6 6 920.0 21.2 118.0 72.2 9611.0 13.3 19504.0
7 7 779.0 20.5 114.0 75.2 9483.0 14.5 15969.5
8 8 771.0 23.2 102.0 81.1 9483.0 14.2 17887.2
9 9 724.0 20.5 112.0 80.3 12656.0 13.7 14842.0
10 10 682.0 23.8 96.0 83.0 9483.0 14.6 16231.6
11 11 837.0 22.1 111.0 74.5 12656.0 11.6 18497.7
12 12 599.0 19.9 117.0 83.8 8298.0 15.1 11920.1
13 13 680.0 21.5 121.0 77.6 9265.0 13.7 14620.0
14 14 747.0 22.5 109.0 77.9 8314.0 14.4 16807.5
15 15 982.0 19.4 137.0 65.3 9483.0 13.3 19050.8
16 16 719.0 25.9 109.0 80.9 8298.0 14.9 18622.1
17 17 831.0 18.5 138.0 80.2 9483.0 14.6 15373.5
18 18 858.0 19.4 119.0 84.8 12656.0 14.3 16645.2
19 19 652.0 27.2 108.0 86.4 13566.0 14.6 17734.4
20 20 718.0 23.7 115.0 73.5 9483.0 15.0 17016.6
21 21 787.0 20.8 126.0 74.7 9483.0 14.9 16369.6
22 22 515.0 26.8 106.0 87.8 8298.0 15.3 13802.0
23 23 732.0 23.0 103.0 86.6 9611.0 13.8 16836.0
24 24 783.0 20.5 125.0 78.5 9483.0 14.1 16051.5
25 25 612.0 23.7 100.0 80.6 9033.0 13.3 14504.4
26 26 486.0 23.2 117.0 84.8 8298.0 15.9 11275.2
27 27 765.0 23.6 105.0 79.2 9483.0 13.7 18054.0
28 28 793.0 21.7 125.0 78.4 9483.0 14.5 17208.1
29 29 776.0 23.0 110.0 77.2 9265.0 13.6 17848.0
30 30 978.0 19.3 130.0 71.5 9483.0 15.3 18875.4
31 31 792.0 21.2 126.0 82.2 12656.0 15.1 16790.4
dat_exog=sm.add_constant(dat_exog)
dat_exog.head(10)
const Unnamed: 0 COUTAX UNEMPF MOR ACT GDP AGE COUTAX_FEMALEUNEMP
0 1.0 0 712.0 21.0 105.0 82.4 13566.0 12.3 14952.0
1 1.0 1 643.0 26.5 97.0 80.2 13566.0 15.3 17039.5
2 1.0 2 679.0 28.3 113.0 86.3 9611.0 13.9 19215.7
3 1.0 3 801.0 27.1 109.0 80.4 9483.0 13.6 21707.1
4 1.0 4 753.0 22.0 115.0 64.7 9265.0 14.6 16566.0
5 1.0 5 714.0 24.3 107.0 79.0 9555.0 13.8 17350.2
6 1.0 6 920.0 21.2 118.0 72.2 9611.0 13.3 19504.0
7 1.0 7 779.0 20.5 114.0 75.2 9483.0 14.5 15969.5
8 1.0 8 771.0 23.2 102.0 81.1 9483.0 14.2 17887.2
9 1.0 9 724.0 20.5 112.0 80.3 12656.0 13.7 14842.0

Regresion de Poisson en Python

#!pip install pyreadstat
import pandas as pd
import numpy as np
import statsmodels.api as sm
import pyreadstat
df=pd.read_spss('poisson data2.sav')
df.head()
salary manager genderid worksatisf stress numb.absent
0 12.0 yes identified as male 21.0 3.0 0.0
1 7.0 no identified as female 14.0 4.0 0.0
2 10.0 yes identified as female 6.0 6.0 0.0
3 13.0 yes identified as male 19.0 5.0 1.0
4 8.0 no identified as male 10.0 7.0 1.0
df2=df.rename(columns={'numb.absent':'numabsent'})

df2.dtypes
salary         float64
manager       category
genderid      category
worksatisf     float64
stress         float64
numabsent      float64
dtype: object
f = """numabsent ~ salary+C(manager)+C(genderid)+
       worksatisf+stress"""

from patsy import dmatrices

respuesta, predictores = dmatrices(f, df2, return_type='dataframe')

respuesta.head()
numabsent
0 0.0
1 0.0
2 0.0
3 1.0
4 1.0
pois_results = sm.GLM(respuesta, predictores,
family=sm.families.Poisson()).fit()

pois_results.aic
np.float64(173.13598143045402)
pois_results.summary()
Generalized Linear Model Regression Results
Dep. Variable: numabsent No. Observations: 50
Model: GLM Df Residuals: 44
Model Family: Poisson Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -80.568
Date: Wed, 07 Aug 2024 Deviance: 16.800
Time: 18:17:03 Pearson chi2: 13.1
No. Iterations: 5 Pseudo R-squ. (CS): 0.6609
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 1.6127 0.604 2.671 0.008 0.429 2.796
C(manager)[T.yes] -0.1469 0.174 -0.846 0.397 -0.487 0.193
C(genderid)[T.identified as male] -0.1364 0.159 -0.857 0.391 -0.448 0.175
salary -0.0749 0.036 -2.110 0.035 -0.144 -0.005
worksatisf -0.0615 0.032 -1.908 0.056 -0.125 0.002
stress 0.0606 0.025 2.420 0.016 0.012 0.110
pois_results.aic
np.float64(173.13598143045402)
nb_results = sm.GLM(respuesta, predictores,
family=sm.families.NegativeBinomial(alpha=0.20213936671179472)).fit()

nb_results.summary()
Generalized Linear Model Regression Results
Dep. Variable: numabsent No. Observations: 50
Model: GLM Df Residuals: 44
Model Family: NegativeBinomial Df Model: 5
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -91.011
Date: Wed, 07 Aug 2024 Deviance: 11.620
Time: 18:17:03 Pearson chi2: 8.10
No. Iterations: 6 Pseudo R-squ. (CS): 0.4826
Covariance Type: nonrobust
coef std err z P>|z| [0.025 0.975]
Intercept 1.6125 0.790 2.041 0.041 0.064 3.161
C(manager)[T.yes] -0.1229 0.232 -0.530 0.596 -0.577 0.332
C(genderid)[T.identified as male] -0.1213 0.215 -0.564 0.573 -0.543 0.300
salary -0.0840 0.046 -1.819 0.069 -0.174 0.006
worksatisf -0.0604 0.042 -1.437 0.151 -0.143 0.022
stress 0.0622 0.033 1.888 0.059 -0.002 0.127
nb_results.aic
np.float64(194.0226456547587)
nb_results.pearson_chi2/nb_results.df_resid
np.float64(0.18414903796550222)
nb_results.df_resid
np.int64(44)
nb_results.aic
np.float64(194.0226456547587)