Abstract
Class exercise using Longley data with R and Python - source code from Kuan-Pin Lin (http://web.pdx.edu/~crkl/ceR/).This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
License: CC BY-SA 4.0
Sugestão de citação: FIGUEIREDO, Adriano Marcos Rodrigues. Econometria com R e Python: exemplo com Longley Data. Campo Grande-MS,Brasil: RStudio/Rpubs, 2021. Disponível em http://rpubs.com/amrofi/r_python_longley.
Usarei os dados de Longley. Usarei a rotina de http://web.pdx.edu/~crkl/ceR/, baseada em Kuan-Pin Lin e WISER-Club do site <Econometric Analysis by Examples>.
O código foi executado em RMarkdown por meio do RStudio. O RStudio é bastante amigável para executar chunks em R e em Python, além de muitas outras linguagens. O usuário poderá verificar exatamente o código baixando no link <code> no alto da página. Basicamente a instrução do chunk é alterada de {r} para {python} no início do chunk e o RStudio automaticamente entenderá a necessidade de chamar o pacote reticulate
para integrar as linguagens.
Atentar que se existir a necessidade de instalar pacotes e módulos (libraries) de R e Python, o usuário deve realizar previamente, ou aparecerão mensagens de erro na execução. A sugestão é, se estiver em RStudio nativo, executar pelo Terminal do RStudio. Se estiver pelo Anaconda, instalar no environment em uso.
data <- read.table("http://web.pdx.edu/~crkl/ceR/data/longley.txt", header = T, nrows = 16)
PGNP <- data[, 2]
GNP <- data[, 3]/1000
POPU <- data[, 6]/1000
EM <- data[, 7]/1000
x = cbind(PGNP, GNP, POPU, EM)
x
PGNP GNP POPU EM
[1,] 83.0 234.289 107.608 60.323
[2,] 88.5 259.426 108.632 61.122
[3,] 88.2 258.054 109.773 60.171
[4,] 89.5 284.599 110.929 61.187
[5,] 96.2 328.975 112.075 63.221
[6,] 98.1 346.999 113.270 63.639
[7,] 99.0 365.385 115.094 64.989
[8,] 100.0 363.112 116.219 63.761
[9,] 101.2 397.469 117.388 66.019
[10,] 104.6 419.180 118.734 67.857
[11,] 108.4 442.769 120.445 68.169
[12,] 110.8 444.546 121.950 66.513
[13,] 112.6 482.704 123.366 68.655
[14,] 114.2 502.601 125.368 69.564
[15,] 115.7 518.173 127.852 69.331
[16,] 116.9 554.894 130.081 70.551
rm(list = ls())
# install package when first use
# pip install pandas
import numpy as np
import pandas as pd
# data=pd.read_table('http://web.pdx.edu/~crkl/ceR/data/longley.txt',nrows=16)
data=pd.read_csv('http://web.pdx.edu/~crkl/ceR/data/longley.txt',sep='\s+',nrows=16)
data.head()
YEAR PGNP GNP UEM AF POP EM
0 1947 83.0 234289 2356 1590 107608 60323
1 1948 88.5 259426 2325 1456 108632 61122
2 1949 88.2 258054 3682 1616 109773 60171
3 1950 89.5 284599 3351 1650 110929 61187
4 1951 96.2 328975 2099 3099 112075 63221
data.columns
Index(['YEAR', 'PGNP', 'GNP', 'UEM', 'AF', 'POP', 'EM'], dtype='object')
data.values
array([[1.94700e+03, 8.30000e+01, 2.34289e+05, 2.35600e+03, 1.59000e+03,
1.07608e+05, 6.03230e+04],
[1.94800e+03, 8.85000e+01, 2.59426e+05, 2.32500e+03, 1.45600e+03,
1.08632e+05, 6.11220e+04],
[1.94900e+03, 8.82000e+01, 2.58054e+05, 3.68200e+03, 1.61600e+03,
1.09773e+05, 6.01710e+04],
[1.95000e+03, 8.95000e+01, 2.84599e+05, 3.35100e+03, 1.65000e+03,
1.10929e+05, 6.11870e+04],
[1.95100e+03, 9.62000e+01, 3.28975e+05, 2.09900e+03, 3.09900e+03,
1.12075e+05, 6.32210e+04],
[1.95200e+03, 9.81000e+01, 3.46999e+05, 1.93200e+03, 3.59400e+03,
1.13270e+05, 6.36390e+04],
[1.95300e+03, 9.90000e+01, 3.65385e+05, 1.87000e+03, 3.54700e+03,
1.15094e+05, 6.49890e+04],
[1.95400e+03, 1.00000e+02, 3.63112e+05, 3.57800e+03, 3.35000e+03,
1.16219e+05, 6.37610e+04],
[1.95500e+03, 1.01200e+02, 3.97469e+05, 2.90400e+03, 3.04800e+03,
1.17388e+05, 6.60190e+04],
[1.95600e+03, 1.04600e+02, 4.19180e+05, 2.82200e+03, 2.85700e+03,
1.18734e+05, 6.78570e+04],
[1.95700e+03, 1.08400e+02, 4.42769e+05, 2.93600e+03, 2.79800e+03,
1.20445e+05, 6.81690e+04],
[1.95800e+03, 1.10800e+02, 4.44546e+05, 4.68100e+03, 2.63700e+03,
1.21950e+05, 6.65130e+04],
[1.95900e+03, 1.12600e+02, 4.82704e+05, 3.81300e+03, 2.55200e+03,
1.23366e+05, 6.86550e+04],
[1.96000e+03, 1.14200e+02, 5.02601e+05, 3.93100e+03, 2.51400e+03,
1.25368e+05, 6.95640e+04],
[1.96100e+03, 1.15700e+02, 5.18173e+05, 4.80600e+03, 2.57200e+03,
1.27852e+05, 6.93310e+04],
[1.96200e+03, 1.16900e+02, 5.54894e+05, 4.00700e+03, 2.82700e+03,
1.30081e+05, 7.05510e+04]])
PGNP=data['PGNP']
GNP=data['GNP']/1000
POPU=data['POP']/1000
EM=data['EM']/1000
x=np.array([PGNP,GNP,POPU,EM])
x.T # x.transpose()
array([[ 83. , 234.289, 107.608, 60.323],
[ 88.5 , 259.426, 108.632, 61.122],
[ 88.2 , 258.054, 109.773, 60.171],
[ 89.5 , 284.599, 110.929, 61.187],
[ 96.2 , 328.975, 112.075, 63.221],
[ 98.1 , 346.999, 113.27 , 63.639],
[ 99. , 365.385, 115.094, 64.989],
[100. , 363.112, 116.219, 63.761],
[101.2 , 397.469, 117.388, 66.019],
[104.6 , 419.18 , 118.734, 67.857],
[108.4 , 442.769, 120.445, 68.169],
[110.8 , 444.546, 121.95 , 66.513],
[112.6 , 482.704, 123.366, 68.655],
[114.2 , 502.601, 125.368, 69.564],
[115.7 , 518.173, 127.852, 69.331],
[116.9 , 554.894, 130.081, 70.551]])
# Example 5.1 Condition Number and Correlation Matrix
data <- read.table("http://web.pdx.edu/~crkl/ceR/data/longley.txt", header = T, nrows = 16)
year <- data$YEAR
pgnp <- data$PGNP
gnp <- data$GNP
af <- data$AF
em <- data$EM
results <- lm(em ~ year + pgnp + gnp + af)
summary(results)
Call:
lm(formula = em ~ year + pgnp + gnp + af)
Residuals:
Min 1Q Median 3Q Max
-905.8 -342.7 -107.6 216.8 1437.7
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.169e+06 8.359e+05 1.399 0.18949
year -5.765e+02 4.335e+02 -1.330 0.21049
pgnp -1.977e+01 1.389e+02 -0.142 0.88940
gnp 6.439e-02 1.995e-02 3.227 0.00805 **
af -1.015e-02 3.086e-01 -0.033 0.97436
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 667.3 on 11 degrees of freedom
Multiple R-squared: 0.9735, Adjusted R-squared: 0.9639
F-statistic: 101.1 on 4 and 11 DF, p-value: 1.346e-08
# Correlation Matrix
y <- cbind(year, pgnp, gnp, af, em)
y1 <- cor(y)
y1
year pgnp gnp af em
year 1.0000000 0.9911492 0.9952735 0.4172451 0.9713295
pgnp 0.9911492 1.0000000 0.9915892 0.4647442 0.9708985
gnp 0.9952735 0.9915892 1.0000000 0.4464368 0.9835516
af 0.4172451 0.4647442 0.4464368 1.0000000 0.4573074
em 0.9713295 0.9708985 0.9835516 0.4573074 1.0000000
# Condition Number
x <- cbind(1, year, pgnp, gnp, af)
x1 <- t(x) %*% x
cm <- eigen(x1)$val
cn <- sqrt(max(cm)/min(cm))
cn
[1] 2001853658
kappa(x) # 2-norm condition number
[1] 1399032738
rm(list = ls())
# Example 5.1
# Condition Number and Correlation Matrix
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
data=pd.read_csv('http://web.pdx.edu/~crkl/ceR/data/longley.txt',sep='\s+',nrows=16)
model = smf.ols(formula='EM ~ YEAR + PGNP + GNP + AF',data = data).fit()
print(model.summary())
#condition number
OLS Regression Results
==============================================================================
Dep. Variable: EM R-squared: 0.974
Model: OLS Adj. R-squared: 0.964
Method: Least Squares F-statistic: 101.1
Date: Sun, 01 Aug 2021 Prob (F-statistic): 1.35e-08
Time: 18:40:26 Log-Likelihood: -123.76
No. Observations: 16 AIC: 257.5
Df Residuals: 11 BIC: 261.4
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1.169e+06 8.36e+05 1.399 0.189 -6.71e+05 3.01e+06
YEAR -576.4643 433.487 -1.330 0.210 -1530.564 377.635
PGNP -19.7681 138.893 -0.142 0.889 -325.469 285.933
GNP 0.0644 0.020 3.227 0.008 0.020 0.108
AF -0.0101 0.309 -0.033 0.974 -0.689 0.669
==============================================================================
Omnibus: 5.574 Durbin-Watson: 1.019
Prob(Omnibus): 0.062 Jarque-Bera (JB): 2.903
Skew: 0.956 Prob(JB): 0.234
Kurtosis: 3.836 Cond. No. 2.00e+09
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
C:\Users\amrof\AppData\Local\R-MINI~1\envs\R-RETI~1\lib\site-packages\scipy\stats\stats.py:1603: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=16
warnings.warn("kurtosistest only valid for n>=20 ... continuing "
model.condition_number
# cm=model.eigenvals
# np.sqrt(max(cm)/min(cm))
#correlation matrix
2001533396.5440526
data[['EM', 'YEAR', 'PGNP', 'GNP', 'AF']].corr()
# alternatively, using design-matrix to compute the condition number
# pip install patsy
EM YEAR PGNP GNP AF
EM 1.000000 0.971329 0.970899 0.983552 0.457307
YEAR 0.971329 1.000000 0.991149 0.995273 0.417245
PGNP 0.970899 0.991149 1.000000 0.991589 0.464744
GNP 0.983552 0.995273 0.991589 1.000000 0.446437
AF 0.457307 0.417245 0.464744 0.446437 1.000000
import patsy as pt
y, X = pt.dmatrices('EM ~ YEAR + PGNP + GNP + AF',data = data)
np.linalg.cond(X)
2001533396.5440536
Referências de pacotes utilizados - lista todos os pacotes que foram abertos em algum procedimento
Lin, Kuan-Pin; WISER-Club. Computational Econometrics: GAUSS Programming for Econometricians and Financial Analysts, ETEXT Textbook Publishing, 2001.