This work is a gently introduction to the usage of very user-friendly (open) software for Bayesian Model Averaging. We focused on showing tecnical details of every algorithm that we implemented in our app which is available at this link, we include some simulation exercises that can be performed directly in our App and where we explain how to read the results from the algorithms. Finally we have here a simple tutorial in how to use our product but any user can see that is its usage is pretty intuitive.
Usually, there are a lot of complex procedures which are developed under the light of the academy which solves a lot of problems that face the standard statistical methodologies. For instance, most of people know how to do simple regression using some software (i.e., Excel), but not everyone knows exactly what is going on behind an OLS (Ordinary Least Squares) estimation.
BMA solves the problem of ignoring model uncertainty behind regression analysis but it is not commonly used by people who are not in the academy world. The truth behind that is not that people do not like to use anything beyond a standard analysis, what really happens is that people usually want to use a simple software in order to do their regression. This project wants to solve at least that problem, the idea is that the final software would be as simple as possible so that people can perform a BMA procedure just with a few clicks.
Regression analysis is a workhorse tool for econometricians, however, the standard application ignores model uncertainty. In particular econometricians select a set of regressors based on theory or some statistical criteria such as \(R^2\) or information criteria to specify their models. But, if there are \(k\) possible regressors, then there are \(2^k\) possible models which may not be manageable. Moreover, if the researcher does not take into consideration model uncertainty the answer would be to choose the model with the best fit (i.e, best \(R^2\)). Choosing one of the models would not be a problem if every model gives the same explanation but we can see in the application 4 in Kass and Raftery (1995) an example of this problem.
An approach to the solution of the latter is BMA Madigan and Raftery (1994) . This takes into account model uncertainty, and gives an average model according to the posterior model probability. They also show that using this procedure leads to a better average predictive performance than using just one model.
Hoeting et al. (1999) gives a complete review of the BMA methodology and also shows different software options in order to perform this. These options are really powerful but the user need programing skills to use them, to the best of our knowledge there is no a software or a package that combine every option in a user friendly interface.
Bayesian Model Averaging (BMA) is a Bayesian approach which accounts for model uncertainty among a set of models \(\mathcal{M}\). Let \(\mathcal{M}=\left\{ M_1, M_2,...,M_{2^k}\right \}\) be the set of models under consideration where each models depends on a set of parameters \(\alpha_j \hspace{2mm} \forall\hspace{1mm} j \in \hspace{1mm} 1...2^k\) and \(k\) is the number of possible regressors. Observe that \(k\) regressors imply \(2^k\) possible models just taking into account regressors uncertanty. Following cite{MC3}, the posterior model probability (PMP) for model \(M_j\) is defined as:
\[\begin{equation} \label{pmp} P(M_j \mid y)=\frac{P(y \mid M_j)\pi(M_j)}{\sum_{i=1}^{2^k}P(y \mid M_i)\pi(M_i)} \hspace{5mm} \forall j=1,2,...,2^k \end{equation}\]
where \[\begin{equation} \label{margLike} P(y \mid M_j)=\int_{\mathcal{R}^{dim(\alpha_j)}} P(y\mid \alpha_j, \hspace{2mm} M_j)\pi(\alpha_j \mid M_j ) d \alpha_{j} \hspace{5mm} \forall j=1,2,...,2^k \end{equation}\]
is the marginal likelihood of the model \(M_j\), \(\pi(\alpha_j \mid M_j )\) is the prior of parameters under \(M_j\), \(P(y \mid \alpha_j, \hspace{2mm} M_j)\) is the likelihood and \(\pi(M_j)\) is the prior probability of model \(M_j\).
From this framework everything looks pretty easy but that integral, we will see how to deal with that in this work.
So far the given methodology leads to a model which is a combinations of different generalized linear models, but it is possible that someone wants to use only one model, the question would be which are the regressors which leads to the best model. Intuitively one can say that the variables to include would be those which appears in the best model (in terms of PMP), but as Barbieri and Berger (2004) shows, the best model is the median probability model in term of prediction.
The median probability model is the one which includes every variable which has posterior inclusion probability (\(PIP\)) higher than 0.5. The \(PIP\) for variable \(i\) is defined as
\[\begin{equation*} PIP_i=\sum_{j=1}^{2^k}P(M_j \mid y)*I_{i,j} \end{equation*}\] where \[\begin{equation*} I_{i,j}= \left\{ \begin{array}{lcc} 1& if & x_{i}\in M_j \\ \\ 0 & if & x_{i}\not \in M_j \end{array} \right. \end{equation*}\]On our definition of the posterior odds in we can define what is called the Bayes factor (BF) and the prior odds(PO) as: \[\begin{align*} BF_{2,1} &=\frac{P(y\mid M_2 )}{P(y \mid M_1 )} \\ PO_{2,1} &= \frac{\pi(M_2 )}{\pi( M_1 )} \end{align*}\]
MC\(^3\) is a Bayesian methodology which uses a stochastic search comparing different models by its posterior model probability. The whole idea came from where they performed, on the first part, a not practical procedure because it was needed to had information about the whole possible models which are \(2^k\). Since today procedures usually allows the researchers to have a big set of data, that would require a lot of computation time. They also knew it so the proposed a Markov chain Monte Carlo approach that directly approximates the exact solution, which leads to the almost the same answer without calculating \(2^k\) different models.
On this subsection (and the following ones) we will focus on how to incorporate the calculation of the integral in \(\ref{margLike}\) since is the core of our work.
For this model we have the following likelihood function:
\[\begin{equation} \mathscr{L}({\bf{y}}|\beta, \sigma^2, {\bf{X}}) = (2\pi\sigma^2)^{-\frac{n}{2}} \exp \left\{-\frac{1}{2\sigma^2} (y - X\beta)'(y - X\beta) \right\} \label{like} \end{equation}\] The conjugate priors for the parameters are: \[\begin{align} \beta|\sigma^2 & \sim \mathcal{N}(\beta_0, \sigma^2 B_0) \label{eq:2} \\ \sigma^2 & \sim \mathcal{I}\mathcal{G}(\alpha_0/2, \delta_0/2) \label{eq:3} \end{align}\] First consider the following result: \[\begin{align} \nonumber &(y-X\beta)^{'}(y-X\beta)+(\beta-\beta_o)B_o^{-1}(\beta-\beta_o)\\ &=\beta^{'}(X^{'}X+B_o^{-1})\beta-2\beta^{'}(X^{'}y+B_o^{-1}\beta_o)+y^{'}y+\beta_o^{'}V^{-1}\beta_o \label{ecbas}\\ \nonumber &\beta^{'}(X^{'}X+B_o^{-1})\beta-2\beta^{'}(X^{'}y+B_o^{-1}\beta_o)+y^{'}y+\beta_o^{'}B_o^{-1}\beta_o\\ \nonumber &=\beta^{'}B^{-1}\beta-2\beta^{'}B^{-1}\beta^{*}+\beta^{*'}B^{-1}\beta^{*}-\beta^{*'}B^{-1}\beta^{*}+y^{'}y+\beta_o^{'}B_o^{-1}\beta_o\\ \nonumber &=(\beta-\beta^{*})^{'}B^{-1}(\beta-\beta^{*})-\beta^{*'}B^{-1}\beta^{*}+y^{'}y+\beta_o^{'}B_o^{-1}\beta_o \\ \label{AL} &=(\beta-\beta^{*})^{'}B^{-1}(\beta-\beta^{*})+M \end{align}\] where \[\begin{align} \nonumber B^{-1}=&(X^{'}X+B_o^{-1})\\ \nonumber B=&(X^{'}X+B_o)^{-1}\\ \nonumber \beta^{*}=&B(X^{'}y+B_o^{-1}\beta_o)\\ \nonumber =&(X^{'}X+B_o^{-1})^{-1}(X^{'}y+B_o^{-1}\beta_o) \\ =&(\beta-\beta^{*})^{'}B^{-1}(\beta-\beta^{*})+M \end{align}\] Taking into account (\(\ref{AL}\)), (\(\ref{like}\)), (\(\ref{eq:2}\)) and (\(\ref{eq:3}\)) we calculate the marginal as: \[\begin{align*} &P(y)=\int_{0}^{\infty}\int_{\beta}\pi (\beta \mid \sigma^2,\hspace{1mm}B_o,\hspace{1mm}\beta_o )\pi(\sigma^2\mid \alpha_0/2, \delta_0/2)\mathscr{L}({\bf{y}}|\beta, \sigma^2, {\bf{X}})d\sigma^2 d\beta\\ &=\int_{0}^{\infty} \pi(\sigma^2\mid \alpha_0/2, \delta_0/2) \frac{1}{(2\pi\sigma^2)^{(p+n)/2}\mid B_o\mid^{1/2}}\int_{\beta}exp\left\{-\frac{1}{2\sigma^2}((\beta-\beta_o)^{'}B_o^{-1}(\beta-\beta_o)+M))\right\} d\sigma^2 d\beta \\ &=\int_{0}^{\infty} \pi(\sigma^2\mid \alpha_0/2, \delta_0/2) \frac{1}{(2\pi\sigma^2)^{n/2}}exp\left\{-\frac{1}{2\sigma^2}M \right\} \frac{1}{(2\pi\sigma^2)^{p/2}\mid B_o\mid^{1/2}}\int_{\beta}exp\left\{-\frac{1}{2\sigma^2}((\beta-\beta^*)^{'}B^{-1}(\beta-\beta^*))\right\}d\sigma^2 d\beta \\ &=\int_{0}^{\infty} \pi(\sigma^2\mid \alpha_0/2, \delta_0/2) \frac{1}{(2\pi\sigma^2)^{n/2}}exp\left\{-\frac{1}{2\sigma^2}M \right\} \frac{\mid B\mid^{1/2}}{\mid B_o\mid^{1/2}} d\sigma^2 \end{align*}\] The last step was performed completing the multivariate pdf of a p-dimensional normal distribution with mean \(\beta^*\) and covariance matrix \(\sigma^2B\). Replacing \(\ref{eq:3}\) we obtain: \[\begin{align} \nonumber P(y)=&\int_{0}^{\infty} \frac{(\alpha_0/2)^{\delta_0/2}}{\Gamma(\delta_0/2)(\sigma^2)^{\delta_0/2+1}}exp\left\{ (\alpha_0/2\sigma^2\right\} \frac{1}{(2\pi\sigma^2)^{n/2}}exp\left\{-\frac{1}{2\sigma^2}M \right\} \frac{\mid B\mid^{1/2}}{\mid B_o\mid^{1/2}}\\ \nonumber =&\frac{\mid B\mid^{1/2} (\alpha_0/2)^{\delta_0/2}}{(2\pi)^{n/2} \mid B_o\mid^{1/2} \Gamma(\delta_0/2)}\int_{0}^{\infty} (\sigma^2)^{-(n+\delta_0)/2-1} exp\left\{-\frac{1}{2\sigma^2}(M+\alpha_0) \right\} \\ \nonumber & making \hspace{2mm} u=\frac{\sigma^2 (M+\alpha_0)}{2}\\ \nonumber =&\frac{\mid B\mid^{1/2} (\alpha_0/2)^{\delta_0/2} 2^{n/2+(\delta_0/2)+1-1}}{(2\pi)^{n/2}\mid B_o \mid^{1/2} \Gamma(\delta_0/2) (M+\alpha_0)^{n/2+(\delta_0/2)+1-1}}\int_{0}^{\infty}u^{(n+\delta_0)/2-1}e^{-u}du\\ \nonumber =&\frac{\mid B\mid^{1/2} (\alpha_0/2)^{\delta_0/2} 2^{\delta_0/2}}{(\pi)^{n/2}\mid B_o \mid^{1/2} \Gamma(\delta_0/2) (M+\alpha_0)^{n/2+\delta_0/2}}\Gamma(\frac{n+\delta_0}{2})\\ \nonumber & making \hspace{2mm} M+\alpha_0=\alpha_0\left(1+\frac{(\delta_0/2)M}{\delta_0(\alpha_0/2)}\right)\\ \label{margA} =&\frac{\mid B\mid^{1/2} \Gamma(\frac{n+\delta_0}{2}) (\delta_0/2)^{n/2}}{\pi^{n/2} \mid B_o\mid^{1/2}(\alpha_0/2)^{n/2}\delta_0^{n/2}\Gamma(\delta_0/2)\left(1+\frac{(\delta_0/2)M}{\delta_0(\alpha_0/2)}\right)^{(n+\delta_0)/2}} \end{align}\]Appling some extra algebra over \(M\) we obtain
\[\begin{align} \nonumber M=&y^{'}y-\beta^{*'}B^{-1}\beta^*+\beta_0^{'}B_0^{-1}\beta_0\\ \nonumber =&y^{'}y+\beta_0^{'}B_o^{-1}\beta_0-(X^{'}y+B_o^{-1}\beta_0)^{'}B(X^{'}y+B_o^{-1}\beta_0)\\ \nonumber =&y^{'}y+\beta_0^{'}B_o^{-1}\beta_0-y^{'}XBX^{'}y-2y^{'}XBB_0^{-1}\beta_0+ \beta_0^{'}B_o^{-1}BB_o^{-1}\beta_0 \\ \label{ecprin} =&y^{'}(I-XBX)y-2y^{'}XBB_0^{-1}\beta_0+\beta_0^{'}(B_o^{-1}+B_o^{-1}BB_o^{-1})\beta_0 \end{align}\]Define \(\Sigma_e^{-1}=I-XBX^{'}\) and consider the following lema (Woodbury 1950):
\[\begin{equation*} (A+BDC)^{-1}=A^{-1}-A^{-1}B(D^{-1}+CA^{-1}B)^{-1}CA^{-1} \end{equation*}\]Taking into account the latter and the definition of \(B\)
\[\begin{align} \nonumber B_o^{-1}+B_o^{-1}BB_o^{-1}&=B_o^{-1}+B_o^{-1}(B_o^{-1}+X^{'}X)^{-1}B_o^{-1}\\ \nonumber & Making \hspace{2mm} A=B_o \hspace{2mm}B=C=I \hspace{2mm} and \hspace{2mm} D^{-1}=X^{'}X\\ \nonumber &=(B_o+(X^{'}X)^{-1})\\ \nonumber & Making \hspace{2mm} A^{-1}=X^{'}X \hspace{2mm}B=C=I \hspace{2mm} and \hspace{2mm} D=B_o\\ \nonumber &=(X^{'}X)-(X^{'}X)(B_o^{-1}+(X^{'}X))^{-1}X^{'}X\\ \nonumber &=X^{'}(I-X(B_o^{-1}+(X^{'}X))^{-1}X^{'})X\\ \nonumber &=X^{'}(I-XBX^{'})X\\ \label{aux1} &=X^{'}\Sigma_e^{-1}X \end{align}\] \[\begin{align} \nonumber &BB^{-1}=I=B(X^{'}X+B_o^{-1})&\\ \nonumber &BB_o^{-1}=I-BX^{'}X& &Pre-multipliying \hspace{2mm} X\\ \nonumber &XBB_o^{-1}=X-XBX^{'}X=\Sigma_e^{-1}X&&\\ \label{aux2} &XBB_o^{-1}=\Sigma_e^{-1}X&& \end{align}\] Replacing (\(\ref{aux1}\)) and (\(\ref{aux2}\)) in (\(\ref{ecprin}\)) we obtaing: \[\begin{align} \nonumber M=&y^{'}\Sigma_e^{-1}y-2y^{'}\Sigma_e^{-1}X\beta_0+\beta_0^{'}X^{'}\Sigma_e^{-1}X\beta_0\\ \label{M} =&(y-X\beta_0)^{'}\Sigma_e^{-1}(y-X\beta_0) \end{align}\]From the inversion lemma we also obtain (Woodbury 1950)
\[\begin{align} \nonumber \Sigma_e^{-1}=&I-XBX^{'}\\ \nonumber =&I-X(B_0^{-1}+X^{'}X)^{-1}X^{'}\\ \nonumber &Making \hspace{2mm} A=I \hspace{2mm}B=X=C^{'} \hspace{2mm} and \hspace{2mm} D=B_0\\ \Sigma_e=&I+XB_0X^{'}=(I-XBX^{'})^{-1} \end{align}\]Consider the following lemma :
\[\begin{equation} \mid A+BDC \mid= \mid A\mid \mid D \mid \mid D^{-1} + CA^{-1}B\mid \end{equation}\]Let \(A=I\), \(B=X\), \(D=B_o\), \(C=X^{'}\), then we have
\[\begin{align} \nonumber \mid \Sigma_e\mid=&=\mid I + XB_oX^{'}\mid\\ \nonumber =&\mid B_o\mid \mid B_o^{-1}+X^{'}X\mid\\ \nonumber =&\mid B_o\mid \mid B^{-1}\mid\\ \nonumber =&\frac{\mid B_o\mid}{ \mid B\mid}\\ \label{det} \frac{1}{\mid \Sigma_e\mid^{1/2} }=&\left(\frac{\mid B\mid}{\mid B_o\mid }\right)^{1/2} \end{align}\]Finally consider \(\Sigma_M=\frac{\alpha_0\Sigma_e}{\delta_0}\), (\(\ref{M}\)) and (\(\ref{det}\)) and replace in (\(\ref{margA}\))
\[\begin{align} \nonumber P(Y)=&\int_{0}^{\infty}\int_{\beta}\pi (\beta \mid \sigma^2,\hspace{1mm}B_o,\hspace{1mm}\beta_o )\pi(\sigma^2\mid \alpha_0/2,\delta_0/2)\mathscr{L}({\bf{y}}|\beta, \sigma^2, {\bf{X}})d\sigma^2 d\beta=\\ \nonumber &=\frac{\Gamma(\frac{n+\delta_0}{2})}{\pi^{n/2}\Gamma(\delta_0/2)\delta_0^{n/2}\mid \Sigma_M \mid^{1/2}\left[ 1+\frac{1}{\delta_0} (y-X\beta_0)^{'}\Sigma_M^{-1}(y-X\beta_0)\right]^{(n+\delta_0)/2}}\\ \label{resFin} &=T\left[y\mid X\beta_0, \frac{\alpha_0(I+XB_0X^{'})}{\delta_0}, \delta_0\right] \end{align}\]where \(T[t\mid m, S, f]\) is the pdf of a multivariate t distribution with location parameter \(m\), scale matrix \(S\) and \(f\) degrees of freedom evaluated at \(t\).
From (\(\ref{resFin}\)) we can sum up all the previous math into the following result when \(\sigma^2 \sim \mathcal{I} \mathcal{G}(\alpha_0/2,\delta_0/2)\) , \(\beta\sim \mathcal{N}(\beta_0, \sigma^2 B_0)\) and \(y\sim \mathcal{N}(X\beta, \sigma^2 I)\) \[\begin{equation} \label{resUtilMarg} P(y)=\int_{0}^{\infty}\int_{\beta} \mathcal{I} \mathcal{G}(\alpha_0/2,\delta_0/2) \mathcal{N}(\beta_0, \sigma^2 B_0) \mathcal{N}(X\beta, \sigma^2 I) d\sigma^2 d\beta = T\left(y,X\beta_0, \frac{\alpha_0(I+XB_0X^{'})}{\delta_0}, \delta_0\right) \end{equation}\]The result in \(\ref{resUtilMarg}\) is the result for \(\ref{margLike}\) in this type of model.
where
\[\begin{align} \label{sigmaEndo} \begin{bmatrix} \epsilon_i \\ \eta_i \end{bmatrix}\sim & \mathcal{N} (0, \Sigma)= \mathcal{N} \left(0, \begin{bmatrix} \sigma_{11}& \sigma_{12} \\ \sigma_{21} & \sigma_{22} \end{bmatrix}\right) \hspace{4mm} with \hspace{1mm}\sigma_{12}=\sigma_{21}\neq 0 \\ \rho= & [\beta \gamma']'\\ \lambda= & [\delta \tau']' \end{align}\]Lets consider a normal prior with mean \(0\) and covariance matrix equals to the identity matrix for \(\rho\), if we want to calculate the conditional posterior distribution (CPD) of \(\rho\) we can follow these steps:
Conditioned on \(\eta\) \[\begin{align*} \epsilon \mid \eta=\epsilon^{*}& \sim \mathcal{N} \left(\frac{\sigma_{21}\eta_i}{\sigma_{22}}, \sigma_{11}-\frac{\sigma_{12}^{2}}{\sigma_{22}}\right)\\ &\sim\frac{\sigma_{21}\eta_i}{\sigma_{22}}+\mathcal{N} (0, \xi)\\ \xi&=\sigma_{11}-\frac{\sigma_{12}^{2}}{\sigma_{22}} \end{align*}\]Then we can write (\(\ref{et2}\)) as
\[\begin{equation*} Y-\frac{\sigma_{21}\eta_i}{\sigma_{22}}=\hat{Y}=V\rho+\epsilon^{*} \end{equation*}\]Then our CPD would be:
\[\begin{align} \nonumber P(\rho \mid \delta, \Sigma, D)&=P(\rho \mid \hat{Y}, V) \\ \nonumber &\propto p(\hat{Y}\mid \rho, V) \pi(\rho)\\ \label{propN} &\propto exp(-A/2)\\ \nonumber A&=(\hat{Y}-V\rho)'(\xi I)^{-1}(\hat{Y}-V\rho)+(\rho-0)I^{-1}(\rho-0)\\ \nonumber &= \frac{1}{\xi}(\hat{Y}'\hat{Y}-2\rho'V'\hat{Y}+\rho'V'V\rho)+\rho'\rho\\ \nonumber &= \hat{Y}'\hat{Y}\xi^{-1}+\rho' (V'V\xi^{-1}+I)\rho - 2 \xi^{-1}\rho'V'\hat{Y}\\ \nonumber &= \hat{Y}'\hat{Y}\xi^{-1}+\rho' \Omega_1\rho - 2 \xi^{-1}\rho' \Omega_1 \Omega_1^{-1}V'\hat{Y}\\ \nonumber &= \hat{Y}'\hat{Y}\xi^{-1}+\rho' \Omega_1\rho - 2 \rho'\Omega_1 \hat{\rho}\\ \nonumber &= \hat{Y}'\hat{Y}\xi^{-1}+\rho' \Omega_1\rho - 2 \rho'\Omega_1 \hat{\rho} +\hat{\rho}'\Omega_1\hat{\rho}-\hat{\rho}'\Omega_1\hat{\rho} \\ &=(\rho-\hat{\rho})'\Omega_1(\rho-\hat{\rho})+\hat{Y}'\hat{Y}\xi^{-1}-\hat{\rho}'\Omega_1\hat{\rho} \label{complN} \end{align}\] From (\(\ref{propN}\)) and (\(\ref{complN}\)) it is pretty easy to see that \[\begin{equation} \label{CPDS2} P(\rho \mid \delta, \Sigma, D)\propto \mathcal{N} (\hat{\rho}, \Omega_1^{-1}) \end{equation}\]where \(\Omega_1=V'V\xi^{-1}+I\) and \(\hat{\rho}=\xi^{-1}\Omega_1^{-1}V'\hat{Y}\)
For the case of \(\lambda\) consider a normal prior with mean \(0\) and covariance matriz equals to the identity matrix, the (CPD) of \(\lambda\) can be calculated as follows:
Here, our quantity of interest would be \(p(\lambda \mid \rho, \Sigma, D)\), since we are conditioned on \(\rho\) we can plug \(\ref{et2}\) into \(\ref{et1}\) and obtain
\[\begin{align} \nonumber Y=&Z\delta\beta+W\tau\beta + W\gamma +\beta\eta+\epsilon\\ \nonumber \frac{Y-W\gamma}{\beta}=&Z\delta + W\tau+ \eta +\frac{\epsilon}{\beta}\\ \nonumber Y^{*}=&Z\delta + W\tau+ \upsilon\\ \nonumber \upsilon =&\eta +\frac{\epsilon}{\beta}\\ \nonumber E(\upsilon_i)=&0 \\ \nonumber V(\upsilon_i)=&\sigma_{22}+\beta^{-2}\sigma_{11}+2\beta^{-1}\sigma_{21}=\alpha\\ \upsilon_i\sim& \mathcal{N}(0,\alpha) \end{align}\]
Here we can write a new system with the double observations:
\[\begin{equation} \begin{bmatrix} Y^{*} \\ X \end{bmatrix}= \begin{bmatrix} Z \\ Z \end{bmatrix}\delta+ \begin{bmatrix} W \\ W \end{bmatrix}\tau + \begin{bmatrix} \upsilon \\ \eta \end{bmatrix} \end{equation}\]
Crearly our observations are not completely independent, we have that \(cov(\eta_i,\upsilon_i)\neq 0\), we have that:
\[\begin{align*} cov(\eta_i,\upsilon_i)&=E(\eta_i\upsilon_i)\\ &=E(\eta_i(\eta_i+\epsilon_i\beta^{-1}))\\ &=E(\eta_i^{2})+\beta^{-1}E(\eta_i\epsilon_i)\\ &=\sigma_{22}+\beta^{-1}\sigma_{21} \end{align*}\] Then the error in the new system has the following properties: \[\begin{equation} \begin{bmatrix} \upsilon_i \\ \eta_i \end{bmatrix}\sim \mathcal{N} \left(0, \begin{bmatrix} \alpha& \sigma_{22}+\beta^{-1}\sigma_{21} \\ \sigma_{22}+\beta^{-1}\sigma_{21} &\sigma_{22} \end{bmatrix} \right) = \mathcal{N} \left(0, \Theta \right) \end{equation}\]Again, since we are conditioned by \(\Sigma\) and \(\rho\) we have that we can use the Cholesky decomposition of \(\Theta\) so that \(\Phi \Phi'=\Theta\). We can set a new system with independent errors with unit covariance matrix as follows:
\[\begin{align*} [ Y^{u} X^{u}]=&[ Y^{*} X]\Theta^{-1}\\ [ W^{uu}_k W^{ub}_k]=&[ W_k W_k]\Theta^{-1}\\ [ Z^{uu}_k Z^{ub}_k]=&[ Z_k Z_k]\Theta^{-1}\\ [ \upsilon_i^u \eta_i^u]=&[ \upsilon_i \eta_i]\Theta^{-1}\\ \end{align*}\] Our new system is \[\begin{equation*} YX^u=ZW^u\lambda+\eta^* \end{equation*}\]where
\[\begin{align*} YX^u=& \begin{bmatrix} Y^u \\ X^u \end{bmatrix} \\ ZW^u=& \begin{bmatrix} Z^{uu}& W^{uu}\\ Z^{ub}& W^{ub} \end{bmatrix}\\ \eta^*=& \begin{bmatrix} \upsilon^u \\ \eta^u \end{bmatrix} \end{align*}\] After analogous steps as the ones in the CPD of \(\rho\) we then have that \[\begin{equation} \label{CPDS1} p(\lambda \mid \rho, \Sigma, D)\sim \mathcal{N} (\hat{\lambda},\Omega_2^{-1})=\mathcal{N} (\Omega_2^{-1}ZW^{u'}YX^u,(I+ZW^{u'}(ZW^u))^{-1}) \end{equation}\]For this case we have that our prior is Inverse Wishart which has a conjugated form with the Gaussian Family, so if we consider \(\pi(\Sigma)\sim \mathcal{I}\mathcal{W}(I,3)\) we obtain our CPD as:
\[ p(\Sigma\mid \rho,\lambda,D)\sim \mathcal{I}\mathcal{W}(I+[\epsilon \hspace{1mm} \eta]'[\epsilon \hspace{1mm} \eta],3+n) \]
If we want to take into account model uncertanty we have to consider somehow any measure of how good are our models. For this matter we will use Conditional Bayes Factors (\(CBF\)) (Dickey and Gunel 1978) and include it in the estimation algorithm (MC3). If we consider the model in () as the model of the second stage and the model in () as the first stage we obtain the following \(CBF\)
\[\begin{align} \label{CBF2} CBF^{snd}&=\frac{\int p(D \mid M^{snd}_{curr}, \Sigma, \lambda ) \pi(\rho \mid M^{snd}_{curr})} {\int p(D \mid M^{snd}_{new}, \Sigma, \lambda ) \pi(\rho \mid M^{snd}_{new})}\\ \nonumber &\\ \label{CBF1} CBF^{fst}&=\frac{\int p(D \mid M^{fst}_{curr}, \Sigma, \rho ) \pi(\lambda\mid M^{fst}_{curr})} {\int p(D \mid M^{fst}_{new}, \Sigma, \rho ) \pi(\lambda \mid M^{fst}_{new})} \end{align}\]Where \(D\) stands for the data of the problem, \(M^{snd}\) and \(M^{fst}\) are the space of possible models for the second stage and first stage respectively.
If we have a model which satisfies:
\[\begin{align*} y =& X\beta + error\\ p(y \mid \phi,X,\sigma )\sim& \mathcal{N}(X\phi,\sigma I)\\ \pi(\phi)\sim& \mathcal{N}(0, I)\\ \end{align*}\]
then
\[\begin{align*} p(y \mid \sigma )=&exp(-\frac{1}{2\sigma}y'y) exp(-\frac{1}{2}\mu'\Omega\mu) \mid \Omega \mid^{-1/2} (2\pi)^{dim(\mu)}\\ \Omega=&X'X/\sigma+I\\ \mu=&(X'X/\sigma+I)^{-1}X'Y/\sigma \end{align*}\]
This result applies directly for the integrals in (\(\ref{CBF2}\)) and (\(\ref{CBF1}\))
So far we have achieved the derivation of every necesary quantity and distribution to perform an MC3 algorithm, we will use the one in Karl and Lenkoski (2012), the reader should refer to it in order to see it.
Crearly the tricky part when we are considering model uncertainty is the non obvious way to calculate the integral in \(\ref{margLike}\). Here we will consider the Bayesian Information Criteria (BIC) approximation.
For a given model we need to perform the following integral (the one in \(\ref{margLike}\)). For a given model we have:
\[\begin{align*} P(y)=&\int_{\mathcal{R}^{dim(\alpha)}} P(y\mid \alpha) \pi(\alpha) d\alpha\\ g(\alpha)=& - \frac{ln( P(y\mid \alpha) \pi(\alpha) )}{n}\\ P(y)=&\int_{\mathcal{R}^{dim(\alpha)}} exp(-n \cdot g(\alpha)) d\alpha \end{align*}\]
If \(n\) is sufficiently large (\(n\to \infty\)) we can make the following assumptions (Hoeting et al. 1999)
Under those assumptions we get the following results :
\[\begin{align*} P(y)=&\left( \frac{2\pi}{n}\right)^{p/2}\mid K\mid^{-1/2} exp(-n \cdot g(\hat{\alpha}_{MLE}))\\ ln\left(P(y) \right)=& \frac{p}{2}ln(2\pi)- \frac{p}{2} ln(n) - \frac{1}{2}ln(\mid K\mid) + ln\left(P(y\mid \hat{\alpha}_{MLE} \right)+ ln\left(\pi(\hat{\alpha}_{MLE}) \right) \end{align*}\]Where \(p=dim(\alpha)\) and K is the hessian matrix of \(g(\hat{\alpha}_{MLE})\). Moreover, considering that \(\frac{p}{2}ln(2\pi)\) and \(ln\left(\pi(\hat{\alpha}_{MLE}) \right)\) are constant, and that \(\mid K \mid\) is bounded by a finite constant we have:
\[\begin{align*} ln\left(P(y) \right)=& - \frac{p}{2} ln(n) + ln\left(P(y\mid \hat{\alpha}_{MLE} \right)+ O(1)&\\ ln\left(P(y) \right)\approx& - \frac{p}{2} ln(n) + ln\left(P(y\mid \hat{\alpha}_{MLE} \right)&= \frac{-2ln\left(P(y\mid \hat{\alpha}_{MLE} \right) +p \cdot ln(n) }{-2}\\ ln\left(P(y) \right)\approx &-\frac{BIC}{2}& \end{align*}\]Here we will some exercises to show how the algorithm that we used in the app development works. We will explain how to read the tables which are so much like the ones that we show in our app.
We also have the estimated value of the covariance matrix in the model specification as in (\(\ref{sigmaEndo}\)) which goes in a third table.
The data that is used here is available in this link
library(BMA)
library(mvtnorm) #multivariate normal and T
library(ivbma)
library(knitr)
n=100
k=7
x=matrix(rnorm(n*k),ncol = k)
realbeta=matrix(c(0,1,-1,0,5,-10,0),ncol=1)
colnames(x)=c("Tr1","G1","G2","Tr2","G3","G4","T3")
y=x%*%realbeta+rnorm(n)
#write.csv(cbind(y,x),"~/Google Drive/9/PI3/YXgaus1.csv")
ayuda=MC3.REG(y, x, num.its=1000,rep(TRUE,dim(x)[2]), outliers = FALSE )
pmp=ayuda$post.prob
which=ayuda$variables
nModels=ayuda$n.models
nreg=ncol(x)
betaModels=matrix(rep(0,nreg*nModels),nModels,nreg)
BMAbeta=rep(0,nreg)
BMAvar=rep(0,nreg)
PIP=BMAbeta
for (i in 1:nModels){
included=which[i,]
xred=as.matrix(x[,included])
model=lm(y ~ 0+xred)
beta=model$coefficients##lm ()
betaModels[i,included]=beta
BMAbeta[included]=BMAbeta[included]+beta*pmp[i]#here a i get the bma point estimate
PIP[included]=PIP[included]+pmp[i]###here i get the posterior inclusion probability
###This is following the formula for the variance
esd=coef(summary(model))[, "Std. Error"]
esd=esd^2
suma=esd+beta^2
BMAvar[included]=BMAvar[included]+suma*pmp[i]
###
}
BMAvar=BMAvar-BMAbeta^2##E(x^2)-E(x)^2
BMAsd=(BMAvar^0.5)
table=cbind(PIP,BMAbeta,BMAsd,realbeta)
colnames(table)=c("PIP","Est Value","Sd","real")
rownames(table)=colnames(x)
table1=kable(table,digits = 3)
| PIP | Est Value | Sd | real | |
|---|---|---|---|---|
| Tr1 | 0.016 | -0.001 | 0.017 | 0 |
| G1 | 1.000 | 1.103 | 0.106 | 1 |
| G2 | 1.000 | -0.962 | 0.094 | -1 |
| Tr2 | 0.018 | 0.002 | 0.021 | 0 |
| G3 | 1.000 | 5.089 | 0.088 | 5 |
| G4 | 1.000 | -10.041 | 0.111 | -10 |
| T3 | 0.014 | -0.001 | 0.014 | 0 |
We can see here that the algorithm really identify the variables that belongs to the real model, we see it because the PIP of the variables that has the real value of 0 has a low PIP, on the other hand, higher PIP belongs to real values different to 0.
We can also see that the estimated values are really close to the real ones.
For its replication you can use the file called YXgaus1.csv
hasta=dim(x)[2]
aux <- bicreg(x=x, y=y, strict = FALSE, OR = 50,maxCol = (hasta+1))
a=summary(aux)
aux2=matrix(as.numeric(cbind(a[1:(ncol(x)+1),1:3],c(0,realbeta))),ncol=4)
colnames(aux2)=c(colnames(a)[1:3],"real")
rownames(aux2)=rownames(a)[1:(ncol(x)+1)]
table2=kable(aux2,digits = 3)
| p!=0 | EV | SD | real | |
|---|---|---|---|---|
| Intercept | 100.0 | 0.028 | 0.104 | 0 |
| Tr1 | 11.9 | -0.010 | 0.046 | 0 |
| G1 | 100.0 | 1.096 | 0.112 | 1 |
| G2 | 100.0 | -0.962 | 0.095 | -1 |
| Tr2 | 13.1 | 0.015 | 0.056 | 0 |
| G3 | 100.0 | 5.092 | 0.089 | 5 |
| G4 | 100.0 | -10.039 | 0.113 | -10 |
| T3 | 8.8 | -0.007 | 0.037 | 0 |
We can see here that the algorithm really identify the variables that belongs to the real model, we see it because the PIP of the variables that has the real value of 0 has a low PIP, on the other hand, higher PIP belongs to real values different to 0.
We can also see that the estimated values are really close to the real ones.
Here we can see that, even though we are using the same data, the results are a bit different. We see that using the MC3 algorithm we identify better the variables that belongs to the real model (See the PIP column in both results).
For its replication you can use the file called YXgaus1.csv
library(mvtnorm) #multivariate normal and T
library(ivbma)
library(knitr)
n=150
nz=6
Z=rmvnorm(n, rep(0,nz),diag(nz))
nw=8
W=rmvnorm(n, rep(0,nw),diag(nw))
colnames(Z)=c("z1","z2","z3","z4","z5","z6")
colnames(W)=c("w1","w2","w3","w4","w5","w6","w7","w8")
Sigma=matrix(c(1,0.4,0.4,1),nrow=2)
errores=rmvnorm(n, rep(0,2),Sigma)
realrho=matrix(c(0,1,-1,0,5,-10,0,0),ncol=1)
reallambda=matrix(c(1,-1,2,0,0,0,0,1,-1,0,5,-10,0,0),ncol=1)
x=cbind(Z,W)%*%reallambda+errores[,2]
y=1.5*x+W%*%realrho+errores[,1]
realrho=rbind(1.5,realrho)
aux <- ivbma(y, x, Z, W, s = 1000)
PIP2=aux$M.bar
exp2=aux$lambda.bar
tabla2 = PIP2[,1]
cols=""
for (i in 1 :ncol(PIP2)){
cols=c(cols,paste("EV",i), paste("PIP",i))
tabla2=cbind(tabla2,exp2[,i],PIP2[,i])
}
tabla2=tabla2[,-1]
tabla2=cbind(tabla2,reallambda)
colnames(tabla2)=c(cols[-1],"real")
rownames(tabla2)=c(colnames(Z),colnames(W))
PIP=aux$L.bar
exp=aux$rho.bar
tabla1=cbind(exp,PIP,realrho)
colnames(tabla1) = c("EV","PIP","real")
rownames(tabla1) = c("x",colnames(W))
table3=kable(tabla1,digits = 3)
table4=kable(tabla2,digits = 3)
sigma=aux$Sigma.bar
colnames(sigma)=c("e1","e2")
sigma0=sigma
sigma=kable(sigma)
sigma0=kable(sigma0,digits = 3,format = 'latex')
#Probemos el calculo de la c de la d
Second Stage
| EV | PIP | real | |
|---|---|---|---|
| x | 1.634 | 1.000 | 1.5 |
| w1 | 0.006 | 0.120 | 0.0 |
| w2 | 0.839 | 1.000 | 1.0 |
| w3 | -1.000 | 0.987 | -1.0 |
| w4 | -0.009 | 0.113 | 0.0 |
| w5 | 4.188 | 1.000 | 5.0 |
| w6 | -8.422 | 1.000 | -10.0 |
| w7 | 0.074 | 0.453 | 0.0 |
| w8 | 0.000 | 0.063 | 0.0 |
First Stage
| EV 1 | PIP 1 | real | |
|---|---|---|---|
| z1 | 1.105 | 1.000 | 1 |
| z2 | -1.015 | 0.969 | -1 |
| z3 | 2.040 | 1.000 | 2 |
| z4 | 0.009 | 0.129 | 0 |
| z5 | -0.016 | 0.197 | 0 |
| z6 | -0.008 | 0.092 | 0 |
| w1 | 0.000 | 0.107 | 0 |
| w2 | 0.975 | 1.000 | 1 |
| w3 | -1.265 | 1.000 | -1 |
| w4 | 0.001 | 0.021 | 0 |
| w5 | 4.772 | 1.000 | 5 |
| w6 | -9.849 | 1.000 | -10 |
| w7 | 0.050 | 0.339 | 0 |
| w8 | -0.005 | 0.082 | 0 |
\(\hat{\Sigma}\)
| e1 | e2 |
|---|---|
| 0.8815328 | 0.2765433 |
| 0.2765433 | 0.9570985 |
For its replication you can use the file called YXWgaus2.csv and Zgaus2.csv
Here we see that for both stages the algorithm has a good estimation of the real value, and that we also identify the regressors that belong to the true model (on both stages) because of a PIP close to 1.
n=100
k=7
x=matrix(rnorm(n*k),ncol = k)
realbeta=matrix(c(0,1,-1,0,5,-10,0),ncol=1)*0.1
colnames(x)=c("Tr1","G1","G2","Tr2","G3","G4","T3")
y=x%*%realbeta
y = 1/(1+exp(-y)) # pass through an inv-logit function
y = rbinom(n,1,y)
hasta=dim(x)[2]
aux <- bic.glm(x=x, y=y, strict = FALSE, OR = 50,maxCol = (hasta+1), glm.family=binomial())
a=summary(aux)
aux2=matrix(as.numeric(cbind(a[1:(ncol(x)+1),1:3],c(0,realbeta))),ncol=4)
colnames(aux2)=c(colnames(a)[1:3],"real")
rownames(aux2)=rownames(a)[1:(ncol(x)+1)]
table5=kable(aux2,digits = 3)
| p!=0 | EV | SD | real | |
|---|---|---|---|---|
| Intercept | 100.0 | -0.204 | 0.232 | 0.0 |
| Tr1 | 6.7 | 0.007 | 0.062 | 0.0 |
| G1 | 10.8 | 0.026 | 0.112 | 0.1 |
| G2 | 10.0 | 0.020 | 0.094 | -0.1 |
| Tr2 | 6.2 | 0.004 | 0.053 | 0.0 |
| G3 | 97.8 | 0.846 | 0.303 | 0.5 |
| G4 | 100.0 | -1.216 | 0.310 | -1.0 |
| T3 | 6.0 | -0.002 | 0.061 | 0.0 |
We can see that the algorithm identify only the most important variables (the ones that has the higher real values)
For its replication you can use the file called YXbin.csv
n=100
k=7
x=matrix(rnorm(n*k),ncol = k)
realbeta=matrix(c(0,1,-1,0,5,-10,0),ncol=1)*0.1
colnames(x)=c("Tr1","G1","G2","Tr2","G3","G4","T3")
y=x%*%realbeta
y = exp(y)
y=rpois(n=n, lambda=y)
hasta=dim(x)[2]
aux <- bic.glm(x=x, y=y, strict = FALSE, OR = 50,maxCol = (hasta+1), glm.family=poisson())
a=summary(aux)
aux2=matrix(as.numeric(cbind(a[1:(ncol(x)+1),1:3],c(0,realbeta))),ncol=4)
colnames(aux2)=c(colnames(a)[1:3],"real")
rownames(aux2)=rownames(a)[1:(ncol(x)+1)]
aux2p=aux2
table6=kable(aux2,digits = 3)
We can see here that the algorithm really identify the variables that belongs to the real model, we see it because the PIP of the variables that has the real value of 0 has a low PIP, on the other hand, higher PIP belongs to real values different to 0.
We can also see that the estimated values are really close to the real ones.
For its replication you can use the file called YXpos.csv
set.seed(999)
N <- 100
x <- runif(N, -1, 1)
a <- 0.5
a0 <- 0.5
b <- 1.2
y_true <- exp(a + b * x)
x=cbind(x,runif(N, -1, 1))
shape <- 10
y <- rgamma(N, rate = shape / y_true, shape = shape)
colnames(x)=c("G1","T1")
y=rgamma(n=n, rate = shape / y, shape = shape)
hasta=dim(x)[2]
aux <- bic.glm(x=x, y=y, strict = FALSE, OR = 50,maxCol = (hasta+1), glm.family=Gamma(link = "log"))
a=summary(aux)
aux2=matrix(as.numeric(cbind(a[1:(ncol(x)+1),1:3],c(a0,b,0))),ncol=4)
colnames(aux2)=c(colnames(a)[1:3],"real")
rownames(aux2)=rownames(a)[1:(ncol(x)+1)]
aux2p=aux2
table7=kable(aux2,digits = 3)
VER=kable(aux2,digits = 3,format = "latex")
write.table(toString(VER),"ExtraEjemplo4.txt")
| p!=0 | EV | SD | real | |
|---|---|---|---|---|
| Intercept | 100.0 | 0.540 | 0.042 | 0.5 |
| G1 | 100.0 | 1.215 | 0.074 | 1.2 |
| T1 | 9.1 | -0.001 | 0.021 | 0.0 |
We can see here that the algorithm really identify the variables that belongs to the real model, we see it because the PIP of the variables that has the real value of 0 has a low PIP, on the other hand, higher PIP belongs to real values different to 0.
We can also see that the estimated values are really close to the real ones.
For its replication you can use the file called YXgam.csv
The use of this software is really simple, once you have your file with the data ready you are a few clicks away from your Bayesian Model. Go to the video tutorial
This works shows technical details behind Bayesian Model Average and also provides the user a simply GUI to perform the whole complexity of BMA with no need of programming. This software seeks to make this kind a models more used in a daily basis of a econometrician.
Further work should include real applications, more functions for the software (like predict functions using the BMA model), and maybe some new options like the usage of non-local priors (Johnson and Rossell 2012).
Barbieri, Maria Maddalena, and James O Berger. 2004. “Optimal Predictive Model Selection.” Annals of Statistics. JSTOR, 870–97.
Dickey, James M, and E Gunel. 1978. “Bayes Factors from Mixed Probabilities.” Journal of the Royal Statistical Society. Series B (Methodological). JSTOR, 43–46.
Hoeting, Jennifer A, David Madigan, Adrian E Raftery, and Chris T Volinsky. 1999. “Bayesian Model Averaging: A Tutorial.” Statistical Science. JSTOR, 382–401.
Johnson, Valen E, and David Rossell. 2012. “Bayesian Model Selection in High-Dimensional Settings.” Journal of the American Statistical Association 107 (498). Taylor & Francis: 649–60.
Karl, Anna, and Alex Lenkoski. 2012. “Instrumental Variable Bayesian Model Averaging via Conditional Bayes Factors.” ArXiv Preprint ArXiv:1202.5846.
Kass, Robert E, and Adrian E Raftery. 1995. “Bayes Factors.” Journal of the American Statistical Association 90 (430). Taylor & Francis Group: 773–95.
Madigan, David, and Adrian E Raftery. 1994. “Model Selection and Accounting for Model Uncertainty in Graphical Models Using Occam’s Window.” Journal of the American Statistical Association 89 (428). Taylor & Francis: 1535–46.
Tierney, Luke, and Joseph B Kadane. 1986. “Accurate Approximations for Posterior Moments and Marginal Densities.” Journal of the American Statistical Association 81 (393). Taylor & Francis Group: 82–86.
Woodbury, Max A. 1950. “Inverting Modified Matrices.” Memorandum Report 42. Statistical Research Group, Princeton University: 106.