Determining conditional independence relationships through undirected graphical models is a key component in the statistical analysis of complex obervational data in a wide variety of disciplines. In many situations one seeks to estimate the underlying graphical model of a dataset that includes variables of different domains.
In the following, we use the R package mgm to estimate a Mixed Graphical Model on a data set consisting of questionnaire responses of individuals diagnosed with Autism Spectrum Disorder. This dataset includes variables of different domains, such as age (continuous), type of housing (categorical) and number of treatments (count).
The dataset consists of responses of 3521 individuals diagnosed with Autism Spectrum Disorder (ASD) to a questionnaire including 28 variables of domains continuous, count and categorical and is automatically loaded with the mgm package.
library(mgm)
library(qgraph)
head(autism_data_large$data)
## Gender IQ Age diagnosis Openness about Diagnosis Success selfrating
## 1 1 6.00 -0.9605781 1 2.21
## 2 2 6.00 -0.5156103 1 6.11
## 3 1 5.00 -0.7063108 2 5.62
## 4 1 6.00 -0.4520435 1 8.00
## 5 1 5.00 -0.6474479 1 4.03
## 6 1 4.49 -0.6427440 1 4.18
## Well being Integration in Society No of family members with autism
## 1 2 1 4
## 2 4 2 0
## 3 4 2 1
## 4 4 1 1
## 5 4 1 2
## 6 2 1 0
## No of Comorbidities No of Physical Problems No of Treatments
## 1 1 2 5
## 2 1 0 0
## 3 0 0 0
## 4 0 1 4
## 5 1 0 4
## 6 1 0 2
## No of Medications No of Care Units Type of Housing
## 1 0 1 1
## 2 1 2 1
## 3 0 0 1
## 4 0 1 1
## 5 2 1 1
## 6 0 5 1
## No of unfinished Educations Type of work Workinghours No of Interests
## 1 1 1 0 1
## 2 0 1 0 0
## 3 0 1 0 2
## 4 1 3 10 4
## 5 1 1 0 1
## 6 0 1 0 0
## No of Social Contacts Good Characteristics due to Autism
## 1 3 4
## 2 4 3
## 3 4 2
## 4 5 10
## 5 3 2
## 6 2 3
## Satisfaction: Given advice Satisfaction: Treatment
## 1 2 3.00
## 2 2 2.00
## 3 1 4.00
## 4 0 3.00
## 5 2 1.00
## 6 0 1.75
## Satisfaction: Medication Satisfaction: Care Satisfaction: Education
## 1 2.4000 2.120000 2.200
## 2 3.6270 1.815000 3.330
## 3 3.8126 1.656667 3.198
## 4 3.3300 1.785833 3.200
## 5 3.5000 2.000000 2.000
## 6 1.5000 1.998333 2.500
## Satisfaction: Work Satisfaction: Social Contacts Age
## 1 1.1700 2 -0.5285760
## 2 3.8600 3 -0.7209461
## 3 3.8000 2 -0.5926994
## 4 2.5924 2 0.3050276
## 5 3.2900 3 -0.7850694
## 6 3.7903 1 -1.1056862
mgm allows to estimate k-order MGMs. Here we are interested in fitting a pairwise MGM, and we therefore choose k = 2. In order to get a sparse graph, we use L1-penalized regression, which minimizes the negative log likelihood together with the L1 norm of the parameter vector. This penality is weighted by a parameter λ, which can be selected either using cross validation (lambdaSel = “CV”) or an information criterion, such as the Extended Bayesian Information Criterion (EBIC) (lambdaSel = “EBIC”). Here, we choose to use the EBIC with a hyper parameter of γ=0.25.
fit_ADS <- mgm(data = autism_data_large$data,
type = autism_data_large$type,
level = autism_data_large$level,
k = 2,
lambdaSel = 'EBIC',
lambdaGam = 0.25)
##
|
| | 0%
|
|-- | 4%
|
|----- | 7%
|
|------- | 11%
|
|--------- | 14%
|
|------------ | 18%
|
|-------------- | 21%
|
|---------------- | 25%
|
|------------------- | 29%
|
|--------------------- | 32%
|
|----------------------- | 36%
|
|-------------------------- | 39%
|
|---------------------------- | 43%
|
|------------------------------ | 46%
|
|-------------------------------- | 50%
|
|----------------------------------- | 54%
|
|------------------------------------- | 57%
|
|--------------------------------------- | 61%
|
|------------------------------------------ | 64%
|
|-------------------------------------------- | 68%
|
|---------------------------------------------- | 71%
|
|------------------------------------------------- | 75%
|
|--------------------------------------------------- | 79%
|
|----------------------------------------------------- | 82%
|
|-------------------------------------------------------- | 86%
|
|---------------------------------------------------------- | 89%
|
|------------------------------------------------------------ | 93%
|
|--------------------------------------------------------------- | 96%
|
|-----------------------------------------------------------------| 100%
## Note that the sign of parameter estimates is stored separately; see ?mgm
The fit function returns all estimated parameters and a weighted adjacency matrix. Here we use the qgraph package to visualize the weighted adjacency matrix. The separately provide the edge color for each edge, which indicates the sign of the edge-parmeter, if defined. We also provide a grouping of the variables and associated colors, both of which are contained in the data list autism_data_large.
qgraph(fit_ADS$pairwise$wadj,
layout = 'spring', repulsion = 1.3,
edge.color = fit_ADS$pairwise$edgecolor,
nodeNames = autism_data_large$colnames,
color = autism_data_large$groups_color,
groups = autism_data_large$groups_list,
legend.mode="style2", legend.cex=.4,
vsize = 3.5, esize = 15)
The layout is created using the Fruchterman-Reingold algorithm, which places nodes such that all the edges are of more or less equal length and there are as few crossing edges as possible. Green edges indicate positive relationships, red edges indicate negative relationships and grey edges indicate relationships involving categorical variables for which no sign is defined. The width of the edges is proportional to the absolute value of the edge-parameter. The node color maps to the different domains Demographics, Psychological, Social Environment and Medical.
We observe, for instance, a strong positive relationship between age and age of diagnosis, which makes sense because the two variables are logically connected (one cannot be diagnosed before being born).The negative relationship between number of unfinished educations and satisfaction at work seems plausible, too. Well-being is strongly connected in the graph, with the strongest connections to satisfaction with social contacts and integration in society. These three variables are categorical variables with 5, 3 and 3 categories, respectively.
In order to investigate the exact nature of the interaction, one needs to look up all parameters in fit_ADS\(rawfactor\)indicator and fit_ADS\(rawfactor\)weights.
# head fit_ADS$rawfactor$indicator
# head(fit_ADS$rawfactor$weights)