class: center, middle, inverse, title-slide .title[ # Categorical Analysis and Clustering ] .subtitle[ ## Case Study: Expatriated U.S. Veterans ] .author[ ### Angelo Saporito ] .date[ ### 2024-02-12 ] --- class:inverse4, top <h1 align="center"> Overview</h1> .pull-left[ - Agenda and Contents - Gain a sense of familiarity with the data by analyzing its structure and nuances - Analyze the research questions presented using various similarity and dissimilarity measures - Chi-Squared test and `\(\phi\)`-coefficient - Use multiple unsupervised machine learning algorithms to cluster the data: - K-modes clustering for categorical data - Multiple Correspondence Analysis (MCA) - Analysis of dimensions and components - Individual clustering ] .pull-right[ - We are researching various characteristics of U.S. veterans living abroad. Some research questions include: - Is there a relationship between why an expatriated veteran moved abroad and why they might repatriate? - Is there a relationship between why a veteran might repatriate and where/how they access medical care? And their employment status? - Is there a relationship between why a veteran expatriated and their first year of active service? - Is there a relationship between where they veteran expatriated to (location) and their age/first year of active service? ] --- class: inverse1 <h1 align="center"> Data Exploration</h1>
- This is a subset of our data. Our full dataset contains 160 observations and 39 variables ranging from demographic data, socioeconomic data, data on reason for expatriation and (possible) repatriation, medical access data, VA benefit data, etc. - We will explore the relationship between a specific subset of data within the next sections. --- class: inverse1 <h2 align="center"> Research Question and Methods of Analysis</h2> #### Relationship between *expatriation* and *repatriation*? - We are interested in the first research question: is there a relationship between why a veteran moved abroad *vis-à -vis* why they may repatriate to the U.S. - If there is a relationship between specific variables, there may be a story to be told, i.e., if there is a relationship between expatriating due to financial burden and repatriating due to medical care. #### We will test the relationship using two different methods: .pull-left[ 1. Chi-Squared test: a statistical test used to determine whether there is a significant association between two categorical variables. - `\(\chi^2 = \sum \frac{(O_i-E_i)^2}{E_i}\)`, where `\(O_i=\)` observed value and `\(E_i=\)` expected value. ] .pull-right[ 1. `\(\phi\)`-coefficient: a measure of association between two binary variables in a contingency table. - `\(\phi = \frac{(AD-BC)}{\sqrt{(A+B)(C+D)(A+C)(B+D)}}\)`, where A,B,C,D are the frequencies of the four possible outcomes in a 2×2 contingency table. ] --- class: inverse1 <h2 align="center"> Test Output and Analysis: Chi-Squared Test</h2> .pull-center[ <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Retus01 </th> <th style="text-align:right;"> Retus02 </th> <th style="text-align:right;"> Retus03 </th> <th style="text-align:right;"> Retus04 </th> <th style="text-align:right;"> Retus05 </th> <th style="text-align:right;"> Retus06 </th> <th style="text-align:right;"> Retus07 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Whyab01 </td> <td style="text-align:right;"> 0.7082941 </td> <td style="text-align:right;"> 0.1079346 </td> <td style="text-align:right;"> 0.2751144 </td> <td style="text-align:right;"> 0.9511464 </td> <td style="text-align:right;"> 0.8292729 </td> <td style="text-align:right;"> 0.0601336 </td> <td style="text-align:right;"> 0.0341690 </td> </tr> <tr> <td style="text-align:left;"> Whyab02 </td> <td style="text-align:right;"> 0.0263885 </td> <td style="text-align:right;"> 0.9070845 </td> <td style="text-align:right;"> 0.8616097 </td> <td style="text-align:right;"> 0.0832410 </td> <td style="text-align:right;"> 0.7565657 </td> <td style="text-align:right;"> 0.0933546 </td> <td style="text-align:right;"> 0.4928439 </td> </tr> <tr> <td style="text-align:left;"> Whyab03 </td> <td style="text-align:right;"> 0.6915149 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 0.9137980 </td> <td style="text-align:right;"> 0.4119961 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 0.1726708 </td> <td style="text-align:right;"> 0.4778591 </td> </tr> <tr> <td style="text-align:left;"> Whyab04 </td> <td style="text-align:right;"> 0.4001912 </td> <td style="text-align:right;"> 0.3091081 </td> <td style="text-align:right;"> 0.7041108 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 0.8776752 </td> <td style="text-align:right;"> 0.1631868 </td> </tr> <tr> <td style="text-align:left;"> Whyab05 </td> <td style="text-align:right;"> 0.5758281 </td> <td style="text-align:right;"> 0.5539428 </td> <td style="text-align:right;"> 0.7402693 </td> <td style="text-align:right;"> 0.5766460 </td> <td style="text-align:right;"> 0.1689887 </td> <td style="text-align:right;"> 0.0198734 </td> <td style="text-align:right;"> 0.6019206 </td> </tr> <tr> <td style="text-align:left;"> Whyab06 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 0.4439387 </td> <td style="text-align:right;"> 0.5763638 </td> <td style="text-align:right;"> 1.0000000 </td> <td style="text-align:right;"> 0.9381726 </td> <td style="text-align:right;"> 0.4948278 </td> </tr> </tbody> </table> ] .pull-left[ #### Interpretation of Significant Results - Whyab02 ("I have closer personal ties") and Retus01 ("Family ties"): This result suggests that individuals who originally expatriated from the US due to having closer personal ties are more likely to consider returning because of family ties. - Whyab01 ("I have closer business ties") and Retus07 ("Nothing"): This result suggests that individuals who ] .pull-right[ - initially left the US due to business-related reasons might not find compelling reasons to return, indicating that their current situation is satisfactory enough to deter any desire to return to the US. - Whyab05 ("To avoid the politics of the US") and Retus06 ("Patriotism"): This finding suggests that some individuals who left the US to avoid political and policy-related reasons might consider returning due to a sense of patriotism. ] --- class: inverse1 <h2 align="center"> Test Output and Analysis: `\(\phi\)`-coefficient</h2> .pull-center[ <table class="table" style="font-size: 16px; width: auto !important; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> Retus01 </th> <th style="text-align:right;"> Retus02 </th> <th style="text-align:right;"> Retus03 </th> <th style="text-align:right;"> Retus04 </th> <th style="text-align:right;"> Retus05 </th> <th style="text-align:right;"> Retus06 </th> <th style="text-align:right;"> Retus07 </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Whyab01 </td> <td style="text-align:right;"> 0.0659839 </td> <td style="text-align:right;"> 0.1887060 </td> <td style="text-align:right;"> 0.2013191 </td> <td style="text-align:right;"> 0.0435924 </td> <td style="text-align:right;"> 0.0852369 </td> <td style="text-align:right;"> 0.1952357 </td> <td style="text-align:right;"> 0.2036533 </td> </tr> <tr> <td style="text-align:left;"> Whyab02 </td> <td style="text-align:right;"> 0.1929868 </td> <td style="text-align:right;"> 0.0387546 </td> <td style="text-align:right;"> 0.0689082 </td> <td style="text-align:right;"> 0.1555094 </td> <td style="text-align:right;"> 0.0571834 </td> <td style="text-align:right;"> 0.1549963 </td> <td style="text-align:right;"> 0.0715660 </td> </tr> <tr> <td style="text-align:left;"> Whyab03 </td> <td style="text-align:right;"> 0.0494244 </td> <td style="text-align:right;"> 0.0095496 </td> <td style="text-align:right;"> 0.0656102 </td> <td style="text-align:right;"> 0.0840742 </td> <td style="text-align:right;"> 0.0185997 </td> <td style="text-align:right;"> 0.1309301 </td> <td style="text-align:right;"> 0.0740661 </td> </tr> <tr> <td style="text-align:left;"> Whyab04 </td> <td style="text-align:right;"> 0.0823445 </td> <td style="text-align:right;"> 0.1072113 </td> <td style="text-align:right;"> 0.0800641 </td> <td style="text-align:right;"> 0.0000000 </td> <td style="text-align:right;"> 0.0237289 </td> <td style="text-align:right;"> 0.0324486 </td> <td style="text-align:right;"> 0.1259882 </td> </tr> <tr> <td style="text-align:left;"> Whyab05 </td> <td style="text-align:right;"> 0.0580538 </td> <td style="text-align:right;"> 0.0701862 </td> <td style="text-align:right;"> 0.0698857 </td> <td style="text-align:right;"> 0.0588490 </td> <td style="text-align:right;"> 0.1346301 </td> <td style="text-align:right;"> 0.2018044 </td> <td style="text-align:right;"> 0.0549857 </td> </tr> <tr> <td style="text-align:left;"> Whyab06 </td> <td style="text-align:right;"> 0.0079803 </td> <td style="text-align:right;"> 0.0189103 </td> <td style="text-align:right;"> 0.1008713 </td> <td style="text-align:right;"> 0.0577600 </td> <td style="text-align:right;"> 0.0119583 </td> <td style="text-align:right;"> 0.0224847 </td> <td style="text-align:right;"> 0.0666667 </td> </tr> </tbody> </table> ] .pull-left[ #### Phi-coefficient Interpretation The phi coefficient, also known as the phi correlation coefficient, measures the association between two binary variables. It ranges from -1 to 1, where: - 1: both variables are perfectly related and move in the same direction. - 0: the variables are independent of each other. - -1: both variables are perfectly related but move in opposite directions. ] .pull-right[ 1. There is approximately a 19 percent correlation between expatriating due to personal ties and repatriating due to familial ties. 2. There is approximately a 19.5 percent correlation between expatriating due to business ties and possessing no reasons to repatriate. 3. There is approximately a 20 percent correlation between expatriating to avoid U.S. politics and repatriating due to patriotism. ] --- class: inverse1 <h2 align="center">K-Modes Clustering</h2> .pull-left[ <img src="ClusterPresentation_files/figure-html/unnamed-chunk-5-1.png" width="100%" /> - Unlike k-means, which calculates cluster centers as the mean of numeric attributes, k-modes uses modes (most frequent values) for categorical attributes. - It assigns each data point to the cluster whose mode is most similar to it with the goal to minimize the dissimilarity within clusters. ] .pull-right[ Let's denote: - `\(X=\{x_1, x_2,...,x_n\}\)` as the dataset, where each `\(x_i\)` is a categorical data point with `\(m\)` dimensions. - `\(C=\{c_1,c_2,...,c_k\}\)` as the set of cluster modes, where `\(c_j\)` represents the most frequent category (mode) of cluster `\(j\)`. The k-modes algorithm: 1. Initialize `\(k\)` cluster modes `\(c_1, c_2,..., c_k\)` based on the dissimilarity measure. 2. Assign each `\(x_i\)` to the nearest cluster mode `\(c_j\)` based on the dissimilarity (distance) measure. 3. Update each cluster mode `\(c_j\)` by selecting the mode of the data points assigned to cluster `\(j\)`. 4. Repeat the assignment and update steps until convergence criteria are met. ] --- class: inverse1 <h2 align="center">Multiple Correspondence Analysis (MCA)</h2> .pull-left[
] .pull-right[ - MCA takes a dataset consisting of categorical variables and represents it in a lower-dimensional space. - MCA provides insights into the relationships between categories of different variables and identifies patterns of association between categories. - cos2 (Squared Cosine) is a measure of the quality of representation of categories on the dimensions obtained from MCA (this is our *coloring* or *gradient*). - cos2 values range between 0 and 1, where 0 indicates that the category is poorly represented on the dimension, and 1 indicates perfect representation. - Higher values of cos2 indicate that the category is well-represented on the dimension and contributes more to the variability explained by that dimension. ] --- class: inverse1 <h2 align="center">Components: K-Modes and MCA</h2> .pull-left[ <img src="ClusterPresentation_files/figure-html/unnamed-chunk-8-1.png" width="80%" height="80%" /> ] .pull-right[ #### Explanation of Components - MCA, similar to K-Modes, constructs a set of synthetic variables, called dimensions or components, which capture the underlying structure of the categorical data. - Components in MCA are constructed based on the association between categories of different variables. These components are linear combinations of the original variables and represent patterns of association in the data. - In our case, 8.9% of the variability in the whole dataset is explained by Component 1, 4.6% in Component 2, and 13.5% woth Components 1 and 2. ] --- class: inverse1 <h2 align="center">References</h2> - [Husson, F., Josse, J., & Mazet, J. (n.d.). MCA: Multiple Correspondence Analysis (MCA). RDocumentation. https://www.rdocumentation.org/packages/FactoMineR/versions/2.9/topics/MCA](https://www.rdocumentation.org/packages/FactoMineR/versions/2.9/topics/MCA) - [Huang, Z. (1997) A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining. in KDD: Techniques and Applications (H. Lu, H. Motoda and H. Luu, Eds.), pp. 21-34, World Scientific, Singapore.](https://vladestivill-castro.net/teaching/kdd.d/readings.d/huang97fast.pdf) - [Kassambara. (2017, September 24). MCA - multiple correspondence analysis in R: Essentials. STHDA. http://sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials](http://sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/114-mca-multiple-correspondence-analysis-in-r-essentials) - [Neumann, C., & Szepannek, G. K-modes clustering. RDocumentation. https://search.r-project.org/CRAN/refmans/klaR/html/kmodes.html](https://search.r-project.org/CRAN/refmans/klaR/html/kmodes.html)