class: center, middle, inverse, title-slide .title[ # STAT5003 Group 8-1 Presentation ] .subtitle[ ## Non-compliance behaviours of personal insolvency ] .author[ ### Shirui Wang, Yang Chang, Zihang Fu, Yusen Shi, Yu Xiang ] .institute[ ### RStudio, PBC ] .date[ ### 2025/10/29 (updated: 2025-10-27) ] --- # Introduction -- ### Research question -- - Predicting Non-Compliance Behaviours in Australian Personal Insolvency Cases - Goal: Predict type of non-compliance using early-stage data. - Target: Non.Compliance.Type — 7 categories. - Dataset: AFSA 2007–2018, 356k records. - Supports AFSA’s move to data-driven regulation. -- ### Why It Matters -- - Enables proactive, data-driven regulation. - Focuses AFSA’s effort on high-risk cases. - Improves fairness and public trust. - Saves cost and time, reduces misconduct risk. - Addresses a real national challenge. --- class: top <div style="margin-top: -0.4em; text-align:center; font-family:'Yanone Kaffeesatz', Times, serif; font-size: 2.5em;"> Process Data </div> -- <div style="font-family:'Yanone Kaffeesatz', Times, serif; font-size:1.8em; margin-bottom: 0.7em;"> 1. Data Splitting </div> -- <div style="font-size: 1em; font-family:'Times New Roman', Times, serif;"> Split the dataset into 80% training and 20% testing using stratified sampling to maintain class distribution. </div> ``` r split <- initial_split(data, prop = 0.8, strata = 'Non.Compliance.Type') ``` -- <div style="font-family:'Yanone Kaffeesatz', Times, serif; font-size:1.8em; margin-bottom: 0.4em;"> 2. Deal with missing values </div> -- <div style="font-size: 1em; font-family:'Times New Roman', Times, serif;"> 2.1 Remove records with missing target labels.<br> 2.2 Impute missing categorical values with "Unknown" or "Inspection Error".<br> 2.3 Replace missing numeric values with 0.<br> 2.4 Standardize imprisonment terms in outcome to 0.<br> 2.5 Apply imputation to both train and test sets to avoid type errors. </div> ``` r ## Debtor.Occupation.Code..ANZSCO.: 14.2% ## Debtor.Occupation.Name..ANZSCO.: 14.2% ## Type.of.Party: 0.13% ## Result.of.Non.Compliance: 4.54% ## Outcome.of.Non.Compliance: 27.2% ## Non.Compliance.Conviction.Result: 76.03% ``` --- class: top <div style="margin-top: -0.4em; text-align:center; font-family:'Yanone Kaffeesatz', Times, serif; font-size: 2.5em;"> Process Data </div> -- <div style="font-family:'Yanone Kaffeesatz', Times, serif; font-size:1.8em; margin-bottom: 0.7em;"> 3. Unified format </div> -- <div style="font-size: 1em; font-family:'Times New Roman', Times, serif; margin-bottom: 1.5em;"> Convert the “Outcome of Non-Compliance” column from text to numeric by removing symbols (e.g., “$”, commas) and applying as.numeric(). </div> -- <div style="font-family:'Yanone Kaffeesatz', Times, serif; font-size:1.8em;"> 4. Fixing errors </div> -- ``` r ## Number of SA3.Code.of.Debtor found with errors: 29 ## [1] 10 2 1 ## Number of GCCSA.Code.of.Debtor found with errors: 29 ## [1] "10" "2" "1" ## Number of Sex.of.Debtor found with errors: 4 ## [1] "Not Stated" ## Number of Family.Situation found with errors: 165 ## [1] "Not Stated" ## Number of Number.of.Instances found with errors: 16 ## [1] 13 19 14 16 11 ## Number of Outcome.of.Non.Compliance found with errors: 3 ## [1] "5e+05" "3e+05" "1e+05" ``` <div style="font-size: 1em; font-family:'Times New Roman', Times, serif;"> Error rules were applied by column to detect irregular records and ensure data quality. Most flagged values (e.g., “Not Stated”) were special cases, so no corrections were made and were handled in outlier analysis.</div> --- class: top <div style="margin-top: -0.4em; text-align:center; font-family:'Yanone Kaffeesatz', Times, serif; font-size: 2.5em;"> Process Data </div> -- <div style="font-family:'Yanone Kaffeesatz', Times, serif; font-size:1.8em; margin-bottom: 0.7em;"> 5. Fixing outliers </div> -- <div style="display: flex; align-items: center;"> <div style="flex: 2; padding-right: 20px;"> <img src="data:image/png;base64,#photoes/Process Data/Figure1.jpg" width="200%" style="display: block; margin: auto auto auto 0;" /> </div> <div style="flex: 1; font-size: 1em; font-family:'Times New Roman', Times, serif;"> Outliers were detected using the IQR method for the only continuous variables; minimal outliers in the former were removed, while extreme values in the latter were retained as important risk signals. </div> </div> ``` r | Unique.ID | Year | SA3 | State | Sex | Occupation | Non.Compliance.Type | Instances | Outcome | |-----------|------|-------------------------|-------|------|-------------------------------------------|----------------------|-----------|---------| | 3222793 | 2012 | Gungahlin | ACT | Male | Protective Service Workers | Offence Referral | 0 | 0 | | 3182580 | 2016 | Gippsland - South West | VIC | Male | Sales Representatives | Offence Referral | 0 | 0 | | 3116440 | 2011 | Tuggeranong | ACT | Male | Unknown | Offence Referral | 0 | 0 | | 2646169 | 2010 | Belconnen | ACT | Male | Mobile Plant Operators | Offence Referral | 0 | 0 | | 3641920 | 2013 | Belconnen | ACT | Male | Legal, Social and Welfare Professionals | Info Request | 1 | 0 | ``` --- # EDA <div style="display:flex; justify-content:space-between; align-items:flex-start;"> <!-- 左侧留文字说明 --> <div style="width:55%;"> <img src="data:image/png;base64,#photoes/EDA图像/Figure1.png" width="130%" /><br><br><br><img src="data:image/png;base64,#photoes/EDA图像/Figure2.png" width="130%" /> </div> <!-- 右侧上方放图像 --> <div style="width:48%; text-align:left; margin-top:-14px;"> Observation: - Offence Referral (58.4%) - Objection to Discharge (22.9%) - The other categories account <10% <br><br><br> Observation: - Cases rose sharply between 2007 and 2009 - The overall trend is declining --- # EDA <div style="display:flex; justify-content:space-between; align-items:flex-start;"> <!-- 左侧留文字说明 --> <div style="width:55%;"> <img src="data:image/png;base64,#photoes/EDA图像/Figure3.png" width="110%" /><img src="data:image/png;base64,#photoes/EDA图像/Figure4.png" width="110%" /> </div> <!-- 右侧上方放图像 --> <div style="width:48%; text-align:left; margin-top:-14px; font-size: 16px;"> Observation: - Concentrated in NSW, VIC, QLD - Predominantly Male - Couple with Dependants or Single without Dependants - Majority are non-business related <br><br><br><br><br> Observation: - Type.of.Party is a highly predictive feature - Officials Trustees: High Offence Referral - Debtors: Prone to Complaint - Practitioners: Higher Inspection Error --- # Model Selection -- ### tree-based methods -- - Nearly no assumptions about the data distribution or linear relationships - Ability to automatically learn nonlinearity and higher-order interaction - Natively support K-calss splits and probability outputs - handle mixed feature types and missing value well and scale efffectively -- #### Decesion Tree -- - Great base leaners for ensembles -- #### Random Forest -- - reduce variance and improve stability through bagging, combining multiple trees trained on bootstrap samples with random feature selection -- #### CatBoost -- - Ordered boosting and symmetric trees make it more resistant to overfitting and often deliver higher accuracy --- # Model Selection -- ### Linear Model & Nonparametric models -- #### SVM - Given the encoded categorical variables and complex class structure in our dataset, SVM is well-suited for improving class separation in high-dimensional space. #### KNN - simplicity and flexibility allow it to capture similarities across geographic, demographic, and financial features. - With feature standardization and distance-weighted voting, KNN can also alleviate class imbalance and enhance minority-class recognition. Both models no assumption about linear relationship --- # Model Details -- #### Best hyper-parameters of all models -- <img src="data:image/png;base64,#photoes/best-parameters-2.png" width="85%" style="display: block; margin: auto;" /> .pull-left[ Pipeline - Ran 5-fold CV - Hyperparameters tuned via grid search within each training fold - Reported Macro-F1 on validation folds ] .pull-right[ Key findings - When tuning CatBoost hyperparameters, the model typically favors higher L2 regularization on leaf values and lower learning rates and depths to prevent overfitting. ] --- # Evaluation and Limitation -- #### Evaluation -- - Precision, recall, and F1 respectively measure prediction accuracy, sensitivity, and their balance - Macro-averaging ensures each class contributes equally, preventing majority classes from dominating and providing a fairer assessment. -- #### Performance results of all models -- <img src="data:image/png;base64,#photoes/performance-results.png" width="75%" style="display: block; margin: auto;" /> -- #### Limitation -- - KNN, Decision Tree, SVM, and CatBoost required substantial tuning time—around one hour per model. - Despite applying class-balancing techniques, KNN, Decision Tree, and SVM still performed poorly on the minority classes. --- # Conclusion - CatBoost achieves a top macro score by capturing non-linear interactions and handles categoricals; enabling early triage and prioritisation at case commencement. - Random Forest ranks second in performance, bagging reduces variance and resists noise; providing reliable baseline for early guidance and resource allocation. - SVM finds large-margin boundaries when data is separable, but it is less interpretable and costly to tune; making it useful as a structural separability check. - KNN provides nearest-case explanations and handles irregular boundaries, but it is sensitive to scaling; good for communication, supporting evidence. - Decision tree uses axis-aligned splits with limited expressiveness; boundaries are brittle and minority recall is low, making stable type-level decisions difficult.