EDA

Row

Categorical Variables

Row

Loan Purpose

Mosaic (Martial Status X Credit Risk X Gender)

Feature Selection

Column

Continuous Variables

Column

PCA

χ-squared

Conclusion

Column

Supervised Learning Results

Post Mortem

Insights & Further Considerations

  • The majority of customers in this data set are male, single, own a house, and have do ‘skilled’ work.
  • The data is not representative for female customers
  • Loan Purpose, Gender, Marital Status, and Housing had significant relationships with Credit Risk. The relationship between Job & Credit Risk was not significiant.
  • 88.8% of variance is explained in PC1 through PC5. However, removing ‘Years’ (which is primarily in PC6) reduced the accuracy of every model by 1.0-2.5%
  • The most accurate model was RF(n = 850)*

What went well

  • I was able to start learning the basics of building interactive plots (plotly) and dashboards (flex dashboard) in R, and sharing R files
  • The R community & open-source nature of the language was incredibly helpful

What didn’t

  • Needed to spend more time in the ‘80%’ of data analysis
  • Naming scheme interfered with dashboard render
  • R provided an all-in-one package, but executing the overall dashboard was difficult. Other intricacies were problematic such as libraries & functions
  • Building Marimekko/Mosaic Plots in plotly & ggplot
  • Program stability

Future plans

  1. Define a more clear-cut goal
  2. Feature engineering: Dummy coding/one hot encoding categorical variables, creating categories for Age, Checking, & Savings, and combining
  3. Fix current code
  4. Testing the model with additional data
  5. Create dashboard slides that work together