Application of LIME and SHAP to Credit Scoring Models

Authors

Ruoqi Li (rl3489)

Yibing Liu (yl5714)

Published

May 5, 2025

1 Introduction

This project aims to build interpretable predictive models for credit risk assessment using the German Credit dataset provided by the UCI Machine Learning Repository. The primary goal is to classify loan applicants as “Good” or “Bad” risks, addressing real-world challenges such as class imbalance and ensuring model interpretability for transparent decision-making.

Specifically, we conduct the following key steps:

1.1 Data Preparation

Loading the dataset and assigning descriptive column names.
Converting the target variable (CreditRisk) into a binary factor for clear interpretation.
Transforming categorical variables into factors with simplified level labels to enhance modeling efficiency.

1.2 Data Splitting and Validation

Creating stratified training (70%) and testing (30%) datasets to accurately reflect the original data distribution.
Ensuring feature consistency by removing variables with insufficient variability.

1.3 Modeling Approaches

Logistic Regression: Applying class-weighting to manage imbalance, assessing multicollinearity via VIF, and evaluating feature importance.
Decision Tree: Training and pruning a CART model for simplicity and ease of interpretation.

1.4 Performance Evaluation

Evaluating models using metrics including Accuracy, Sensitivity, Specificity, and ROC-AUC.
Conducting comparative ROC analyses to select the best-performing model.

1.5 Interpretability Analysis

Employing SHAP-based global and local explanations (DALEX) to understand logistic regression predictions.
Utilizing LIME to provide intuitive local explanations for both logistic regression and decision tree predictions.

1.6 Goal

Through these analyses, the project delivers accurate and transparent models, facilitating reliable decision-making in credit risk management.