2  Data

2.1 Data Loading and Preprocessing

  • Assign descriptive column names matching the project proposal’s features.

  • Convert the target ‘CreditRisk’ to a two‐level factor (“Bad” vs “Good”), focusing on the “Bad” class as the event of interest.

  • Identify categorical columns and convert them into unordered factors with simplified level labels (L1, L2, …) to prepare for modeling.

Code
library(knitr)
library(kableExtra)
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE)
library(caret)     
library(rpart)     
library(rpart.plot) 
library(DALEX)      
library(lime)       
library(ggplot2) 
# Load the German credit dataset from the UCI repository
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data"
data <- read.table(url, header = FALSE, stringsAsFactors = FALSE, strip.white = TRUE)

# Assign descriptive column names
col_names <- c(
  "Status", "Duration", "CreditHistory", "Purpose", "CreditAmount", "Savings",
  "EmploymentDuration", "InstallmentRate", "PersonalStatus", "OtherDebtors",
  "ResidenceDuration", "Property", "Age", "OtherInstallmentPlans", "Housing",
  "ExistingCreditsCount", "Job", "Dependents", "Telephone", "ForeignWorker", "CreditRisk"
)
names(data) <- col_names

# Convert CreditRisk to a factor with levels Bad (2) and Good (1)
data$CreditRisk <- factor(data$CreditRisk, 
                         levels = c(2, 1), 
                         labels = c("Bad", "Good"))  # 'Bad' is the class of interest

# Specify which columns are categorical
categorical_cols <- c(
  "Status", "CreditHistory", "Purpose", "Savings", "EmploymentDuration",
  "InstallmentRate", "PersonalStatus", "OtherDebtors", "ResidenceDuration",
  "Property", "OtherInstallmentPlans", "Housing", "ExistingCreditsCount",
  "Job", "Telephone", "ForeignWorker"
)

# Convert these columns to factors with simplified level names
data[categorical_cols] <- lapply(data[categorical_cols], function(x) {
  x <- as.factor(x)
  levels(x) <- paste0("L", seq_along(levels(x)))  # Simplify factor levels to L1, L2, …
  return(x)
})

2.1.1 Summary data after preprocessing

2.2 Data Splitting

Split data into 70% training and 30% testing to evaluate model generalization.

  • Use stratified sampling to preserve class ratios.

  • Remove any factor columns in training with only one level to avoid zero‐variance predictors.

  • Ensure test set uses the same predictor columns as training.

Code
set.seed(2023)
train_idx <- createDataPartition(data$CreditRisk, 
                                p = 0.7, 
                                list = FALSE,
                                times = 1)
train_data <- data[train_idx, ]
test_data <- data[-train_idx, ]

# Remove factor columns that have only one level
remove_single_level <- function(df) {
  sapply(df, function(x) {
    if(is.factor(x)) {
      return(length(unique(x)) > 1)
    } else {
      return(TRUE)
    }
  })
}

valid_cols <- remove_single_level(train_data)
train_data <- train_data[, valid_cols]
test_data <- test_data[, colnames(train_data)]  # Keep feature set consistent

# Validate class distribution
cat("Training set distribution:\n")
prop.table(table(train_data$CreditRisk)) 
cat("\nTest set distribution:\n")
prop.table(table(test_data$CreditRisk))

saveRDS(train_data, "train_data.rds")
saveRDS(test_data,  "test_data.rds")

2.3 Description

2.3.1 Origin & File

File Description
german.data Original dataset provided by Prof. Hofmann, containing mixed symbolic and categorical attributes for credit-risk prediction.

2.3.2 Instances

Total Records Description
1,000 Each record is one loan applicant profile

2.3.3 Predictor Variables

No Feature
1 Status of existing checking account
2 Duration in months
3 Credit history
4 Purpose of the loan
5 Credit amount
6 Savings account / bonds
7 Present employment duration
8 Installment rate (% of disposable income)
9 Personal status and sex
10 Other debtors / guarantors
11 Present residence duration
12 Property ownership
13 Age in years
14 Other installment plans
15 Housing type
16 Number of existing credits at this bank
17 Job category (ordered)
18 Number of dependents
19 Telephone availability (yes/no)
20 Foreign worker status (yes/no)
21 CreditRisk (target: Good vs. Bad)