Jiannina Pinto - Data Science Gal

Image by Freepik

Introduction

In finance, making informed decisions is critical for both investors and lending institutions. Predictive models can provide valuable insights into risk, helping investors allocate capital more effectively. This analysis focuses on assessing the risk of loan defaults using historical LendingClub data, highlighting patterns that may indicate higher likelihoods of repayment issues.

Project Statement

The primary goal of the LendingClub project was to develop a predictive model to estimate the likelihood of loan defaults. By analyzing borrower and loan characteristics, the project aims to provide actionable insights to investors, helping them mitigate risk and maximize returns in peer-to-peer lending.

Data Collection

The dataset used for this analysis was sourced from Kaggle - Loan Data. It includes information on borrowers’ credit histories, income, loan details, and repayment status.

Here is the data dictionary:

Data Preprocessing

Before building the predictive models, the dataset underwent several preprocessing steps to ensure quality and usability:

Missing Value Handling: Missing values in key fields such as credit score, income, and loan purpose were either imputed with appropriate estimates or removed, depending on the extent of missingness.

Feature Engineering: New features were created to capture risk patterns better, including:

log_annual_income: logarithm of annual income to normalize skewed distributions
multiple_hard_inq: number of recent hard credit inquiries
credit_policy: whether the borrower meets LendingClub’s credit policy criteria

Encoding Categorical Variables: Loan purpose, credit score category, and other categorical variables were encoded into numerical formats suitable for modeling.

Handling Class Imbalance: Since the dataset was dominated by fully paid loans, oversampling and class weighting were applied to improve the models’ ability to identify high-risk loans.

Train-Test Split: The processed data was divided into training and testing sets to evaluate model performance objectively.

This preprocessing ensured that the models could effectively learn patterns from the data and produce reliable predictions for loan default risk.

Model Selection and Evaluation

Three models were evaluated:

Logistic Regression
Random Forest
XGBoost

Initially, Logistic Regression and Random Forest were trained and evaluated on the imbalanced dataset, where fully paid loans (class 0) far outnumbered defaults (class 1). While these models achieved good overall accuracy, they struggled to correctly identify defaults, highlighting the challenge of class imbalance. This issue underscored the importance of using techniques such as oversampling, class weighting, or model tuning to improve predictive performance for the minority class.

Imbalance loan classes.png Image: Distribution of the target feature "not fully paid"

Next, we applied cross-validation on balanced datasets, using both oversampling and undersampling techniques. This adjustment significantly improved classification performance for the minority class. The classification reports for the models showed better recall and F1-scores for predicting defaults, allowing us to more effectively identify high-risk loans. These are the classification reports for the models:

Model Evaluation and Performance

The models were evaluated using the F1-score, which balances precision and recall, making it particularly suitable for imbalanced classification tasks such as predicting loan defaults.

For the balanced datasets:

Balanced Logistic Regression (LR) achieved an F1-score of 0.34 for the minority class.

Balanced Random Forest (RF) achieved an F1-score of 0.20 for the minority class.

While these scores showed improvement over the original imbalanced models, they highlighted that accurately identifying defaulters remained a challenge. As expected, overall accuracy decreased when focusing on the minority class, with the Balanced LR model at approximately 64% and the Balanced RF model at 79%.

To further improve performance, we trained and evaluated an XGBoost model on the dataset, which produced the following results:

We can see that the Balanced XGBoost model achieved higher accuracy than the Balanced LR model and slightly less accuracy than the Balanced RF. In terms of the F1-score for both classes, the Balanced XGBoost model achieved better performance.

Hyperparameter Tuning

To further improve model performance, we conducted hyperparameter tuning to identify the optimal settings for our models. Using RandomizedSearchCV, we determined that the XGBoost model achieved the best performance among all evaluated models.

tuned xgb clasrep loan.png

Despite these efforts, challenges remained. Computational constraints and time limitations prevented us from exhaustively exploring all possible hyperparameter combinations. In future work, more efficient optimization methods, such as Bayesian optimization, could be employed to further improve model performance.

A summary table was created to present the evaluation results of the tuned models. Metrics included overall accuracy, macro-average F1-score, weighted-average F1-score, as well as precision, recall, and F1-score for each class, providing a clear comparison of model performance.

model comparison loan.png Image: Summary metrics table for tuned models

For predicting borrowers likely to default (class 1), the Tuned XGBoost model achieved a precision of 0.27, recall of 0.37, and an F1-score of 0.32. These metrics indicated that XGBoost outperformed the other models in identifying high-risk loans. The model also demonstrated higher recall and F1-score for fully paid loans (class 0), making it the most balanced option overall. While performance for class 1 remains modest, this reflects the inherent difficulty of accurately predicting loan defaults and the challenge of achieving high accuracy for both classes simultaneously.

Results and Insights

The predictive models provided actionable insights into loan default risk within the LendingClub dataset. We used permutation importances to identify the most influential features, highlighting the key drivers of loan default. Additionally, Partial Dependence Plots (PDP) were employed to examine the relationship between the interest rate and the predicted outcome. Additionally, a Shapley plot offered a detailed view of how individual features contributed to the model’s predictions for loan default risk.

Image: Permutation importance for the best performer model

Permutation importance provided insights into the relative impact of each feature on the model’s predictions. This analysis helped identify the key drivers of loan default risk. For example, the features credit_policy and multiple_hard_inq were among the most influential, indicating that borrowers’ adherence to credit policies and the number of recent hard credit inquiries significantly affected the model’s predictive performance.

PDP isolate plot loan.png Image: PDP Isolate plot for the interest rate feature"

The Partial Dependence Plot (PDP) for the int_rate feature illustrated how changes in interest rates influenced the model’s predictions. As the interest rate increased, the predicted probability of default also increased, but the effect plateaued at approximately 0.25. Beyond this point, further increases in the interest rate had little impact on the predicted outcome, indicating that other factors likely played a stronger role in determining default risk. This suggests that while interest rates are important, their influence on loan default probability is limited when considered in isolation.

PDP interact plot loan.png Image: PDP Interact plot for the credit policy and interest rate features"

Higher interest rates were generally associated with an increased likelihood of borrowers failing to meet underwriting criteria and being classified as potential loan defaulters. This aligns with the understanding that higher borrowing costs can make it more difficult for borrowers to meet their loan obligations. However, the probability of satisfying the credit underwriting criteria (credit_policy_1) did not show a clear trend with interest rate changes, suggesting that other factors had a stronger influence on overall creditworthiness.

Shapley plot loan.png Image: Shapley values for a sample row"

This SHAP plot provided valuable insights into how the model was making predictions for a new loan applicant.

Base value = 0.64: There was a 64% chance that this person might have trouble paying back the loan.

Red arrows (Risk-Increasing Factors):

credit_score_cat_Fair/Poor = 0: This person has a credit score category of "Fair/Poor," which increases the risk of loan default. Lower credit scores are generally associated with higher default risks.
multiple_hard_inq = 1.891: This person has multiple hard inquiries on their credit report, which also increases the risk of loan default. Multiple hard inquiries might indicate that they have applied for several loans or credit accounts recently, potentially straining their financial situation.
log_annual_inc = 2.427: The logarithm of the annual income for this person had a positive impact on the prediction, indicating that higher income is associated with a higher risk of loan default. This might seem counterintuitive, but it's important to remember that the model is based on patterns in the data it was trained on, which might not always reflect real-world causality. However, it could also potentially be related to their financial management and budgeting habits.
purpose_debt_consolidation = 1: This person's loan purpose is debt consolidation, which, in their case, increases the risk of loan default. Debt consolidation loans are often taken to pay off existing debts, and borrowers with higher levels of existing debt might be at a higher risk of default.
delinq_2yrs = 3.296: This person has a history of delinquencies in the past two years, which significantly increases the risk of loan default. Previous delinquencies suggest a pattern of late or missed payments, indicating potential financial instability.

Blue arrows (Risk-Decreasing Factors):

int_rate = 1.006: The interest rate on this person's loan was associated with a decreased risk of default. Lower interest rates might make the loan more affordable and manageable for the borrower.
credit_policy = -2.03: This person meets the credit policy requirements, which decreases the risk of loan default. Meeting the credit policy suggests that the borrower meets specific criteria set by the lender, indicating a higher likelihood of being a reliable borrower.
purpose_credit_card = 0: Apparently the purpose of the loan for a credit card might have a neutral effect on the prediction.

For those interested in the technical implementation, the full codebase and details are available in the GitHub repository linked below. The project demonstrates how predictive modeling can provide actionable insights to support investor decision-making and manage loan default risk effectively.

GitHub Repository

Conclusion

The LendingClub project demonstrates the potential of predictive modeling to assess loan default risk. While the current models, particularly XGBoost, have shown strong performance, there is still room for improvement, especially in handling imbalanced datasets and further refining hyperparameters. Future work will focus on enhancing model accuracy and uncovering deeper insights to support data-driven investment decisions, helping investors manage risk and maximize returns.

Thanks for your interest in this analysis. You can explore the GitHub repository for full implementation details.