- Published on
LendingClub: Empowering Investor Decisions - Exploring Predictive Models for Loan Default Risk
- Author
- Name
- Jiannina Pinto
Introduction
In the dynamic seas of finance, making well-informed decisions is crucial for investors and lending institutions alike. As the world becomes increasingly data-driven, predictive modeling emerges as a powerful tool to assess risks and steer toward financial success. Now, our journey takes us deep into the captivating domain of data science, where we unlock the potential of AI algorithms and cutting-edge techniques to confront the challenge of predicting loan default risk head-on. Set sail with us as we navigate this exciting adventure, exploring the uncharted waters of data-driven finance and empowering investors with the tools for success.
Project Statement
The main goal of the LendingClub Project is to develop a predictive model for loan default risk, equipping investors with valuable insights, and enabling them to make informed decisions. By analyzing historical data from LendingClub borrowers and loan characteristics, we seek to reveal patterns that can predict potential defaults. Our purpose throughout this work is to empower investors, safeguard their interests, and maximize returns in the dynamic world of peer-to-peer lending.
Data Collection
The dataset used for this analysis was sourced from Kaggle - Loan Data. It contains essential information about borrowers' credit history, income, loan details, and the status of loan repayments. Here is the data dictionary:
Data Preprocessing
The quest for accurate loan default risk prediction set us on a journey of data preprocessing, where we pursued a reliable foundation for our model. Our voyage began with a thorough data-cleaning process. We were well aware of the dangerous outliers that can lurk in the data seas. So, we carefully addressed them ensuring the integrity and reliability of the data. But our voyage did not end there. We unleashed the art of feature engineering, crafting new variables that would help us uncover hidden relationships that hold the key to accurate predictions.
We decided to say goodbye to redundant features, which are variables that no longer served a purpose, streamlining our dataset and decluttering the waters for our model to sail smoothly. Categorical variables were carefully encoded using OneHotEncoder
, while feature scaling (using StandardScaler
) ensured a level playing field for all the variables, enabling fair comparisons.
In this stage, we also identified an uneven distribution in the classes (0 and 1) for our target variable, not_fully_paid
, which required correction at a later stage.
Image: Distribution of the target feature "not fully paid"
Machine Learning Models
It was time to set our predictive models on a course to conquer the challenging task of loan default risk prediction. The models chosen for our journey were:
- Logistic Regression
- Random Forest
- XGBoost
Initially, the Logistic Regression and the Random Forest models were trained and evaluated using the imbalanced dataset, where the majority class (class 0) was overrepresented compared to the minority class (class 1). Despite achieving good accuracy scores, we observed that the models struggled to classify the minority class effectively. This highlighted the challenge of imbalanced data and the need for addressing class imbalance to improve the F1-score. Now, let's examine the classification reports to gain deeper insights into the performance of our models.
Next, we used cross-validation on balanced datasets in which we applied oversampling and undersampling techniques. This helped us obtain significant improvements in classification performance for the minority class. These are the classification reports for the models:
Model Evaluation and Performance
The models were evaluated using F1-score
, which is a combined metric that balances both precision and recall, making it a useful evaluation metric for imbalanced classification tasks like predicting loan defaults.
The F1-score for the minority class was 0.34 for the Balanced LR model and 0.20 for the Balanced RF model. Although there was an improvement, compared to the imbalanced models, the F1 scores reminded us that capturing defaulters required further refinement. Additionally, the accuracy of the models decreased compared to the imbalanced models, which is expected when focusing on improving performance for the minority class. The accuracy was around 64% for the Balanced LR model and 79% for the Balanced RF model.
Determined to raise our sails to greater heights, we trained and evaluated an XGBoost model on our dataset, leading to the following results:
We can see that the Balanced XGBoost model achieved higher accuracy than the Balanced LR model and slightly less accuracy than the Balanced RF. In terms of the F1-score for both classes, the Balanced XGBoost model achieved better performance.
Hyperparameter Tuning
To further enhance our model's accuracy, we embarked on the task of fine-tuning, seeking the best combinations of hyperparameters for our models. Our use of RandomizedSearchCV
allowed us to determine that the XGBoost model was the best performer.
Yet, we must acknowledge the challenges of this endeavor. The computational expense and time constraints limited us from fully exploring all hyperparameter combinations. As we planned our path, we realized that we might need to use more efficient optimization methods, such as Bayesian optimization.
In a sea of performance metrics, a summary table emerged, showcasing the evaluation results of our tuned models. Accuracy, macro average F1-score, weighted average F1-score, precision, recall, and F1-score for each class danced upon the waves.
Image: Summary metrics table for tuned models
For identifying borrowers more likely to default (class 1), the Tuned XGBoost model achieved a precision of 0.27, recall of 0.37, and F1-score of 0.32. As we analyzed these metrics, we found the Tuned XGBoost model was the best model among the three options and was voted to be our guiding star. Its higher recall and F1-score for class 0 (fully paid loans) set it apart from the other models. While the precision, recall, and F1-score for class 1 (not fully paid loans) remained relatively low, we admit it is important to mention that predicting loan defaults is a demanding task, and achieving high accuracy for both classes simultaneously can be challenging.
Results and Insights
The outcomes of our predictive models are not just numbers; they hold valuable insights into loan default risk within the LendingClub system. We employed permutation importances to identify the most influential features, shedding light on the key drivers of loan default. Additionally, Partial Dependence Plots (PDP) revealed the relationships between the interest rate feature and the predicted outcome. Complementing these insights, a Shapley plot provided a comprehensive understanding of individual feature impacts on loan default risk within the context of LendingClub.
Image: Permutation importance for the best performer model
Permutation Importance values could offer useful insights into the relative importance of different features in the model and their impact on the model's predictive power. It helped identify key drivers of loan default risk. For instance, the "credit_policy" and "multiple_hard_inq" features indicated that these variables were important for the model's predictive performance.
Image: PDP Isolate plot for the interest rate feature"
The PDP plot helped us understand how changes in the "int_rate" feature affected the model's predictions. It showed the average response as well as the variation in predictions based on different values of the "int_rate" feature.
If the interest rate increased, the predicted outcome (centered) also increased. Notice that it reached a point where it became flat at around 0.25, suggesting that further increases in the interest rate had little to no effect on the predicted outcome. The relationship between the interest rate and the predicted outcome became less significant.
This behavior could be interpreted as an indication that beyond a certain interest rate, other factors or variables might have had a stronger influence on the predicted outcome. It suggests that there is a limit to how much the outcome could be affected by changes in the interest rate alone.
Image: PDP Interact plot for the credit policy and interest rate features"
It seems like higher interest rates were associated with a higher likelihood of customers not meeting the underwriting criteria and being classified as potential loan defaulters. This aligns with the general understanding that higher interest rates could lead to higher borrowing costs and potentially make it more challenging for borrowers to meet their loan obligations. However, the probability of meeting the credit underwriting criteria ('credit_policy_1') did not show a clear trend with changing interest rates, indicating that other factors may have played a more significant role in determining creditworthiness.
Image: Shapley values for a sample row"
This SHAP plot provided valuable insights into how the model was making predictions for a new loan applicant.
Base value = 0.64: There was a 64% chance that this person might have trouble paying back the loan.
Red arrows (Risk-Increasing Factors):
-
credit_score_cat_Fair/Poor = 0: This person has a credit score category of "Fair/Poor," which increases the risk of loan default. Lower credit scores are generally associated with higher default risks.
-
multiple_hard_inq = 1.891: This person has multiple hard inquiries on their credit report, which also increases the risk of loan default. Multiple hard inquiries might indicate that they have applied for several loans or credit accounts recently, potentially straining their financial situation.
-
log_annual_inc = 2.427: The logarithm of the annual income for this person had a positive impact on the prediction, indicating that higher income is associated with a higher risk of loan default. This might seem counterintuitive, but it's important to remember that the model is based on patterns in the data it was trained on, which might not always reflect real-world causality. However, it could also potentially be related to their financial management and budgeting habits.
-
purpose_debt_consolidation = 1: This person's loan purpose is debt consolidation, which in their case, increases the risk of loan default. Debt consolidation loans are often taken to pay off existing debts, and borrowers with higher levels of existing debt might be at a higher risk of default.
-
delinq_2yrs = 3.296: This person has a history of delinquencies in the past two years, which significantly increases the risk of loan default. Previous delinquencies suggest a pattern of late or missed payments, indicating potential financial instability.
Blue arrows (Risk-Decreasing Factors):
-
int_rate = 1.006: The interest rate on this person's loan was associated with a decreased risk of default. Lower interest rates might make the loan more affordable and manageable for the borrower.
-
credit_policy = -2.03: This person meets the credit policy requirements, which decreases the risk of loan default. Meeting the credit policy suggests that the borrower meets specific criteria set by the lender, indicating a higher likelihood of being a reliable borrower.
-
purpose_credit_card = 0: Apparently the purpose of the loan for a credit card might have a neutral effect on the prediction.
As we reach the end of this fascinating voyage, feel free to explore the full codebase and implementation details in the GitHub repository linked below. Experience the capabilities of machine learning and predictive modeling to empower investors' decisions that would lead them to financial success.
GitHub RepositoryConclusion
The LendingClub Project undertakes an exciting journey to predict loan default risk, armed with data science and predictive modeling. While our current models have shown promise, we acknowledge that there is still ample room for improvement. The XGBoost model shines as the best performer, yet it falls short of perfection. We are determined to navigate the challenges of imbalanced datasets and hyperparameter tuning to refine the models further. As we set sail for new horizons, we aspire to uncover deeper insights and leverage advanced techniques that would allow us to empower investors to make data-driven decisions and navigate the course to financial prosperity by safeguarding their interests and maximizing returns.
Thank you for joining us on this voyage of exploring predictive modeling and loan default risk. Your engagement and support are strongly appreciated.