Jiannina Pinto - Data Science Gal

Modified Image: Bandersnatch (Landscape Version) by Gencraft © 2023 Copyright Manager This image is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Disclaimer: This image is provided "as is" without any warranties or guarantees. To view a copy of the license, visit this license URL.

Introduction

Welcome to the Bandersnatch Monster Project, where the power of machine learning and AI is utilized to automate the classification of monsters based on their attributes. Join me on an exciting journey as we delve into the fascinating realm of this project, analyze the data, explore the algorithms used, and reveal the amazing results achieved.

Project Statement

The goal of the Bandersnatch project is to automate the classification of monsters based on their unique attributes, providing users with valuable insights into their ranks. With a diverse range of monsters, understanding their traits and characteristics can be a challenging task. This project tackles this challenge by leveraging cutting-edge technology and advanced machine-learning algorithms to create an efficient and accurate classification system.

Data Collection

The dataset used for the monster rarity classification problem was sourced from MongoDB and consisted of monsters randomly generated using the MonsterLab library. During the preprocessing phase, irrelevant columns such as _id, Name, and Damage were excluded from the dataset to focus on attributes relevant for modeling purposes. The dataset includes features like Type, Level, Health, Energy, Sanity, and the target variable Rarity. This dataset contains 1,500 observations, with rarity classes ranging from Rank 0 to Rank 5. Here is a random sample of the monster data, showcasing a diverse range of monster attributes.

Image: Random sample of the monster dataset

Data Preprocessing

The initial Exploratory Data Analysis (EDA) revealed an imbalance in the "Rarity" classes, with some classes being overrepresented and others underrepresented. To address this challenge, various preprocessing techniques were applied, including feature scaling of numerical variables, one-hot encoding of categorical variables, and using the ColumnTransformer technique for streamlined preprocessing. Additionally, oversampling (using SMOTE | Synthetic Minority Over-sampling Technique) and undersampling (using RandomUnderSampler) techniques were employed to balance the rarity classes, ensuring fair representation for all classes.

Screen Shot 2023-07-05 at 4.59.11 PM.png Image: Distribution of the target feature "Rarity"

Machine Learning Models

To tackle the monster classification task, three powerful machine learning algorithms were utilized: Random Forest, Extreme Gradient Boosting (XGBoost), and Support Vector Machines (SVM).

Random Forest was chosen for its ability to handle high-dimensional data and potentially capture non-linear relationships between monster attributes and their ranks.
XGBoost was included as it has demonstrated state-of-the-art performance in several machine-learning competitions and is suitable for classification tasks.
SVM was chosen for its ability to handle complex classification tasks and non-linear relationships using kernel functions. Also, because of its potential to generalize well to unseen data.

Initially, all the models were trained and evaluated using the imbalanced dataset, which reflected the real-world scenarios where some rarity classes were overrepresented compared to others. Despite achieving good roc_auc_ovo scores, it was observed that the models struggled to classify the minority classes effectively, especially Rank 4 and Rank 5. This highlighted the challenge of imbalanced data and the need for addressing class imbalance to improve overall model performance. Now, let's examine the classification reports and the ROC AUC OvO scores to gain deeper insights into the performance of our models.

- Imbalanced Random Forest

ROC AUC OvO Imbalanced RF: 0.9587445566198092

- Imbalanced Extreme Gradient Boosting

ROC AUC OvO Imbalanced XGB: 0.9509173960536714

- Imbalanced Support Vector Machines

ROC AUC OvO Imbalanced SVM: 0.9800088622224434

To overcome the imbalanced data challenge, cross-validation was performed using techniques like StratifiedKFold, and the data was split into multiple folds, ensuring that each fold preserved the same class distribution as the original dataset. The models were trained on balanced datasets achieved through oversampling and undersampling techniques. This resulted in significant improvements in classification performance, particularly for the minority classes. The balanced training data enabled the models to learn from a more representative dataset and improve their ability to classify all classes effectively. Let's take a look at the classification reports and the ROC AUC OvO scores.

- Balanced Random Forest

ROC AUC OvO Balanced RF: 0.9602214013749324

- Balanced Extreme Gradient Boosting

ROC AUC OvO Balanced XGB: 0.961452933038248

- Balanced Support Vector Machines

ROC AUC OvO Balanced SVM: 0.9734961009815477

The roc_auc_ovo metric showed improvements, indicating that the models became more skilled at distinguishing between the different ranks of monsters. This balanced approach not only improved the accuracy and performance of the models but also ensured fair and reliable classification for all rarity classes.

Model Evaluation and Performance

The performance of the models was evaluated using various metrics such as ROC AUC OvO, overall accuracy, and F1-scores. The SVM model demonstrated the highest roc_auc_ovo score of 97%, indicating its ability to classify monsters accurately. It also achieved the highest F1 scores and high precision and recall across most classes, indicating a well-balanced performance.

Hyperparameter Tuning

To optimize the models, hyperparameter tuning was performed using techniques like Randomized Search. This involved systematically searching through different combinations of hyperparameters to identify the best configuration for each algorithm. The Tuned SVM model stood out with the highest ROC AUC OvO score and accuracy, demonstrating superior precision, recall, and F1-score values compared to the other models.

ROC AUC OvO Tuned SVM: 0.993272225898018

A summary table has been generated to display the evaluation metrics of the three tuned models::

Image: Summary metrics table for tuned models

Results and Insights

The best-performing model, SVM, was further evaluated on the test set to verify its effectiveness in real-world scenarios. It achieved an overall accuracy of 96% and a ROC AUC OvO score of 0.9845, indicating its strong ability to classify monsters accurately.

The classification report for the SVM model on the test set displayed excellent precision, recall, and F1-scores across all classes, highlighting its well-balanced performance, and accurately predicting the majority as well as the minority classes.

To provide further insights, feature importances were examined, revealing the significant contribution of certain attributes in determining monster ranks. These insights can assist in understanding the key factors that influence a monster's rarity.

The results highlight the significant role of the level attribute, followed by energy, health, and sanity in determining monster ranks. A visual representation of the feature importances can be found in the accompanying image.

Image: Feature Importances for SVM model

Deployment and Future Work

The SVM model emerged as a robust choice for the classification task in the Bandersnatch Monster Project. The model was deployed in a web application using Flask, allowing users to interactively classify and explore monsters based on their attributes and make informed decisions based on the ranks assigned to each monster. Additionally, a visualization tool has been developed using the Altair visualization library, that serves as a playground for users to explore the correlation between different features, providing an interactive and engaging experience. Here is an example of the correlation visualization between "Health" and "Energy" grouped by "Rarity".

Image: Correlation between Health and Energy using Altair

Future improvements could involve incorporating additional features like Damage, expanding the dataset, or integrating other advanced techniques such as natural language processing for text-based monster descriptions.

Feel free to explore the full codebase and implementation details in the GitHub repository linked below. Witness the power of machine learning and AI in unraveling the mysteries of the Bandersnatch monsters!

Conclusion

The Bandersnatch Monster Project showcases the huge potential of machine learning and AI in automating complex tasks like monster classification. By leveraging powerful algorithms, data preprocessing techniques, and hyperparameter tuning, this project has achieved remarkable results in accurately classifying monsters based on their attributes. With the integration of Flask, the web application framework, and the developed visualization tool, the project became more interactive and user-friendly.

Thank you for joining me on this exciting journey through the world of monsters and the power of data science and machine learning. I hope you found this article insightful and informative. By exploring the depths of monster classification, we have uncovered fascinating insights and embarked on a captivating adventure. Your interest and engagement are greatly appreciated.