Cash Score: Alternative Credit Scoring

Using Banking Transaction Data for Fairer Credit Assessment

By: Jevan Chahal, Hillary Chang, Kurumi Kaneko, and Kevin Wong.

Mentors: Brian Duke (Prism Data, Inc.) Kyle Nero (Prism Data, Inc.) and Berk Ustun (Prism Data, Inc.).


Abstract

The process of determining a creditor's trustworthiness is crucial within the context of bank data, given the ethical and regulatory constraints surrounding its use. Despite the vast quantity of available data, only a limited number of features are explicitly useful for machine learning applications, raising the question of how best to assess a customer's financial reliability. Our methodology involves refining bank data into meaningful categories using Natural Language Processing, estimating individual income based on transaction data alone, and evaluating creditworthiness with both accuracy and efficiency.

Introduction

Determining a customer’s creditworthiness is essential for banks, influencing decisions about loans, credit cards, and other financial services. While traditional credit scores are widely used, they often fail to capture the full financial picture, especially for individuals with little to no credit history. Our project aims to bridge this gap by leveraging transaction data to assess financial behavior rather than relying solely on past credit activity. By analyzing spending habits, income consistency, and transaction types, we seek to create a scoring model that provides a more accurate and equitable reflection of financial reliability. As part of our project, we have implemented a data pipeline to support this model, including a categorization system that classifies transaction memos from vendors with high accuracy and low latency—ensuring that entries like "Amazon.com" are correctly categorized under "General Merchandise." This categorization plays a key role in our broader goal of developing the Cash Score, a numerical value ranging from 1 to 999 that predicts a consumer’s likelihood of defaulting on debt or credit. By equipping financial institutions with this tool, we aim to enhance risk assessment while ensuring that credit access is extended to those with strong banking histories, making lending practices more inclusive and effective.

Methods

Our methodology consists of feature engineering from transaction data, model training using various machine learning algorithms, and performance evaluation based on key metrics. The figure below shows our overall methodology:

Overall Methodology Diagram
Figure: Overview of our methodology.

Categorizing Transactions Based on Memos

Feature Creation and Selection

Token Augmentation

Token Augmentation

To prepare the transaction data for further analysis and modeling, we enhanced the memo field by adding specific tokens based on transaction characteristics. This token augmentation adds additional context about transaction amounts and dates, helping the model learn more meaningful patterns.

Pre-token augmentation transaction data
Figure: This is the transaction data cleaned but not token augmented yet. The memo field contains raw text descriptions without any added tokens.

The following steps outline the transformation process:

  • Whole Dollar Amount Identification: We added a token <W_D> to the memo field for transactions with whole dollar amounts. This token helps identify transactions that might involve cash withdrawals or ATM transactions, as these are typically in round amounts (e.g., $20, $50).
  • Day of Transaction: For each transaction's posting date, we generated a day token in the format <D_day>, where day represents the specific day of the month. This token helps analyze patterns such as end-of-month or beginning-of-month spending habits.
  • Month of Transaction: Similarly, we generated a month token in the format <M_month> (where month is the numerical month) to capture seasonal or monthly spending trends.
  • Token Augmentation Process: For each row in the dataset, these tokens were concatenated to the memo field using the following transformation:

    memo = memo + whole_dollar_amount_token + day_token + month_token

Post-token augmentation transaction data
Figure: Transaction data after token augmentation. The memo field now includes additional tokens: <D_day> for the day of the transaction, <M_month> for the month, and <W_D> for whole dollar amounts.
Cleaning the Memo Fields

Cleaning the Memo Fields

To prepare the memo field for analysis, we first applied a series of cleaning steps to standardize the text. We transformed all text to lowercase for uniformity, removed any extraneous punctuation and symbols, and stripped out dates, state abbreviations, and recurring text that didn’t contribute to transaction categorization (e.g., “POS withdrawal”). Placeholder values such as multiple X’s were also removed.

To prepare the dataset for modeling, we began by reviewing a sample of unique transaction memos from each category. This step provided insights into common patterns and inconsistencies, helping us determine the scope and focus of our cleaning tasks.

Uncleaned Memo Dataframe
Figure: This table shows the memo dataframe before cleaning. Variations in text format, inconsistent capitalization, and extraneous characters can be observed, which motivated the cleaning process to improve uniformity in transaction descriptions.
Cleaned Memo Dataframe
Figure: This table represents the cleaned memo dataframe, where we applied text preprocessing to ensure consistency across transaction descriptions. By converting to lowercase, removing punctuation, and standardizing certain tokens, we prepared the data for more accurate feature extraction and analysis.

We then applied several transformations to standardize the data and improve its interpretability:

  • Date Removal: We used regular expressions (RegEx) to locate and remove dates across entries in the memo column, as they did not contribute to the model's predictive goals and added unnecessary complexity. Placeholder patterns such as "XXXX" were also removed.
  • Pattern Recognition: Specific patterns and keywords were identified as indicators of certain transaction categories. For example, "TST" was reliably associated with "Food and Beverages," while "APPLE.COM/BILL" was linked to "General Merchandise." These patterns were flagged to automate future classifications and reduce manual intervention.
  • Selective Character Retention: To preserve potentially valuable information, certain characters such as dots ('.') were retained to keep URLs or email addresses intact within the memo field, which could provide clues to transaction categories.
  • Transaction Labels: We identified recurring phrases such as "POS Withdrawal," location-specific markers (e.g., “CA 10/27” for state and date), and labels indicating recurring payments. These were removed as they did not contribute to the model’s predictive goals and added unnecessary complexity.
  • Text Normalization: All text was converted to lowercase to reduce variability due to case differences.

Models and Results

TF-IDF Vectorization

TF-IDF Vectorization

We transformed text data into numerical features using a TF-IDF vectorizer, allowing the model to analyze term frequency patterns.

  • Max Features = 5000: Limited to the top 5,000 most important terms.
  • Max DF = 0.95: Ignored terms appearing in more than 95% of documents.
  • Min DF = 5: Ignored terms appearing in fewer than 5 documents.
Logistic Regression with TF-IDF Vectorization

Logistic Regression Model

To classify transaction categories based on the memo text field, we first used a Logistic Regression model with TF-IDF vectorization. This method converts raw text data into numerical features, allowing the model to utilize term frequency patterns for classification.

For classification, we configured the Logistic Regression model with:

  • solver='saga': Efficient for large datasets and supports L2 regularization.
  • max_iter=200: Ensured convergence.
  • n_jobs=-1: Utilized all available CPU cores for parallel training.

Accuracy: The model achieved 96.15% accuracy on the test set.

Confusion Matrix

Logistic Regression Confusion Matrix
Figure: Confusion Matrix for Logistic Regression Model.
Random Forest with TF-IDF Vectorization

Random Forest Model

To improve classification performance, we implemented a Random Forest model. While this model introduces greater complexity, its performance was slightly lower than Logistic Regression.

For classification, we configured the Random Forest model with:

  • n_estimators=100: Used 100 decision trees.
  • max_depth=60: Restricted tree depth to prevent overfitting.
  • n_jobs=-1: Utilized all available CPU cores for parallel training.

Accuracy: The Random Forest model achieved 84.28% accuracy on the test set.

Confusion Matrix

Random Forest Confusion Matrix
Figure: Confusion Matrix for Random Forest Model.
FastText for Text Classification

FastText Model

FastText is a lightweight and efficient text classification model developed by Facebook AI. We trained this model to categorize transaction memos with high speed and accuracy.

Model Training

The FastText model was trained using the following hyperparameters:

  • Epochs: 25
  • Learning Rate: 1.0
  • Word N-Grams: 2
  • Embedding Dimension: 50
  • Bucket Size: 200,000

Accuracy: The FastText model achieved an impressive 98.9% accuracy.

Classification Report

The classification report below presents precision, recall, and F1-scores for different transaction categories.

FastText Classification Report
Figure: Classification Report for FastText Model.
LLM with Transformer Model

Transformer Model

We implemented a Transformer-based model using distilbert-base-uncased to leverage deep learning for text classification. This model was trained to classify transaction memos with contextual embeddings.

Training Configuration

The model was configured with the following hyperparameters:

  • Model Type: DistilBERT
  • Maximum Length: 128 tokens
  • Batch Size: 16
  • Learning Rate: 2e-5
  • Optimizer: AdamW
  • Scheduler: Linear schedule with a warmup ratio of 0.1
  • Gradient Clipping: Maximum value of 1.0
  • Epochs: 3

Model Performance

The Transformer model was evaluated on 9 transaction categories. After one epoch, it achieved:

  • Training Loss: 0.1361
  • Validation Loss: 0.0603
  • Validation Accuracy: 98.56%
  • Evaluation Time: 2 minutes and 43 seconds

While the Transformer model achieved high accuracy, it exhibited higher latency compared to simpler models, requiring longer processing times for both training and evaluation.

Predicting Cash Score

Methodology

Exploratory Data Analysis

Exploratory Data Analysis

We conducted an exploratory data analysis (EDA) to understand patterns in consumer spending, overdrafts, and transaction distributions. This helped us identify meaningful features for predicting credit risk.

  • Transaction Patterns: Identified differences in transaction behaviors between delinquent and non-delinquent consumers, such as frequency and types of purchases.
  • Seasonal Trends & Payday Effects: Examined how consumer spending fluctuates with seasonal trends, including holiday expenses and end-of-month payday spikes.
  • Income Estimation: Estimated consumer income based on recurring transactions such as payroll deposits, rent, and utility payments.
  • Impact of Fees & Overdrafts: Analyzed the effects of account fees, buy-now-pay-later (BNPL) transactions, and overdraft occurrences on overall financial stability.
Delinquency Percentage vs Credit Score
Figure: Delinquency Percentage vs Credit Score.
Comparing Delinquent and Non-Delinquent Consumers

Comparing Delinquent and Non-Delinquent Consumers

We analyzed balance trends over time to observe how cash flow differs between individuals who are delinquent and those who are not. Our key findings highlight financial stability differences based on account balance trends.

  • Delinquent Consumers: Frequently experience negative balances, indicating financial instability and a higher risk of missing payments.
  • Non-Delinquent Consumers: Maintain relatively stable balances with fewer occurrences of overdrafts, suggesting better financial health.
  • Periodic Fluctuations: Balance trends exhibit cyclical patterns influenced by income deposits and spending habits.
Balance Trends for Delinquent Consumers
Figure: Balance trends over time for five randomly selected delinquent consumers. The plot illustrates fluctuations and frequent occurrences of negative balances, highlighting financial instability.
Balance for a Single Delinquent Consumer
Figure: Balance over time for a single delinquent consumer. This consumer frequently experiences negative balances, indicating financial distress and an increased risk of missing payments.
Feature Engineering

Feature Engineering

We engineered multiple features relevant to the prediction of delinquency, focusing on balance trends, transaction behaviors, and account types. These features help in assessing financial stability and predicting default risk.

  • Balance Features: Negative balance ratio, balance trends, payday effects.
  • Transaction-Based Features: Credit vs. debit transaction volume, category-based spending breakdown.
  • Temporal Features: Spending frequency over time, accounting for longevity effects.
  • Account Types: Features based on the types of accounts a consumer has.
  • Overdraft Frequency: The number of times a consumer overdrafted in the past 6 months.
  • Spending Volatility: Standard deviation of monthly expenditures.
  • Recurring Payments: Identifying transactions like rent, utilities, and subscriptions.
Spending Balance Ratio
Figure: Spending balance ratio feature created to measure how much consumers spend relative to their balance. This helps assess financial stability and risk of delinquency.
Standardized Credit and Balance
Figure: Feature engineering step where credit and balance were standardized to allow for easier model interpretability and comparisons across different financial profiles.
Feature Importance

Feature Importance

Top predictive features for different models:

Feature Importance using Shap Values
Figure: Feature Importance using SHAP Values.

The following are the top 10 most important features influencing delinquency prediction, as identified by SHAP values:

  • account_type_SAVINGS: Indicates whether a user has a savings account. A savings account is often associated with financial stability, reducing the risk of delinquency.
  • balance: Represents the consumer's total account balance. Lower balances correlate with higher delinquency risk, while higher balances suggest financial security.
  • ACCOUNT_FEES_median: Measures the median account fees incurred. Frequent fees may indicate financial strain or poor account management.
  • LOAN_std (Loan Standard Deviation): Represents the variability in loan-related transactions. High fluctuations may indicate inconsistent repayments and financial instability.
  • OVERDRAFT_count: Tracks the number of overdraft occurrences. A higher count strongly correlates with financial distress and increased risk of delinquency.
  • REFUND_skewness: Measures the asymmetry in refund transactions. Unusual refund patterns may suggest erratic cash flow or financial instability.
  • ACCOUNT_FEES_count: The total number of account fees charged. Higher counts may indicate insufficient funds or excessive financial penalties.
  • INVESTMENT_INCOME_std (Investment Income Standard Deviation): Measures fluctuations in investment income. High variation could indicate unreliable financial inflows.
  • SELF_TRANSFER_iqr (Interquartile Range of Self-Transfers): Measures variability in self-transfers between accounts. A high IQR may indicate inconsistent cash flow management.
  • ATM_CASH_median: The median amount of cash withdrawn from ATMs. Frequent and high cash withdrawals may suggest reliance on liquid cash, which could relate to financial instability.

These features provide critical insights into financial behavior and help the model better assess delinquency risk.

Models and Results

Model Training

Model Training

We trained multiple machine learning models to predict delinquency risk, focusing on maximizing performance while maintaining interpretability and efficiency.

  • Feature Selection: Used only the top 75 most important features to improve model interpretability and predictive power.
  • Scoring Exclusions: Consumers with fewer than two transactions were excluded to ensure meaningful behavioral patterns.

To address the class imbalance issue (delinquent consumers make up only 8.4% of the dataset), we applied:

  • SMOTE & SMOTEENN: Synthetic oversampling techniques to balance the dataset.
  • Feature Normalization: Standardization of key numerical variables to improve model performance.

To optimize the ROC-AUC score, we trained five primary models:

  • XGBoost: High-performance gradient boosting model with extensive hyperparameter tuning.
  • LightGBM: Fast and efficient for large datasets, especially with categorical features.
  • CatBoost: Optimized for categorical data, reducing the need for preprocessing.
  • Balanced Random Forest: Addresses class imbalance by appropriately weighting samples.
  • Weighted Ensemble: Combines top models for an optimal balance of accuracy, interpretability, and efficiency.
Model Evaluation

Model Evaluation

We evaluated model performance using several key classification metrics:

  • ROC-AUC: Measures the model’s ability to differentiate between delinquent and non-delinquent users.
  • Precision and Recall: Precision measures the percentage of correctly predicted delinquents, while recall captures the proportion of actual delinquents that were correctly identified.
  • F1-Score: Provides a balance between precision and recall, ensuring both false positives and false negatives are considered.
  • Training Time: Evaluates how long each model takes to train, an essential factor for scalability.
  • Prediction Time: Measures how efficiently models generate predictions for new consumers.
Model Performance

Model Performance

Logistic Regression Confusion Matrix
Figure: Logistic Regression Confusion Matrix.
Random Forest Confusion Matrix
Figure: Random Forest Confusion Matrix.
LightGBM Confusion Matrix
Figure: LightGBM Confusion Matrix.
Balanced RF Confusion Matrix
Figure: Balanced Random Forest Confusion Matrix.
XGBoost Confusion Matrix
Figure: XGBoost Confusion Matrix.
CatBoost Confusion Matrix
Figure: CatBoost Confusion Matrix.
RUSBoost Confusion Matrix
Figure: RUSBoost Confusion Matrix.
Model ROC-AUC Accuracy Precision Recall F1-Score Train Time (s) Predict Time (s)
LightGBM 0.8221 0.9031 0.8739 0.9031 0.8814 2.5050 0.000019
Weighted Ensemble 0.8301 0.9006 0.8728 0.9006 0.8813 0.0010 0.000001
XGBoost 0.8232 0.8962 0.8677 0.8962 0.8778 2.0606 0.000006
CatBoost 0.8212 0.8892 0.8707 0.8892 0.8785 3.0342 0.000004
Balanced RF 0.8144 0.8982 0.8703 0.8982 0.8797 26.2355 0.000064
HistGB 0.823 0.913 0.892 0.913 0.901 3.196 0.000021
RUSBoost 0.802 0.829 0.903 0.829 0.859 6.954 0.000012
Random Forest 0.800 0.915 0.893 0.915 0.902 14.364 0.000088
Logistic Regression 0.777 0.743 0.911 0.743 0.803 0.348 0.000003
Conclusion

Conclusion

Our analysis highlights the effectiveness of different machine learning models in predicting delinquency risk. Among them, weighted ensemble demonstrated the highest ROC-AUC score, making it the most effective model for our use case.

AUC-ROC Scores
Figure: ROC-AUC curve comparison for all models.
  • Best Model: Weighted ensemble achieved the best overall performance.
  • Cash Scores outperform Credit Scores: Our analysis shows that transaction-based Cash Scores provide better delinquency predictions than traditional Credit Scores.
Heatmap of Delinquency Rates by Cash Score and Credit Score
Figure: Heatmap showing delinquency rates across Cash Scores and Credit Scores, highlighting the effectiveness of our model.

Discussion

Our project aims to create a fairer credit assessment system while maintaining accuracy and transparency. The Cash Score model reduces reliance on traditional credit history and promotes financial inclusivity. By leveraging transactional data, we provide a more equitable method of assessing credit risk, particularly for those with limited or no credit history.

Our model demonstrates the potential for alternative credit scoring methods but faces challenges such as data bias and class imbalance. Future work will focus on refining the fairness and interpretability of the Cash Score while ensuring regulatory compliance.

Ensuring Compliance with the Equal Credit Opportunity Act (ECOA)

To make sure our model is fair and unbiased, we carefully removed features that could introduce discrimination, aligning with the Equal Credit Opportunity Act (ECOA). ECOA prevents creditors from making lending decisions based on race, gender, age, marital status, and reliance on public assistance, among other protected characteristics. Since some spending and income categories could act as indirect indicators of these attributes, we excluded them from our feature set to avoid potential bias.

Categories Removed for ECOA Compliance

After reviewing our data, we decided to remove certain features that could disproportionately impact protected groups:

By removing these features, we made sure our Cash Score model stays focused on financial behavior rather than personal demographics. Our goal is to create a fair and transparent way to assess credit risk without reinforcing systemic biases.

Future Work

Moving forward, we plan to enhance the Cash Score model with the following improvements:

Go To Top ↑