Abstract
The process of determining a creditor's trustworthiness is crucial within the context of bank data, given the ethical and regulatory constraints surrounding its use. Despite the vast quantity of available data, only a limited number of features are explicitly useful for machine learning applications, raising the question of how best to assess a customer's financial reliability. Our methodology involves refining bank data into meaningful categories using Natural Language Processing, estimating individual income based on transaction data alone, and evaluating creditworthiness with both accuracy and efficiency.
Introduction
Determining a customer’s creditworthiness is essential for banks, influencing decisions about loans, credit cards, and other financial services. While traditional credit scores are widely used, they often fail to capture the full financial picture, especially for individuals with little to no credit history. Our project aims to bridge this gap by leveraging transaction data to assess financial behavior rather than relying solely on past credit activity. By analyzing spending habits, income consistency, and transaction types, we seek to create a scoring model that provides a more accurate and equitable reflection of financial reliability. As part of our project, we have implemented a data pipeline to support this model, including a categorization system that classifies transaction memos from vendors with high accuracy and low latency—ensuring that entries like "Amazon.com" are correctly categorized under "General Merchandise." This categorization plays a key role in our broader goal of developing the Cash Score, a numerical value ranging from 1 to 999 that predicts a consumer’s likelihood of defaulting on debt or credit. By equipping financial institutions with this tool, we aim to enhance risk assessment while ensuring that credit access is extended to those with strong banking histories, making lending practices more inclusive and effective.
Methods
Our methodology consists of feature engineering from transaction data, model training using various machine learning algorithms, and performance evaluation based on key metrics. The figure below shows our overall methodology:

Categorizing Transactions Based on Memos
Feature Creation and Selection
Models and Results
Predicting Cash Score
Methodology
Models and Results
Discussion
Our project aims to create a fairer credit assessment system while maintaining accuracy and transparency. The Cash Score model reduces reliance on traditional credit history and promotes financial inclusivity. By leveraging transactional data, we provide a more equitable method of assessing credit risk, particularly for those with limited or no credit history.
Our model demonstrates the potential for alternative credit scoring methods but faces challenges such as data bias and class imbalance. Future work will focus on refining the fairness and interpretability of the Cash Score while ensuring regulatory compliance.
Ensuring Compliance with the Equal Credit Opportunity Act (ECOA)
To make sure our model is fair and unbiased, we carefully removed features that could introduce discrimination, aligning with the Equal Credit Opportunity Act (ECOA). ECOA prevents creditors from making lending decisions based on race, gender, age, marital status, and reliance on public assistance, among other protected characteristics. Since some spending and income categories could act as indirect indicators of these attributes, we excluded them from our feature set to avoid potential bias.
Categories Removed for ECOA Compliance
After reviewing our data, we decided to remove certain features that could disproportionately impact protected groups:
- Unemployment Benefits: Could unfairly penalize people experiencing temporary job loss.
- Education: Spending on tuition or student loans could indirectly reveal age or socioeconomic status.
- Home Improvement: Homeownership status could create an unintended bias against renters.
- Healthcare/Medical Expenses: High medical spending might reflect age or disability, which are protected attributes.
- Child Dependence: Child-related expenses could be linked to marital status or family structure.
- Pension & Retirement Benefits: Income from pensions or Social Security might unfairly disadvantage older consumers.
By removing these features, we made sure our Cash Score model stays focused on financial behavior rather than personal demographics. Our goal is to create a fair and transparent way to assess credit risk without reinforcing systemic biases.
Future Work
Moving forward, we plan to enhance the Cash Score model with the following improvements:
- Integrate Q1 Project: Leverage our categorization model to create a category column, enabling additional feature generation for the Cash Score model. The model categorizes transactions based on the memos column in our Q1 dataset.
- Expand Dataset: Train and test a full-size dataset to improve model generalizability and robustness.