Traditional credit risk assessment methods rely heavily on historical credit scores and limited financial data. However, this system is only available to those with an existing history of credit. Immigrants, younger individuals without established, and the elderly who have not made recent purchases, may be financially smart but have a low credit score if they are lacking in credit history. Without a high credit score, many groups of people are unable to gain access to loans, credit cards, and more.
โ ๏ธ Essential Question: Can we predict whether or not an individual will pay back their debts based on their banking transaction history using natural language processing and predictive models?
This capstone explores how natural language processing can categorize banking transaction memos and predict an individual's credit risk through cashflow underwriting. This project uses techniques, such as embedding, and models, like BERT, to accurately and efficiently classify and predict categories and credit scores. These results are then combined with income data to estimate whether a consumer would be delinquent, creating an accurate and reliable credit risk assessment model.
๐ก Cashflow Underwriting (n.) A method of evaluating a borrower's creditworthiness without their credit scores, by analyzing their actual income and spendings.
๐ก Delinquent (adj.) Late or past-due on a required payment; when describing a consumer, the borrower has consistently failed to make minimum payments.
As previously stated, traditional credit scores are determined solely by financial statement history, which focuses on assets liabilities, and net worth. This approach can be unfair to many populations, such as low-income individuals, new graduates, contracted workers, and immigrants, who may not have extensive credit records despite being financially responsible. PrismData addresses this limitation by analyzing more than just traditional financial statements, incorporating income and transaction data to gain a fuller picture of financial behavior. Through its pioneering use of cashflow underwriting, PrismData augments credit score accuracy and expands credit access to consumers who have limited or no traditional credit history, promoting a more inclusive and accurate assessment of creditworthiness.
We worked with ucsd-inflows.pqt and ucsd-outflows.pqt, which details the banking transaction memos of money entering and exiting a consumer's banking account, respectively.
We were introduced to new datasets providing information at the consumer, account, and transaction levels. We used an 80/20 train-test split across 15K consumers. Unlike previous datasets, these were stripped of banking transaction memos for privacy protection; we were directly provided with purchase categories.
โ ๏ธ Important: In a real setting, we would need to build a model to estimate the category from the banking transaction memos. Using the newly projected categories, we would build a model predicting whether a consumer would be delinquent before scaling their respective delinquency probabilities into an understandable credit risk value. Additionally, the data was provided through internal sources through PrismData and is not open-source.
Consumer-level data:
prism_consumer_id,
evaluation_date,
credit_score,
DQ_TARGET (delinquent: 1, non-delinquent: 0)
Account-level data:
prism_consumer_id,
prism_account_id,
account_type (e.g., savings, checking),
balance_date,
balance
Transaction-level data:
prism_consumer_id,
prism_transaction_id,
category,
amount,
credit_or_debit,
posted_date
๐ก The category column from trxndf uses numerical values mapped to category names via cat_map.
Our strongest performing model, DistilBERT, successfully categorizes transactions into education, food and beverages, general merchandise, and other significant categories.
| Model | Accuracy The percentage of transactions correctly classified out of all transactions. | Macro F1 The unweighted mean of the F1 scores by each class; useful for imbalanced datasets when each class should be treated equally. | Weighted F1 The weighted mean of the F1 scores by each class' support relative to the sum of all support values; useful for imbalanced datasets when accounting for class frequency. |
|---|---|---|---|
| Logistic Regression | 0.9300 | 0.87 | 0.93 |
| Random Forest | 0.9128 | 0.84 | 0.91 |
| Sentence Encoder + Logistic Regression | 0.8631 | 0.81 | 0.86 |
| Sentence Encoder + XGBoost | 0.8916 | 0.77 | 0.89 |
| FinBERT1 | 0.8061 | 0.54 | 0.81 |
| DistilBERT | 0.9552 | 0.92 | 0.96 |
| Model | AUC Area under the curve; measures how well a binary classification model can distinguish between classes. | Precision How many predicted delinquent are actually delinquent? High precision minimizes false positives. | Recall Of all delinquent consumers, how many were correctly identified? High recall minimizes false negatives. |
|---|---|---|---|
| Logistic Regression | 0.76 | 0.209 | 0.572 |
| Random Forest | 0.79 | 0.281 | 0.422 |
| XGBoost | 0.80 | 0.357 | 0.398 |
| LightGBM | 0.80 | 0.521 | 0.295 |
We applied LightGBM to create a binary classification model that predicted whether consumers would be delinquent. Although XGBoost also achieved an equally high AUC, the LightGBM model had a faster inference time.
We combined multiple NLP techniques to predict credit risk from transaction text.
Banking transaction descriptions from Prism Data
Applying natural language processing to transaction memos
Assessing credit risk through delinquency binary classification
We tested six models total. Here are three of the more important ones.
The Logistic Regression model, trained on TF-IDF features with 99.80% sparsity, achieved an overall accuracy of 93.00% and weighted F1 score of 0.93 in 12.16 minutes. The model demonstrated strong performance on well-represented categories, getting F1 scores above 0.90 for six of the nine categories.
Logistic Regression performs strongly on majority classes but struggles on minority classes, a pattern that holds for Random Forest as well (Figure 1). Generally, Logistic Regression performs better than Random Forest across most of the categories (Figure 2).
The model did well at identifying Overdraft transactions and high-frequency categories, like FOOD_AND_BEVERAGES and GENERAL_MERCHANDISE but struggled with EDUCATION transactions. The gap between weighted F1 (0.93) and macro F1 (0.87) of 0.06 shows moderate sensitivity to class imbalance, suggesting the model's aggregate performance is inflated by strong results on majority classes.
This model uses a transformer-based sentence encoder to generate dense 768-dimensional embeddings for each memo, followed by a logistic regression classifier. Pipeline timing was 7.67 minutes, including setup of 1.742s, embedding generation of 2.86 min, training of 4.81 min, and inference of less than 0.5s.
Comparing F1 across categories shows that the logistic regression of the sentence-encoder improves on classical models and reduces the variance related to the imbalances but does not reach the performance at the DistilBERT-level (Figure 3). Precision-recall trade-offs remain relatively stable across categories, with minority-class recall outperforming TF-IDF models (Figure 4). Categories with lower sample support show more performance variability, which contextualizes the remaining gaps (Figure 5). This method forms a strong intermediate model: heavily improved semantic understanding over TF-IDF, but lighter and more stable than transformer fine-tuning.
DistilBERT achieved the highest performance across all models, with 97.92% accuracy and both weighted and macro F1 scores around 0.98 on training data.
Training dynamics over three epochs show efficient learning: training loss decreased from 0.1178 to 0.0635, while validation loss decreased from 0.0950 to 0.0730. Validation metrics tracked training closely, showing no signs of overfitting. Accuracy improved from 96.97% to 97.92%, and F1 increased from 0.9697 to 0.9792 by the final epoch.
On the validation set, DistilBERT maintained almost perfect performance across most categories (Figure 8). High-frequency categories such as FOOD_AND_BEVERAGES (F1: 0.952), GENERAL_MERCHANDISE (F1: 0.958), and GROCERIES (F1: 0.960) were classified accurately. Even minority classes achieved strong results. The model's weighted and macro F1 scores (0.955 and 0.921, respectively) suggest robustness to class imbalance.
While looking at the test set, we can see that DistilBERT maintained accurate generalization (Figure 9). Overall accuracy was 95.52%, with a weighted F1 of 0.955 and a macro F1 of 0.921. Performance on high-support categories remained excellent: FOOD_AND_BEVERAGES (F1: 0.952), GENERAL_MERCHANDISE (F1: 0.958), and GROCERIES (F1: 0.960). Importantly, DistilBERT also substantially outperformed all other models on minority classes.
Figure 9:
Above is a confusion matrix displaying DistilBERT's performance on the test set.
DistilBERT not only surpasses classical models in overall metrics but also handles minority classes with high reliability. Its combination of efficiency, accuracy, and balanced performance makes it particularly effective for transaction classification tasks with imbalanced datasets.
We tested four models in all: logistic regression, random forest, XGBoost, and LightGBM.
For Logistic Regression, we configured the model with a maximum iteration of 1000, balanced class weights, and a random state of 42. After excluding accounts without transactions and accounts with less than 30 days of observable history and restricting the model to only accounts with at least one credit and one debit transaction, identifying the top 50 most predictive features, our refined model achieved an AUC of 0.757.
After excluding accounts without transactions and accounts with less than 30 days of observable history and restricting the model to only accounts with at least one credit and one debit transaction, we identified the top 50 most predictive features for the Random Forest model. The refined model achieved a test AUC of 0.798, test accuracy of 0.916, test precision of 0.586, and test recall of 0.102.
For XGBoost, we used 600 estimators with a shallow max depth of 4, slow learning rate of 0.01, and moderate subsampling. After excluding accounts without an account or transaction and accounts with too short of observable history (less than 30 days) and restricting the model to only accounts with at least 1 credit and 1 debit, we achieved a test AUC of 0.80, which is the highest AUC compared to any other model tested, except for LightGBM.
After optimizing the threshold, the recall was about 0.376. The model performed well with precision, suggesting high confidence in flagging someone delinquent but missed correctly identifying most of the delinquents.
Since the train AUC was 0.934 compared to the test AUC of 0.801, it is reasonable to infer that some overfitting occurred despite implementing regularization parameters, including reg_lambda=0.5, gamma=0, and min_child_weight=3.
For LightGBM, we used 900 estimators with a shallow max depth of 3, slow learning rate 0.01, and conservative leaf count (15).
To prevent overfitting on the minority class, we specified min_child_samples=80. We achieved a test AUC of 0.80,
which had an improved precision-recall balance compared to XGBoost. Using the optimized threshold, we obtained a precision of 0.521 and a recall
of 0.291. The Train AUC (0.903) and Test AUC (0.793) are also smaller in difference compared to XGBoost, which suggests there is less overfitting.
We also used class_weight="balanced" to handle the ~8.6% delinquency rate in the training data. Using a 5-fold
cross validation test, we also obtained a mean CV AUC of 0.798 (std 0.011), which demonstrates that the model generalizes consistently,
rather than performs well one time.
A report by the Consumer Financial Protection Bureau estimates that about 32 million Americans are either โcredit invisibleโ or unscorable. PrismData enhances traditional credit scoring by applying natural language processing to banking transaction memo data to extract insights into consumer spending behavior and improve credit risk estimates. This approach can expand credit access for underserved populations, including immigrants, young adults, recent graduates, and gig workers, while providing lenders with more accurate assessments of borrower risk. By incorporating alternative data, the model has the potential to promote financial inclusion and support fairer lending practices.
Deploying our model requires careful ethical and operational planning. Continuous monitoring is necessary to ensure consistent performance over time and across different consumer populations, particularly as spending behaviors and economic conditions change. The system should include safeguards to flag uncertain or ambiguous predictions and require human review before final decisions. Bias across vulnerable populations, including cash-based workers, immigrants, ESL users, recent graduates, and the elderly, must be monitored and mitigated. Transparency, human oversight, and regular evaluation are essential to ensure that the model supports responsible lending decisions without causing unintended harm.
The project is designed as a decision-support tool to supplement, not replace, traditional credit scoring. It focuses on analyzing transaction data to provide a fuller view of consumer financial behavior but does not make automated lending decisions. Regulatory compliance, real-time deployment, and long-term monitoring are outside the scope. The model aims to improve inclusivity and accuracy for populations underrepresented in traditional credit scoring and provides insights to human reviewers to support better-informed decisions.
The model relies on transaction data, often captured as screenshots, which offer only a partial representation of a consumerโs financial history. Missing, incomplete, or ambiguous memos, high-volume repetitive transactions, seasonal shifts such as tax season or holidays, and unseen transaction types can reduce accuracy. There is a risk of misclassifying consumers as delinquent or nondelinquent. False positives may unfairly restrict access to credit, while false negatives may expose lenders to financial loss. Performance also depends on the populations and transaction patterns represented in the training dataset and may not generalize well to unseen data.
Deployment is appropriate only when ethical, technical, and operational safeguards are in place. Guardrails such as confidence thresholds, human review, and fallback mechanisms must be established to manage uncertainty and prevent overconfident errors. Edge cases must be thoroughly tested, and bias across vulnerable populations including cash-based workers, immigrants, ESL users, recent graduates, and the elderly must be monitored. The system is limited to providing support to human decision-makers rather than replacing them, ensuring careful interpretation of predictions and minimizing potential harm.
Interested in our research? Let's connect!