ML for Credit Fraud Detection

<aside> 💡

the why…

With the growing demand for Digital Lending and Mobile Device Financing in LMICs, where credit identity data for scoring is often scarce, we must explore alternative credit assessment methods. It has become essential to innovate and strategically collect relevant data, while building robust fraud and risk management models.

As a product manager, I should be able to track portfolio quality, risk, and fraud effectively. These efforts will help us stay ahead of potential portfolio shocks and address macroeconomic and individual behavioural patterns that could lead to revenue leakage.

</aside>

the what…

Credit companies are turning to ML, specifically Anomaly detection (AD) and Social network analysis (SNA), to detect and control fraud at scale. How do we estimate malicious probability of non performance or non - payment?

↪️ Anomaly detection is an ML approach used to identify data points or patterns that deviate from the norm. Unlike traditional classification models, which rely on predefined labels (e.g., fraud vs non-fraud), anomaly detection excels in identifying unknown or novel fraudulent activities by flagging outliers in data.

Some of they key techniques within this classification include (in order of popularity):

Statistical Models: Z-scores/Distribution based methods to detect deviations.
- Practical Steps of Implementation
Clustering Algorithms: Identifying outliers in clusters ( popular implementations are based on DBSCAN - Density-based Spatial Clustering of Applications with Noise - unlike Kmeans which is limited to spherical shapes, it can detect more complex shapes.). These are particularly useful in segmenting users into variable risk strata.
- Practical Steps of Implementation
- Utilising Activity Events for Segmentation
Time Series Analysis: Detecting abnormal patterns in temporal data (e.g repayment, dealership acquisition, Back + Front Officer, behaviour over time). This is particularly for segmented customers, we can detect repayment behaviour deviating from the expected norm and flag as risk for fraud or default.
- Practical Steps of Implementation
Benford’s Law: This is especially useful in validating customer reported data like income, expense on rent, utilities, food etc, number of dependents, other financial obligations etc. The law states that in many naturally occurring datasets the leading digits are distributed in a specific, non uniform way. Specifically, the number one appears as the leading number 30% of the time while higher digits appear less frequently with number 9 as the least appearing digit at 5% of the time.
Geo Hashing: This is particularly useful to detect location based anomalies based on sudden spikes or drops of activity within a geohash, users (or similar attributed users) frequently operating across geo hashes that are distant within short time frames transactions originating from unexpected geo hashes for a user.
Deep Learning: Autoencoders and recurrent neural networks for high-dimensional or sequential data (data intensive algorithms)
1. Auto enconders identify anomalies when the reconstruction error is high; the network learns a compressed representation of input data and reconstructs it.
2. LSTMs (type of RNN) can be used to predict behaviour and by measuring large deviations predict fraud. RNNs are efficient for handling long term dependencies in sequential data (a client profile can be sequential with time dependent events such as instalment records, dates of enrolment agent interactions, app utilisation, etc)

Practical applications of this include pin pointing cases of credit applications from high risk geographic / dealership / Front + Back Officer zones and suspicious repayment behaviours. Further applications include device theft, collusion, and payment evasion as described below:

Loan Origination Fraud: Customers / Sales Agents may use fake identities ( Identity is a function of NIN, Card Numbers, demographics, KYC/affordability survey questions)
1. Mismatched demographic data such as low affordability or digital literacy but applying for high tier smart phone
2. Outliers in application velocity such as multiple loans linked to same geographic location within a short period
3. Based on Benford’s Law, find customers who may have submitted false critical KYC data (predictor data of their affordability or intent to pay). These are customers whose data falls outside of natural disposition.
Repayment Anomalies (These are great asset for collections teams prioritising who to engage)
1. Large one-off payments followed by default; large in this case is relative based on the customer cohort
2. Payment patterns inconsistent with historical trends in the same cohort
MDF device misuse and resale
1. Unusual geo-location data eg devices activating in unexpected regions
2. Under utilisation post-acquisition suggesting high probability of default
3. SIM switching post acquisition suggesting high probability of fraud

↪️ Social Network Analysis combined with anomaly detection provides a robust fraud control strategy. SNA focuses on identifying relationships and patterns within a network, making it invaluable in detecting collusion and organised fraud rings.

SNA uncovers hidden connections between entities ( Front Officers, Customers, NoK, Alternate Phone Numbers Provided, Phone Numbers used to Pay). Fraud rings leave traces in shared contact details and transaction identifiers.

(Connected) Clusters can be analysed as mini portfolios in order to flag high, mid and low risk clusters thereafter driving proactive mechanisms to nub fraud such as:

Detecting collusion by identifying high risk or poorly performing clusters
Tracing fraud rings when a customer / sales agent is flagged for fraud or delinquency + write off
Mapping connections between devices financed in the same time frame
Preventing fake customer onboarding (users of multiple accounts

the how…

How exactly do we get started? How do we build (from scratch) an ML Powered Fraud Detection Pipeline?

Data Instrumentation: Implement mechanisms within the product or system to collect, measure, and track data. This ensures visibility into how the system is used and how it performs, laying the foundation for effective fraud detection.
Data collection & Preprocessing: Structure raw data by normalising and cleaning it to ensure consistency and quality. Proper preprocessing eliminates noise and prepares the data for reliable analysis and model training.
Evaluate off the shelf models: Identify quick-win solutions by testing the performance of readily available models:
1. Unsupervised Models: Train these on historical data to detect abnormal patterns. Deploy them in real-time environments without decision-making authority to assess delivery feasibility and efficiency.
2. Graph-Based ML Models: Use these to identify suspicious clusters on historical data and deploy real time to flag suspicious entries.
3. Benford’s Law: Calculate the frequency of historical data to find % of customers who’s submitted numerical data is out of stated thresholds. Alignment to thresholds suggests that use of this data to make a lending decision will yield a high quality portfolio.
  - Ranges described in Beford’s Law
Risk Scoring: Combine anomaly scored and SNA derived metrics into a unified risk score strata. Deploy risking model to classify clients in real time to gauge model generalisability.
Gradual Release: Gradually allow models to take on scoring and independently control acceptance rates of new customers from 5 to 90% reserving a 10% control group to evaluate whether risk controls are efficient in terms of default Rates, known fraud % of customers, on time/self serving repayments, collection rates as compared to this control group.