Case study · 2025

RFM Customer Segmentation

Machine learning pipeline that segments customers by purchasing behavior, turning raw transactional data into actionable retention and marketing strategy.

Overview

RFM Customer Segmentation is an end-to-end machine learning project that transforms raw transactional records into clearly defined customer segments. By scoring each customer on Recency, Frequency, and Monetary value and then grouping them with K-Means clustering, the system surfaces who your best customers are, who is slipping away, and who can be reactivated, all in a reproducible, fully automated pipeline.

Problem

Businesses rely on broad demographics instead of actual purchasing behavior, leading to generic campaigns with low conversion rates
Manual customer classification does not scale: analysts spend days categorising thousands of records with inconsistent results
Without quantified customer value, marketing budgets are spread evenly across high-value and low-value segments, wasting spend
At-risk customers who have stopped buying are invisible until it is too late to re-engage them cost-effectively
One-size-fits-all retention strategies fail to address the different motivations of champions, loyal regulars, and lapsed buyers

Methodology

The pipeline follows four tightly coupled stages: clean, engineer, score, and cluster. Each stage produces an auditable artefact so the transformation from raw data to business segment is fully traceable.

Data cleaning: removed duplicate transactions, corrected negative quantities and prices from return entries, and resolved missing Customer IDs, reducing noise by ~18% before any feature engineering began
RFM feature engineering: computed Recency (days since last purchase from the snapshot date), Frequency (distinct invoice count per customer), and Monetary (total net spend) for every unique customer in the dataset
Score normalisation: applied log transformation to Monetary and Frequency distributions to reduce right-skew, then StandardScaler-normalised all three features before clustering to prevent high-spend customers from dominating the distance metric
Cluster selection: ran K-Means for k=2 through k=10, plotted Within-Cluster Sum of Squares (WCSS) via the Elbow Method, and confirmed the optimal k with Silhouette Score, landing on k=4 with a score of 0.52
Segment labelling: profiled each cluster by its median RFM centroid, mapping them to Champions, Loyal Customers, At-Risk, and Lost/Inactive, labels grounded in actual spending patterns, not arbitrary names

Results & Impact

Segmented 4,300+ unique customers into 4 behaviorally distinct groups with a Silhouette Score of 0.52, indicating well-separated, compact clusters
Champions segment (top ~15% of customers) accounts for approximately 61% of total revenue, confirming a strong Pareto effect and justifying a VIP retention programme
At-Risk segment identified 820+ customers whose last purchase was 90–180 days ago, a directly actionable re-engagement list estimated to represent $38K in recoverable annual revenue
Lost/Inactive cluster surfaced 540+ customers silent for 180+ days, enabling the business to suppress them from expensive outbound campaigns and reduce wasted ad spend
Automated the full segmentation pipeline end-to-end, replacing a manual analyst workflow that previously took 2–3 days per cycle with a repeatable notebook run under 5 minutes
Cluster stability confirmed across 5 independent random seeds with fewer than 3% customer label changes, demonstrating robustness of the chosen k and preprocessing steps

Technical Highlights

Log + StandardScaler normalisation pipeline prevents high-spend outliers from collapsing cluster separation in Euclidean space
Elbow Method combined with Silhouette Score validation ensures k is chosen on statistical evidence, not intuition
Snapshot-date parameterisation means Recency scores are reproducible and comparable across analysis runs without touching raw data
Modular notebook structure: each stage (clean → engineer → cluster → visualise) is isolated so individual steps can be re-run after data refreshes without full pipeline re-execution
Visualisation layer uses Seaborn pair plots and cluster scatter plots with centroid markers to make segment boundaries interpretable to non-technical stakeholders
RFM composite scoring grid overlaid on cluster results provides a dual-lens view, statistical clusters validated against human-readable RFM quintile bands

Business Insights

Segmentation outputs were translated directly into four differentiated marketing playbooks:

Champions: early access to new products, referral incentives, and loyalty rewards, focus is retention and lifetime value expansion
Loyal Customers: upsell and cross-sell campaigns timed to their purchase cadence, personalised product recommendations based on category history
At-Risk: win-back sequences triggered at the 90-day recency threshold, offering time-limited discounts or free shipping to re-establish the purchase habit
Lost/Inactive: suppressed from standard campaigns to protect deliverability; re-engagement only through low-cost channels such as low-frequency email with high-value incentives

Roadmap

Real-time scoring: stream new transactions through the RFM pipeline via a lightweight API so segment labels update daily instead of on a scheduled batch cycle
Dynamic k selection: automate cluster count re-evaluation each cycle using Calinski-Harabasz and Davies-Bouldin indices alongside Silhouette Score to detect structural shifts in purchasing behaviour
CLV integration: augment RFM scores with predicted Customer Lifetime Value from a BG/NBD or Pareto/NBD model for more precise segment prioritisation
Dashboard: build an interactive Streamlit or Dash front-end exposing segment distributions, centroid drift over time, and per-segment revenue contribution
A/B testing framework: instrument marketing playbooks to feed campaign response rates back into the segmentation model as an additional behavioural signal

Tech stack

PythonPandasNumPyScikit-LearnMatplotlibSeabornJupyter NotebookK-Means ClusteringRFM Analysis