In data science and machine learning, SQL isn’t just a helpful tool—it’s a foundational skill. Whether you’re building recommendation systems, preparing large datasets, or analyzing user behavior, mastering intermediate SQL concepts is key to unlocking complex data workflows.
This guide focuses on practical SQL examples used in real-world machine learning pipelines, especially highlighting how to use JOINs, aggregation functions, subqueries, and CTEs (Common Table Expressions) for advanced data preparation.
Why Intermediate SQL Skills Matter for ML Projects
As machine learning moves from theory to deployment, SQL remains the backbone for:
- Extracting and transforming user behavior logs
- Segmenting customers for personalized recommendations
- Detecting anomalies in real-time logs
- Preparing training data from enterprise-scale databases
Let’s dive into an example-driven walkthrough of essential SQL patterns used in production.

(Source: Wikipedia)
Relational Analysis with JOINs
INNER JOIN – Merging Related Tables
SELECT A.column, B.columnFROM tableA AS AINNER JOIN tableB AS BON A.key = B.key;
Example: Combine customer profiles with order history to calculate total spend:
SELECT c.customer_id, c.name, SUM(o.total_price) AS total_spentFROM customers AS cINNER JOIN orders AS o ON c.customer_id = o.customer_idGROUP BY c.customer_id, c.nameORDER BY total_spent DESC;
Use Case: Feature engineering for high-value customer prediction.
LEFT JOIN – Retain All Rows from Left Table
SELECT A.column, B.columnFROM tableA AS ALEFT JOIN tableB AS BON A.key = B.key;
Example: Include all customers, even those with no orders:
SELECT c.customer_id, c.name, o.order_id, o.total_priceFROM customers cLEFT JOIN orders o ON c.customer_id = o.customer_id;
Use Case: Prepare input data for churn prediction or cold-start analysis.
SELF JOIN – Join a Table to Itself
SELECT A.column, B.columnFROM table AS AJOIN table AS BON A.related_column = B.related_column;
Example: Analyze mentor-mentee relationships within an employee table:
SELECT e1.name AS mentor, e2.name AS menteeFROM employees e1JOIN employees e2 ON e1.employee_id = e2.mentor_id;
Use Case: Internal hierarchy mapping or social graph analysis.
Aggregation with GROUP BY and HAVING
SQL’s aggregation functions are essential for feature generation and statistical summaries.
Common Aggregate Functions
| Function | Description |
|---|---|
COUNT() | Row count |
SUM() | Total value |
AVG() | Average value |
MAX() | Maximum value |
MIN() | Minimum value |
Example: Average spend per customer
SELECT customer_id, AVG(total_price) AS avg_spentFROM ordersGROUP BY customer_id;
Filtering Aggregates with HAVING
SELECT column, AGG_FUNCTION(column)FROM tableGROUP BY columnHAVING AGG_FUNCTION(column) condition;
Example: Customers who spent over $500
SELECT customer_id, SUM(total_price) AS total_spentFROM ordersGROUP BY customer_idHAVING total_spent > 500;
Use Case: Customer segmentation for targeted offers or retention models.
Advanced Filtering with Subqueries and CTEs
Subqueries – Nested Queries for Derived Filters
SELECT columnFROM (
SELECT column FROM table WHERE condition
) AS subquery;
Example: Find the customer who spent the most
SELECT customer_id, total_spent
FROM (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
) AS t
WHERE total_spent = (SELECT MAX(total_spent) FROM (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
) AS totals);
CTE (Common Table Expressions) – Cleaner, Reusable Logic
WITH cte_name AS (
SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY column
)
SELECT * FROM cte_name WHERE condition;
Example: Get customers with over $500 spent in the last 6 months
WITH recent_orders AS (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
WHERE order_date >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
GROUP BY customer_id
)
SELECT c.customer_id, c.name, r.total_spent
FROM customers c
JOIN recent_orders r ON c.customer_id = r.customer_id
WHERE r.total_spent > 500;
Use Case: Generate filtered customer segments for AI-based personalization engines.
Practical Machine Learning Scenario
Goal: Identify customers likely to repurchase based on recent behavior
Steps:
- Filter customers with purchases in the last 6 months
- Aggregate total spend per customer
- Join with profile data
- Filter by a spending threshold (e.g., $500)
Result: Clean, aggregated customer dataset for supervised ML tasks like churn prediction or LTV modeling.
- JOINs help connect relational data for enriched feature sets
- Aggregation functions and HAVING allow behavioral segmentation
- Subqueries and CTEs simplify complex logic and improve readability
- SQL remains a powerful preprocessing tool for AI and machine learning workflows
Mastering these intermediate SQL techniques will empower you to transform raw data into ML-ready insights and contribute to scalable AI pipelines in real-world projects.
