In data science and machine learning, SQL isn’t just a helpful tool—it’s a foundational skill. Whether you’re building recommendation systems, preparing large datasets, or analyzing user behavior, mastering intermediate SQL concepts is key to unlocking complex data workflows.
This guide focuses on practical SQL examples used in real-world machine learning pipelines, especially highlighting how to use JOINs, aggregation functions, subqueries, and CTEs (Common Table Expressions) for advanced data preparation.
Why Intermediate SQL Skills Matter for ML Projects
As machine learning moves from theory to deployment, SQL remains the backbone for:
- Extracting and transforming user behavior logs
- Segmenting customers for personalized recommendations
- Detecting anomalies in real-time logs
- Preparing training data from enterprise-scale databases
Let’s dive into an example-driven walkthrough of essential SQL patterns used in production.

(Source: Wikipedia)
Relational Analysis with JOINs
INNER JOIN – Merging Related Tables
SELECT A.column, B.column
FROM tableA AS A
INNER JOIN tableB AS B
ON A.key = B.key;
Example: Combine customer profiles with order history to calculate total spend:
SELECT c.customer_id, c.name, SUM(o.total_price) AS total_spent
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC;
Use Case: Feature engineering for high-value customer prediction.
LEFT JOIN – Retain All Rows from Left Table
SELECT A.column, B.column
FROM tableA AS A
LEFT JOIN tableB AS B
ON A.key = B.key;
Example: Include all customers, even those with no orders:
SELECT c.customer_id, c.name, o.order_id, o.total_price
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;
Use Case: Prepare input data for churn prediction or cold-start analysis.
SELF JOIN – Join a Table to Itself
SELECT A.column, B.column
FROM table AS A
JOIN table AS B
ON A.related_column = B.related_column;
Example: Analyze mentor-mentee relationships within an employee table:
SELECT e1.name AS mentor, e2.name AS mentee
FROM employees e1
JOIN employees e2 ON e1.employee_id = e2.mentor_id;
Use Case: Internal hierarchy mapping or social graph analysis.
Aggregation with GROUP BY and HAVING
SQL’s aggregation functions are essential for feature generation and statistical summaries.
Common Aggregate Functions
Function | Description |
---|---|
COUNT() | Row count |
SUM() | Total value |
AVG() | Average value |
MAX() | Maximum value |
MIN() | Minimum value |
Example: Average spend per customer
SELECT customer_id, AVG(total_price) AS avg_spent
FROM orders
GROUP BY customer_id;
Filtering Aggregates with HAVING
SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY column
HAVING AGG_FUNCTION(column) condition;
Example: Customers who spent over $500
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
HAVING total_spent > 500;
Use Case: Customer segmentation for targeted offers or retention models.
Advanced Filtering with Subqueries and CTEs
Subqueries – Nested Queries for Derived Filters
SELECT column
FROM (
SELECT column FROM table WHERE condition
) AS subquery;
Example: Find the customer who spent the most
SELECT customer_id, total_spent
FROM (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
) AS t
WHERE total_spent = (SELECT MAX(total_spent) FROM (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
) AS totals);
CTE (Common Table Expressions) – Cleaner, Reusable Logic
WITH cte_name AS (
SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY column
)
SELECT * FROM cte_name WHERE condition;
Example: Get customers with over $500 spent in the last 6 months
WITH recent_orders AS (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
WHERE order_date >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
GROUP BY customer_id
)
SELECT c.customer_id, c.name, r.total_spent
FROM customers c
JOIN recent_orders r ON c.customer_id = r.customer_id
WHERE r.total_spent > 500;
Use Case: Generate filtered customer segments for AI-based personalization engines.
Practical Machine Learning Scenario
Goal: Identify customers likely to repurchase based on recent behavior
Steps:
- Filter customers with purchases in the last 6 months
- Aggregate total spend per customer
- Join with profile data
- Filter by a spending threshold (e.g., $500)
Result: Clean, aggregated customer dataset for supervised ML tasks like churn prediction or LTV modeling.
- JOINs help connect relational data for enriched feature sets
- Aggregation functions and HAVING allow behavioral segmentation
- Subqueries and CTEs simplify complex logic and improve readability
- SQL remains a powerful preprocessing tool for AI and machine learning workflows
Mastering these intermediate SQL techniques will empower you to transform raw data into ML-ready insights and contribute to scalable AI pipelines in real-world projects.