Atomicity jpg Mastering Intermediate SQL for Data Anaylsis: JOINs, Aggregations, and Subqueries Data Science, Programming

Mastering Intermediate SQL for Data Anaylsis: JOINs, Aggregations, and Subqueries

Posted by

In data science and machine learning, SQL isn’t just a helpful tool—it’s a foundational skill. Whether you’re building recommendation systems, preparing large datasets, or analyzing user behavior, mastering intermediate SQL concepts is key to unlocking complex data workflows.

This guide focuses on practical SQL examples used in real-world machine learning pipelines, especially highlighting how to use JOINs, aggregation functions, subqueries, and CTEs (Common Table Expressions) for advanced data preparation.


Why Intermediate SQL Skills Matter for ML Projects

As machine learning moves from theory to deployment, SQL remains the backbone for:

  • Extracting and transforming user behavior logs
  • Segmenting customers for personalized recommendations
  • Detecting anomalies in real-time logs
  • Preparing training data from enterprise-scale databases

Let’s dive into an example-driven walkthrough of essential SQL patterns used in production.

deeplink beginners guide to sql for data analysis and machine learning beginners guide to sql for data analysis and machine learning Mastering Intermediate SQL for Data Anaylsis: JOINs, Aggregations, and Subqueries Data Science, Programming
A chart showing several of the SQL language elements comprising a single statement
(Source: Wikipedia)

Relational Analysis with JOINs

INNER JOIN – Merging Related Tables

SELECT A.column, B.column
FROM tableA AS A
INNER JOIN tableB AS B
ON A.key = B.key;

Example: Combine customer profiles with order history to calculate total spend:

SELECT c.customer_id, c.name, SUM(o.total_price) AS total_spent
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC;

Use Case: Feature engineering for high-value customer prediction.


LEFT JOIN – Retain All Rows from Left Table

SELECT A.column, B.column
FROM tableA AS A
LEFT JOIN tableB AS B
ON A.key = B.key;

Example: Include all customers, even those with no orders:

SELECT c.customer_id, c.name, o.order_id, o.total_price
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;

Use Case: Prepare input data for churn prediction or cold-start analysis.


SELF JOIN – Join a Table to Itself

SELECT A.column, B.column
FROM table AS A
JOIN table AS B
ON A.related_column = B.related_column;

Example: Analyze mentor-mentee relationships within an employee table:

SELECT e1.name AS mentor, e2.name AS mentee
FROM employees e1
JOIN employees e2 ON e1.employee_id = e2.mentor_id;

Use Case: Internal hierarchy mapping or social graph analysis.


Aggregation with GROUP BY and HAVING

SQL’s aggregation functions are essential for feature generation and statistical summaries.

Common Aggregate Functions

FunctionDescription
COUNT()Row count
SUM()Total value
AVG()Average value
MAX()Maximum value
MIN()Minimum value

Example: Average spend per customer

SELECT customer_id, AVG(total_price) AS avg_spent
FROM orders
GROUP BY customer_id;

Filtering Aggregates with HAVING

SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY column
HAVING AGG_FUNCTION(column) condition;

Example: Customers who spent over $500

SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
HAVING total_spent > 500;

Use Case: Customer segmentation for targeted offers or retention models.


Advanced Filtering with Subqueries and CTEs

Subqueries – Nested Queries for Derived Filters

SELECT column
FROM (
SELECT column FROM table WHERE condition
) AS subquery;

Example: Find the customer who spent the most

SELECT customer_id, total_spent
FROM (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
) AS t
WHERE total_spent = (SELECT MAX(total_spent) FROM (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
) AS totals);

CTE (Common Table Expressions) – Cleaner, Reusable Logic

WITH cte_name AS (
SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY column
)
SELECT * FROM cte_name WHERE condition;

Example: Get customers with over $500 spent in the last 6 months

WITH recent_orders AS (
SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
WHERE order_date >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
GROUP BY customer_id
)
SELECT c.customer_id, c.name, r.total_spent
FROM customers c
JOIN recent_orders r ON c.customer_id = r.customer_id
WHERE r.total_spent > 500;

Use Case: Generate filtered customer segments for AI-based personalization engines.


Practical Machine Learning Scenario

Goal: Identify customers likely to repurchase based on recent behavior

Steps:

  1. Filter customers with purchases in the last 6 months
  2. Aggregate total spend per customer
  3. Join with profile data
  4. Filter by a spending threshold (e.g., $500)

Result: Clean, aggregated customer dataset for supervised ML tasks like churn prediction or LTV modeling.


  • JOINs help connect relational data for enriched feature sets
  • Aggregation functions and HAVING allow behavioral segmentation
  • Subqueries and CTEs simplify complex logic and improve readability
  • SQL remains a powerful preprocessing tool for AI and machine learning workflows

Mastering these intermediate SQL techniques will empower you to transform raw data into ML-ready insights and contribute to scalable AI pipelines in real-world projects.

Leave a Reply

Your email address will not be published. Required fields are marked *