Mastering Intermediate SQL for Data Anaylsis

In data science and machine learning, SQL isn’t just a helpful tool—it’s a foundational skill. Whether you’re building recommendation systems, preparing large datasets, or analyzing user behavior, mastering intermediate SQL concepts is key to unlocking complex data workflows.

This guide focuses on practical SQL examples used in real-world machine learning pipelines, especially highlighting how to use JOINs, aggregation functions, subqueries, and CTEs (Common Table Expressions) for advanced data preparation.

Beginner’s Guide to SQL for Data Analysis and Machine Learning

Table of Contents　

Why Intermediate SQL Skills Matter for ML Projects

As machine learning moves from theory to deployment, SQL remains the backbone for:

Extracting and transforming user behavior logs
Segmenting customers for personalized recommendations
Detecting anomalies in real-time logs
Preparing training data from enterprise-scale databases

Let’s dive into an example-driven walkthrough of essential SQL patterns used in production.

deeplink beginners guide to sql for data analysis and machine learning beginners guide to sql for data analysis and machine learning Mastering Intermediate SQL for Data Anaylsis: JOINs, Aggregations, and Subqueries Data Science, Programming — A chart showing several of the SQL language elements comprising a single statement
(Source: Wikipedia)

Relational Analysis with JOINs

INNER JOIN – Merging Related Tables

SELECT A.column, B.column
FROM tableA AS A
INNER JOIN tableB AS B
ON A.key = B.key;

Example: Combine customer profiles with order history to calculate total spend:

SELECT c.customer_id, c.name, SUM(o.total_price) AS total_spent
FROM customers AS c
INNER JOIN orders AS o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name
ORDER BY total_spent DESC;

Use Case: Feature engineering for high-value customer prediction.

LEFT JOIN – Retain All Rows from Left Table

SELECT A.column, B.column
FROM tableA AS A
LEFT JOIN tableB AS B
ON A.key = B.key;

Example: Include all customers, even those with no orders:

SELECT c.customer_id, c.name, o.order_id, o.total_price
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id;

Use Case: Prepare input data for churn prediction or cold-start analysis.

SELF JOIN – Join a Table to Itself

SELECT A.column, B.column
FROM table AS A
JOIN table AS B
ON A.related_column = B.related_column;

Example: Analyze mentor-mentee relationships within an employee table:

SELECT e1.name AS mentor, e2.name AS mentee
FROM employees e1
JOIN employees e2 ON e1.employee_id = e2.mentor_id;

Use Case: Internal hierarchy mapping or social graph analysis.

Aggregation with GROUP BY and HAVING

SQL’s aggregation functions are essential for feature generation and statistical summaries.

Common Aggregate Functions

Function	Description
`COUNT()`	Row count
`SUM()`	Total value
`AVG()`	Average value
`MAX()`	Maximum value
`MIN()`	Minimum value

Example: Average spend per customer

SELECT customer_id, AVG(total_price) AS avg_spent
FROM orders
GROUP BY customer_id;

Filtering Aggregates with HAVING

SELECT column, AGG_FUNCTION(column)
FROM table
GROUP BY column
HAVING AGG_FUNCTION(column) condition;

Example: Customers who spent over $500

SELECT customer_id, SUM(total_price) AS total_spent
FROM orders
GROUP BY customer_id
HAVING total_spent > 500;

Use Case: Customer segmentation for targeted offers or retention models.

Advanced Filtering with Subqueries and CTEs

Subqueries – Nested Queries for Derived Filters

SELECT column
FROM (
    SELECT column FROM table WHERE condition
) AS subquery;

Example: Find the customer who spent the most

SELECT customer_id, total_spent
FROM (
    SELECT customer_id, SUM(total_price) AS total_spent
    FROM orders
    GROUP BY customer_id
) AS t
WHERE total_spent = (SELECT MAX(total_spent) FROM (
    SELECT customer_id, SUM(total_price) AS total_spent
    FROM orders
    GROUP BY customer_id
) AS totals);

CTE (Common Table Expressions) – Cleaner, Reusable Logic

WITH cte_name AS (
    SELECT column, AGG_FUNCTION(column)
    FROM table
    GROUP BY column
)
SELECT * FROM cte_name WHERE condition;

Example: Get customers with over $500 spent in the last 6 months

WITH recent_orders AS (
    SELECT customer_id, SUM(total_price) AS total_spent
    FROM orders
    WHERE order_date >= DATE_SUB(NOW(), INTERVAL 6 MONTH)
    GROUP BY customer_id
)
SELECT c.customer_id, c.name, r.total_spent
FROM customers c
JOIN recent_orders r ON c.customer_id = r.customer_id
WHERE r.total_spent > 500;

Use Case: Generate filtered customer segments for AI-based personalization engines.

Practical Machine Learning Scenario

Goal: Identify customers likely to repurchase based on recent behavior

Steps:

Filter customers with purchases in the last 6 months
Aggregate total spend per customer
Join with profile data
Filter by a spending threshold (e.g., $500)

Result: Clean, aggregated customer dataset for supervised ML tasks like churn prediction or LTV modeling.

JOINs help connect relational data for enriched feature sets
Aggregation functions and HAVING allow behavioral segmentation
Subqueries and CTEs simplify complex logic and improve readability
SQL remains a powerful preprocessing tool for AI and machine learning workflows

Mastering these intermediate SQL techniques will empower you to transform raw data into ML-ready insights and contribute to scalable AI pipelines in real-world projects.