Advantage Developer Zone

 
 
 

Correlated Subqueries

Friday, January 08, 2010

Subqueries are a very powerful and useful feature of the SQL standard. Subqueries can be categorized as either correlated or uncorrelated queries. A correlated subquery is one that is dependent on the outer query to be processed. These types of subqueries can be very inefficient and should be avoided.

In this article we will demonstrate several correlated subqueries and discuss methods for rewriting the query to avoid using them. It is important to note that you may not always be able to entirely remove a correlated subquery.

What is a Correlated Subquery

A correlated subquery, also referred to as a repeating subquery, is a subquery that uses values from the outer query in its WHERE clause. The correlated subquery will be evaluated once for each row returned by the outer (main) query. Although this query might be efficient when used on small tables the query will get more and more inefficient as the size of the tables increase.

The easiest way to determine if your subquery is a correlated subquery is to run the subquery by itself. If it does not run as a standalone query then it is dependent on the outer query and thus a correlated subquery.

Common Correlated Subqueries

It is common temptation to use a subquery as a returned field in a SELECT statement. It is very simple to do and the query is fairly straight forward. For example if our customer table stores a companyid the company name can be retrieved from the company table using the id stored in the customer table.

SELECT c.LastName, c.FirstName,
(SELECT CompanyName FROM Company WHERE ID = c.CompanyID) AS CompanyName
FROM Customer c

The Advantage SQL Engine is optimized to handle this type of correlated subquery and you may not have performance issues. It is still best practice to avoid the correlated subquery when possible. For instance this query can be easily rewritten to use a join.

SELECT c.LastName, c.FirstName, cm.CompanyName
FROM Customer c LEFT OUTER JOIN Company cm
ON c.CompanyID = cm.CompanyID

Another common instance of a correlated subquery can be found in the where clause. For example:

SELECT e.LastName, e.FirstName, e.Salary
FROM Employee e WHERE e.Salary < (SELECT AVG(Salary)
FROM Employee WHERE Department = e.Department GROUP BY Department)

This query returns the names of all the employees who are getting paid less than the average salary for their department. The subquery depends on the outer query to identify each employee’s department.

Rewriting this query is a bit more complicated because we are calculating the average based on the department. We can create a view which calculates the average salary and join it with the employee table to get the same results.

SELECT e.LastName, e.FirstName, e.Salary
FROM Employee e INNER JOIN AverageSalary a ON e.Department = a.Department
WHERE e.Salary < a.AverageSalary
Taking a Closer Look

Let’s take a closer look at removing a correlated subquery and optimizing the new query. Since a correlated subquery has a subquery with an outer reference (i.e. a reference to the main query) the inner query will be evaluated once for each value in the outer query. This may not cause a problem with smaller tables but performance will suffer as the tables get larger. Consider the following query:

SELECT * FROM SalesReps WHERE Empl_Num IN
( SELECT Empl_Num FROM OrdersTaken
WHERE MonthName( Order_Date ) = 'February'
GROUP BY Empl_Num
HAVING Sum( Amount ) < SalesReps.Quota )

Below is the SQL Execution plan for the query, notice that the subquery will be run 13535 times.

As with the first example this query can be modified to remove the outer reference. This is accomplished by rewriting the subquery to include a join.

SELECT * FROM SalesReps WHERE Empl_Num IN
( SELECT o.Empl_Num FROM OrdersTaken o INNER JOIN SalesReps s ON o.Empl_Num = s.Empl_Num
WHERE MonthName( o.Order_Date ) = 'February' GROUP BY o.Empl_Num, s.Quota
HAVING Sum( o.Amount ) < s.Quota )

In order to optimize the join condition I ensured that indexes were created on Empl_Num in both the OrdersTaken and SalesReps tables. With these indexes in place the server can use an index scans instead of a table scan when joining the tables. You’ll also notice the subquery will only have to be executed 30 times in order to process the query. Although this new query has a much more complicated execution plan it is more efficient than the first query.

When running these two queries on my machine against the same database the execution times were: Query 1 – 1 sec 938 ms; Query 2 – 0 sec 16 ms. As you can see there is a significant performance improvement when the outer reference is removed.

Summary

Subqueries are a powerful feature of SQL. However, subqueries that contain an outer reference can be very inefficient. In many instances these queries can be rewritten to remove the outer reference which can improve performance. It is worthwhile to review the SQL Execution plan to help identify potential inefficiencies.