One of the things we see as consultants are consistent patterns of behavior by T-SQL developers that cause big performance problems. I call these anti-patterns. Today I want to specifically address SQL User Defined Functions (UDFs).
SQL Server 2000 introduced UDFs. While they can be useful in certain situations, an all too prevalent anti-pattern has emerged.
SQL User Defined Functions: Identifying The Problem
To illustrate this, let’s take a simple example from the AdventureWorks sample database. If you don’t have this database, you can download and install it. This is a good thing to have on your personal development or lab SQL Server instance, as many blogs, books, and articles use it for examples.
In our example, we are going to generate a simple report for the AdventureWorks Human Resources department using the following tables.
Our report will consist of the employee’s name, the department they work in, their job title, and start date. We have a mock-up of the report that the end user would like to see:
Taking a look at the schema above, we realize that the EmployeeDepartmentHistory table has most of this information. So we start there.
SELECT BusinessEntityID AS EmployeeID,
DepartmentID,
StartDate
FROM HumanResources.EmployeeDepartmentHistory
This gets us the EmployeeID and DepartmentID, but not the actual names. As developers, we are big into encapsulating code as much as possible to avoid repetitive code and the manageability issues that come with it. We know that UDFs in SQL Server can help with this so we create the following function to get the department name:
CREATE FUNCTION dbo.GetDepartment
(
@DepartmentID INT
)
RETURNS VARCHAR(40)
AS
BEGIN
DECLARE @ret VARCHAR(40)
SET @ret =
(
SELECT [Name]
FROM HumanResources.Department
WHERE DepartmentID = @DepartmentID
)
RETURN @ret
END
Now we can execute the same query but call the new function to return the department name.
SELECT
BusinessEntityID AS EmployeeID,
dbo. GetDepartment(DepartmentID) AS DepartmentName,
StartDate
FROM HumanResources. EmployeeDepartmentHistory
That worked great so we do the same thing for EmployeeName and JobTitle.
CREATE FUNCTION dbo. GetEmployeeName(@EmployeeID int)
RETURNS VARCHAR(40)
AS
BEGIN
DECLARE @ret VARCHAR(40), @ContactID INT
--get contactid
SELECT @ContactID = BusinessEntityID FROM HumanResources. Employee WHERE BusinessEntityID = @EmployeeID
--get name
SET @ret = ( SELECT coalesce(FirstName,'') + ' ' + coalesce(MiddleName,'') + ' ' + coalesce(LastName,'')
FROM Person. Person
WHERE BusinessEntityID = @ContactID
)
RETURN @ret
END
CREATE FUNCTION dbo.GetJobTitle
(
@EmployeeID INT
)
RETURNS VARCHAR(40)
AS
BEGIN
DECLARE @ret VARCHAR(40)
SET @ret =
(
SELECT [JobTitle]
FROM HumanResources.Employee
WHERE BusinessEntityID = @EmployeeID
)
RETURN @ret
END
Executing the query again using the three functions we’ve created, we get the data we are looking for. We have encapsulated the code into function for re-use in other places, and we avoided those pesky joins. The resulting query is clean and simple. Best of all, this is a single query so we’re doing a set operation like our DBA has told us we should do.
SELECT
dbo. GetEmployeeName(BusinessEntityID) as EmployeeName,
dbo. GetDepartment(DepartmentID) AS DepartmentName,
dbo. GetJobTitle(BusinessEntityID),
StartDate
FROM HumanResources. EmployeeDepartmentHistory
So what’s the problem?
The problem is that we have unwittingly turned our single query into a loop by introducing these functions. Because the functions are non-deterministic, they have to be called for each row in our result set.
To see this, let’s fire up our old friend SQL Server Profiler, which shows us the actual executing statements and the accompanying resources used. We’ll use the following trace definition to keep things as simple and clean as possible:
Let’s execute our query and see what we get.
SELECT
dbo. GetEmployeeName(BusinessEntityID) as EmployeeName,
dbo. GetDepartment(DepartmentID) AS DepartmentName,
dbo. GetJobTitle(BusinessEntityID),
StartDate
FROM HumanResources. EmployeeDepartmentHistory
Wait a minute. Our single-statement query just generated 2076 individual SQL Statements, generating a total of 2,672 logical reads. If we look through the profiler results carefully, we can see that each of the three functions was called once for each of the 296 rows our query returned. So what looks like a simple set operation is, in fact, a loop.
We go to our DBA for help and she explains that we need to get rid of the functions and use joins instead to make it into a true set operation. So we refactor the query as follows:
SELECT
COALESCE(c. FirstName,'') + ' ' + coalesce(c. MiddleName,'') + ' ' + coalesce(c. LastName,'') AS EmployeeName,
d. name AS DepartmentName,
e. JobTitle,
StartDate
FROM HumanResources. EmployeeDepartmentHistory eh
INNER JOIN HumanResources. Department d ON eh. DepartmentID = d. DepartmentID
INNER JOIN HumanResources. Employee e ON e. BusinessEntityID = eh. BusinessEntityID
INNER JOIN Person. Person c ON c. BusinessEntityID = e. BusinessEntityID
Now if we execute the query and look at the profiler trace, we get just a single statement for a total of 1032 logical reads. Our query has become much more complicated and we don’t have the benefit of code encapsulation, but we’ve got the logical reads to less than half. And that was for only 296 rows.
Conclusion
This is a simple example, but hopefully you can see the impact in a busy system with lots more data. I have seen this in production environments where something like GetCustomerName() is being called tens of millions of times daily. This is a great example of an anti-pattern, where SQL user defined functions look like a great idea but the performance impact can be devastating.