June 14, 2005 – Jarrett House North

I lost my copy of the classic programmer’s cookbook Refactoring several years ago, alas, but its philosophy of careful replacement of smelly code with clean code to improve the performance and maintainability of a software program is one that has stayed with me long after I ceased being an active programmer. One regret I had about the book was that it primarily addressed refactoring for object oriented languages (all the examples were in Java). But one of the biggest opportunities for cleaning up code is in non-object oriented languages such as Transact SQL. Recently I had an opportunity to clean up some legacy stored procedure code that I was adapting for an integration project I was working on, and one particular refactoring struck me as especially useful.

A common mistake when writing stored procedures is to overuse cursors. The cursor provides a way to operate on one row of data at a time, which is logical to a programmer used to thinking about working with arrays by looping through them. But SQL is fundamentally a language that is about set operations, and you can realize tremendous gains if you can stop operating on one row at a time and instead operate on a bunch of them at once. Microsoft provided a way to do that in SQL Server 2000 by introducing the table datatype.

Table variables help us by giving us an entire rowset in memory that acts precisely like a table in the database. An article by Alex Grinberg on SQL Server Central illustrates some of the applications of the table variable. I used one of the ideas to replace a cursor that was being used to import data into a table in our system using a counter table.

In fact, the counter table probably caused the use of the cursor in the first place, because a developer had helpfully written a stored procedure to get the next ID value from the table (a common construct for database structures that need to be cross-platform and therefore can’t use features like ). Such a function is a good idea in a client application because it enforces a consistent method for creating IDs for new records, but because it enforces getting only one ID at a time it leads inexorably to row-at-a-time inserts and other abuses of SQL.

So the first step in converting this insert cursor into a sensible insert was to create a new procedure that allowed me to get a bunch of IDs out at once. Where the original procedure had as inputs the table for which the counter was being incremented and the new ID as output, this one also took the number of keys, the starting key value, and the ending key value. Thus even for a very large number of rows, I was only performing a single transaction to get a block of keys to work with.

Now the cursor replacement. The original logic of the stored procedure declared a select cursor against the staging table containing the data to be imported, then opened the cursor, got a new id, and inserted the contents of the cursor plus the ID into the destination table. So for each row of data to be imported we performed two transactions, an update on the counter table and an insert to the destination table. For 500 rows, this was taking about 25 seconds—not huge, but definitely a place where there could be a big improvement. Here are the steps I used to replace the insert cursor:

Declare a table variable, @tbl, with the same columns as the cursor, plus
an ID column defined with the IDENTITY property.
Populate @tbl with an INSERT INTO…SELECT FROM statement that draws data from the staging table. Now each row in @tbl has the values from the staging
table plus a temporary ID, ranging from 1 to the number of rows. (We didn’t specify a seed for the ID field; more on that in a second.
Count the number of rows in @tbl (select count(*) from @tbl).
Call the updated counter procedure with the number of rows in @tbl, getting
back the first ID in the range (call it @startid.
Finally, insert into the destination table, selecting from @tbl and adding
@startid to the identity column in @tbl. Since the IDENTITY column
started at 1, we want to subtract 1 from @startid for each row as well, or else
we’ll exceed the range of keys reserved.

Here’s some sample code:

DECLARE @tbl TABLE (col1 int, col2 int, col3 int, temp_id IDENTITY (1,1))

INSERT INTO @tbl (col1, col2, col3)
SELECT stagecol1, stagecol2, stagecol3 FROM staging_table WHERE ...

DECLARE @startid int
DECLARE @total int

SELECT @total = COUNT (*) FROM @tbl

exec p_increment_counter_multi 'dest_table', @total, @startid OUTPUT

INSERT INTO dest_table ( dest_id, destcol1, destcol2, destcol3 )
SELECT temp_id + @startid - 1, col1, col2, col3 FROM @tbl

So instead of two transactions for each row in the staging table, we end up with four transactions total, one of which is a select to populate the table variable and one which just gets the total number of rows. The performance benefits can be substantial: In my particular procedure, for 500 source rows, the time to execute the procedure went from 25 seconds to 6 seconds.

Of course, you could have done the same thing using a temporary table, though it appears that table variables are slightly more efficient in memory. But the basic principle is the same: an application-level counter column is no reason to insert one row at a time.

The same principle, with a twist, can be applied to update cursors as well; more on that in a while.