Sometimes we need to reduce the granularity of date ranges by combining serveral contiguous rows into one. While this is a common classic SQL problem, I was unable to find an elegant solution that also performs well, and came up with my own. This article explains the problem and outlines the solution using a Common Table Expression (CTE) with recursion in SQL Server 2005.
Let’s assume that a person’s work history is broken down into several contiguous records, based on a variety of employment attributes. One of these attributes is whether the period is salaried or not. There are other attributes, such as work status etc. that are also of interest and that cannot be lost.
The granularity of the data is a result of all combined changes in all attributes that are being tracked. In other words, whenever there is a change in one of the columns of our table, a new row is needed. This can be impractical for certain applications. For instance, if a report is to show the periods for which the person was receiving pay, all other attributes should be ignored, and contiguous records must be combined based on the paid flag, resulting in something like the following (smaller) set:
But how do you do that?
GROUP BY does not work, obviously, because it is not possible to group by a column and at the same time apply a chronological order. For the same reason, RANK() and PARTITION cannot be used.
Clearly, what is needed is information about when the value in the observed column (Paid, in this case)changes. This can be done through a variety of ways, but they all lack a certain elegance. For instance, this source suggests the use of two views that yield the start and end dates of each range. The resulting SQL is fairly convoluted.
My initial approach was a self-join that provided a calculated column (called ‘first’ in the example below). This column contained the value ‘1’ for evey row where a new group of contiguous pay periods begins. A correlated subquery was used to add up all the 1 values from prior rows up to the current one, yielding a new column that can be grouped by.
The resulting query below works, but is has some flaws.
Most notably, the correlated subquery that sums up the ‘first’ column gets evaluated for every row, and it is expensive. And I still thought the SQL was too complex for what it actually did.
The solution was obvious – a way to iterate through each of the contiguous date ranges while ‘memorizing’ the start date. A trivial solution would be a cursor loop, but that is hardly an option for online applications that need to perform well.
While thinking about this approach it occured to me that a recursion would do exactly that, join to all of the rows in a contiguous date range and then use the start date from the first and the end date from the last row. In SQL Server, this can be done with a Common Table Expression (CTE):
A pretty cool use of recursion, isn’t it? It works and it is fast, but it has some limitations. First and foremost, the SQL Server limits the recursion depth to 100. If you have more then a few rows that need to be eliminated this way, recursion is not the way to go. Also, the example relies on the fact that there are no gaps in the dates, or else the joins will not work reliably. And finally, if there is data for multiple employees, an employee identifier would have to be included in the sets.