Sometimes we need to reduce the granularity of date ranges by combining serveral contiguous rows into one. While this is a common classic SQL problem, I was unable to find an elegant solution that also performs well, and came up with my own. This article explains the problem and outlines the solution using a Common Table Expression (CTE) with recursion in SQL Server 2005.
Let’s assume that a person’s work history is broken down into several contiguous records, based on a variety of employment attributes. One of these attributes is whether the period is salaried or not. There are other attributes, such as work status etc. that are also of interest and that cannot be lost.
CREATE TABLE Work ( from_date DATETIME, to_date DATETIME, work_status NVARCHAR(10), paid BIT ); INSERT INTO Work VALUES ('1/1/1980', '12/31/1980', 'active', 1); INSERT INTO Work VALUES ('1/1/1981', '12/31/1981', 'on leave', 1); INSERT INTO Work VALUES ('1/1/1982', '12/31/1982', 'active', 1); INSERT INTO Work VALUES ('1/1/1983', '12/31/1983', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1984', '12/31/1984', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1985', '12/31/1985', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1986', '12/31/1986', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1987', '12/31/1987', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1988', '12/31/1988', 'sick', 1); INSERT INTO Work VALUES ('1/1/1989', '12/31/1989', 'active', 1); INSERT INTO Work VALUES ('1/1/1990', '12/31/1990', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1991', '12/31/1991', 'active', 1); INSERT INTO Work VALUES ('1/1/1992', '12/31/1992', 'active', 1); INSERT INTO Work VALUES ('1/1/1993', '12/31/1993', 'vacation', 1); INSERT INTO Work VALUES ('1/1/1994', '12/31/1994', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1995', '12/31/1995', 'sick', 0);
The granularity of the data is a result of all combined changes in all attributes that are being tracked. In other words, whenever there is a change in one of the columns of our table, a new row is needed. This can be impractical for certain applications. For instance, if a report is to show the periods for which the person was receiving pay, all other attributes should be ignored, and contiguous records must be combined based on the paid flag, resulting in something like the following (smaller) set:
start end paid ---------- ---------- ----- 01/01/1980 12/31/1982 1 01/01/1983 12/31/1987 0 01/01/1988 12/31/1989 1 01/01/1990 12/31/1990 0 01/01/1991 12/31/1993 1 01/01/1994 12/31/1995 0
But how do you do that?
GROUP BY does not work, obviously, because it is not possible to group by a column and at the same time apply a chronological order. For the same reason, RANK() and PARTITION cannot be used.
Clearly, what is needed is information about when the value in the observed column (Paid, in this case)changes. This can be done through a variety of ways, but they all lack a certain elegance. For instance, this source suggests the use of two views that yield the start and end dates of each range. The resulting SQL is fairly convoluted.
My initial approach was a self-join that provided a calculated column (called ‘first’ in the example below). This column contained the value ‘1’ for evey row where a new group of contiguous pay periods begins. A correlated subquery was used to add up all the 1 values from prior rows up to the current one, yielding a new column that can be grouped by.
The resulting query below works, but is has some flaws.
WITH Grouped (from_date, to_date, paid, first) AS ( SELECT from_date, to_date, paid, isnull ( ( SELECT CASE WHEN paid <> w.paid THEN 1 ELSE 0 END FROM Work WHERE from_date = ( SELECT max(from_date) FROM Work WHERE from_date < w.from_date ) ), 1 ) AS first FROM Work w ) SELECT min(from_date) AS from_date, max(to_date) AS to_date, paid FROM ( SELECT from_date, to_date, paid, isnull ( (SELECT sum(first) FROM grouped WHERE from_date > g.from_date), 0 ) AS part FROM grouped g ) p GROUP BY p.part, p.paid ORDER BY from_date
Most notably, the correlated subquery that sums up the ‘first’ column gets evaluated for every row, and it is expensive. And I still thought the SQL was too complex for what it actually did.
The solution was obvious – a way to iterate through each of the contiguous date ranges while ‘memorizing’ the start date. A trivial solution would be a cursor loop, but that is hardly an option for online applications that need to perform well.
While thinking about this approach it occured to me that a recursion would do exactly that, join to all of the rows in a contiguous date range and then use the start date from the first and the end date from the last row. In SQL Server, this can be done with a Common Table Expression (CTE):
WITH Recursion (from_date, d, to_date, paid, Level) AS ( -- Anchor member definition SELECT this.from_date, this.from_date, this.to_date, this.paid, 0 AS Level FROM Work AS this LEFT JOIN Work AS last ON last.to_date + 1 = this.from_date WHERE last.paid <> this.paid or last.paid is null UNION ALL -- Recursive member definition SELECT this.from_date, last.d, this.to_date, this.paid, Level + 1 FROM Work AS this inner JOIN Recursion AS last ON last.to_date + 1 = this.from_date WHERE last.paid = this.paid and Level<99 ) -- Statement that executes the CTE SELECT d as start, max(to_date) as [end], paid FROM Recursion group by d, paid order by d GO
A pretty cool use of recursion, isn’t it? It works and it is fast, but it has some limitations. First and foremost, the SQL Server limits the recursion depth to 100. If you have more then a few rows that need to be eliminated this way, recursion is not the way to go. Also, the example relies on the fact that there are no gaps in the dates, or else the joins will not work reliably. And finally, if there is data for multiple employees, an employee identifier would have to be included in the sets.