Site icon Donner's Daily Dose of Drama

Combining contiguous date ranges in a SQL query – using CTE recursion

Sometimes we need to reduce the granularity of date ranges by combining serveral contiguous rows into one. While this is a common classic SQL problem, I was unable to find an elegant solution that also performs well, and came up with my own. This article explains the problem and outlines the solution using a Common Table Expression (CTE) with recursion in SQL Server 2005.

Let’s assume that a person’s work history is broken down into several contiguous records, based on a variety of employment attributes. One of these attributes is whether the period is salaried or not. There are other attributes, such as work status etc. that are also of interest and that cannot be lost.

CREATE TABLE Work
(
      from_date DATETIME,
      to_date DATETIME,
      work_status NVARCHAR(10),
      paid  BIT
);

INSERT INTO Work VALUES ('1/1/1980', '12/31/1980', 'active',   1);
INSERT INTO Work VALUES ('1/1/1981', '12/31/1981', 'on leave', 1);
INSERT INTO Work VALUES ('1/1/1982', '12/31/1982', 'active',   1);
INSERT INTO Work VALUES ('1/1/1983', '12/31/1983', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1984', '12/31/1984', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1985', '12/31/1985', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1986', '12/31/1986', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1987', '12/31/1987', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1988', '12/31/1988', 'sick',     1);
INSERT INTO Work VALUES ('1/1/1989', '12/31/1989', 'active',   1);
INSERT INTO Work VALUES ('1/1/1990', '12/31/1990', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1991', '12/31/1991', 'active',   1);
INSERT INTO Work VALUES ('1/1/1992', '12/31/1992', 'active',   1);
INSERT INTO Work VALUES ('1/1/1993', '12/31/1993', 'vacation', 1);
INSERT INTO Work VALUES ('1/1/1994', '12/31/1994', 'on leave', 0);
INSERT INTO Work VALUES ('1/1/1995', '12/31/1995', 'sick',     0);

The granularity of the data is a result of all combined changes in all attributes that are being tracked. In other words, whenever there is a change in one of the columns of our table, a new row is needed. This can be impractical for certain applications. For instance, if a report is to show the periods for which the person was receiving pay, all other attributes should be ignored, and contiguous records must be combined based on the paid flag, resulting in something like the following (smaller) set:

start      end        paid
---------- ---------- -----
01/01/1980 12/31/1982 1
01/01/1983 12/31/1987 0
01/01/1988 12/31/1989 1
01/01/1990 12/31/1990 0
01/01/1991 12/31/1993 1
01/01/1994 12/31/1995 0

But how do you do that?
GROUP BY does not work, obviously, because it is not possible to group by a column and at the same time apply a chronological order. For the same reason, RANK() and PARTITION cannot be used.

Clearly, what is needed is information about when the value in the observed column (Paid, in this case)changes. This can be done through a variety of ways, but they all lack a certain elegance. For instance, this source suggests the use of two views that yield the start and end dates of each range. The resulting SQL is fairly convoluted.

My initial approach was a self-join that provided a calculated column (called ‘first’ in the example below). This column contained the value ‘1’ for evey row where a new group of contiguous pay periods begins. A correlated subquery was used to add up all the 1 values from prior rows up to the current one, yielding a new column that can be grouped by.
The resulting query below works, but is has some flaws.

WITH Grouped (from_date, to_date, paid, first)
AS
(
    SELECT
        from_date,
        to_date,
        paid,
        isnull
        (
            (
                SELECT
                    CASE WHEN paid <> w.paid THEN 1
                    ELSE 0 END
                FROM Work
                WHERE from_date =
                (
                    SELECT max(from_date)
                    FROM Work
                    WHERE from_date < w.from_date
                )
            ), 1
        ) AS first
        FROM Work w
)
SELECT
    min(from_date) AS from_date,
    max(to_date) AS to_date,
	paid
FROM
(
    SELECT
        from_date,
        to_date,
        paid,
        isnull
        (
            (SELECT sum(first) FROM grouped
            WHERE from_date > g.from_date), 0
        ) AS part
    FROM grouped g
) p
GROUP BY p.part, p.paid
ORDER BY from_date

Most notably, the correlated subquery that sums up the ‘first’ column gets evaluated for every row, and it is expensive. And I still thought the SQL was too complex for what it actually did.
The solution was obvious – a way to iterate through each of the contiguous date ranges while ‘memorizing’ the start date. A trivial solution would be a cursor loop, but that is hardly an option for online applications that need to perform well.
While thinking about this approach it occured to me that a recursion would do exactly that, join to all of the rows in a contiguous date range and then use the start date from the first and the end date from the last row. In SQL Server, this can be done with a Common Table Expression (CTE):

WITH Recursion (from_date, d, to_date, paid, Level)
AS
(
-- Anchor member definition
    SELECT this.from_date, this.from_date, this.to_date, this.paid, 0 AS Level
    FROM Work AS this
    LEFT JOIN Work AS last
        ON last.to_date + 1 = this.from_date
    WHERE last.paid <> this.paid or last.paid is null
    UNION ALL
-- Recursive member definition
    SELECT this.from_date, last.d, this.to_date, this.paid, Level + 1
    FROM Work AS this
    inner JOIN Recursion AS last
        ON last.to_date + 1 = this.from_date
    WHERE last.paid = this.paid
	and Level<99
)
-- Statement that executes the CTE
SELECT d as start, max(to_date) as [end], paid
FROM Recursion
group by d, paid
order by d
GO

A pretty cool use of recursion, isn’t it? It works and it is fast, but it has some limitations. First and foremost, the SQL Server limits the recursion depth to 100. If you have more then a few rows that need to be eliminated this way, recursion is not the way to go. Also, the example relies on the fact that there are no gaps in the dates, or else the joins will not work reliably. And finally, if there is data for multiple employees, an employee identifier would have to be included in the sets.

Exit mobile version