Combining contiguous date ranges in a SQL query – using CTE recursion Christian Donner, November 6, 2008April 12, 2009 Sometimes we need to reduce the granularity of date ranges by combining serveral contiguous rows into one. While this is a common classic SQL problem, I was unable to find an elegant solution that also performs well, and came up with my own. This article explains the problem and outlines the solution using a Common Table Expression (CTE) with recursion in SQL Server 2005. Let’s assume that a person’s work history is broken down into several contiguous records, based on a variety of employment attributes. One of these attributes is whether the period is salaried or not. There are other attributes, such as work status etc. that are also of interest and that cannot be lost. CREATE TABLE Work ( from_date DATETIME, to_date DATETIME, work_status NVARCHAR(10), paid BIT ); INSERT INTO Work VALUES ('1/1/1980', '12/31/1980', 'active', 1); INSERT INTO Work VALUES ('1/1/1981', '12/31/1981', 'on leave', 1); INSERT INTO Work VALUES ('1/1/1982', '12/31/1982', 'active', 1); INSERT INTO Work VALUES ('1/1/1983', '12/31/1983', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1984', '12/31/1984', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1985', '12/31/1985', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1986', '12/31/1986', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1987', '12/31/1987', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1988', '12/31/1988', 'sick', 1); INSERT INTO Work VALUES ('1/1/1989', '12/31/1989', 'active', 1); INSERT INTO Work VALUES ('1/1/1990', '12/31/1990', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1991', '12/31/1991', 'active', 1); INSERT INTO Work VALUES ('1/1/1992', '12/31/1992', 'active', 1); INSERT INTO Work VALUES ('1/1/1993', '12/31/1993', 'vacation', 1); INSERT INTO Work VALUES ('1/1/1994', '12/31/1994', 'on leave', 0); INSERT INTO Work VALUES ('1/1/1995', '12/31/1995', 'sick', 0); The granularity of the data is a result of all combined changes in all attributes that are being tracked. In other words, whenever there is a change in one of the columns of our table, a new row is needed. This can be impractical for certain applications. For instance, if a report is to show the periods for which the person was receiving pay, all other attributes should be ignored, and contiguous records must be combined based on the paid flag, resulting in something like the following (smaller) set: start end paid ---------- ---------- ----- 01/01/1980 12/31/1982 1 01/01/1983 12/31/1987 0 01/01/1988 12/31/1989 1 01/01/1990 12/31/1990 0 01/01/1991 12/31/1993 1 01/01/1994 12/31/1995 0 But how do you do that? GROUP BY does not work, obviously, because it is not possible to group by a column and at the same time apply a chronological order. For the same reason, RANK() and PARTITION cannot be used. Clearly, what is needed is information about when the value in the observed column (Paid, in this case)changes. This can be done through a variety of ways, but they all lack a certain elegance. For instance, this source suggests the use of two views that yield the start and end dates of each range. The resulting SQL is fairly convoluted. My initial approach was a self-join that provided a calculated column (called ‘first’ in the example below). This column contained the value ‘1’ for evey row where a new group of contiguous pay periods begins. A correlated subquery was used to add up all the 1 values from prior rows up to the current one, yielding a new column that can be grouped by. The resulting query below works, but is has some flaws. WITH Grouped (from_date, to_date, paid, first) AS ( SELECT from_date, to_date, paid, isnull ( ( SELECT CASE WHEN paid <> w.paid THEN 1 ELSE 0 END FROM Work WHERE from_date = ( SELECT max(from_date) FROM Work WHERE from_date < w.from_date ) ), 1 ) AS first FROM Work w ) SELECT min(from_date) AS from_date, max(to_date) AS to_date, paid FROM ( SELECT from_date, to_date, paid, isnull ( (SELECT sum(first) FROM grouped WHERE from_date > g.from_date), 0 ) AS part FROM grouped g ) p GROUP BY p.part, p.paid ORDER BY from_date Most notably, the correlated subquery that sums up the ‘first’ column gets evaluated for every row, and it is expensive. And I still thought the SQL was too complex for what it actually did. The solution was obvious – a way to iterate through each of the contiguous date ranges while ‘memorizing’ the start date. A trivial solution would be a cursor loop, but that is hardly an option for online applications that need to perform well. While thinking about this approach it occured to me that a recursion would do exactly that, join to all of the rows in a contiguous date range and then use the start date from the first and the end date from the last row. In SQL Server, this can be done with a Common Table Expression (CTE): WITH Recursion (from_date, d, to_date, paid, Level) AS ( -- Anchor member definition SELECT this.from_date, this.from_date, this.to_date, this.paid, 0 AS Level FROM Work AS this LEFT JOIN Work AS last ON last.to_date + 1 = this.from_date WHERE last.paid <> this.paid or last.paid is null UNION ALL -- Recursive member definition SELECT this.from_date, last.d, this.to_date, this.paid, Level + 1 FROM Work AS this inner JOIN Recursion AS last ON last.to_date + 1 = this.from_date WHERE last.paid = this.paid and Level<99 ) -- Statement that executes the CTE SELECT d as start, max(to_date) as [end], paid FROM Recursion group by d, paid order by d GO A pretty cool use of recursion, isn’t it? It works and it is fast, but it has some limitations. First and foremost, the SQL Server limits the recursion depth to 100. If you have more then a few rows that need to be eliminated this way, recursion is not the way to go. Also, the example relies on the fact that there are no gaps in the dates, or else the joins will not work reliably. And finally, if there is data for multiple employees, an employee identifier would have to be included in the sets. Related Posts:SUTAB Scam?TyreWiz not working after battery changeThe Great Cat Litter Poop OffAmazon threatens customer of 26 yearsEnphase Envoy Local Access SQL Server SQL
I like your solution, and am about to massage it for a slightly different purpose. Thanks for posting! You are probably aware of this by now, but just in case: while SQL Server 2005 and 2008 limit recursion depth to 100 *by default*, you can provide the hint MAXRECURSION to specify the maximum number of recursions allowed, between 0 and 32767. NB: 0 means “infinite” — use with caution!
Hi Christian, I tried using your solution but it is EXTREMELY slow! I have contiguous date ranges for my employees but I have 450 000 rows in my table. This query took 4.5 hours to complete!!! Is this normal and is there a way to speed it up? Thanks
Hi Christian, Thanks for replying, I actually have no indices on any of the columns. Would I need at least an index on StartDate column? I really want to use your recursive solution as it works out perfectly, but I can’t justify the 4.5 hours! 🙂
Also, here is my query: I had to include an employee id because I have multiple employees. I hope what I have done is correct. ;WITH Recursion (EmpID, StartDate, EndDate, BSB, CC, Level) AS ( — Anchor member definition SELECT this.EmpID, this.StartDate, this.EndDate, this.BSB, this.CC, 0 AS Level FROM TestEmployees AS this LEFT OUTER JOIN dbo.TestEmployees AS last ON last.EndDate + 1 = this.StartDate WHERE (last.EmpID = this.EmpID AND last.BSB + last.CC this.BSB + this.CC) or last.BSB + last.CC is null UNION ALL — Recursive member definition SELECT last.EmpID, last.StartDate, this.EndDate, this.BSB, this.CC, Level + 1 FROM TestEmployees AS this inner JOIN Recursion AS last ON last.EndDate + 1 = this.StartDate WHERE last.EmpID = this.EmpID AND last.BSB + last.CC = this.BSB + this.CC and Level<99 ) — Statement that executes the CTE SELECT EmpID, StartDate as start, MAX(EndDate) as [end], BSB,CC FROM Recursion group by StartDate, EmpID, BSB,CC order by EmpID, StartDate I've created a clustered index on EmpID and StartDate but the query has been executing for 1hr 45 min!
Start with a composite index on the group by columns: StartDate, EmpID, BSB, CC Keep the index on EmpId and Startdate.
Hmmm, I created the indices as you suggested, but it looks like it has no effect on the slowness of the recursion. Also, I’m now just calling the recursive function with: SELECT EmpID, StartDate as start, EndDate as [end], BSB,CC, Level FROM Recursion without any grouping and ordering just to see if there is any improvement. I noticed that it straight away spits out the first 40 000 to 45000 records quite quickly and then it slows down to a crawl, spitting out only a handful of rows every couple of minutes. And this is just for Level 0. 10 minutes into the query and it is still processing Level 0 records. Is there anything else I could try? Or does recursion only work for small to medium sets of data? I’m curious to know, how many records do you have in your data set and what is the performance like?
Chris, You’re probably too busy, but would you mind executing this query if I provided you with the data set I’m working with? It’s about 5 mb csv file. It’s just that I’m very keen to see why it’s so slow. Cheers.
Tom, I would love to, but right now I can’t spare the time and I don’t even know if I would be of any help. I used in recursion in larger databases, but I can’t give you any specific metrics. It’s been years.
Aaah silly me problem solved! The culprit was in the WHERE clause of my query. I needed to put the “last.EmpID = this.EmpID” check into the ON clause instead. Looking back at the query and the penny dropped – Of course my original query was taking so long, the left outer join itself produces about 50 million rows!!! Anyway, thanks Chris for your solution again and I should put my thinking cap on first before just blatanly copying code! 🙂 Cheers!