Solving the 'Events in Progress' Problem in DAX
Reposted from Chris Webb's blog with the author's permission.
One problem I’ve spent a lot of time thinking about over the last couple of years is what I call the ‘events in progress’ problem. The problem is this: if you have a fact table where each row represents an event, and there are two date columns representing the start and end dates of the event, on any given date how many events were actually in progress (ie how many rows are there where your given date is between the start and the end date)? It’s trivial to solve in SQL, but in SSAS and MDX there’s no truly satisfying solution. Richard Tkatchuk wrote up the best solution for SSAS here, but since it still involves a fair amount of hacking of the data I’ve never felt it was particularly elegant; I’ve always felt like there should be a pure MDX solution, but all of the approaches I’ve tried have been unusably slow. As a result, I was curious to see whether there would be an elegant way of solving this problem in PowerPivot and DAX…
It’s quite easy to use the AdventureWorks DW database to illustrate the problem. The following SQL query returns the number of orders that had been placed but had not yet shipped on any given date:
select DimDate.DateKey
, COUNT(*) as OrdersInProgress
from DimDate,
FactInternetSales
where
DimDate.DateKey>=factinternetsales.OrderDateKey
and
DimDate.DateKey <= FactInternetSales.ShipDateKey
group by DimDate.DateKey
order by Dimdate.DateKey asc
The first thing to do in PowerPivot is, obviously, to import the DimDate and FactInternetSales tables. However the next step is less than obvious: you must not create a relationship between the two tables. Even though PowerPivot suggests you should, you won’t get this solution to work if you do – and as a result, in my opinion, it would be great if this message could be turned off (vote here if you agree).
Next we need to create a new calculated measure in the FactInternetSales table, and we need to come up with a DAX expression for it. Even though the logic is fairly simple, I struggled at first because the Calculate statement doesn’t allow you to use the Min or Max functions directly in one of its filter parameters. Then, though, I realised that the same limitation doesn’t apply to the Filter function and I came up with the following, which worked:
=COUNTROWS(
Filter(
Filter('FactInternetSales', 'FactInternetSales'[OrderDateKey]<=MIN('DimDate'[DateKey])), 'FactInternetSales'[ShipDateKey]>=MAX('DimDate'[DateKey])))
Unfortunately, although it was faster than any of my MDX approaches in SSAS, it was still not really fast enough. So I went back to my friends at Microsoft, and Marius Dumitru (again) came up with the following which is really fast:
=Calculate(
CountRows(FactInternetSales),
Filter(Values(FactInternetSales[OrderDateKey]), Min(DimDate[DateKey])>=[OrderDateKey]),
Filter(Values(FactInternetSales[ShipDateKey]), Max(DimDate[DateKey])<=[ShipDateKey]))
The next problem is that this only works when something is explicitly selected from DateKey. When you put Calendar Year on rows or columns though, for example, it returns nothing. I’m currently working on getting this fixed, but that will be another blog post!
Anyway couple of questions need to be asked at this point. First of all, how/why does this work? Well Marco has just posted a great explanation of the Calculate statement here that I recommend you to read and reread; it covers the theory behind this expression and a lot more.
Secondly why is this second expression so much faster than the first? Here’s what Marius told me: in the first expression the filtering is done on each individual row in the fact table. In the second expression the filters act on the distinct values from both OrderDateKey and ShipDateKey, and since the number of distinct values is smaller than the overall number of rows this is evaluated very quickly; the results of these two filters can be then used in a further, very efficient request to the PowePivot/Vertipaq engine. In Marius’ own words, the takeaway is that “it’s always a best practice to avoid filtering an entire table if the filter condition involves only some of the table columns (i.e. when no key/unique columns are used in the condition expression). The way to go is to only filter the columns involved in the condition (using Filter over All, Values or Distinct applied to the desired columns), and use those filter results as parameters to Calculate to carry out the rest of the computation”. Note that this isn’t a matter of block computation vs cell-by-cell because all calculations in DAX are evaluated in block mode.
So there you go, the first DAX optimisation tip on this blog!
Chris has been working with Microsoft BI tools since he started using beta 3 of OLAP Services back in the late 90s. Since then he has worked with Analysis Services in a number of roles (including three years spent with Microsoft Consulting Services) and he is now an independent consultant specialising in complex MDX, Analysis Services cube design and Analysis Services query performance problems. His company website can be found at http://www.crossjoin.co.uk and his blog can be found at http://cwebbbi.spaces.live.com/ . |