SUMMARIZE AND SUMMARIZECOLUMNS DAX function examples. Discussion on DAX MEASURE usage vs static alternatives.
What
Often there is a need to (distinct) count or sum values based on multiple filtered tables over a selected variable like a product type. An example could be a KPI like the customer count of a company (per product) when different products have differences in the counting logic or data tables.
In this post I present two function patterns to handle most of this type of situations. In the theory section is discussed alternatives and additional approaches and their differences. How section considers whether DAX is the best solution in the first place.
Code
Example 1
The basic function pattern is DAX CALCULATETABLE with SUMMARIZE.
Measure KPI customer count = VAR bikes_customers = --alltime CALCULATETABLE(SUMMARIZE ( Customers; Customers[CustomerKey]); FILTER(dimProductCategory;dimProductCategory[EnglishProductCategoryName]="Bikes")) VAR accessories_customers = --latest year, in the example 2004 CALCULATETABLE(SUMMARIZE ( Customers; Customers[CustomerKey]); FILTER(dimProductCategory;dimProductCategory[EnglishProductCategoryName]="Accessories"); FILTER('Calendar';'Calendar'[Year]=2004)) --CALCULATETABLE is needed to filter based on two tables VAR other_customers = --last 2 quarters, in this example Q3 and Q4 CALCULATETABLE(SUMMARIZE ( Customers; Customers[CustomerKey] ); FILTER(dimProductCategory;dimProductCategory[EnglishProductCategoryName]<>"Bikes" && dimProductCategory[EnglishProductCategoryName]<>"Accessories"); FILTER('Calendar';'Calendar'[Year]=2004);FILTER('Calendar'; 'Calendar'[Quarter]="Q3" || 'Calendar'[Quarter]="Q4")) VAR all_customer_keys = DISTINCT(UNION ( bikes_customers; accessories_customers; other_customers)) --above union of all customer keys, then distinct as e.g. bikes and accessories --can have in 2004 same customer and we want to count it only 1 time --here would be possible to use EXCEPT or INTERSECT functions --too instead of UNION in another type of requirement setting VAR Result = COUNTROWS ( all_customer_keys ) RETURN Result
Example 2
The basic function pattern is DAX ADDCOLUMNS with SUMMARIZE.
Measure KPI customer count sales 2004 over 1000 eur = --VALUES VAR sales_per_customer_table_expression_with_addcolumns = ADDCOLUMNS ( SUMMARIZE(Customers; Customers[CustomerKey]); "Sales Amount sum";SUMX(RELATEDTABLE(Sales);Sales[SalesAmount]); "sales_events";SUMX(RELATEDTABLE(Sales);1) --same as COUNTX but better performance ) VAR table_expression_filter = FILTER(sales_per_customer_table_expression_with_addcolumns;[Sales Amount sum]>1000) --([Measure KPI customer count] taken from example 1 VAR count_of_customers = CALCULATE([Measure KPI customer count]; table_expression_filter;FILTER('Calendar';'Calendar'[Year]=2004)) RETURN count_of_customers
Theory
Preferred DAX function patterns
DAX SUMMARIZECOLUMNS function should be preferred over SUMMARIZE due to its better performance. There are some differences like SUMMARIZECOLUMNS not having a row context like SUMMARIZE. This makes the use of SUMMARIZECOLUMNS not possible at all in the case of the code example 1, and in the code example 2 in the case of showing data in a categorical graph or a table. Example 2 would work with only the end result of the measure used as a single figure without any row evaluation.
If you do not calculate anything with SUMMARIZE (= just selecting a list of dimensional values for example) there should not be any performance difference to SUMMARIZECOLUMNS. You can read more about SUMMARIZECOLUMNS vs SUMMARIZE in this post by SQLBI: https://www.sqlbi.com/articles/introducing-summarizecolumns/. The same post and linked articles to it point out that one should not calculate values directly with SUMMARIZE especially due to its handling of filters in the measures used inside the SUMMARIZE calculations. It is safer to always use ADDCOLUMNS + SUMMARIZE structure. It is also used in the example 2 code.
SUMMARIZE as a variable in a measure cannot be used as a base table
If one builds a table expression in a measure with a variable, like with SUMMARIZE in the code examples of this post, one cannot use this table expression variable like a physical base table later in the measure. E.g. in the example 2 code, it would NOT be possible to call the function below in the end, it will lead to an error “table cannot be used because a base table is expected”:
CALCULATE(SUM(sales_per_customer_table_expression_with_addcolums[sales_events])
Three ways to create the same table with DAX
Here are three ways to create the same table with dax, with preference on alternatives table_sales_per_customer_table_with_summarizecolumns and table_sales_per_customer_table_with_addcolumns:
table_sales_per_customer_table_with_summarizecolumns = SUMMARIZECOLUMNS( Customers[CustomerKey]; "Sales Amount sum";SUM(Sales[SalesAmount]); "sales_events";COUNTROWS(Sales) ) --VALUES can be replaced with SUMMARIZE (or DISTINCT) table_sales_per_customer_table_with_addcolumns = ADDCOLUMNS ( VALUES(Customers[CustomerKey]); "Sales Amount sum";SUMX(RELATEDTABLE(Sales);Sales[SalesAmount]); "sales_events";SUMX(RELATEDTABLE(Sales);1) ) --not recommended if any more complex measures or filters in measures table_sales_per_customer_table_with_summarize = SUMMARIZE( Sales;Customers[CustomerKey]; "Sales Amount sum";SUM(Sales[SalesAmount]); "sales_events";COUNTROWS(Sales) )
SUMMARIZE function alternatives with a single column
In the code examples SUMMARIZE is used. With a single column, it would be possible to use DISTINCT or VALUES too. By the VALUES function documentation: “In most scenarios, when the argument is a column name, the results of the VALUES function are identical to those of the DISTINCT function. Both functions remove duplicates and return a list of the possible values in the specified column. However, the VALUES function can also return a blank value.”
SUMMARIZE with a single column will give the same result with the same performance as DISTINCT, so no blanks.
RELATEDTABLE function needed when iterating rows over a table relationship
As nicely formulated by SQLBI in this post about row context and filter context (must read if you have not done already): “A row context does not propagate through relationships. If you have a row context in a table, you can iterate the rows of a table on the many side of a relationship using RELATEDTABLE, and you can access the row of a parent table using RELATED.” This is exactly what is done with SUMX and RELATEDTABLE in the Example 2 Code.
My earlier post Power BI DAX How to Calculate in Row Level with Multiple Tables introduces SUMX and how it works in detail.
How
It makes sense to build complicated measures in chains by making a measure, checking it works, and referencing to it with another measure. Furthermore, it is possible to use DAX variables VAR. It is easier not to try to perfect everything with one measure monster. It is possible to add new tables in DAX with SUMMARIZECOLUMNS to have the data more visible when building and testing new measures.
It is good to look after the measure calculation performance and user experience. Consider when it might be smarter to use Power Query or SQL and make a certain data edit in the data load instead of the real time DAX MEASURE calculations loading for every user separately. Furthermore, typically SQL or Power Query is relatively static and easy syntax to write compared to DAX which allows more dynamic end results and is likely to be harder to maintain or reverse engineer in the future.
Typically, calculations and data editing should be done before DAX MEASURE calculation layer if some of the following apply, the more there is data the more likely DAX MEASURE is not a good solution:
- The calculation is not dynamic, typically considering the most granular level of the data
- The calculation formula is complex and would take a lot of memory & time with DAX
- DAX MEASURE can also be used only in the final calculation logic, for example with dividing it is possible to calculate the divided figure and the divider separately without DAX MEASURE and do the DIVIDE with DAX MEASURE.
- Calculations are more like documented and tested processes with multiple steps and joins. Here even DAX CALCULATED COLUMNS and tables start to be tricky in the sense of documentation and joining of the data compared to SQL.
- Static & transparent end results with documentation & testing have a high importance. For example this could be the case with selected Activity Based Costing calculations to drive the profitability and pricing of a company. In addition, here even the DAX CALCULATED COLUMNS can be challenging compared to SQL in the sense of documentation, testing and relationships.
My earlier post Power BI DAX When to Use Measure VS Calculated Column VS Other Tools includes also a theory section discussing why the rules above apply.
Sample Power BI file
Example measures are created under the customer table in the example file. In the file are also the example tables created with DAX, introduced in the theory section (named table_sales_per_customer…):
https://drive.google.com/file/d/1rY8Azr5jljRNHTbFfqS4o7yjvWTFlN_q/view?usp=sharing
Sample data source: https://www.learndax.com/power-bi-sample-data-for-beginners-to-download/