Friday, February 24, 2012

Crossjoining large dimensions which are close to 1-to-1

We have two dimensions which are about 2 million members each. The fact table that joins those two dimensions is pretty sparse because really those two dimensions are highly related to each other. (They're not enough related to model as one dimension though, trust me.) So when you run the following, the count it returns is around 2.3 million:

with member test as

Exists(

[Dimesion1].[Attrib1].[Attrib1].Members

* [Dimension2].[Attrib2].[Attrib2].Members

,

, "My Fact Table"

).Count

select test on 0

from [My Cube]

The problem is that statement takes 5 minutes to execute and causes SSAS to consume 3GB of memory.

Any thoughts on improving this?

Could you describe the set you are attempting to generate with the [test] expression? Though valid, how you are using EXISTS is a little unusual.

Also, could you explain what you mean by the fact table being "sparse"?


Thanks,

Bryan

|||

Bryan-

Yea. A little background might help. We're trying to use the UDM in the true sense of the word... not just for summary level reports which take advantage of high degrees of aggregation, but also for more detail level reports. We're running into some problems using it for semi-detail level reports because of the issue mentioned in this thread.

While car rental isn't remotely the industry I'm working in, it will serve to illustrate the concept. Let's say we have a Driver dimension containing 3 million people and a VIN dimension containing 2 million cars. We take a daily snapshot of who has rented what car. On any given day, there's roughly a 1-to-1 correspondence between drivers and VINs. But a particular car is rented by different drivers over time. So when I say the fact table is "sparse" I mean that it doesn't contain 4,000,000,000,000 rows per day (which would be 2 million X 2 million).

One obvious answer to this question is to use the drillthrough MDX statement. We're looking into rearchitecting the app to do that, but there are a couple of challenges... one being that there are a couple of other dimensions in that fact table which explode out the rows by 5-10x per VIN... but in my query, I only want to see one row per VIN, which is only a small degree of summarization.

So what I'm looking for is to see if anybody else has experience crossjoining large dimensions and making it work. Are there functions I should look into? Just looking for suggestions to try.

|||

Is there any difference in performance with his query? (It's probably the same but I have to ask.)

Also, do you have an aggregation built on [Attrib1] + [Attrib2] ?

Code Snippet

with member [Measures].[Combo Count] as
COUNT(
EXISTS(
[Dimension1].[Attrib1].[Attrib1].Members,
[Dimension2].[Attrib2].[Attrib2].Members,
'My Fact Table'
)
)
select
[Measures].[Combo Count] on 0
from [My Cube]

|||

It wasn't clear from my original question, but I'm actually wanting to select those two attributes on rows:

select {} on 0,

Exists(

[Dimesion1].[Attrib1].[Attrib1].Members

* [Dimension2].[Attrib2].[Attrib2].Members

,

, "My Fact Table"

) on 1

from [My Cube]

So I need both in the first parameter of exists.

|||

Understood. Try running that sample anyway and let's see how that performs. It will perform a similar evaluation the two attributes. If this performs well, then we can start to move forward to see exactly when we introduce the performance problem. That should give us a clue as to how we might address this.

Also, did you have that aggregation set up and populated in your cube?

Thanks,
Bryan

|||

I am not sure how trying Bryan's suggestion will help here since it returns different result.

I am puzzled, however, why Exists with measure group takes 5 minutes and 3 GBs on only 2.3 million records, but drillthrough works fast. After all both drillthrough and Exists with measure group go through pretty much the same codepath.

Have you played with DISCOVER_MEMORY to see where these 3 GBs go ? My math goes like that: Since you run your query on attribute hierarchies, there should be no more than 8 bytes per member, i.e. 16 bytes per tuple maximum. Therefore, 2.3 million*16 bytes = about 36 MBs - 100 times less than what you observed.

|||

To explain my approach....

The way EXISTS is being used in the orginal query strikes me as unusual. We normally define two sets and evaluate across a measure group (if needed). I'm wanting to see how the engine evaluates [Attrib1] vs [Attrib2] using his syntax vs. a more traditional syntax. This may lead no where, but we have seen situations where variations in syntax, though producing the same result, take widely different paths. Though this doesn't return the same result, its a starting point.

The following statement may also be provide a test, closer to where Gregg is wanting to go. Bottomline, I'm trying to see if we can find a mechanism that evaluates Attrib 1 against Attrib 2 that returns quicker.

SELECT [Measures].[My Measure] on 0, NON EMPTY [Dim 1].[Attrib 1].[Attrib 1].Members * [Dim 2].[Attrib 2].[Attrib 2].Members on 1

FROM [Adventure Works]


Regarding the count, I'm wanting to remove the cost of returning a large cellset from our evaluation.

Good luck,
B.

|||

Bryan, I don't think NON EMPTY is the way to go. I don't need to retrieve a measure value, just the valid combinations. Performance using the NonEmpty function was just about the same as Exists, interestingly enough.

As for your comment about aggs, I dismissed it initially thinking it wasn't relevant for two reasons. Both I have disproven both with some tests:

1. I initially thought that those dimensions are way too high cardinality to be involved in aggs. That's a sensible assumption, but I decided to test it. I built an agg on Date, Dim1.Attrib1, and Dim2.Attrib2. That agg turned out to be about 7% of the size of the fact.data file. Now that I think about it, that makes sense. Using that granularity of agg shrinks the distinct rowcount by 5x. And excluding 6 of 9 dimension keys shrinks the size of a row by 3x. So (100/5)/3 = 6.6%. So the takeaway is that you shouldn't assume an agg is too big out of hand.

2. I had just assumed that Exists-with-a-measure-group couldn't run off an agg. Turns out it can!!! (Was this what you were hinting at Bryan?) I used filemon to determine that it was hitting only the agg file on disk, not the fact.data file. In one test I did the Exists query ran 2x faster and produced about 10x less IO. I'll add this to the list of topics to blog out when I get time :-)

Mosha, you're omniscient as always. I'm having trouble reproducing the high memory usage today with the Exists function. The memory usage is just about in line with your guesstimate. It spends most of its time within Serialize Results as you would expect for having large sets on the axes. It's still slower than I would hope, but your statements give me the impression that there's not really a better way to write the MDX, so that pretty much answers my question and closes off this thread. As mentioned above, I think looking into partitioning and aggs is the way to go instead of changing the MDX.

I think I must have been a bit tired Thursday when I was starting this thread because it appears I got memory consumption backwards between Exists-with-a-measure-group and drillthrough. It's drillthrough that uses such a huge amount of memory. It actually consumes over 5GB of memory. The high memory usage is due to setting maxrows to 10,000,000. I'm still surprised at how much memory it uses, but oh well, I guess I won't look at using drillthrough cause it doesn't perform as well as I had anticipated. I reported this as a bug mostly because I saw filemon saying the drillthrough command hit each segment in the fact.data file twice: https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=282134

Thanks y'all.

No comments:

Post a Comment