Populated Numeric Cells vs Memory for Input Data

gtonkin · Post by **gtonkin** » Tue Dec 20, 2016 3:06 pm

I know that this starts becoming a dark art but would like to rehash some of the old posts - like this one and this one. There are others but the common theme is that there is generally no resolution or process to optimize.

So the issue I am having right now is the following. I have a model with lots of data across 8 cubes (all different sources and uses i.e. no overlap)
Cube MB
14,8 Gb on disk - 834,673,540 populated numeric cells. Memory for input data is 42,823,840,768 (42 Gb). Roughly 51 bytes per cell.
There are 7 dimensions which were created based on the data expected in terms of size and sparsity - dimensions with few elements, used frequently at the top going towards the largest dimension just before the measures dimension. 3rd last dimension has about 2000 leaf and 1100 C: level. 2nd Last dimension has about 25000 N levels and 100 C: levels. Measures dimension has 8 elements, 5 are string - 1 rule derived, the others not actually populated at this stage.

Cube SD
8 Gb on disk - 543,177,497 populated numeric cells. Memory for input data is 13,035,897,856 (13 Gb). Roughly 24 bytes per cell.
Structure is similar to Cube MB but has 2 less dimensions. The two dimensions that are not used contain Source (2 elements) and Type (50 elements). The measures dimension only has 3 elements - NO strings

So I guess the question is really how do I go about reducing the memory usage per cell in Cube MB from 51 bytes to a more reasonable value which in turn will reduce the memory consumption.

-Managed to re-order dimensions manually and achieved a massive 0.0003% reduction! Moving most of the dimensions around resulted in zero change or an increase. System optimization took about an hour and yielded 0.000000%
-Removed string measures and restarted the server - no change.

Where to from here?

tomok · Post by **tomok** » Tue Dec 20, 2016 7:38 pm

gtonkin wrote:So I guess the question is really how do I go about reducing the memory usage per cell in Cube MB from 51 bytes to a more reasonable value which in turn will reduce the memory consumption.

What is a reasonable value in your opinion, considering that string data is inherently more memory intensive than numerical data? Try removing the string measures from the cube and then check the memory footprint. That should tell you how much having strings is hurting you.

gtonkin · Post by **gtonkin** » Tue Dec 20, 2016 7:51 pm

Thanks Tom-I did remove the strings which contained no data-no savings on memory.
In terms of what is reasonable-I would guess somewhere around the 30 mark based on other deployments.
In the past, the way to bring down the usage was by re-ordering the dimensions. My next plan is to actually swap the two big ones, restart and see if the reported values in the stats cube agrees to the percentages in the re-ordering window. Got my suspicions at the moment.

lotsaram · Post by **lotsaram** » Tue Dec 20, 2016 10:10 pm

From my experience density is more important than size.
Rough as guts method to determine density:
- Create 1D views of the cube with all N elements on row dimension and "total" on all other dimensions, slice out to Excel and count the number of rows
- density on each dimension is then populated N elements / total N elements for the dimension

Order the cube dense to sparse. Smaller dimensions will normally be naturally more dense anyway. See what improvement you get. If there are now no string measures then absolutely no reason to leave the measure dimension as last. This is a limitation of the system suggested reordering which means you can pretty much always beat it manually.

I think IBM's cube sizing benchmark is either 22 or 24 bytes per numeric cell for x64. This is almost always achievable. If you get below 22 bytes/cell then you are doing very very well. > 50 is too much, I'm pretty sure you will be able to save half of this. But 50 isn't terrible. I have audited systems where it was > 150 which was then brought down to under 30.

After doing a reorder the numbers in StatsByCube get funky and can't be trusted. From memory I think total memory used is still reliable but memory for input data stays as it was and memory for feeders shows as negative to balance out (or vice versa), but it's weird. You need to save data and unload the cube and reload the cube ad restart perfmon to get reliable stats for reordered cubes.

Marcus Dietrich IBM · Post by **Marcus Dietrich IBM** » Wed Dec 21, 2016 4:29 am

Hi
an approach I found quite useful for reducing cube sizes is to convert N-calculations into C-calculations where possible, which let's me save the feeders; for example if I have KPI ['Rev per FTE'] = ['Revenue'] \ ['FTE'] then I can save the feeder by making ['Rev per FTE'] a C-element by assigning ['FTE'] as a child; the rule will then not require a feeder.

not sure if this works for you.

gtonkin · Post by **gtonkin** » Wed Dec 21, 2016 5:08 am

Thanks Marcus. The cubes have populated data at this stage-I have removed rules to ensure I focus on the main issue I.e. Data points vs memory for this points.

I do however use the technique you describe in other areas/models.

gtonkin · Post by **gtonkin** » Wed Dec 21, 2016 9:41 am

lotsaram wrote:From my experience density is more important than size.
Rough as guts method to determine density:
- Create 1D views of the cube with all N elements on row dimension and "total" on all other dimensions, slice out to Excel and count the number of rows
- density on each dimension is then populated N elements / total N elements for the dimension

Order the cube dense to sparse. Smaller dimensions will normally be naturally more dense anyway. See what improvement you get. If there are now no string measures then absolutely no reason to leave the measure dimension as last. ...

Thanks Lotsa - will go through your suggestion and post back once I have something useful to report.

gtonkin · Post by **gtonkin** » Wed Dec 21, 2016 2:04 pm

Ran the density exercise per Lotsa and have the following:

Cube MB	All	C Levels	N Levels	Density	Order
Marketing Budget Source	2	0	2	100%	1
Scenario	3	0	3	67%	4
Period Date	1176	40	1136	66%	5
Spend Type	76	25	51	73%	3
Marketing Store	25954	631	25323	76%	2
Marketing Product	3120	1141	1979	1%	6
Marketing Budget Measures	3	0	3	100%

I re-ordered the dimensions based on the Order column in the table above. Keeping measures last as I will need to include the strings again at some point.
I saved data, restarted the instance and performance manager and checked memory usage again.
Memory used for input data 42,823,840,768 previously 42,823,840,768 - no change!

The only other thing I can think of is reducing the number of dimensions to see how this impacts the cube size. Rationale is that each cell requires an additional pointer for each additional dimension added. Each pointer would consume x no. of bytes in the cube - more dimensions equals more pointers equals more storage required. Will try test this theory further.

Any other suggestions in the mean time are welcome.

TM1 Forum

Populated Numeric Cells vs Memory for Input Data

Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data

Re: Populated Numeric Cells vs Memory for Input Data