Tries & parallel loads using tm1runti

lotsaram · Post by **lotsaram** » Tue Oct 04, 2016 9:40 pm

I have a question about how the "Tries" used to store data in TM1 cubes work in practice and managing parallel loads. Warning it's long!

So here goes. A proposal was raised today for an alternate way of optimizing parallel fact loads to a very large cube (multiple hundred GB). Here's a summary of the current setup vs. the new proposal.

Current parallelization regimen:
TM1 directs the segmentation of SQL query with a where based on the “region” dimension. There are a few hundred regions with data and we cycle through all queries managing to a queue of 24 cores in parallel until each of the several hundred queries are loaded. Daily we load a few million records and the overall process takes about 10 minutes on the 24 cores. Each Trie from each TI that gets committed is for a single region and so all are unique and non-overlapping. The dimension order is pretty well optimized such that each Trie is small, garbage memory is minimized and each commit phase is pretty quick. But ... the downside of this approach is that not all the queries are the same size. There will be quite a few regions with no data or very few records so we waste some resources authenticating with tm1runti and executing a query with just a few records. And on the other hand there are 2 or 3 regions which have much more data than average so the processes running these queries take a few minutes longer and are always the last to complete.

So the current approach isn't perfect as the fact segments aren't balanced. The process allows for significantly faster end-to-end of the fact load than single threaded; in the order 10x to 15x faster, but it is a long way from a perfectly linear 24x from the scale up of cores. This is due to burning CPU on wasted logins, small queries, and the phase at the end which is only parallel by 2 or 3 cores due to the larger regions. We live with this as i) this strategy was easy to implement, ii) it is still an order of magnitude better than single threaded, iii) it's a daily load and weighting of regions could change over time so grouping would be difficult. It would be a lot of work to do a pre-scan of the fact table and have an algorithm group the regions into roughly equal sized groups rather than single regions, especially if this needed to be done for each load.

Proposed parallelization regimen:
There is a new proposal to compartmentalize the fact load to TM1 by segmenting the query based not on region but on row ID of the fact table. This would allow for 24 equal sized queries. So all parallel fact loads would start and finish at the same time. This is also simple to implement, even simpler. There won't be any wasted CPU due to lots of small queries and unnecessary tm1runti login/logout. All loads should run for the same amount of time. Sounds great. But ... the queries will not be segmented according to any TM1 dimension but will be randomly assorted across region (and every other dimension.) My question: is this a downside and if so how much of a downside? (clearly I suspect it is or I wouldn't ask the question, but I have an open mind.)

Under the new proposal the Tries built by each TI process will overlap. Individual ZeroOut isn't possible, but this doesn't matter as can be done single threaded before the parallel phase starts. Each Trie for the new proposal will certainly be larger than the single region Tries, so I am pretty certain that each commit will be longer. What I am unsure of is any other effects of the overlapping ...

With the overlapping Tries all trying to commit at the same time is it possible to end up with incorrect data in the cube due to “last commit wins”? (Clearly not a good or desired result). Or will the lock manager recognize that the commits overlap and cause waits at this point once the contention is recognized? Note this would effectively serialize the commits and negate a significant portion of the time gain from loading the facts in parallel. This would not be as bad as incorrect numbers in the cube but still not good. (my hunch is this is the most likely outcome.) Worst case, might the lock manager get tangled with all the overlapping commits and get into a deadlock with rollbacks and thrashing? (yep, worst case, server at this point is pretty ugly.)

According to IBM it is a best practice to segment fact records to non-overlapping "chunks" based on unique combinations of one or more TM1 dimensions and this is certainly what I have always done. The question gets to the WHY is this a "best practice" (I know many are allergic to this term). Clearly we could do the experiment and see if the new proposal is better than the current approach. But this will take a lot of time & effort, time & effort costs money. If someone knows the answer already it will save time, effort & money.

Question is, can anyone outside of IBM's core server engineering team definitively answer the question? I hope someone can answer. In any case I hope the question was interesting. I don't ask questions that often.
Don't be shy!

BrianL · Post by **BrianL** » Wed Oct 05, 2016 2:27 am

I'll answer to the best of my knowledge from both my own experience and things I've been told by IBMers.

First about the lock manager and contention. As you are already aware (and taking advantage of) PI removes much of the lock contention from cell data writes. TM1 will allow multiple TI processes to write to the same exact cell simultaneously and without (apparent) conflict. The tricky bit is that PI removes the pre-commit IX lock contention BUT still performs commit time serialization. This is generally unnoticed because commit is usually pretty quick as well as significantly shorter than the rest of the processing TM1 does. You can see this simply by watching for tm1top to show a thread in a Commit state. Doesn't happen too often in my experience and when it does it's usually a small percentage of the overall reported time. If you have transaction logging enabled (and I suspect for these data loads you do not) TM1 obviously has to serialize on the file I/O, but even with transaction logging disabled I haven't seen a change in commit synchronization behavior. Now the only time I've ever seen this synchronization matter was once...in exactly the situation you are describing. With several TI processes doing multi-hundred-GB data loads and all hitting commit at the same time, I've seen commit become a bottleneck. All the TI Processes end up in Commit (according to tm1top); but they all exit one at a time. You might be able to estimate the impact by logging tm1top out at a short interval and executing the longest data load you have (or expect to have). Time how long you see tm1top report a 'Commit' state, and for a worst case scenario each of your Processes will each serialize for that amount of time. A one minute commit time getting hit by 24 Processes at once could take 24 minutes to fully complete!

Now for the data integrity question. As I'm sure you know, TM1 has used a last-writer-wins strategy for many years. There's nothing to prevent two people from writing to the same exact cell location at (nearly) the same time. Both will succeed and both will show in the transaction log, but the first write that commits is quickly (and transparently) overwritten by the next. This still applies to large data loads. If you have overlapping writes only one will still exist when all processing is complete. As long as you avoid this scenario I wouldn't expect any data inconsistencies. While I can't seem to dig up any official documentation, I seem to recall hearing that TM1 follows the ACID database properties. If so, the Isolation property would dictate that you won't lose data.

Unfortunately, I don't have any direct experience to share on your main question about the overall efficiency of segmenting your data loads based on TM1 dimensions versus random values.

lotsaram · Post by **lotsaram** » Wed Oct 05, 2016 6:01 am

Brian thanks very much for your answer.

I hadn't considered that if transaction logging is on then this would also effectively serialize the commits due to contention on tm1s.log that's a very good point.

I guess the whole point with parallel interaction is that there is no check anymore or IX lock on cell updates so cell for cell it will always be last commit wins so there shouldn't be any waits or locks due to this. Good. But I think I still have a fundamental question that goes to how tries work and how they are built.

In this case the set of updated cells will be unique for each of the 24 threads, but what level is the commit made, cell-by-cell or whole trie? I'm pretty sure whole trie, so what does a trie contain? I know that branches are added to a trie as required, as a trie is built does it take a shadow copy of the existing cube data as it builds a new branch and thus the shadow trie contains a complete data set of that area of the cube including the unchanged cells as well as the changed cells or does the trie only contain only the directly changed cells and nothing else? This is critical as when the tires are being built the base model of the cube will be identical as no commits have yet happened.

So in a situation of trie branch overlap but no cell update overlap can you still have conflicting updates on a cell level?? Can cells updated with values in one commit simple disappear or go back to their old values on the next?

qml · Post by **qml** » Wed Oct 05, 2016 9:10 am

I've done quite a bit of work on solutions similar to what you are describing - possibly even larger in scale (up to parallel 60 threads, each one loading tens of millions of records). As Brian points out, commits are definitely serialised, so can become a bottleneck. This is true even if there is no transaction logging switched on (and yes, the transaction log is in itself a resource that can cause bottlenecks - e.g. CAM logons are serialised due to the fact that CAM authentication always requires a lock on the transaction log, even if logging is turned off - fun fact). It is true that commits are usually quick enough so that these things are hard to notice and do not matter much. In the case of some of my projects (especially one) the scale was large enough (an average commit taking a minute) that these things became quite evident. Also, TM1Top/Operations Console shows things in a misleading way - it is possible for multiple TI threads to display in a Commit state at the same time, but in reality they form a queue with all but one in a Wait state - something that is not shown on screen.

For the reason stated above I found that the most optimal setup (i.e. minimising the total end-to-end load time) is when the data batches are not identical in size, so that TI instances finish at slightly staggered times, keeping the commit queue as short as possible. Another technique could be to have them identically sized, but stagger the start. I have also found (with all the 10.1.1+ versions, anyway) that the last TI instance to finish in a batch of instances, regardless of how much data it was loading, spends much longer in the Commit state than all the other ones (I can't remember exactly, but it was something like 3 or 4 times longer). This is apparently (as vaguely explained by IBM engineers) in order to do to some memory cleanups that the last TI is doing, even though superficially you'd think that each TI instance is independent and a separate transaction - turns out it's not entirely the case!

So no matter what you do, you will always end up with a situation where you are not utilising all the cores for the entire duration of the load - there will always be these few procrastinator threads at the end and the last one will even take extra time to finish.

As to your worries about data completeness - I can't say for sure, but I don't think that a commit would be as simple as attaching one trie to another. Merging tries is almost certainly a more complicated operation and I'm quite sure you will end up with no data losses. At least I have never experienced anything like that. Mind you, I have always made sure that there are no data overlaps between batches. Based on empirical data I would guess that any 'shadow' trie owned by a TI instance only represents the cells changed in that transaction. But if different TI instances attempt to change the same cells then yes, of course the last one will win.

Edit: as to your specific situation, I think the best thing to do would be to keep the split by region, but based on longer-term data statistics manually clump some regions together, so that all resulting datasets are of the same order of magnitude (e.g. a query for 20 small regions can return roughly as many records as a query for a single large region). That relatively simple change, combined with the natural distribution of dataset sizes (after the change), where the largest one might be, say, 2 times larger than the smallest one, should give you something much closer to the optimum you are looking for. If you're feeling really adventurous you can write an algorithm to dynamically clump regions in queries based on any given dataset in the relational DB - but that's probably overkill, especially that your load times are quite decent already.

Steve Rowe · Post by **Steve Rowe** » Wed Oct 05, 2016 12:30 pm

Great thread!
I don't have much to add.
If a Trie does not have a single unique element defining it does that make it more costly for the engine to handle, no idea. It sounds logical that it might be though.

Thinking about the clumping of queries to try and ensure that commits don't happen at the same time.
Staggering the start times seems to me that you are just creating a queue at the beginning of the process rather than the end?
If you clump (need another word!) the queries into the same size does that make a commit clash more likely?
Maybe this would work, if we assume that there is no overhead to non-unique tries then you could have one thread processing 1,000 rows at a time, another 5,010 then you would only get commit clash when the row counts were multiples of each other.

I would have thought you are well into the space of marginal returns though.

Have you considered pushing the data to flat files and then reading the flat files into TM1? This would save all the ODBC handshakes, especially for empty regions. Not sure if this would be end to end faster but I'd expect the TM1 bit to be. You wouldn't need to wait until all the files were produced as you could trigger the TM1 load once a few of the flat files were produced.

What's the driver for this conversation? Is it just a theoretical desire to use the threads in the most efficient manner? Just wondering as most of what you are doing is non-locking from the users PoV?

Cheers,

qml · Post by **qml** » Wed Oct 05, 2016 1:11 pm

Steve Rowe wrote:Staggering the start times seems to me that you are just creating a queue at the beginning of the process rather than the end?

True, but as I said, there needs to be a queue somewhere anyway. The goal is to have a situation where for as long as possible N-1 threads are processing something and 1 thread is committing. If there is more than 1 thread in the commit queue then that thread is not processing and a CPU core is wasted on waiting.

Steve Rowe wrote:If you clump (need another word!) the queries into the same size does that make a commit clash more likely?

How about 'bundle' instead of 'clump'? The goal in my proposal is for the queries to not return identically sized datasets - but instead sets that are spread out in size in such a way that it takes as long for one thread to commit as it takes to process the average difference in recordset sizes. This difference will naturally vary depending on the specific circumstances and it needs to be found via trial and error. In the scenarios I worked with the ideal ratio between the largest and smallest dataset was 2:1 with other datasets spread out as evenly as possible. It doesn't have to be exact, but it's something to aim for. Things obviously get more complicated when you have more batches to process than threads - in which case it might be hard to control how the slots are allocated as it would typically be on a first-come-first-served basis. In that case my overall suggestion would still hold true, because once you stagger commits in the first round, they are likely to stay staggered for the rest of the load - provided that batches are similarly (but not identically) sized.

Steve Rowe wrote:Have you considered pushing the data to flat files and then reading the flat files into TM1? This would save all the ODBC handshakes, especially for empty regions. Not sure if this would be end to end faster but I'd expect the TM1 bit to be. You wouldn't need to wait until all the files were produced as you could trigger the TM1 load once a few of the flat files were produced.

Lotsa's circumstances might be different, but in my case the ODBC method proved as fast as it could be. Generating huge flat files would take a long time and was a no-no anyway due to security reasons. I was able to configure the relational DB and the ODBC connection to provide an almost optimal throughput i.e. serve data as quickly as TI can consume it (I monitored network and CPU usage). We optimised the $%!* out of our Oracle DB and it also helped that it was running on a powerful Exadata server. I also had to change the default ODBC settings quite drastically, because the default Oracle client buffer size of 64 KiB was not nearly enough - I increased it to 32 MiB (!) per connection, which boosted the observed throughput by an order of magnitude. I've tested dozens of different settings to arrive at the absolute optimal values for our queries. In the end, under perfect conditions when all the threads were busy, I achieved throughputs of 300k records per second, something I think is quite an achievement. It translates to about 5k records per second in a single thread - which is very good for a TI process with hundreds of lines of code and it would not have been any better loading from flat files. A very simple TI process with a local flat file for a source can maybe process about 15k records per second, give or take.

Steve Rowe · Post by **Steve Rowe** » Wed Oct 05, 2016 1:33 pm

Sounds like you were having fun Kamil!

BrianL · Post by **BrianL** » Wed Oct 05, 2016 1:34 pm

lotsaram wrote:But I think I still have a fundamental question that goes to how tries work and how they are built.

In this case the set of updated cells will be unique for each of the 24 threads, but what level is the commit made, cell-by-cell or whole trie? I'm pretty sure whole trie, so what does a trie contain? I know that branches are added to a trie as required, as a trie is built does it take a shadow copy of the existing cube data as it builds a new branch and thus the shadow trie contains a complete data set of that area of the cube including the unchanged cells as well as the changed cells or does the trie only contain only the directly changed cells and nothing else? This is critical as when the tires are being built the base model of the cube will be identical as no commits have yet happened.

So in a situation of trie branch overlap but no cell update overlap can you still have conflicting updates on a cell level?? Can cells updated with values in one commit simple disappear or go back to their old values on the next?

Here's the way I've had PI explained to me. The first moment a thread asks to look at cube data it effectively sees a snapshot of the latest data committed in the cube. During the entire lifetime of the operation, it will use this same snapshot. Any new writes by other threads will not be visible. To keep uncommitted cell writes private to the thread performing the writes an overlay is created that only contains the changed data. At commit time a new "version" of cube data is created. This version re-uses all the unchanged data. The data overlay is then merged into the new version, again modifying as little as possible. The merge is just that a merge. It doesn't blindly overwrite anything existing, but combines the old and the new so no data is lost.

lotsaram · Post by **lotsaram** » Thu Oct 13, 2016 4:22 pm

To anyone who has followed this I thought I should update with information back from IBM.

1/
The shadow trie structures contain ONLY UPDATED CELLS. So there is no risk of data loss or data corruption. Obviously for any cells with the same address it's last commit wins. But as long as each set of cells is unique then the data in the cube at the conclusion of all commits will be correct. However, if the trie branches overlap then the commits will take longer and require more memory.

2/
The most optimal way to segment parallel loads is non-overlapping segments of the 1st dimension in the cube (the reordered indexing not the original). This will result in smallest, tries, smallest memory footprint, quicker writes and commits.

3/
Commits to a cube are serialized. So it is inefficient to have all segments equal sized and starting at the same time as this will just result in a queue at the end. It is most efficient to have more segments of differing sizes and extinguish the processing queue to manage peak CPU. This will stagger the commits and give the best end to end time.

Steve Rowe · Post by **Steve Rowe** » Fri Oct 14, 2016 10:01 am

Thanks for sharing Lotsaram, much appreciated.
Cheers,

qml · Post by **qml** » Fri Oct 14, 2016 10:46 am

lotsaram wrote:The most optimal way to segment parallel loads is non-overlapping segments of the 1st dimension in the cube (the reordered indexing not the original). This will result in smallest, tries, smallest memory footprint, quicker writes and commits.

This sounds interesting and definitely something I'd like to understand in more detail. Have you actually tested this in a rigorous way? I would be interested to know how big the impact on commit time is of the segmenting dimension being the first vs middle vs last one in the cube. The first dimension in most cubes is usually very small, often just having one leaf element and in my experience the most natural candidate for the segmenting dimension is one that is somewhere close to the end of the cube like organizational structure or date.

I'm not that fussed about the memory footprint of the temporary tries, because with parallel loading there is going to be quite a lot of garbage memory generated anyway and one needs to have the appropriate hardware for that. However, being able to measurably improve write and commit times by ordering the segmenting dimension right would be great.

TM1 Forum

Tries & parallel loads using tm1runti

Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti

Re: Tries & parallel loads using tm1runti