Researching a chore failure issue on a very large process, I came across this: https://www-304.ibm.com/support/docview ... wg21459718
The advice relates to multiple TI loaders. We don't have that, but our batch process does a server restart at the end to free up feeder memory (our server only has 400gig of RAM ) and, naturally, we have a savedataall just before the restart. (Version is CX90=TM1 94something)
There's also this in the ref manual:
There is a brief window during the commit operation where the locks are released and another user or TurboIntegrator process could delete objects the original chore was using. When the original chore attempts to reacquire the locks on those objects, the objects will not be available and the chore will cease processing. In this case, an error similar to the following is written to the Tm1s.log file:
844 WARN 2008-04-01 16:40:09,734 TM1.Server TM1ServerImpl::FileSave
could
not reacquire lock on object with index 0x200002ca
I'm scratching my head trying to understand what IBM are advising. If we didn't do a savedataall before a restart we would lose all the numbers. If we turned logging on we'd have huge logfiles and the process would be slower.
And... we've been running the batch process with a restart since last December, and have only just started encountering these problems.
My plan is to mod the overnight SaveDataAll to test the presence of an 'in progress' cell we already use, and bail out if it finds it populated.
But I'd like to here what others do, both without and with 9.5.2 PI.
My plan is to mod the overnight SaveDataAll to test the presence of an 'in progress' cell we already use, and bail out if it finds it populated.
'In Progress' cell as in a distinct cell in some application specific custom system cube ?
Which must be written by a first TI process and read by a second TI process ?
By which you serialize these two Ti processes, as the second one cannot read the distinct cell as long as the first one has a write lock on it's cube ?
If you intend to avoid serializing TI processes and chores, did you consider using file semaphores, in short terms empty (text) files having a distinct file name, to control the execution of TI processes and chores ?
For instance, the first TI process of your first load chore named "lc1" uses the TI fct FileExists(file) to check if the empty text file "lc1.txt" exists. If so it uses the TI fct ExecuteCommand to execute a batch script to delete that file. The last TI process of your first load chore uses the TI fct ExecuteCommand to execute another batch script to create the empty text file "lc1.txt". You second load chore named "lc2" is doing the same with the empty text file "lc2.txt". And so on.
You have another chore running in convenient intervals, for instance every 30 minutes, using the TI fct FileExists to check if all file semaphores like "lc1.txt", "lc2.txt", etc do exist, indicating that all of your load chores have finished. If that condition is meet it will call the TI Fct SaveDataAll to perform the SaveDataAll operation.
To avoid lock conflicts with other (user) threads, you should execute the TI Fct SaveDataAll in Bulk Load mode by a leading TI fct EnableBulkLoadMode before calling the TI fct SaveDataAll and a closing TI fct DisableBulkLoadMode after calling the TI fct SaveDataAll.
Interesting idea.
We have in the meantime made and tested the change. When the SaveDataAll kicks off while the batch process is running it accesses the flag cell without issue and duly bails out. It may be relevant that the flag is set at the start of the batch process and the rest of the process takes some time to run.
On the 'empty file' - why not just use Asciioutput?
It's probably worth mentioning that there's no users interacting with the server while all this is going on - it's a very batch-orientated system, not because I wanted it that way, but the model is simply too large to run up live on our 400gig server.
Thanks also for reminding me about BulkLoadMode - would be interesting to see if this speeded up the batch process.
I like the idea of a file semaphore as a way of overcoming issues with traditionally managing queuing now that we have parallel interaction.
David - what version is the system running? I have noticed a marked improvement now with 9.5.1 and 9.5.2 over 9.4.1 during SaveDataAll in how locks are obtained on objects as they get saved and released individually as the save of each cube completes (as opposed to the old way where the lock was obtained on all objects en masse and released only after the very last cube was saved.)
I can't see that BulkLoadMode woudl do anything for the speed of the save, it would only put the server in single thread mode and block any connection attempts during the process (which might help if there were any contention issues.) You have to tread very carefully with bulk load mode though as if anything causes a process to quit before it is released then the server finds itself in a terminal state as no new connections are accepted to turn it off
Version (in original post) is CX90 aka TM1 9.4.whatever.
I hear what you say about bulk load - as mentioned, the only contention is this overnight SaveDataAll
But we are also experimenting, back at the ranch, running the model under 9.5.2 PI. So far results have been... less than stellar. IBM are interested in seeing the model to find out why.
I have had problems with BulkDataLoad is the past. (Around June) When it was enabled it was causing chores to crash. Saying that I was doing it at the begining of the chore not just for the save data.
David Usherwood wrote:But we are also experimenting, back at the ranch, running the model under 9.5.2 PI. So far results have been... less than stellar. IBM are interested in seeing the model to find out why.
Doesnt the stopping of a Service do a Save Data All by default anyway?, If this is correct then do you need to do a separate SaveDataAll before a restart?
I have just stopped a temp Service (9.5.2) I was running and the following events were logged in the server logs
8460 [] INFO 2011-11-23 07:15:36.587 TM1.Server Closing...
8460 [] INFO 2011-11-23 07:15:36.587 TM1.Server Saving...
8460 [] INFO 2011-11-23 07:15:36.587 TM1.Server The server is coming down...
And it has created a timestamped tm1s???.log file which suggests it has done a save data by default.
lotsaram wrote:I can't see that BulkLoadMode woudl do anything for the speed of the save, it would only put the server in single thread mode and block any connection attempts during the process (which might help if there were any contention issues.) You have to tread very carefully with bulk load mode though as if anything causes a process to quit before it is released then the server finds itself in a terminal state as no new connections are accepted to turn it off
TM1 Version 9.5.2 introduced two new TM1 server configuration parameters to manage server wait times: "NetRecvBlockingWaitLimitSeconds" and "NetRecvMaxClientIOWaitWithinAPIsSeconds".
Problem(Abstract)
In default configuratoin, the TM1 server can wait for a long time for input, which can result in long-held threads and other problems.
Symptom
Server hangs waiting for input.
Cause
Server waits for a long time without causing a socket error when using functions such as EnabledBulkMode.
Resolving the problem
NetRecvBlockingWaitLimitSeconds and NetRecvMaxClientIOWaitWithinAPIsSeconds are new configuration parameters that can be used to manage the wait time of the server.
NetRecvMaxClientIOWaitWithinAPIsSeconds is the maximum time for a client in seconds to do I/O within the time interval between arrival of the first packet of data for a set of APIs through processing until a response has been sent. Using this parameter requires the client to handle I/O in a reasonably timely fashion after initiating API requests. This parameter is designed to protect against connections that go dead but do not raise a socket error or create other possibilities such as a hung client. Default value is 0 which means no time limit. NetRecvMaxClientIOWaitWithinAPIsSeconds=30 is a reasonable setting for this parameter.
NetRecvBlockingWaitLimitSeconds changes the way the maximum time the server waits for a client to send the next request from one long wait to shorter wait periods so a thread may be cancelled if needed. This parameter instructs the TM1 server to perform the wait as a series of repeated shorter waits and gives the server the opportunity to cancel or pause the thread. When set to zero (the default) the behavior of one long wait is used. For this parameter NetRecvBlockingWaitLimitSeconds=60 is typical.
Doesnt the stopping of a Service do a Save Data All by default anyway?, If this is correct then do you need to do a separate SaveDataAll before a restart?
I have just stopped a temp Service (9.5.2) I was running and the following events were logged in the server logs
8460 [] INFO 2011-11-23 07:15:36.587 TM1.Server Closing...
8460 [] INFO 2011-11-23 07:15:36.587 TM1.Server Saving...
8460 [] INFO 2011-11-23 07:15:36.587 TM1.Server The server is coming down...
And it has created a timestamped tm1s???.log file which suggests it has done a save data by default.
When you run your TM1 server process as a Windows Service, and you stop that Windows Service, due to the settings of the Windows Service Control Manager (SCM) it has only a limited amount of time to perform a proper service shutdown, by default 30 seconds.
That may not be enough time to perform a proper service shutdown, to do a cleanup of the multithreaded TM1 server process and to successfully execute a SaveDataAll operation writing all new information from memory to disk.
Depending on the size of your TM1 server transaction logfile tm1s.log, the memory footprint of your TM1 server process and the number of users working with the TM1 server process, to stop the TM1 server process may take minutes or hours.
If that happens the SaveDataAll operation performed by a TM1 server shutdown may fail, leaving you with a corrupt TM1 server transaction logfile tm1s.log, see for instance the manaul Tm1 9.5.2 Operation Guide:
Troubleshooting: Recovering from a Corrupt Transaction Log File
In some cases, an unexpected or incomplete shutdown of the TM1® server, due to a server crash or power outage, can cause the transaction log file to become corrupt. If this happens, the server will not be able to restart.
Thus, depending on your TM1 application, you may have to seperate the SaveDataAll operation from the TM1 server shutdown by executing a TI process calling the TI fct SavaDataAll some time before the TM1 server shutdown.
Absolutely, everything moby91 just said. Always safer to call a SaveDataAll prior to forcing a shutdown, for any large data set there may not be enough time and you don't want to risk an incomplete save.
After a quiet period this issue has re-emerged with a vengeance, with approx. one batch process in 10 failing to complete.
To remind eager listeners, we have a complex model which cannot be run up completely in our little 400gb server (that's not a typo) running at present on CX9.0.
Accordingly, I wrote a batch process which copies chunks of input data from an unfed area to a fed area, firing the feeders, and then freezes the calculations into an unruled static cube. After this stage completes, the fed area is cleared, a SaveDataAll is executed, and a batch file called which restarts the instance (thus freeing up the feeder memory), ready for the next run.
The volume of calculations is very large indeed, and the overall run times are around 60 hours (for one model) and 120 hours (for another).
We've been running this without issues since I wrote in over the Christmas holiday period last year. In November, we hit an issue where a batch run reached the SaveDataAll and went no further - which led me to the post at the beginning of this thread. This did not recur for a while, but over the last 2 weeks, it has suddenly been occurring far more often as above.
We've been back to IBM about this. After some toing and froing, their explicit (but to me not consistent) recommendations are:
a To avoid lock contention in a long running process, turn Performance Monitor off, and ensure that other chores are inactive (we have the standard CX overnight SaveDataAll in the system)
b Server restart performs a SaveDataAll. Turn cube logging on and use the transaction log.
c SaveDataAll must _never_ be included within a TI including other commands. If you must run SaveDataAll, keep it in a chore on its own and use a file semaphore to control when it runs.
Now, I have seen Perfmon threads start, fail to get a lock and stop, and the same with the overnight SaveDataAll. My belief is that they fail silently and don't affect the long running chore - certainly the lockups are happening several hours later than the overnight SaveDataAll. We have disabled the overnight SaveDataAll off, but to stop PerfMon running we'd need to edit the CFG file, because of the restarts.
I, myself, cannot see how, in a system where no other processing is taking place, that a SaveDataAll cannot be placed at the end of a batch process. I'm running some tests to count up how many data values are being frozen because I don't think IBM have a picture of just how much data would need to be written to, and then committed from, the log file, if we used cube logging. I'm thinking in the tens, maybe hundreds, of million values.
If I have a suspicion it is that the static cubes to which we write are very, very large, and may have become too big for the SaveDataAll to complete consistently, or that possibly there is some corruption in the .cub files - these are 2.5gig for one and 5gig for the other. In Perfmon, the memory footprint (input values only) are 8.5gig and 18.6gig respectively.
I'd appreciate comments from fellow forumers on IBM's response, and on what sensibly can be done to deal with this issue.
I don't have answer for you but a comment in the cube logging. I did a test a few years ago on some very large data loads, with cube logging turned off and on, and found enabling cube logging to increase the process run times to somewhere between 2 to 3 times as long. That can be a show stopper when you have really large data loads. I believe IBM is recommending the cube logging option because the "re-process" feature in the server startup is extremely fast, faster than any process I've ever run. I'm guessing they were able to configure it to run so fast because that's all the server needs to focus it's resources on, instead of worrying about client interaction. Still, it would mean you are actually running everything twice. That's hardly a good option.
I have to say I've never seen anything like a model with 60 - 120 hour batched calculation runs, you're probably the only one who has. But in terms of the end result of pre-calculated data cubes of 2.5 and 5 gig that is pretty run of the mill and I can't imagine that causing any issues in terms of being able to write to disk.