Threads disappear without a trace from message log
Posted: Thu Dec 08, 2016 9:46 am
I have a rather curious phenomenon where we have a model which loads a lot of data on a lot of threads with tm1runti. Sometimes a thread just "disappears", that is you see the thread appear in tm1top with function ProcessExecuteEx and a corresponding match in the tm1server.log that the thread logs in and calls a TI process, ... but when the thread disappears from tm1top there is not a corresponding process complete or logout line logged to the message log or any process error message in the log directory. It is as if the thread vanished without trace.
So why is this even an issue?
There is a very large volume of data to process in a very limited timeframe. (Many many many millions of records within a window of a few minutes). So an optimized process with tm1runti is critical which is triggered externally and managed end-to-end as there are several pre-processing steps before TM1 in the source system and DWH layers before the several steps within TM1. The scheduling tool is watching the tm1server.log (among other things) to coordinate end-to-end data flows and trigger the next step in the process when the preceding job finished. If a process which it is waiting to finish just "disappears" then the dataflow stalls and the next step is never triggered. When this happens manual intervention is needed to complete the data processing. The data flow needs to be completely automated. That in a nutshell is why this is an issue, and a critical one.
It seems to be random which processes/threads are affected but does seem to happen to some more often than others. My unfounded assumption would be it is a bug or deficiency in the tm1server.log that it only has granularity to millisecond level and if more than one thread wants to log a message in the same millisecond then it is first come first served with 2nd and subsequent log lines with the same millisecond ID just don't get logged? Has anyone else observed this? Can anyone else confirm this?
So why is this even an issue?
There is a very large volume of data to process in a very limited timeframe. (Many many many millions of records within a window of a few minutes). So an optimized process with tm1runti is critical which is triggered externally and managed end-to-end as there are several pre-processing steps before TM1 in the source system and DWH layers before the several steps within TM1. The scheduling tool is watching the tm1server.log (among other things) to coordinate end-to-end data flows and trigger the next step in the process when the preceding job finished. If a process which it is waiting to finish just "disappears" then the dataflow stalls and the next step is never triggered. When this happens manual intervention is needed to complete the data processing. The data flow needs to be completely automated. That in a nutshell is why this is an issue, and a critical one.
It seems to be random which processes/threads are affected but does seem to happen to some more often than others. My unfounded assumption would be it is a bug or deficiency in the tm1server.log that it only has granularity to millisecond level and if more than one thread wants to log a message in the same millisecond then it is first come first served with 2nd and subsequent log lines with the same millisecond ID just don't get logged? Has anyone else observed this? Can anyone else confirm this?