Server Lockup

Post Reply
hbell
Posts: 61
Joined: Wed Feb 25, 2009 6:15 pm
Version: 9.1 SP3
Excel Version: 11.8

Server Lockup

Post by hbell »

Looking for any hints on "usual suspects" for a server locking up. We are running 64-bit 9.1 SP3 with Windows. 32GB of RAM and 4 dual core AMD processors. We are accessing via Citrix. We have set ReceiveProgressResponse = 20.

The server periodically goes into a complete lockup. There is no obvious common catalyst that we can detect - though it is fair to point out that it is a busy server with 3 or 4 administrators often working simultaneously (different cubes) making metadata changes and running processes.

We are not allowed direct access to the box but remote monitoring software shows memory at only 10% or so usage. Processor is more or less idle. So no indication that the server is under any kind of stress, but nobody can do anything with it. We are not even able to stop the service. We have to get our IT group to crash it and restart. The message logs do not show anything unusual - though on the last occasion we had an "SSL Write Error" (as opposed to the very common - and apparently harmless "SSL Read Error").

Not sure where else we could look for clues. I saw an old post on the forum suggesting that the ReceiveProgressResponse might not work with Citrix. Not sure if there is any update on that?

thanks ............hugh
David Usherwood
Site Admin
Posts: 1458
Joined: Wed May 28, 2008 9:09 am

Re: Server Lockup

Post by David Usherwood »

I can suggest one. Roll back to 90SP3U9 or similar, do a stress test, see what you find. I'm utterly unconvinced by the new locking model in 9.1 and 9.4 being any use unless you have lots of simultaneous data writers _and_ you (re) architect your model so they dump their data into unruled microcubes.
User avatar
George Regateiro
MVP
Posts: 326
Joined: Fri May 16, 2008 3:35 pm
OLAP Product: TM1
Version: 10.1.1
Excel Version: 2007 SP3
Location: Tampa FL USA

Re: Server Lockup

Post by George Regateiro »

We went through something similar on 9.1 SP2 U3. In that release the locking model still had some issues and there were times that processes would crash the server. In our case since we did NOT use Citrix they had us remove the ReceiveProgressResponse setting.

From Support
I also pounded out ReceiveProgressResponseTimeoutSecs=20 since you don't use Citrix, as that is the primary usefulness of that parameter. Instead of that we are finding that using ProgressMessage=F improves performance and it is being set as the default in the newer releases. It does eliminate the ability to cancel a view but it usually improves performance enough that there is no need to cancel it.
These steps seemed to help slightly, but what we found to the biggest problem was interaction between processes. Turned out we were able to find a pattern that when two processes were run at about the same time and were dynamically trying to delete a rebuild the same temp subset the server would crash. Since you mentioned that there a 4 of you running processes you might want to see if any of the processes are trying to manipuate the same objects.
User avatar
Martin Ryan
Site Admin
Posts: 1989
Joined: Sat May 10, 2008 9:08 am
OLAP Product: TM1
Version: 10.1
Excel Version: 2010
Location: Wellington, New Zealand
Contact:

Re: Server Lockup

Post by Martin Ryan »

We've had some problems with processes locking things and tripping over each other. The work around suggested to us was to create a "zConcurrency" cube, which each process would write to in the prolog. This would lock that cube and chain it to that process until it released it at the end of the epilog. This means that Process B - that also has a write to the zConcurrency cube in its prolog - has to wait for process A to release its lock.

This effectively disables the multi thread ability, but then a stable server is just slightly more important.

Martin
Please do not send technical questions via private message or email. Post them in the forum where you'll probably get a faster reply, and everyone can benefit from the answers.
Jodi Ryan Family Lawyer
User avatar
paulsimon
MVP
Posts: 808
Joined: Sat Sep 03, 2011 11:10 pm
OLAP Product: TM1
Version: PA 2.0.5
Excel Version: 2016
Contact:

Re: Server Lockup

Post by paulsimon »

Hugh

We were also on 64 bit 9.1.3. We had problems with the Server intermittently losing the ability to resolve Aliases to Elements, in Excel. As the problem was server side, we upgraded our Server to 9.1.4 but our Client Software is still on 9.1.3. That appears to have cured that particular problem.

We also have intermittent problems with the server locking up. The concurrency control that Martin suggested will certainly help as clashes between two chores running at the same time can cause crashes or lock ups. We have only had problems where the two chores are accessing the same cubes or cubes that share dimensions which the chores try to update. For the most part we avoid these by scheduling.

However, I am not convinced that concurrent processes are the only cause of the problem. Even when only one Chore is running, we have still had server lock ups. The lock seems to be caused by someone creating an object like a View or Subset, or even doing something like Spreading.

When I look at TM1 Top it seems as though the Chore that is running is just running forever, and the task locking the object is in a Wait state. It appears that the Chore is in a busy loop waiting to get at the object that the user is trying to save, and the user task is in a Wait state, wating on the Chore - known in IT as a deadly embrace. We find that we cannot kill the chore or the user task in TM1 Top. Stopping the Service doesn't stop it. No one can log in. We have to get it to terminate the TM1SD.EXE before we can start the server.

We only have two developers on the server, and usually only one. I can certainly say that the instability tends to coincide with development work. However, one of the main benefits of TM1 is the ability to, for example, change dimension hierarchy without having to bring down the server for batch recalcs etc as you would have to with systems like SQL Server Analysis Services and Essbase. If we are confined to doing this out of hours when no one is using the server, because of the risk of a change crashing the server then TM1 loses one of its selling points.

I recently had a case where someone deleted an element from a dimension that was referenced in a rule. There were no complaints from TM1 until the regular update ran which updates the dimension to add new elements. As the dimension was saved a TM1ProcessError file popped up to say that it could not save the dimension because the rule was invalid (I would have expected it to have saved the dim anyway and just invalidate the rules). However, immediately after that, the server also crashed.

We have been working with Cognos for some time to try to investigate the cause of this. We have added in debugging switches. Unfortunately this renders the TM1Server.log useless for normal use as it gets clogged with debug messages. We have changed our Dr Watson settings to try to capture the crash. Unfortunately, so far IT haven't successfully captured a dump from a crash. Apparently they need this together with the debug messages in the server log to investigate the crash. Unfortunately, it appears that in the lock up situation, no dump is produced anyway.

We are considering upgrading to 9.4. It does appear that there have been some improvements to the locking mechanism in that.

The idea of migrating back to 9.0 does not appeal as this only had the old server level locking mechanism, and users want to be able to continue reporting from our reporting cube, while users around the business enter forecasts in to our forecasting cube.

I joined shortly after the upgrade to 9.1, but from what I can gather the locking issue in 9.0 made forecasting rather difficult for both the forecasters and the reporters.

If you do get any answers from Cognos, please let me know.

I was surprised that Cognos did not mention commenting out the:

ReceiveProgressResponseTimeoutSecs=30

As we have a mix of direct client and Terminal Services users, which I guess would be similar to Citrix. I think that the issue is more the slow network rather than the use of screen remoting itself.

We already have the:

ProgressMessage=F

From what I understand, this relates to an older method of progress messages, and should be turned off in 9.1.3 and after. In theory this is done as part of the installation, but most people will copy the tm1s.cfg from their old version so they may not get this. From my reading of the manual, this does not mean that you cannot cancel a view, as the ReceiveProgressResponse method is an alternative to this. Apparently in 9.4 even this is replaced by ClientMessagePortNumber.

By the way, do you get ViewArrayOutOfDateWithDimension errors leading to crashes in the client? This seems to happen when we update a dimension while someone has a View open on the cube.

Regards


Paul Simon
Alan Kirk
Site Admin
Posts: 6645
Joined: Sun May 11, 2008 2:30 am
OLAP Product: TM1
Version: PA2.0.9.18 Classic NO PAW!
Excel Version: 2013 and Office 365
Location: Sydney, Australia
Contact:

Re: Server Lockup

Post by Alan Kirk »

PaulSimon wrote:We already have the:

ProgressMessage=F

From what I understand, this relates to an older method of progress messages, and should be turned off in 9.1.3 and after. In theory this is done as part of the installation, but most people will copy the tm1s.cfg from their old version so they may not get this.
Anyone on older releases needs to be aware that the ProgressMessage parameter may not even exist for them/us.

According to the 9.1 SP3 release notes:
ProgressMessage Server Configuration Parameter

The ProgressMessage server configuration parameter is now set to F in the Tm1s.cgf file for 9.1 SP2 (sic) TM1 servers. This is a change from previous versions, where the ProgressMessage parameter was not included in the standard Tm1s.cgf file created during installation.

The ProgressMessage server configuration parameter is described in the TM1 Operations Guide.
Oh yeah? Not in the original release 9.1 Operations Guide (print date 02/2007), which I have a copy of in front of me, it ain't.

I can't find any reference to it in any of the Operations (or other) .pdf manuals prior to 9.4's; not even the original one relating to 9.1 as noted above, though it's in the on-line version of Help that I have installed with 9.1 SP4.

I tried this parameter in an 8.2.12 session, and it did nothing. That's unfortunate, because often, not always but often enough, when that accursed dialog box appears it doesn't really mean "Processing, click this button to cancel". It actually means "Your client session is now screwed. I suggest that you save and close all of your documents in Excel because the only choice you've now got is to go to Task manager and kill this process." :evil:
"To them, equipment failure is terrifying. To me, it’s 'Tuesday.' "
-----------
Before posting, please check the documentation, the FAQ, the Search function and FOR THE LOVE OF GLUB the Request Guidelines.
hbell
Posts: 61
Joined: Wed Feb 25, 2009 6:15 pm
Version: 9.1 SP3
Excel Version: 11.8

Re: Server Lockup

Post by hbell »

Thanks everyone for the responses. The general impression I get is one of "commiseration" rather than any likely solutions. :(

I had spotted the postings about clashing Chores. We don't really make a lot of use of those. So I'm confident that is not the issue here. I'm also reasonably confident that it would not be a clash of processes accessing the same metadata as we are pretty scrupulous about everyone using their own and not sharing.

There have certainly been "user error factors" in the run-up to one or two of the lockup instances. The last one, for example, occurred after someone had accidentally put in a heroic overfeed. CPU usage flat-lined at 12.5% (one of the 8 cores flat out) for 15 minutes saving the rule. However, when the rule finished saving, everyone was still locked up for a further hour before the plug was pulled.

We have another production server where we do very little metadata management as it is largely a read-only reporting environment. We have yet to see any lockups on that box. That lends credence to the view that this is somehow related to to metadata changes. It would just be nice to know what behaviours to avoid. I agree with Paul, that TM1's robust attitude to changes on the fly has been one of it's key points of differentiation. Sad for IBM and us if it is gradually falling back to parity with the rest of the field.

hugh
lotsaram
MVP
Posts: 3698
Joined: Fri Mar 13, 2009 11:14 am
OLAP Product: TableManager1
Version: PA 2.0.x
Excel Version: Office 365
Location: Switzerland

Re: Server Lockup

Post by lotsaram »

Martin Ryan wrote:We've had some problems with processes locking things and tripping over each other. The work around suggested to us was to create a "zConcurrency" cube, which each process would write to in the prolog. This would lock that cube and chain it to that process until it released it at the end of the epilog. This means that Process B - that also has a write to the zConcurrency cube in its prolog - has to wait for process A to release its lock.
This method works for stopping "process clash" crashes in 9.1.3 (I think the issue is fixed in later releases). Beware that you can't write just any value, you have to write RAND. If you try to write the same value that is already in the cell then no data change is registered and no lock is applied.
cshields
Posts: 1
Joined: Sun Sep 06, 2009 4:39 pm
OLAP Product: TM1
Version: 9.4.1
Excel Version: 2007

Re: Server Lockup

Post by cshields »

Hi all, don't know if this is still an issue for anyone and i hope i'm not being redundant but if the service locks and you can only kill it through TaskManager I am guessing that your server is using DHCP.
The reason you're locked is because the tm1admsd has lost sight of the tm1sd - and since the heartbeat function doesn't really seem to work, once lost it'll never return.
Make sure that your server is using static IP only. (ipconfig from the server's cmd prompt will tell you if DHCP is enabled)
If you're being hosted and they tell you you have static but it still happens, if you do an ipconfig /all and see anything about "leases" listed, you're on dynamic and need to change.
again, hope i'm not answering something that's been taken care of months ago.
hbell
Posts: 61
Joined: Wed Feb 25, 2009 6:15 pm
Version: 9.1 SP3
Excel Version: 11.8

Re: Server Lockup

Post by hbell »

... thanks for that interesting post. The problem is definitely still alive and kicking - so all suggestions welcome. Must confess it is Greek to me ... but I will run it past our IT folks and see. I rather anticipate that, if it proves to be the case, it will turn out to be a matter of "policy" that we cannot change (it usually is :roll: ). But worth a try ...

I think I gathered from what you said that we are talking about the IP address of the server? If I don't have access to the box itself, how do I run your IPConfig test?

thanks ......hugh
hbell
Posts: 61
Joined: Wed Feb 25, 2009 6:15 pm
Version: 9.1 SP3
Excel Version: 11.8

Re: Server Lockup

Post by hbell »

chshields

.. I've looked at our Dev environment (where I'm able to go into the Desktop and run an IPConfig). I'm getting a return which (among other things) shows me "DHCP Enabled ................: No". Is that (as it sounds) conclusive proof that DHCP is not our problem?

thanks ............hugh
User avatar
paulsimon
MVP
Posts: 808
Joined: Sat Sep 03, 2011 11:10 pm
OLAP Product: TM1
Version: PA 2.0.5
Excel Version: 2016
Contact:

Re: Server Lockup

Post by paulsimon »

Hugh

We are on almost the same version as you - we are on 9.1.4.

A few months ago, we had a lot of issues with server lockup and crashes.

I put concurrency control on just about every chore (I just add a process that writes a Random value to a cube at the start of the chore).

That seems to have helped.

We are now a week in to our forecasting cycle and we have had no lockups so far despite the fact that the forecasting allows users to run processes for submissions, and we are regularly updating security and sometimes updating dimensions. There are also hourly extracts from the Input Cube to the Main Cube. There are also quite a lot of rule derived values.

So far the server has been remarkably well behaved.

The only thing we had close to lock up today was when a user constructed a view that ran out of memory. No one could do anything until I spotted the fact that everyone else was waiting in TM1 Top and he had the only running process. I asked him to click OK to acknowledge the message about view array out of memory and then everything sprang back in to life.

I think we still have the Response Time out setting in our CFG. It doesn't seem to make much difference. By the way half our users use Terminal Services which I would guess is similar to Citrix.

The Terminal Services solution is working well for us. There have been no complaints about performance.

One other thing that can cause lock ups is de-activating a chore while it is running, so be careful before doing that, particularly if you have a frequent refresh of actuals or something like that.

Regards


Paul Simon
hbell
Posts: 61
Joined: Wed Feb 25, 2009 6:15 pm
Version: 9.1 SP3
Excel Version: 11.8

Re: Server Lockup

Post by hbell »

Paul

... thanks for the tip. We really don't use Chores much. Does the same apply to TI Processes themselves? Does your concurrency control refuse permission to start a process until the previous one has "checked out"?

thanks ........hugh
User avatar
Martin Ryan
Site Admin
Posts: 1989
Joined: Sat May 10, 2008 9:08 am
OLAP Product: TM1
Version: 10.1
Excel Version: 2010
Location: Wellington, New Zealand
Contact:

Re: Server Lockup

Post by Martin Ryan »

We use the same method for the same reason and have the same success.

The TI process writes to a 'zConcurrency' cube, which then causes that cube to be locked to that TI process until the TI process finishes and releases its lock. As a result, no other process can kick off until that one finishes. Works very well, and yes, not just required for chores. Manual running can cause the same issues without the use of this workaround.

Cheers,
Martin
Please do not send technical questions via private message or email. Post them in the forum where you'll probably get a faster reply, and everyone can benefit from the answers.
Jodi Ryan Family Lawyer
User avatar
paulsimon
MVP
Posts: 808
Joined: Sat Sep 03, 2011 11:10 pm
OLAP Product: TM1
Version: PA 2.0.5
Excel Version: 2016
Contact:

Re: Server Lockup

Post by paulsimon »

Hugh

As Martin says, this works just as well if you include the concurrency control code in the prolog of a TI Process.

The only reason that we went for the Chore option, is that, when we eventually migrate to 9.4 which does a better job of handling concurrency, we can just remove this concurrency control process from the chore, or just leave it there and change the code to a do nothing. In that way we can easily drop the concurrency control in one go, rather than having to edit it out of every process. However, I guess that you could get the same effect by putting a flag in an info cube, and a bit in the prolog saying IF(CellGetS('Info Cube',..,'Concurrency Control') @= 'Y'); ... Put random value in to Concurrency Control Cube.

In practice the concurrency control approach we have used works very well. However, at the time that we implemented it we were not sure that it would cope with every evenutality, so putting it all in to one process that we re-used in multiple chores meant that we only needed to make changes in just one place if it didn't work.

The only downside of using Chores is that you have to specify any parameters in the Chore and you don't get prompted at run time. That is why we have a few processes with parameters where we do put the concurrency control code in the prolog.

Regards

Paul Simon
highlnder8
Posts: 20
Joined: Tue Aug 04, 2009 6:14 pm
OLAP Product: TM1
Version: 9.4.1
Excel Version: Excel 2007

Re: Server Lockup

Post by highlnder8 »

We are having a similar, but possibly slightly different issue here. We are on 9.4 HF3, running on a 64-bit server with four dual core Intel Processors and 64Gig of RAM.

Whenever someone runs a chore the whole system bogs down and nobody can do much of anything until the process is completed. CPU utilitzation rarely tops 18% (of total CPU processing power available) and RAM rarely tops 40GB. So I'm trying to figure out why the slow down. I'm not hopeful that IBM has an answer to the question. I'd understand our users having a hard time if CPU utilization was at 80% for all 8 processors, but this is ridiculous.

We are enabling Multithreading (which was missing from our TM1s.cfg file) to speed up our loads after shutdown/restart. However, this doesn't seem to affect our TI Procedures.

And it's not an issue of conflicting jobs - we only have 6 TI processes that are run on separate days of the month.

Thanks!
Post Reply