Page 1 of 1

Preventing data loss in case of a catastrophic failure

Posted: Mon Oct 19, 2015 3:48 pm
by Jeroen_Eynikel
Hi,

we are trying to change our setup in order to deal with a catastrophic failure on one of the servers and we are running into a lot of unexpected difficulties in working something out with the storage and infrastructure team.

The basic goal we have is no more than to assure that if their is a catastrophic failure on the production server we can bring up a back-up server running the database in a minimal amount of time (few hours at most) but with absolutely no data loss - this is much more important than whether it takes us 5 minutes or a few hours to recover everything -.

Fine so the first thing we thought of doing was storing the data on the SAN instead of the local hard drives as theorethically at least the SAN is secure. The big issue we have there is that apparently the san storage associated with this server is exclusively locked to it. So yes if that server goes down we would not lose any data, but it seems that the we would have to wait for whatever issue happened to that server itself being fixed which could be a few days easily if we are really unlucky.

So now we have some sort of experimental setup in which a second server can actually see the san drives associated with the first one and the idea is that we would disable these drives on the second server untill there is an actual need to access the data on them, then reboot that second server and they would magically appear. Keeping them enabled on both is a no no as apparently there are sorts of exclusive signatures happening. Anyways we notice it does not work just by seeing that depending on which server I am looking from I see different contents for the same san storage. (Explained by IT to me as the san being locked by server 1, so server 2 currently does not see the contents)

The idea is that if server 1 goes down, server 2 can be rebooted and then we would see the contents.

Now I have several issues:

* first off I do not feel particulary confident in this setup, as I am not so sure whether any locks put by server1 on the san would be properly released in case of a server crash. I can imagine we would reboot server 2 and still see an empty disk for instance.

* I can not imagine there is no better solution than this imaginable?? How is this done at other sites... just keep in mind that a: we want to deal with catastrophic failures b: we should be absolutely sure not to lose any data (so the daily backups are not good enough) and c: a schedule where we would run a batch every few minutes to copy and zip the directory may be unworkable as well.. mostly because I think the savedata operations might cause too many performance hiccups..

Anyone have any suggestions? Keep in mind that I am close to illiterate about hardware or storage concepts so please use basic words :P

Re: Preventing data loss in case of a catastrophic failure

Posted: Mon Oct 19, 2015 4:50 pm
by jim wood
We were recently at Chase. As you can imagine after Sept 11 they are very aware of DR. The setup they had for was made up of a Prod Server, DR server, Prod San and DR San. On Prod a data save happens every hour. A copy of the production service is then copied to the DR SAN every hour 15 minutes after the data save. If the Prod server goes down and Prod San is still up, Then the DR is started pointing to the Prod San. If both Prod Server and Prod San goes down then DR server is brought up pointing to DR San. Whenever the DR is started when the Prod goes down they re-point the DNS alias to the DR. They had a large service and the whole switch took less than an hour. They tested it every month out of hours to make sure all of their team knew how to restart everything.

I hope that helps,

Jim.

Re: Preventing data loss in case of a catastrophic failure

Posted: Mon Oct 19, 2015 4:53 pm
by jim wood
Oh and make sure you have logging switched on for vital cubes. If you are able to recover the Prod San you can then load the changes made since the last save in to production once it is back up. Obviously you'll need to write a process that can spool through a log file. Also you need to make sure the logging directory is seperate to the data directory,

Jim.

Re: Preventing data loss in case of a catastrophic failure

Posted: Mon Oct 19, 2015 8:54 pm
by David Usherwood
I haven't looked at this for a while - but I could see some mileage in using rsync (https://en.wikipedia.org/wiki/Rsync) to mirror the TM1 server and log folders to somewhere else. Rsync intelligently compares the content of folders and transfers only the changes. It would need to kick in after a SaveDataAll and wouldn't (I think) transfer the open TM1S.LOG, but TM1 doesn't lock anything else so it should work well. You'd need to run up the server after a crash but that's envisaged anyway.
Might give it another go when I have some time.