
Problem description
Since upgrading from 9.0 to 9.5.2. we have found our servers crashing
while replicating data at what seems to be random times. An issue i
stumbled upon during testing appears to effect more than just what i
discovered at the time, but due to it's nature is impossible for me to
replicate in a testing environment.
Out of a month of log files, about 4 of them had a corrupted 1st data
record. Instead of beginning "", like all the other lines, it read ",
(one quote instead of two). As replication uses log files to determine
what data to copy, i believe the way it parses the logs is unable to
cope with the malformed log entry and causes the server to crash.
During testing i found issues with the transaction log viewer where
data i knew was logged did not appear in the viewer. I eventually
traced it to the malformed logs, manually corrected them using notepad
and the viewer could then pick them up. Time was tight and as this did
not crash a server and was a minor issue to a few admins it was not
logged. Now it appears the same issue also hurts replication (was
unable to test replication due to lack of servers & time) which makes
it a much more serious issue.
Business impact
Without manually checking every single log file there is no way to know
or predict if a log is OK or not. We only find out when a replication
is run and a server crashes.