RMS II
RMS Server "Watchdog"
(rev. 30-Jul-04)
Introduction
Starting with RMS Server 9.9, a new capability has been added
called "Watchdog". The idea of Watchdog is that it continuously
watches over the RMS Server and ensures that it is still running. If
the Mac crashes, or if the Server stops making forward progress, then
Watchdog automatically reboots the Mac. With the Server set as a
Startup Item, the reboot will automatically resurrect the dead or
stalled Server without requiring any human intervention.
With Watchdog, the Server is now "fault tolerant" in that it has a
chance to automatically recover from software bugs and glitches, be
they in the Server itself, or the MacOS, or a third party
component.
Watchdog consists of the following pieces:
- Server 9
- the standard Macintosh debugger "Macsbug", along with its
"Debugger Prefs" file
- custom code built into the RMS Server that tracks and monitors
Server execution
- Server X
- custom code built into the RMS Server that tracks and monitors
Server execution (the code is packaged into a separate application
called "RMS ServerWatchdog"; this application in turn is contained
within the RMS Server application itself and is automatically
launched by RMS Server).
The Different Types of Server Operation Failure
There are three main types of Server operation failure. From least
severe to most severe, they are: Server Hang, Mac
Crash, Hard Crash. The Watchdog can detect and recover
from the first two types on OS 9, but cannot detect the third type.
On OS X, the Watchdog can only detect and recover from the first
type. Each type of failure is listed below along with a description
of how Watchdog interacts with that type of failure.
- The Server is "hung" or "stuck". In this type of
failure, everything is running OK except that the Server is not
making any forward progress. For example, the Server could be
stuck in an infinite loop, or an alert may be displaying an error
message and no one has clicked "OK", or the Server may have called
upon the OS or a 3rd party component to perform a subtask and it
never finishes.
In this case, Watchdog will notice that no forward progress is
being made. After 20 minutes (10 minutes, starting with RMS Server
version 9.9.1), it will give up and automatically reboot the Mac.
Upon restart, the Server log will contain the line:
=1= Watchdog caused the restart; debugger dump follows:
A debugger dump will follow which tells us:
- When the reboot happened
- Which application was running at the time (most likely 'RMS
Server')
- A trace (recent history) of the last places in the Server
code that were executed
- The debugger dump can help the developers pinpoint where in
the code the hang happened, and possibly allow them to change
the code to avoid the same type of hang in the future.
- The Mac crashes. In this type of failure something has
crashed. You'd normally see a Finder alert saying that "the
application has unexpectedly quit, error N", where N is often 2 or
11. Or you'd see the infamous "bomb" alert that lets you restart
the Mac. The crash might be caused by the Server, or the Finder,
or the MacOS, or a 3rd party component, or some combination.
Normally the RMS would just be stuck, but since the debugger is
installed, it is invoked instead (this works on OS 9 only since
there is no Macsbug for OS X). The debugger's special prefs cause
it to automatically create a debugger dump and then reboot the
Mac. Upon restart, the Server log will contain the line:
=1= Watchdog did NOT cause the restart (possibly a crash); debugger dump follows:
A debugger dump will follow which tells us:
- When the reboot happened
- Which application was running at the time (which app
crashed)
- Which CPU instruction was being executed at the time of the
crash
- The fatal error code (e.g. 2 or 11).
- Again, the debugger dump may provide the developers with
some clue as to why the crash happened.
- Hard crash. With this type of failure, the Mac has
crashed so badly that there is no way to automatically recover
(requires human reboot). Examples of this kind of failure are:
- Hard crash - a crash happens and the
debugger is invoked as in #2 above, but the crash is so severe
that the debugger itself cannot function.
- Deep freeze - the Mac is "frozen" because it is
stuck in an infinite loop executing some high priority system
code; although the Watchdog runs in a "preemptive multitasking"
mode, the system code is such a high priority that the Watchdog
never gets a chance to run and thus cannot reboot the Mac.
Installing Watchdog
To install Watchdog on OS X, you need to:
- Use RMS Server X verson 2.2 or later, and RMS
ServerConfig X version 2.1 or later
- Use a Server config document with the Watchdog option turned
ON (create the config document with RMS ServerConfig X v2.1or
later)
To install Watchdog on OS 9, you need to:
- Use RMS Server 9 verson 2.0 or later, and RMS
ServerConfig 9 version 1.4 or later
- Use a Server config document with the Watchdog option turned
ON (create the config document with RMS ServerConfig 9 v1.4 or
later)
- Place the two files "Macsbug" and "Debugger Prefs" into the
System Folder and reboot the Mac
The steps to install Watchdog remotely are:
- Put the two files "Macsbug" and "Debugger Prefs" into a folder
and e-mail the folder to the RMS. Be sure to set the subject of
the e-mail message to be "RMS".
- Once these two files have been e-mailed to the RMS, have the
caretaker do the following:
- Quit the RMS Server.
- Drag both files, "Macsbug" and "Debugger Prefs", into the
System Folder.
- Restart the Mac.
- After the restart, Macsbug will be installed and the RMS
Server should automatically start up (given there's an alias to
the Server in the "Startup Items" folder).
- Once step #2 is complete, E-mail the new RMS Server v2.0 to
the RMS. As usual, the new Server will be activated after the next
time the RMS Server is restarted.
- Once RMS Server v2.0 is running on the RMS (you can confirm
it's running by checking the Server log file, just look for the
log entry, "=1= RMS Server STARTUP (OS 9.2.2; RMS 2.0)"), e-mail
the new Server config document to the RMS Server. Be sure to turn
the Watchdog option ON in the Server config document.
- When the v2.0 RMS Server receives the new config file, it will
automatically restart the RMS Server application. Once restarted,
the Watchdog functionality in the RMS Server will be
operational.
- Lastly, E-mail the new RMS ServerConfig v1.4 to the RMS so
that the caretaker can open the config document if necessary.
If the RMS site is managed by a very capable caretaker (i.e. one
who is very comfortable working on the Mac), then you can perform the
following steps to install the Watchdog. This alternative
installation method achieves the same result as the above method, it
simply reduces the time it takes to install all of the components and
get them running:
- Put the five files, "RMS Server", "RMS ServerConfig" the RMS
site's Server config document, "Macsbug" and "Debugger Prefs",
into a folder and e-mail the folder to the RMS. Be sure to set the
subject of the e-mail message to be "RMS", and make sure to turn
the Watchdog option ON in the Server config document.
- Once these five files have been e-mailed to the RMS, have the
caretaker do the following:
- Quit the RMS Server.
- Drag the two files, "Macsbug" and "Debugger Prefs", into
the System Folder.
- Replace the existing RMS Server application with the new
v2.0 one.
- Replace the existing RMS ServerConfig application with the
new v1.4 one.
- Replace the existing Server config document with the new
one included in the e-mail.
- Restart the Mac.
- After the restart, Macsbug will be installed and the RMS
Server should automatically start up (given there's an alias to
the Server in the "Startup Items" folder). Once restarted, the
Watchdog functionality in the RMS Server will be operational.