RMS II
RMS Server "Watchdog"

(rev. 30-Jul-04)


Introduction

Starting with RMS Server 9.9, a new capability has been added called "Watchdog". The idea of Watchdog is that it continuously watches over the RMS Server and ensures that it is still running. If the Mac crashes, or if the Server stops making forward progress, then Watchdog automatically reboots the Mac. With the Server set as a Startup Item, the reboot will automatically resurrect the dead or stalled Server without requiring any human intervention.

With Watchdog, the Server is now "fault tolerant" in that it has a chance to automatically recover from software bugs and glitches, be they in the Server itself, or the MacOS, or a third party component.

Watchdog consists of the following pieces:

  1. Server 9
  1. Server X

The Different Types of Server Operation Failure

There are three main types of Server operation failure. From least severe to most severe, they are: Server Hang, Mac Crash, Hard Crash. The Watchdog can detect and recover from the first two types on OS 9, but cannot detect the third type. On OS X, the Watchdog can only detect and recover from the first type. Each type of failure is listed below along with a description of how Watchdog interacts with that type of failure.

  1. The Server is "hung" or "stuck". In this type of failure, everything is running OK except that the Server is not making any forward progress. For example, the Server could be stuck in an infinite loop, or an alert may be displaying an error message and no one has clicked "OK", or the Server may have called upon the OS or a 3rd party component to perform a subtask and it never finishes.
    In this case, Watchdog will notice that no forward progress is being made. After 20 minutes (10 minutes, starting with RMS Server version 9.9.1), it will give up and automatically reboot the Mac. Upon restart, the Server log will contain the line:
    =1= Watchdog caused the restart; debugger dump follows:

    A debugger dump will follow which tells us:

    • When the reboot happened
    • Which application was running at the time (most likely 'RMS Server')
    • A trace (recent history) of the last places in the Server code that were executed
    • The debugger dump can help the developers pinpoint where in the code the hang happened, and possibly allow them to change the code to avoid the same type of hang in the future.
  2. The Mac crashes. In this type of failure something has crashed. You'd normally see a Finder alert saying that "the application has unexpectedly quit, error N", where N is often 2 or 11. Or you'd see the infamous "bomb" alert that lets you restart the Mac. The crash might be caused by the Server, or the Finder, or the MacOS, or a 3rd party component, or some combination.
    Normally the RMS would just be stuck, but since the debugger is installed, it is invoked instead (this works on OS 9 only since there is no Macsbug for OS X). The debugger's special prefs cause it to automatically create a debugger dump and then reboot the Mac. Upon restart, the Server log will contain the line:
    =1= Watchdog did NOT cause the restart (possibly a crash); debugger dump follows:

    A debugger dump will follow which tells us:

    • When the reboot happened
    • Which application was running at the time (which app crashed)
    • Which CPU instruction was being executed at the time of the crash
    • The fatal error code (e.g. 2 or 11).
    • Again, the debugger dump may provide the developers with some clue as to why the crash happened.
  3. Hard crash. With this type of failure, the Mac has crashed so badly that there is no way to automatically recover (requires human reboot). Examples of this kind of failure are:
    1. Hard crash - a crash happens and the debugger is invoked as in #2 above, but the crash is so severe that the debugger itself cannot function.
    2. Deep freeze - the Mac is "frozen" because it is stuck in an infinite loop executing some high priority system code; although the Watchdog runs in a "preemptive multitasking" mode, the system code is such a high priority that the Watchdog never gets a chance to run and thus cannot reboot the Mac.

Installing Watchdog

To install Watchdog on OS X, you need to:

To install Watchdog on OS 9, you need to:

The steps to install Watchdog remotely are:

  1. Put the two files "Macsbug" and "Debugger Prefs" into a folder and e-mail the folder to the RMS. Be sure to set the subject of the e-mail message to be "RMS".
  2. Once these two files have been e-mailed to the RMS, have the caretaker do the following:
    • Quit the RMS Server.
    • Drag both files, "Macsbug" and "Debugger Prefs", into the System Folder.
    • Restart the Mac.
    • After the restart, Macsbug will be installed and the RMS Server should automatically start up (given there's an alias to the Server in the "Startup Items" folder).
  3. Once step #2 is complete, E-mail the new RMS Server v2.0 to the RMS. As usual, the new Server will be activated after the next time the RMS Server is restarted.
  4. Once RMS Server v2.0 is running on the RMS (you can confirm it's running by checking the Server log file, just look for the log entry, "=1= RMS Server STARTUP (OS 9.2.2; RMS 2.0)"), e-mail the new Server config document to the RMS Server. Be sure to turn the Watchdog option ON in the Server config document.
  5. When the v2.0 RMS Server receives the new config file, it will automatically restart the RMS Server application. Once restarted, the Watchdog functionality in the RMS Server will be operational.
  6. Lastly, E-mail the new RMS ServerConfig v1.4 to the RMS so that the caretaker can open the config document if necessary.

If the RMS site is managed by a very capable caretaker (i.e. one who is very comfortable working on the Mac), then you can perform the following steps to install the Watchdog. This alternative installation method achieves the same result as the above method, it simply reduces the time it takes to install all of the components and get them running:

  1. Put the five files, "RMS Server", "RMS ServerConfig" the RMS site's Server config document, "Macsbug" and "Debugger Prefs", into a folder and e-mail the folder to the RMS. Be sure to set the subject of the e-mail message to be "RMS", and make sure to turn the Watchdog option ON in the Server config document.
  2. Once these five files have been e-mailed to the RMS, have the caretaker do the following:
    • Quit the RMS Server.
    • Drag the two files, "Macsbug" and "Debugger Prefs", into the System Folder.
    • Replace the existing RMS Server application with the new v2.0 one.
    • Replace the existing RMS ServerConfig application with the new v1.4 one.
    • Replace the existing Server config document with the new one included in the e-mail.
    • Restart the Mac.
  3. After the restart, Macsbug will be installed and the RMS Server should automatically start up (given there's an alias to the Server in the "Startup Items" folder). Once restarted, the Watchdog functionality in the RMS Server will be operational.