What is a watchdog?

A watchdog in computer terms is a very reliable hardware which ensures that the computer is always running. You find such devices in the Mars Pathfinder (who wants to send a person to the mars to press the reset button?) or in some extra expensive servers.

The idea behind such a watchdog is very simple: The computer has to "say hello" from time to time to the watchdog hardware to let it know that it is still alive. If it fails to do that then it will get a hardware reset.

Why do I need a watchdog?

Note that a normal Linux server should be able to run uninterrupted for several months, in average probably 1-2 years without locking up. If you have machine that locks up every week then something else is wrong and a watchdog is not the solution. You should check for defective RAM (see memtest86.com) overheated CPUs, too long IDE cables...

If Linux is so reliable that it will run for a year without any problems then why do you need a watchdog? Well the answer is simple to: make it even more reliable. There is as well a human problem related to that. A server that made no trouble for a year is basically unknown to the service staff. If it fails then nobody knows where it is? It might as well lock up just before Christmas when everybody is at home. In all such cases a watchdog can be very useful.

A watchdog, however, does not solve all of the problems. It is no protection against defect hardware. If you include a watchdog in your server then you should also ensure that you have well dimensioned (probably not the latest BIOS bugs and chipset bugs, properly cooled hardware).

Links

A Hardware watchdog and shutdown button
Watchdog RPM Package