Despite substantial investments in HA infrastructure, Proxmox cluster could still be vulnerable to services interruptions caused by unstable guests. There is one great feature provided by KVM that can help minimize the effect of such incidents.
HA, HA everywhere
From FC SANs, through redundant networking all the way up to HA services included in Proxmox, it’s now easier than ever to build a virtualization cluster immune to hardware failures. And yet, all this effort and spendings could be in vain if the guests misbehave.
We’ve recently come across a nasty low-level kernel bug which manifests itself when using Centos 7 with ASL-based web hosting stack inside a KVM virtual machine. Certain versions of said kernel would cause the system to spend more and more CPU ticks on otherwise perfectly normal processes (no io wait, regular load), until the whole thing comes to a standstill.
When frozen, it required a manual restart of the VM to bring it back to life. Proxmox’ HA won’t help in this case, as from the outside scope this whole ordeal looks like a healthy virtual machine, albeit with unusually high cpu load.
If only there was a way to easily test the responsiveness of a guest operating system and give it a little slap if it goes stale. Alas, there is. Even though Proxmox developers tried really hard to hide it from plain view, because why clog their pretty GUI with such a niche feature.
Impertinences aside, said watchdog is actually a part of KVM, the virtualization engine which powers Proxmox VMs. And it only takes a few commands to have it up and running.
1. Enable watchdog virtual device
As I mentioned, watchdog controls have not made it to the GUI yet (as of Proxmox 4.1), so first you need to locate your VM’s config file. It should be placed in the /etc/pve/qemu-server directory on the node that houses this particular VM. The file name is xxx.conf, where xxx is your VM id (for example: 101.conf).
Open the file and add the following line:
There are two hardware models to choose from, but i6300esb is the more popular one and it’s supported by a vast majority of guest OSes. Action defines how Proxmox should react when the watchdog is triggered. Its possible values are:
2. Confirm the presence of watchdog device
You will now have to reboot your VM. If you’re using a modern stock kernel inside your guest, it should pick up the watchdog automatically:
# dmesg | grep 6300 [ 6.338422] i6300esb: Intel 6300ESB WatchDog Timer Driver v0.05 [ 6.344916] i6300esb: initialized (0xffffc90003498000). heartbeat=30 sec (nowayout=0)
There should also be a /dev/watchdog node present:
# ll /dev/watchdog crw------- 1 root root 10, 130 Aug 13 17:36 /dev/watchdog
As you can see, probing frequency is fixed to 30 seconds, meaning the host is going to expect a heartbeat twice a minute or will otherwise trigger the action defined in step (1).
3. Configure watchdog daemon
For the watchdog to become active, it requires a heartbeat service inside your guest. This is the process that is responsible for reporting your VM as healthy. In this case (based on Centos 7), I’m going to provide a minimal config, however it is possible to have it run all sorts of checks (ranging from network pings to system loads) before sending a heartbeat.
The daemon should be available in most distributions:
yum install watchdog
To make it actually do something, edit the config file at /etc/watchdog.conf, and uncomment the following lines:
watchdog-device = /dev/watchdog log-dir = /var/log/watchdog realtime = yes priority = 1
Afterwards, enable the daemon with systemd:
systemctl enable watchdog
It’s a good idea to reboot again at this point, to make sure it starts automatically. Afterwards, confirm the service is up and running:
# systemctl status watchdog ● watchdog.service - watchdog daemon Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled) Active: active (running) since Sat 2016-08-13 17:36:41 CEST; 11h ago Main PID: 884 (watchdog) CGroup: /system.slice/watchdog.service └─884 /usr/sbin/watchdog
Pull the trigger
All that’s left to do is see it in action. Depending on the i6300esb module build parameters, a graceful shutdown of the watchdog service might disable monitoring altogether. Let’s be sure this does not happen by deliberately killing our guest’s kernel:
echo c > /proc/sysrq-trigger
At this point, you should see your VM freeze due to kernel panic. And in less than 30 seconds, Proxmox should restart it automatically (unless you’ve changed the watchdog action to something else).
The only downside of using this tool is that it does not report its actions anywhere. There are no traces of watchdog reboots anywhere to be found in Proxmox logs.
Otherwise, I find it to be a great tool, which should not replace monitoring, but rather complement it for mission critical guests.