oVirt node upgrade is failing after the "Stop services" step? It's probably a simple fix - /var might be running out of space. Read on to confirm with me.

My Scenario

I have multiple hosts which I have upgraded multiple times throughout oVirt's 4.4 and 4.5 lifecycle so far. Some times, these machines would fail to upgrade. The only hint in the main logs, was "Stop services" and then "failed":

onn-upgrade-fail-stop-service

I have been chasing this issue for a while and found that the system log would show that the last action occurring before the failure was actually a software install:

python3[224883]: ansible-ansible.legacy.dnf Invoked with name=['ovirt-node-ng-image-update.noarch'] state=latest lock_timeout=300 conf_file=/tmp/yum.conf allow_downgrade=False autoremove=False bugfix=False cacheonly=False disable_gpg_check=False disable_plugin=[] disablerepo=[] download_only=False enable_plugin=[] enablerepo=[] exclude=[] installroot=/ install_repoquery=True install_weak_deps=True security=False skip_broken=False update_cache=False update_only=False validate_certs=True allowerasing=False nobest=False disable_excludes=None download_dir=None list=None releasever=None``

This was the last relevant looking log entry before the upgrade timed out. This is a unpack / install / upgrade of the whole image, and thus I thought I'd check the vitals on the server to see what it was doing. Lo and behold:

onn-upgrade-fail-disk-space

/var was running out space. A quick hunt later found that /var/cache/dnf was filling the vast majority of the space on that volume. A quick rm -rfv /var/cache/dnf/* later, and the upgrade succeeds.