I’ve been a fan of VMware’s Site Recovery Manager since I began working with it about 5 years ago. It’s a simple but powerful orchestrator to fail over one or all of your VMs to another site. There are plenty of good guides on getting it working, and it is intuitive enough that most of the time you can just wing it; not that I condone that kind of administratorship… :-)
Recently, however, I had all of my Linux VMs in a protection group refuse to take their new IP address on failover. I saw the error “A general system error occurred: vix error codes = (1, 2).” Nothing much turned up on Google, but I found one Reddit post where a fellow admin found that his machine had VMware Tools installed without ‘vauth’. They said reinstalling Tools with all of the default options fixed his issue. I was ready to dismiss this solution, as we run all of our Tools installs with default options, but I figured it was worth investigating. My Tools installs were slightly out of date anyhow…maybe the upgrade fairy will fix my problem for me?
As I was running the upgrade I paid extra attention to all of the prompts, and sure enough, there it was:
Apparently somewhere between vSphere 6.0 and 6.5, the default for vgauth and caf switched from yes to no. When my Linux boxes updated their Tools installs, they went from having vgauth to not; breaking SRM’s ability to work with the OS in updating its IP. I finished the update on all of my VMs and all were again failing over successfully.
This is why we test our stuff. Well, if we’re lucky. This is why we want to test our stuff. You never know what change in what component could cause calamity when you least expect it.