UPDATE Oktober 2020:
1. Hidden in the comments is a link to VMware KB SFCB crashing and generating dumps: sfcb-vmware_bas-zdump (78046).
2. In the recently released VMware ESXi 6.7, Patch Release ESXi670-202008001, in the Resolved issues sections, you will find:
“PR 2560686: The small footprint CIM broker (SFCB) fails while fetching the class information of third-party provider classes
SFCB fails due to a segmentation fault while querying with the getClass command third-party provider classes such as OSLS_InstCreation, OSLS_InstDeletion and OSLS_InstModification under the root/emc/host namespace“.
We are still in the process of evaluating this patch in test, so cannot tell if it really resolves the issue.
Recently, a number of ESXi hosts were updated from version 6.0 to the latest 6.7 update. Soon after, we detected the following error message “An application (/bin/sfcbd) running on ESXi host has crashed (1 time(s) so far). A core file might have been created at /var/core/sfcb-vmware_bas-zdump.000.”. The core file was indeed created, luckily this was not a PSOD, the host was still up and running, workloads were not impacted. We also noticed that all upgraded hosts were impacted, it also became clear that after (re)booting a host, after about 24 hours the same event re-occurred, creating a new dump file.
After some digging around in the log files, searching for events at the time the dump file was created we found in the syslog.log:
“sfcb-vmware_base: tool_mm_realloc_or_die: memory re-allocation failed(orig=400000 new=800000 msg=Cannot allocate memory, aborting”,
followed by: “sfcb-ProviderManager: handleSigChld:166681408 provider terminated, pid=2100157, exit=0 signal=6”. This looks like some memory related issue.
As this is not an ideal situation, it was time to engage VMware support. Before we continue, some background; sfcbd stands for “Small Footprint CIM Broker (SFCB) daemon”. For performance and health monitoring ESXi enables an agent less approach using industry standards like CIM (Common Information Model) and WBEM (Web-Based Enterprise Management). At the ESXi side, there is the CIM agent, represented by the sfcbd. CIM providers are the counter part, often supplied by 3rd parties like hardware vendors. CIM providers come as .VIB files. After detecting 3rd party CIM provider, the sfcbd (with that the WBEM services) is automatically started by ESXi.
After analyzing the log files, VMware support provided several recommendations, ranging from checking firmware, drivers etc. against the compatibility list (no issues were found), disabling the CIM agent, and to engage the supplier of the 3rd party CIM provider.
Imho, shutting down the CIM agent completely is kind of last resort, so time to get some insight in the CIM providers. To get an overview of the CIM providers installed, log in to the ESXi host as user root and run the following command:
# esxcli system settings advanced list | grep CIM
This presents quite a long list of installed providers. CIM providers are handled by WBEM. For the status of WBEM run the following
# esxcli system wbem get
One of the items in the output should read: “Enabled: true”.
To get an overview of the status of the providers run:
# esxcli system wbem provider list Name Enabled Loaded ---------------- ------- ------ emc_sehost true true sfcb_base true true vmw_base true true vmw_hdr true true vmw_iodmProvider true true vmw_kmodule true true vmw_omc true true vmw_pci true true vmw_smx-provider true true
We see that all providers are loaded an enabled (that is … operational). The last provider in the list, named “vmw_smx-provider” can be found using this command:
# esxcli software vib list | grep smx smx-provider 670.03.16.00.3-7535516 HPE VmwareAccepted 2020-01-30.
This is a hardware / monitoring provider supplied by HPE. My guess was that one of the providers was causing the issue. To prove this, we did the following; disable just one of the providers, reboot the host and check after at least 24 hours: To disable a provider, e.g. the smx-provider, run the following command:
# esxcli system wbem provider set --enable false --name="vmw_smx-provider"
Then check the status, the disabled provider must now show “Enabled” and “Loaded” as false.
# esxcli system wbem provider list Name Enabled Loaded ---------------- ------- ------ emc_sehost true true sfcb_base true true vmw_base true true vmw_hdr true true vmw_iodmProvider true true vmw_kmodule true true vmw_omc true true vmw_pci true true vmw_smx-provider false false
Finally reboot the host. After disabling the providers one-by-one, in all cases a dump file was created, except when the smx-provider was disabled. So the smx-provider seems to be the culprit, so time to engage HPE support for a solution.
As always, I thank you for reading, I hope this was useful.