Troubleshooting CIM on ESXi

Recently, a number of ESXi hosts were updated from version 6.0 to the latest 6.7 update. Soon after, we detected the following error message “An application (/bin/sfcbd) running on ESXi host has crashed (1 time(s) so far). A core file might have been created at /var/core/sfcb-vmware_bas-zdump.000.”. The core file was indeed created, luckily this was not a PSOD, the host was still up and running, workloads were not impacted. We also noticed that all upgraded hosts were impacted, it also became clear that after (re)booting a host, after about 24 hours the same event re-occurred, creating a new dump file.

After some digging around in the log files, searching for events at the time the dump file was created we found in the syslog.log:
“sfcb-vmware_base[2100157]: tool_mm_realloc_or_die: memory re-allocation failed(orig=400000 new=800000 msg=Cannot allocate memory, aborting”,
followed by: “sfcb-ProviderManager[2100151]: handleSigChld:166681408 provider terminated, pid=2100157, exit=0 signal=6”. This looks like some memory related issue.

As this is not an ideal situation, it was time to engage VMware support. Before we continue, some background; sfcbd stands for “Small Footprint CIM Broker (SFCB) daemon”. For performance and health monitoring ESXi enables an agent less approach using industry standards like CIM (Common Information Model) and WBEM (Web-Based Enterprise Management). At the ESXi side, there is the CIM agent, represented by the sfcbd. CIM providers are the counter part, often supplied by 3rd parties like hardware vendors. CIM providers come as .VIB files. After detecting 3rd party CIM provider, the sfcbd (with that the WBEM services) is automatically started by ESXi.

After analyzing the log files, VMware support provided several recommendations, ranging from checking firmware, drivers etc. against the compatibility list (no issues were found), disabling the CIM agent, and to engage the supplier of the 3rd party CIM provider.
Imho, shutting down the CIM agent completely is kind of last resort, so time to get some insight in the CIM providers. To get an overview of the CIM providers installed, log in to the ESXi host as user root and run the following command:

# esxcli system settings advanced list | grep CIM

This presents quite a long list of installed providers. CIM providers are handled by WBEM. For the status of WBEM run the following

# esxcli system wbem get

One of the items in the output should read: “Enabled: true”.

To get an overview of the status of the providers run:

# esxcli system  wbem  provider list
Name              Enabled  Loaded
----------------  -------  ------
emc_sehost           true    true
sfcb_base            true    true
vmw_base             true    true
vmw_hdr              true    true
vmw_iodmProvider     true    true
vmw_kmodule          true    true
vmw_omc              true    true
vmw_pci              true    true
vmw_smx-provider     true    true

We see that all providers are loaded an enabled (that is … operational). The last provider in the list, named “vmw_smx-provider” can be found using this command:

#  esxcli software vib list | grep smx
smx-provider  670.03.16.00.3-7535516  HPE  VmwareAccepted  2020-01-30.

This is a hardware / monitoring provider supplied by HPE. My guess was that one of the providers was causing the issue. To prove this, we did the following; disable just one of the providers, reboot the host and check after at least 24 hours: To disable a provider, e.g. the smx-provider, run the following command:

# esxcli system  wbem  provider set --enable false --name="vmw_smx-provider"

Then check the status, the disabled provider must now show “Enabled” and “Loaded” as false.

# esxcli system  wbem  provider list

Name              Enabled  Loaded
----------------  -------  ------
emc_sehost           true    true
sfcb_base            true    true
vmw_base             true    true
vmw_hdr              true    true
vmw_iodmProvider     true    true
vmw_kmodule          true    true
vmw_omc              true    true
vmw_pci              true    true
vmw_smx-provider    false   false 

Finally reboot the host. After disabling the providers one-by-one, in all cases a dump file was created, except when the smx-provider was disabled. So the smx-provider seems to be the culprit, so time to engage HPE support for a solution.

As always, I thank you for reading, I hope this was useful.

5 Responses to Troubleshooting CIM on ESXi

  1. brunom says:

    Same behaviour here. Any comments from HPE?

    • paulgrevink says:

      It is between HPE and VMware. Last action was to install P02, released April 28. Patch did not resolve. Recently found almost identical hardware, but with different NICs not having this issue. So it seems related to some I/O devices. We continue working on this

  2. Rusty says:

    Seeing the same issue with HPE HW….any updates?

  3. Oleg says:

    I have same issue on HPE DL 360 Gen10 on ESXi 6.7 P02 (16075168). Any updates now?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: