Check your Internal Certificates!

02/01/2021

In the past year I have experienced two incidents in which important applications were no longer available. In both cases the cause turned out to be an expired internal certificate. Although these incidents can be solved using KB articles, the lesson is to check these critical components at least once a year. With the start of a new year, this is a good time to pay attention to this topic. First vRealize Operations Manager (vROPS).

vROPS

Expiration of vROPS internal certificate has the following symptoms:
– Unable to log into the Admin UI.
– The cluster is Offline and you are unable to bring it Online with the message “Data Retriever is not initialized yet. Please wait.”.

The procedure to Replace expired internal certificate in vRealize Operations can be found in this KB.

The best way to check the validity of the certificate is using a browser; connect to the vROPS master node over port 6061 and check the validity of the certificate.

You can also run the following command, I don’t consider this useful because the script doesn’t return an expiration date.

 
 
# /bin/grep -E --color=always -B1 'java.security.cert.CertPathValidatorException: validity check failed|java.security.cert.CertificateExpiredException' $ALIVE_BASE/user/log/*.log | /usr/bin/tail -20

For both scenarios; “Certificate has expired” or “Certificate has not yet expired”, the procedure for replacing the expired internal certificate is described. The procedure for “Certificate has expired” is unfortunately a bit more cumbersome to perform.

The KB mentioned before, states that “Starting in vRealize Operations 8.0, a pop up is displayed in the UI, warning when certificate expiration will occur.”. But bettter safe then sorry and perform the check on a regular interval.

vCenter Server

Another product that comes with internal certificates is vCenter Server. After expiration of the STS certificate, you cannot login to vCenter Server anymore. In some cases (see KB below for more details), the STS certificate has a lifetime of only 2 years!

VMware KB Checking Expiration of STS Certificate on vCenter Server (79248) is there to help you to identify the expiration date. Attached to the KB, you will find a Python script named checksts.py. Follow the instructions and run the script. In my case (recent vCSA 7.x), no actions are needed.

However, in case the STS certificate is expired, you will find instructions for replacing this certificate for the vCSA or a vCenter Server on Windows.

In VMware KB “Signing certificate is not valid” error in VCSA 6.5.x/6.7.x and vCenter Server 7.0.x (76719) you will find instructions and another script named fixsts.sh for replacing the STS certificate.

Step 5 and 6 in the resolution are important, restart of services may fail if there are other expired certificates. Step 6 presents a one-liner to do the check.

The first certificates are about to expire in July 2022.
Also in this case references to KB’s to replace these certificates using the vSphere Certificate Manager. Also be aware that you may encounter the situation as described in VMware KB “Failed to login to vCenter as extension, Cannot complete login due to an incorrect user name or password”, ESX Agent Manager (com.vmware.vim.eam) solution user fails to log in after replacing the vCenter Server certificates in vCenter Server 6.x (2112577).

If you are on vSphere 7 (and some editions of vSphere 6.7), there is an even more convenient option.
One of my colleagues (thank you Joop Kramp) discovered this pre-configured vSphere Alarm
In more recent versions (at least in vSphere 7).

Hopefully these checks will help you to avoid unexpected downtime of important management applications in your vSphere environment.

As always, I thank you for reading.


Tips for vRealize Log Insight or Operations Manager

19/12/2020

Recently I wanted to test from a vROPS host whether certain ports on a remote system were reachable. In the good old days, a telnet client was a useful tool for this.
Today appliances like the vCSA, vRealize Log insight and the Operations Manager (vROPS) are based on PhotonOS and contain few unnecessary utilities (which is a good thing by the way, because what is not there, you do not have to maintain).
Another widely used utility netcat, is also not present in vROPS (although it is still present in Log Insight).
Fortunately PhotonOS is based on Linux and the following option / trick is available to control ports anyway.

To start, login with an SSH client to a vROPS host as user root.
The following command will test a port for a given host:

# cat < /dev/tcp/<hostname or IP>/<port>

If the requested port is closed, the following response is received:

root@vrli-1 [ ~ ]# cat < /dev/tcp/192.168.100.51/222
-bash: connect: Connection timed out
-bash: /dev/tcp/192.168.100.51/222: Connection timed out

On the other hand a successful connection will often return no response and should be ended by submitting a [Ctrl]+C, or show something like this:

root@vrli-1 [ ~ ]# cat < /dev/tcp/192.168.100.51/22
SSH-2.0-OpenSSH_8.1
^C

Another tip, while investigating the options for sending Alerts from vROPS or Log Insight to a REST-enabled application, it may be necessary to gain more insight into the structure and content. VMware does provide some information about the content, but when you want to see it for yourself, there are some useful tools. One of these tools is webhook.site. A prerequisite is that the application needs Internet access. Consider the (temporary) installation of a vROPS or Log Insight instance in a test environment for this reason.
The use of webhook site is very simple.
Go to https://webhook.site , a unique URL is prepared for you

Copy “Your unique URL”
Follow the instructions, given in the VMware documentation for the configuration of the REST plugin

For URL, paste the unique URL and under Content type, select: application/JSON.

Press TEST, to test the configuration.

In vROPS, you will receive “Validate Connection – Test connection successful.” message, if everything goes well. Now switch to the webhook.site.

Here, you can review the structure/content of the alert sent by vROPS or Log Insight and other details. You can also export the alert in CURL or HAR format, see the “Export as” option at the top row.

As always, i thank you for reading.


vROPS, how to disable tagging

09/12/2020

Just a quick write-up for my own convenience. Recently I came across a situation where it was necessary to temporarily disable tagging on a vRealize Operations Manager 8.x (vROPS) cluster, due to an issue with a connected vCenter Server. On the why, I don’t want to go further into this.

However,I did learn something useful about changing this setting.

The steps to disable tagging for a vROPS node are pretty simple:

  1. Login to the vROPS console as user root.
  2. Add the following line:

    tagEnabled=false

    to the file /usr/lib/vmware-vcops/user/plugins/inbound/ vmwarevi_adapter3/conf/vmware.properties.
  3. Restart the collector process with this command:

    # service vmware-vcops restart collector

If you have a single vROPS server, you are good to go. However if you have a vROPS cluster, with a master node, data nodes and remote collectors, then take some extra caution.

In my case, I started with one of the RC nodes, made the change and noticed that a few minutes after restarting the service, the extra line had disappeared from the vmware.properties file.

The reason for that behavior is that during a restart Remote collectors inherit the content of the vmware.properties from the Master node.

So, first disable tagging on the Master node, restart the collector process, then restart the other remote collectors.

Take away is that remote collectors will inherit settings from the master node during a restart.

Finally, after an upgrade of vROPS, check the vmware.properties file, more than once previously made changes were found to have disappeared after the upgrade.

That’s all for now. As Always, I thank you for reading.

Photo by Vitaly Vlasov via Pexels


vSphere LCM – updating Images

06/12/2020

With the release of vSphere 7.0, the vSphere Update manager has been transformed as vSphere Lifecycle Manager. The biggest improvement is the introduction of managing clusters with images besides the familiar baselines concept. This post by Steven Bright is an excellent introduction about the concept of images and how to set up. Of course there is also the official VMware documentation about LCM.
So after reading Steven’s post, I followed instructions, created an image for ESXi 7.0 update 1 – 16850804 added the USB Fling as an additional component and upgraded the NUCs in my home lab without any problem.
Recently ESXi 7.0U1a and 7.0U1b were released, that raised the question how to update the Image?

Screenshot taken after the upgrade…

Under Cluster > Updates > Hosts > Image, next to the EDIT option, is the “Check for recommended images”. This reports no new images to download, so time for something else.

First download the ESXi depot file (I use https://my.vmware.com/group/vmware/patch#search for searching and downloading patches), in my case: Vmware-ESXi-7.0U1b-17168206-depot.zip.

In the vSphere Client go to: Lifecycle Manager, under Actions select Import Updates and import the depot file. After a few moments, the latest updates will appear under Image Depot, ESXi versions. As updates are cumulative, note that also update 7.0 U1a is now available.

Update 7.0 U1a and 7.0 U1b added…

Now go back to: Cluster > Updates, select Image and select EDIT. Now under ESXi Version the latest updates are available. Select the version of your choice (in my case 7.0 U1b).

Select the new ESXi version …

Note that the other components are still in the image and (if needed) can also be changed. If everything is completed, press VALIDATE, to validate the image, if everything is OK, then SAVE the image.

Almost immediately, LCM will notify you that the cluster is not compliant and you can start the remediation.

As Always, I thank you for reading.


Home lab refresh

10/10/2020

Since 2010, my VMware home lab was running on two servers; a HP ProLiant ML 110 G5 and a ML 110 G6 . First the G5 was taken out of active duty because of its 8 GB memory limit. Fortunately, it was possible to upgrade the memory of the G6 from the supported 16 GB to 32 GB, so the G6 remained usable quite some time, for labs with a vCenter Server and 3 virtual ESXi hosts. Recently, it became too tedious to run the latest vSphere editions.

A home lab is a valuable resource for various reasons. When preparing for a VMware exam, like the Datacenter VCP, you can practice installation and configuration of ESXi, vCenter Server, but also other tools like NSX, vROPS or LogInsight. A home lab is also useful for investigations which cannot be done at work in a production environment, to practice changes or upgrades and last but not least break and fix (one of my favorite use cases and highly educational).

If you want to practice with vSphere and other products, there are several options, which mainly depend on available budget, but also on other factors. The possibilities vary from a lab-in-the-cloud such as VMware Hands On Labs to VMware Workstation or a 19-inch rack filled with servers and switches. In my situation, decisive factors were limited space (I live in an apartment) low noise production and low energy consumption and the requirement to run a nested ESXi cluster with tools like LogInsight and vROPS. For a full vSphere 7 plus Kubernetes lab, however, a reasonable amount of hardware is required!

The old and the new, small but powerful

After some searching on the Internet you will soon come across the Intel NUCs, although not mentioned on the official VMware HCL, beloved by the community, see here and here.

Intel NUCs currently support 64GB of memory. The tenth generation is besides an i3, available in an i5 (4 cores) and an i7 (6 cores). My choice fell on the i5 (budget). Intel NUCs come with a processor, but without memory and disk(s), the final composition can be found on my Gear page.

The set-up of the Intel NUCs is not difficult, on the previously mentioned blogs of Virtuallyghetto.com and Virten.net you can find enough information for a successful installation.

The NUCs are installed with the latest ESXi 7.0 and are managed by a vCSA. To support the deployment of vSphere 6.7 and 7.0 labs, I use two Windows domain controllers (DNS and DHCP), a Windows scripting host and a pfSense firewall. For the deployment of the labs I gratefully use the nested ESXi appliances and the deployment scripts as provided by William Lam. With this a complete environment will be available in no time.


PowerShell Tips 1

06/06/2020

As you probably know, PowerShell is built on .NET, to be more precise Windows PowerShell is built on the .NET Framework, where PowerShell Core is built on .NET Core.

When you work with PowerShell in many cases you won’t be very concerned about this fact, but in some cases you can’t ignore it.

The other day while working on a PowerCLI script to get and set the logforwarding for a vCenter Server Appliance (vCSA), see also this older post.
The “get” part worked well. To retrieve the hostname, the port and protocol of the forwarding log servers run the following line of code:

 
(Get-CisService -name 'com.vmware.appliance.logging.forwarding').get()

For the set part, I created:


$spec = New-Object PSObject -Property @{
	hostname="logger1.net"
	port=514
	protocol="UDP"
}

(Get-CisService -name 'com.vmware.appliance.logging.forwarding').set($spec)

However this failed, creating the following error message:

 

Parameter 'cfg_list' expects values of type  'System.Collections.Generic.List`1[[System.Management.Automation.PSObject, 
System.Management.Automation, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]]' 
but received value of type 'System.Management.Automation.PSObject'.
At line:1 char:1
+ (Get-CisService -name 'com.vmware.appliance.logging.forwarding').set( ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], CisException
    + FullyQualifiedErrorId : VMware.VimAutomation.Cis.Core.Types.V1.CisException

From the documentation, it was already known to me that a vCSA supports a total of 3 log forwarding hosts – hence the ‘cfg_list’, but how to interpret this error message? The parameter ‘cfg_list’ must be of a certain type, but how to solve this. Luckily my colleague Bouke (you can see what is on his mind on https://www.jume.nl ), quickly showed me the solution by specifying the variable in the correct type.

The following piece of code does the ‘set’ job. The solution is in the first line; setting the correct type (variable $speclist) for the ‘cfg_list’ parameter.


$speclist = [System.Collections.Generic.List[PSobject]]::new()

$spec = New-Object PSObject -Property @{
	hostname="logger1.net"
	port=514
	protocol="UDP"
}
$speclist.add($spec)
$spec = New-Object PSObject -Property @{
	hostname="logger2.net"
	port=514
	protocol="UDP"
}
$speclist.add($spec)

(Get-CisService -name 'com.vmware.appliance.logging.forwarding').set($speclist)


As always, I thank you for reading.


The importance of good data / How to set-up a baseline document?

05/05/2020

Lately I’ve been working on machine learning and more specifically the Python Scikit library.
What I especially learned from this is the need to have a good
data-set before you want to do any kind of analysis or prediction.

But what does that have to do with subjects I usually write about? In the past period I have blogged regularly about configuration drift and tools like Vester and DSC resources for VMware.
We are also working on this within the company where I work.
Recently the assignment came to set up a baseline for the vCenter Server Appliances – you can’t solve configuration drift without thinking about the desired values, so time for a baseline. Apparently this seems simple, a baseline is a finite list of key-value pairs with the setting on one side and the value on the other side. In practice this seems a bit more complicated. I have to add that this baseline is not meant for a single vCenter Server, but for quite a few.

To get started, after connecting to a vCenter Server, the following command produces an overview of all settings for that vCenter:

PS> Get-AdvancedSettig -Entity <vCSA FQDN or IP>

Next to the fields Name and Value, you will also get the Type (of the Value) and sometimes a brief Description. Since vSphere 6.5 and up, you can also collect many appliance related settings using the API.
Now you can think, of all vCenters, collect the settings, set the desired values and done! In practice, however, there soon seemed to be some obstacles, such as:

  1. Not all vCenters are on the same version. Settings come and go. Some settings from vSphere 6.5 have disappeared in version 6.7, new settings have been introduced in version 6.7 and 7.0.
  2. Sometimes a setting exists, but returns an empty string. This is not equal to a setting that does not exist.
    Why worry about a setting with an empty string? What if, for whatever reason, a value does appear at any time?
  3. Not all settings are actually settings, but contain (status)information. We want to filter these out from our Configuration management tooling.

The baseline was created using PowerShell and the PowerCLI. The first step is to collect the settings of all vCenters as described above. The result is a .csv file for each vCenter. Incorporate the name of the vCenter in the filename like “vc01.csv”.

Read the rest of this entry »


Troubleshooting CIM on ESXi

11/03/2020

UPDATE Oktober 2020:
1. Hidden in the comments is a link to VMware KB SFCB crashing and generating dumps: sfcb-vmware_bas-zdump (78046).
2. In the recently released VMware ESXi 6.7, Patch Release ESXi670-202008001, in the Resolved issues sections, you will find:
PR 2560686: The small footprint CIM broker (SFCB) fails while fetching the class information of third-party provider classes
SFCB fails due to a segmentation fault while querying with the getClass command third-party provider classes such as OSLS_InstCreation, OSLS_InstDeletion and OSLS_InstModification under the root/emc/host namespace“.
We are still in the process of evaluating this patch in test, so cannot tell if it really resolves the issue.

Recently, a number of ESXi hosts were updated from version 6.0 to the latest 6.7 update. Soon after, we detected the following error message “An application (/bin/sfcbd) running on ESXi host has crashed (1 time(s) so far). A core file might have been created at /var/core/sfcb-vmware_bas-zdump.000.”. The core file was indeed created, luckily this was not a PSOD, the host was still up and running, workloads were not impacted. We also noticed that all upgraded hosts were impacted, it also became clear that after (re)booting a host, after about 24 hours the same event re-occurred, creating a new dump file.

After some digging around in the log files, searching for events at the time the dump file was created we found in the syslog.log:
“sfcb-vmware_base[2100157]: tool_mm_realloc_or_die: memory re-allocation failed(orig=400000 new=800000 msg=Cannot allocate memory, aborting”,
followed by: “sfcb-ProviderManager[2100151]: handleSigChld:166681408 provider terminated, pid=2100157, exit=0 signal=6”. This looks like some memory related issue.

As this is not an ideal situation, it was time to engage VMware support. Before we continue, some background; sfcbd stands for “Small Footprint CIM Broker (SFCB) daemon”. For performance and health monitoring ESXi enables an agent less approach using industry standards like CIM (Common Information Model) and WBEM (Web-Based Enterprise Management). At the ESXi side, there is the CIM agent, represented by the sfcbd. CIM providers are the counter part, often supplied by 3rd parties like hardware vendors. CIM providers come as .VIB files. After detecting 3rd party CIM provider, the sfcbd (with that the WBEM services) is automatically started by ESXi.

Read the rest of this entry »


What is the sharedPolicyRefCount?

11/01/2020

Just a quick write-up for my own convenience.
Recently while working on a configuration management baseline for a vSphere environment, I stumbled on a particular advanced setting, present in ESXi. The setting is named config.globalsettings.guest.commands.sharedpolicyrefcount,
with description “Reference count to enable guest operations” and can have an integer value between 0 and 2147483647.

From its name I know it has something to do with the guest OS. A quick Google search did not reveal very useful information, in particular which value needs to be set (as I found “0” and “100” mentioned as preferred values).

From VMware Support, thank you Pranita Kumari, I learned that vRealize Infrastructure Navigator uses VMware tools to access the machines and configure the hosts and virtual machine for the discovery process. vRealize Infrastructure Navigator needs to set the ‘sharedpolicyrefcount’ parameter in order to do agent-less discovery.
If you don’t use vRealize Infrastructure Navigator ( as this product is end of distribution and GS), the best practice would be to set this option to default value 0.

That’s all, I thank you for reading.


Vester and DSC, a comparison

30/12/2019

Over the past couple of months, I have published several posts about Configuration drift and tools like Vester and DSC Resources for VMware. Because Vester and DSC Resources for VMware serve the same goal, let us review what these tools have in common and see some of the differences.
Some topics; general information about the tool, configuration of the tool, the tool in daily operations, performance and a summary.

Introduction

Both tools are built with PowerShell. Vester has been on the market for the longest time and dates from 2017. Vester comes as a PowerShell module and depends on two other modules; Pester and PowerCLI. Vester consists of three parts;

  • Commands that do the actual work, like creating configuration files, verifying the actual configuration and do remediation in case the actual configuration does not match the desired confguration.
  • Set of Test files. Each test file contains code that checks and applies a configuration item.
  • Config files, are key-value pairs with the desired values of the configuration items. Some examples: NTP settings, DNS servers, etc.

Desired State Configuration (DSC) was introduced in PowerShell 4 and brings a declarative model for the configuration of Windows Servers. DSC can copy files, edit the registry, install Windows features and components. After initial configuration, DSC can also test the desired configuration and if necessary perform remediation.
DSC Resources are what can be configured on a Windows server, but today not only on Windows Servers! DSC Resources for VMware was first released in December 2018. Instead of Windows servers, these resources can configure ESXi hosts and vCenter Servers, although the first edition had only a few resources. The second edition, released in June 2019 offered considerably more resources.
Both tools are available in the PowerShell Gallery and can be found in Github.

Read the rest of this entry »