DSCR for VMware 2.2

17/10/2021

Over the past few years I have devoted several posts to configuration management of vCenter Server and ESXi. At that time I also reviewed one of the first versions of DSC Resources for VMware. At the time, I was not undividedly enthusiastic, especially with regard to security aspects.

In February 2021 the latest version 2.2 was released and a lot has changed. Besides support for PowerShell 5.1 and 7.0, there is now also support for PowerShell Core on Linux.

The best improvement in my opinion is that the developers have made good use of the Invoke-DSCResource cmdlet introduced by Microsoft that allows DSC resources to be executed without having to use the PowerShell LCM engine. This eliminates the need for the Windows proxy server (also one of my objections). Cmdlet Invoke-DSCResource is part of the new module PSDesiredStateConfiguration.

Based on these new capabilities, VMware has made available the module Vmware.PSDesiredStateConfiguration. Looking at the contents of this module we see the following features:
Get-VmwDscConfiguration, New-VmwDscConfiguration, Start-VmwDscConfiguration and Test-VmwDscConfiguration. In these we recognize the three basic DSC functions: Test, Set (Start) and Get.

Another interesting enhancement, available only for PowerShell 7, is vSphereNode. vSphereNode is a keyword that represents a connection to a vCenter Server. A configuration can contain one or more vSphereNodes. The advantage, with a normal DSC Resource Server and Credential properties must be declared for each DSC, vSphereNode uses a connection set up with the familiar Connect-VIServer cmdlet to a vCenter Server. This, in my opinion, makes the configuration much more manageable. Here are examples of configuration with and without vSphere Nodes.

Read the rest of this entry »


Intel NUC – boot from iSCSI LUN

03/10/2021

A few hours after a brand new USB flash drive failed for the second time and one of my vSAN NUC nodes couldn’t boot, I came across VMware KB 85685, titled “Removal of SD card/USB as a standalone boot device option”. The message in this KB was clear, time for another way to boot the NUCs. The expansion capabilities of NUCs are limited, both disks are already in use, that leaves Auto Deploy or boot from an iSCSI target. I decided to try the latter option. The work consists of 1. configuring the iSCSI targets (easiest part) and 2. Configuring the NUCs correctly (somewhat more difficult). After some searching and experimenting I found the desired solution, including VLAN configuration.

First step is creating the iSCSI targets. You will need a target for each ESXi host. In this example I will show how the targets are created on a Synology NAS.

Read the rest of this entry »


Error loading Alerts in vRealize Log Insight 8.x

14/07/2021

Recently, after upgrading a Log insight cluster from version 8.2.0 to 8.4.1, the cluster seems to be in good shape, but opening Alerts from the Administration menu shows “Error loading Alerts” instead of the Alerts.

Other Symptons:

  • Opening Certificates from the Administration menu shows “Error loading certificates” instead of the Certificates.
  • Testing integration with vROPS or vCenter shows “Could not connect to host”.

We logged into the Log Insight UI of all cluster nodes, it became clear that not all nodes were having this issue, but about 50 percent.

This issue is caused by a corrupt certificate store on the unhealthy nodes.

How to check the state of the certificate store?

Open a SSH session to an unhealthy node as user root and run the following command:

# /usr/java/jre-vmware/bin/keytool -list -keystore /usr/java/jre-vmware/lib/security/cacerts

When the response is: “keytool error: java.io.EOFException”, the cacerts file is corrupt.

Also the /storage/var/loginsight/ui_runtime.log file will show errors like: “java.security.KeyStoreException: problem accessing trust store”.

Workaround:

  1. Open a SSH session to a healthy node as user root.
  2. Before applying the workaround, the certificate store can be checked, using the same command as above.
    When the response is: “Enter keystore password:”, the cacerts file is OK.
    Cancel with keystroke: [Ctrl]+C.
  3. Using the built-in SCP Utility, copy the /usr/java/jre-vmware/lib/security/cacerts file from the healthy node to the /usr/java/jre-vmware/lib/security/ directory of all unhealthy nodes in the vRealize Log Insight cluster.
    # scp /usr/java/jre-vmware/lib/security/cacerts root@unhealthynode:/usr/java/jre-vmware/lib/security/cacerts
    

    Where unhealthynode is the IP or Fully Qualified Domain name (FQDN) of a unhealthy node.

  4. You don’t even have to restart the unhealthy Log Insight node. Just check the node after the file copy.

As always, I thank you for reading.


Log Insight Agent, first steps…

30/05/2021

Since Log Insight version 8.4, a new tab has been added to the user interface called “Log Sources”.When we look further we see two agents there, besides the new FluentD agent, the well-known Log Insight Agent (LI Agent). The LI Agent is available for Windows and Linux platforms and can forward events from log files to a Log Insight server or other syslog destinations.

By the way, did you know that the LI Agent is also installed by default on a vRealize Operations Manager (vROPS) node? When you configure Log Forwarding via the GUI (Administration > Management > Log Forwarding), you are actually configuring the LI agent.

The reason to pay attention to the LI Agent was a recent situation regarding auditing of applications. Because Log Insight plays a central role in a SDDC environment, auditing of components such as ESXi, vCenter Server ed. can take place via Log Insight, but what about the auditing of Log Insight itself?

A pretty old KB 53123 is a good starting point. Just like vROPS mentioned before, each Log Insight node has the LI Agent installed by default. LI Agents can be centrally configured from the Log Insight GUI or by manually editing the configuration file liagent.ini in folder /var/lib/loginsight-agent. After editing the configuration, the new configuration will automatically become active and can be observed in the file liagent-effective.ini.

<fig1>

The following test setup was created; vRLI-1 is the log source, a one node Log Insight instance where the Log insight Agent is configured.
A second Log Insight instance, vRLI-2 serves as a destination host. For our auditing purposes we start by forwarding successful and unsuccessful login attempts on the vRLI-1, these are logged in the file ui_runtime.log in the /storage/var/loginsight folder, among others.

Read the rest of this entry »


Log Insight Upgrade suddenly halted

25/04/2021

Just a quick write-up that might save you some time and frustration during your next vRealize Log Insight upgrade. Upgrade of Log Insight can be performed comfortably from the GUI, especially when the cluster consists of multiple nodes. After starting, the upgrade is performed on all nodes without further intervention. During a recent upgrade of a cluster from version 8.2 to 8.4, the upgrade of the master node was performed successfully. Normally, after the upgrade of the master node completes, one of the following nodes will go in maintenance and the process continues … but not this time. The upgrade stopped without further notice.

Investigation of the upgrade.log in /storage/var/loginsight on the impacted node showed that before starting the actual upgrade a few checks are performed, like available disk space in /tmp, /storage/core and /storage/var. BTW it is not clear how much free space is required.

It turned out that /storage/var was on 100% usage! In addition to a “lost+found” folder, this partition contains loginsight runtime and other log files located in /storage/var/loginsight.

The subfolder i18n (containing large VIP-error.log and VIP-info.log files, related to internationalization) is the enabler, as there is no log rotation and thus filling up the partition.
So, after reverting snapshots (good practice while doing this kind of upgrades) , manually deleting old log files in the i18n folder, a new attempt was done and this time the upgrade went flawless.

So, before upgrading Log Insight nodes, first check the free space on the partitions aforementioned (SSH to the nodes and run df -h).


Skyline Health Detector

01/04/2021

Skyline is VMware’s proactive self-service support technology available to customers with an active Production Support or Premier Services contract, based on using Skyline Collector and Skyline Advisor. But there is more, in the latest vCenter Server 7.x edition you will also find Skyline Health. Skyline Health is built-in in vCenter Server, no additional installation required. To make use of Skyline Health, you must participate in the Customer Experience Improvement Program (CEIP) to use online health checks and vCenter Server must be able to reach the Internet. Skyline Health will run about 136 health checks and present the results grouped in categories. While browsing the results, the “Self support Diagnostics” section caught my attention, in particular the “VMware Skyline Health Diagnostics”.

According to the documentation; “VMware Skyline Health Diagnostics (SHD) is VMware’s self-service diagnostics platform. It uses product logs to detect problems and provides recommendations in the form of KB articles or steps to remediate them. A vSphere Administrator can use this tool to troubleshoot before contacting the VMware Global Support Service.”

SHD can detect issues in vCenter Server, ESXi and vSAN. Some of the benefits of SHD:
1. SHD runs on-prem, it can also work offline without any internet connectivity.
2. Based on the detected symptoms, the tool provides correct VMware Knowledge Base articles/remediation steps.
3. Get recommendations for a problem from VMware support services.
4. Early recommendations and remediation helps business continuity.

Read the rest of this entry »


Log Insight disable SSHD, permanently

03/03/2021

For many administrative operations, you may want to establish an SSH connection to a Log Insight host and log in as user root. However, there may be situations where SSH access needs to be shut down and enabled only for maintenance.

Before we continue, let us get one thing clear; in Linux country, ssh is the SSH Client, and sshd (ssh daemon) is the server side, accepting connections. So we want to disable sshd!

The “Secure Configuration” guide for Log Insight describes the procedure for disabling SSH. The steps to disable sshd are:

# systemctl disable sshd

Stop the sshd:

# systemctl stop sshd

Read the rest of this entry »


Log Insight REST API

27/02/2021

I am currently working on some PowerShell scripts to verify the user and group permissions on VMware products. For vCenter Server, the PowerCLI provides cmdlets to do the job. However for vRealize Operations Manager (vROPS) and Log Insight this is not the case (Yes, vROPS has some cmdlets but not for getting permissions). Luckily, both products do include a REST API, so time to investigate.

PowerShell offers two commands to interact with REST API’s; Invoke-WebRequest and Invoke-RestMethod. After reading this post from Adam Bertram I decided to give Invoke-RestMethod a try, the next step was how to start?

Reading “Introduction to PowerShell Rest API Authentication” from Joshua Stenhouse, was very helpful. So after some practicing with his vROPS example, which can be found here, it was time to figure out how to setup authentication for Log Insight.

A good starting point is the API documentation. Besides documentation, the Log Insight GUI also provides access; In the upper-right corner, open the drop-down menu and select Help. On this page, you will find a link to the REST API Documentation.

The REST-API can also directly be accessed by this URL:
https://fqdnLogInsight/rest-api.

Read the rest of this entry »


vROPS Alerts, a closer look

03/02/2021

In a previous post I demonstrated a way to see the contents of REST messages.

Since the properties of the REST message were not quite clear to me, I also configured the Standard Email plugin in the Outbound Settings of vROPS in addition to the REST notification plugin and added it to the notification Settings. Alerts are now sent in 2 ways and can be compared.

As an example the content of a “Virtual machine disk I/O write latency is high”.
BTW, my vROPS version is  8.2.0.

First, the email at the time the Alert was generated.

New alert was generated at Tue Feb 02 20:21:00 GMT 2021:
Info:DC3 VirtualMachine is acting abnormally since Tue Feb 02 20:21:00 GMT 2021 and was last updated at Tue Feb 02 20:21:00 GMT 2021

Alert Definition Name: Virtual machine disk I/O write latency is high
Alert Definition Description: Virtual machine disk I/O write latency is high
Object Name : DC3
Object Type : VirtualMachine
Alert Impact: health
Alert State : warning
Alert Type : Storage
Alert Sub-Type : Performance
Object Health State: warning
Object Risk State: info
Object Efficiency State: info
Control State: Open
Symptoms: SYMPTOM SET - self

Symptom Name: Virtual machine disk write latency at Warning level
Object Name: DC3
Object ID: 77a07045-4f83-4bb2-8dca-0c88bd237984
Metric: virtualDisk:Aggregate of all instances|totalWriteLatency_average
Message Info: 22.867 > 15.0

Recommendations:
- Use Storage vMotion to migrate this virtual machine to a different datastore with higher IOPS
- Use Storage vMotion to migrate some virtual machines to a different      datastore
- Increase IOPS for the datastores connected to the virtual machine
- If the virtual machine has multiple snapshots, delete the older snapshots
- Check whether you have enabled Storage IO Control on the datastores connected to the virtual machine
Notification Rule Name: Gmail
Notification Rule Description: vROPS in Home lab
Alert ID : 42d1a03b-58b8-4c29-89df-f9c9b5d3c776
VCOps Server - vrops-1.virtual.local

Alert details

The last line of the email “Alert details” is a URL that directly opens the appropriate Alert in vROPS.
https://vrops.acme.com/ui/index.action#/object/all/77a07045-4f83-4bb2-8dca-0c88bd237984/alertsAndSymptoms/alerts/42d1a03b-58b8-4c29-89df-f9c9b5d3c776
In the URL we recognize two items:

– 77a07045-4f83-4bb2-8dca-0c88bd237984 is the Object ID, that is an vROPS object like a VM or Cluster.
– 42d1a03b-58b8-4c29-89df-f9c9b5d3c776 is the unique Alert ID.

The first REST message is a POST and looks like this:

{
  "updateDate": 1612297260600,
  "resourceId": "77a07045-4f83-4bb2-8dca-0c88bd237984",
  "adapterKind": "VMWARE",
  "Health": 2,
  "impact": "health",
  "criticality": "ALERT_CRITICALITY_LEVEL_WARNING",
  "Risk": 1,
  "resourceName": "DC3",
  "type": "ALERT_TYPE_STORAGE_PROBLEM",
  "resourceKind": "VirtualMachine",
  "alertName": "Virtual machine disk I/O write latency is high",
  "Efficiency": 1,
  "subType": "ALERT_SUBTYPE_PERFORMANCE_PROBLEM",
  "alertId": "42d1a03b-58b8-4c29-89df-f9c9b5d3c776",
  "startDate": 1612297260600,
  "info": "9027",
  "status": "ACTIVE"
}

Note that the properties have a different name compared to the email message, see the overview below:

“REST” : email
========================================================
“updateDate”: Alert Updated
“resourceId”: Object ID
“adapterKind”: Adapter Kind, see AlertDefinition below
“Health”: Object Health State
“impact”: Alert Impact
“criticality”: Alert State
“Risk”: Object Risk State
“resourceName”: Object Name
“type”: Alert Type
“resourceKind”: Object Type
“alertName”: Alert Definition Name
“Efficiency”: Object Efficiency State
“subType”: Alert Sub-Type
“alertId”: Alert ID
“startDate”: Alert Started
“info”: ?????
“status”: Alert Status (Active or Cancelled)

– Date/time values are in Unix epoch format (e.g. 1612297260600 converts to Tuesday 2 February 2021 20:21:00.600).
Converts can be done here.
– Health, Risk and Efficiency scores ranges from 1 – 4 (1=Info, 2=Warning, 3=Immediate or 4=Critical).
– Alert State / Criticality levels: Info, Warning, Immediate or Critical

The propertie info still puzzles me. I can’t find the value anywhere. However, there does seem to be a relationship with the Alert Definition description. With identical consecutive alerts I see the same value of “info”. If someone has more information about this propertie, please leave a message.

The mail in which the same alert is cancelled looks like this:

Alert was cancelled at Tue Feb 02 21:01:00 GMT 2021:
Info:DC3 VirtualMachine is acting abnormally since Tue Feb 02 20:21:00 GMT 2021 and was last updated at Tue Feb 02 20:21:00 GMT 2021

Alert Definition Name: Virtual machine disk I/O write latency is high
Alert Definition Description: Virtual machine disk I/O write latency is high
Object Name : DC3
Object Type : VirtualMachine
Alert Impact: health
Alert State : warning
Alert Type : Storage
Alert Sub-Type : Performance
Object Health State: warning
Object Risk State: info
Object Efficiency State: info
Control State: Open
Recommendations:
- Use Storage vMotion to migrate this virtual machine to a different datastore with higher IOPS
- Use Storage vMotion to migrate some virtual machines to a different datastore
- Increase IOPS for the datastores connected to the virtual machine
- If the virtual machine has multiple snapshots, delete the older snapshots
- Check whether you have enabled Storage IO Control on the datastores connected to the virtual machine
Notification Rule Name: Gmail
Notification Rule Description: Hoi een nieuw alarm van vROPS in Home lab
Alert ID : 42d1a03b-58b8-4c29-89df-f9c9b5d3c776
VCOps Server - vrops-1.virtual.local

The REST message is now a PUT message and looks like this.
A new propertie “cancelDate” was added. The “status” has a new value.

{
  "cancelDate": 1612299660606, 
  "updateDate": 1612297260600,                              
  "resourceId": "77a07045-4f83-4bb2-8dca-0c88bd237984",     
  "adapterKind": "VMWARE",
  "Health": 2,
  "impact": "health",
  "criticality": "ALERT_CRITICALITY_LEVEL_WARNING",
  "Risk": 1,
  "resourceName": "DC3",                                    
  "type": "ALERT_TYPE_STORAGE_PROBLEM",
  "resourceKind": "VirtualMachine",
  "alertName": "Virtual machine disk I/O write latency is high",
  "Efficiency": 1,
  "subType": "ALERT_SUBTYPE_PERFORMANCE_PROBLEM",
  "alertId": "42d1a03b-58b8-4c29-89df-f9c9b5d3c776",        
  "startDate": 1612297260600,
  "info": "9027",
  "status": "CANCELED"
}

Below, the corresponding Alert Definition in JSON format.

   
{
    "id": "AlertDefinition-VMWARE-VMWriteLatency",
    "name": "Virtual machine disk I/O write latency is high",
    "description": "Virtual machine disk I/O write latency is high",
    "adapterKindKey": "VMWARE",
    "resourceKindKey": "VirtualMachine",
    "waitCycles": 1,
    "cancelCycles": 1,
    "type": 18,
    "subType": 19,
    "states": [
    {
        "severity": "AUTO",
        "base-symptom-set": {
        "type": "SYMPTOM_SET",
        "relation": "SELF",
        "symptomSetOperator": "OR",
        "symptomDefinitionIds": [
            "SymptomDefinition-VMWARE-VMWriteLatencyCritical",
            "SymptomDefinition-VMWARE-VMWriteLatencyImmediate",
            "SymptomDefinition-VMWARE-VMWriteLatencyWarning"
        ]
        },
        "impact": {
        "impactType": "BADGE",
        "detail": "health"
        },
        "recommendationPriorityMap": {
        "Recommendation-df-VMWARE-StorageVMotionVM": 4,
        "Recommendation-df-VMWARE-IncreaseIopsForDatastores": 5,
        "Recommendation-df-VMWARE-StorageVmotionVmIOPS": 3,
        "Recommendation-df-VMWARE-CheckStorageIOControl": 1,
        "Recommendation-df-VMWARE-DeleteOldSnapshotOfVm": 2
        }
    }
    ]
},

That is all for now. I welcome any additional information regarding this topic.


Check your Internal Certificates!

02/01/2021

In the past year I have experienced two incidents in which important applications were no longer available. In both cases the cause turned out to be an expired internal certificate. Although these incidents can be solved using KB articles, the lesson is to check these critical components at least once a year. With the start of a new year, this is a good time to pay attention to this topic. First vRealize Operations Manager (vROPS).

vROPS

Expiration of vROPS internal certificate has the following symptoms:
– Unable to log into the Admin UI.
– The cluster is Offline and you are unable to bring it Online with the message “Data Retriever is not initialized yet. Please wait.”.

The procedure to Replace expired internal certificate in vRealize Operations can be found in this KB.

The best way to check the validity of the certificate is using a browser; connect to the vROPS master node over port 6061 and check the validity of the certificate.

You can also run the following command, I don’t consider this useful because the script doesn’t return an expiration date.

 
 
# /bin/grep -E --color=always -B1 'java.security.cert.CertPathValidatorException: validity check failed|java.security.cert.CertificateExpiredException' $ALIVE_BASE/user/log/*.log | /usr/bin/tail -20

For both scenarios; “Certificate has expired” or “Certificate has not yet expired”, the procedure for replacing the expired internal certificate is described. The procedure for “Certificate has expired” is unfortunately a bit more cumbersome to perform.

The KB mentioned before, states that “Starting in vRealize Operations 8.0, a pop up is displayed in the UI, warning when certificate expiration will occur.”. But bettter safe then sorry and perform the check on a regular interval.

vCenter Server

Another product that comes with internal certificates is vCenter Server. After expiration of the STS certificate, you cannot login to vCenter Server anymore. In some cases (see KB below for more details), the STS certificate has a lifetime of only 2 years!

VMware KB Checking Expiration of STS Certificate on vCenter Server (79248) is there to help you to identify the expiration date. Attached to the KB, you will find a Python script named checksts.py. Follow the instructions and run the script. In my case (recent vCSA 7.x), no actions are needed.

However, in case the STS certificate is expired, you will find instructions for replacing this certificate for the vCSA or a vCenter Server on Windows.

In VMware KB “Signing certificate is not valid” error in VCSA 6.5.x/6.7.x and vCenter Server 7.0.x (76719) you will find instructions and another script named fixsts.sh for replacing the STS certificate.

Step 5 and 6 in the resolution are important, restart of services may fail if there are other expired certificates. Step 6 presents a one-liner to do the check.

The first certificates are about to expire in July 2022.
Also in this case references to KB’s to replace these certificates using the vSphere Certificate Manager. Also be aware that you may encounter the situation as described in VMware KB “Failed to login to vCenter as extension, Cannot complete login due to an incorrect user name or password”, ESX Agent Manager (com.vmware.vim.eam) solution user fails to log in after replacing the vCenter Server certificates in vCenter Server 6.x (2112577).

If you are on vSphere 7 (and some editions of vSphere 6.7), there is an even more convenient option.
One of my colleagues (thank you Joop Kramp) discovered this pre-configured vSphere Alarm
In more recent versions (at least in vSphere 7).

Hopefully these checks will help you to avoid unexpected downtime of important management applications in your vSphere environment.

As always, I thank you for reading.