Skyline Health Detector

01/04/2021

Skyline is VMware’s proactive self-service support technology available to customers with an active Production Support or Premier Services contract, based on using Skyline Collector and Skyline Advisor. But there is more, in the latest vCenter Server 7.x edition you will also find Skyline Health. Skyline Health is built-in in vCenter Server, no additional installation required. To make use of Skyline Health, you must participate in the Customer Experience Improvement Program (CEIP) to use online health checks and vCenter Server must be able to reach the Internet. Skyline Health will run about 136 health checks and present the results grouped in categories. While browsing the results, the “Self support Diagnostics” section caught my attention, in particular the “VMware Skyline Health Diagnostics”.

According to the documentation; “VMware Skyline Health Diagnostics (SHD) is VMware’s self-service diagnostics platform. It uses product logs to detect problems and provides recommendations in the form of KB articles or steps to remediate them. A vSphere Administrator can use this tool to troubleshoot before contacting the VMware Global Support Service.”

SHD can detect issues in vCenter Server, ESXi and vSAN. Some of the benefits of SHD:
1. SHD runs on-prem, it can also work offline without any internet connectivity.
2. Based on the detected symptoms, the tool provides correct VMware Knowledge Base articles/remediation steps.
3. Get recommendations for a problem from VMware support services.
4. Early recommendations and remediation helps business continuity.

Read the rest of this entry »


Log Insight disable SSHD, permanently

03/03/2021

For many administrative operations, you may want to establish an SSH connection to a Log Insight host and log in as user root. However, there may be situations where SSH access needs to be shut down and enabled only for maintenance.

Before we continue, let us get one thing clear; in Linux country, ssh is the SSH Client, and sshd (ssh daemon) is the server side, accepting connections. So we want to disable sshd!

The “Secure Configuration” guide for Log Insight describes the procedure for disabling SSH. The steps to disable sshd are:

# systemctl disable sshd

Stop the sshd:

# systemctl stop sshd

Read the rest of this entry »


Log Insight REST API

27/02/2021

I am currently working on some PowerShell scripts to verify the user and group permissions on VMware products. For vCenter Server, the PowerCLI provides cmdlets to do the job. However for vRealize Operations Manager (vROPS) and Log Insight this is not the case (Yes, vROPS has some cmdlets but not for getting permissions). Luckily, both products do include a REST API, so time to investigate.

PowerShell offers two commands to interact with REST API’s; Invoke-WebRequest and Invoke-RestMethod. After reading this post from Adam Bertram I decided to give Invoke-WebRequest a try, the next step was how to start?

Reading “Introduction to PowerShell Rest API Authentication” from Joshua Stenhouse, was very helpful. So after some practicing with his vROPS example, which can be found here, it was time to figure out how to setup authentication for Log Insight.

A good starting point is the API documentation. Besides documentation, the Log Insight GUI also provides access; In the upper-right corner, open the drop-down menu and select Help. On this page, you will find a link to the REST API Documentation.

The REST-API can also directly be accessed by this URL:
https://fqdnLogInsight/rest-api.

Read the rest of this entry »


vROPS Alerts, a closer look

03/02/2021

In a previous post I demonstrated a way to see the contents of REST messages.

Since the properties of the REST message were not quite clear to me, I also configured the Standard Email plugin in the Outbound Settings of vROPS in addition to the REST notification plugin and added it to the notification Settings. Alerts are now sent in 2 ways and can be compared.

As an example the content of a “Virtual machine disk I/O write latency is high”.
BTW, my vROPS version is  8.2.0.

First, the email at the time the Alert was generated.

New alert was generated at Tue Feb 02 20:21:00 GMT 2021:
Info:DC3 VirtualMachine is acting abnormally since Tue Feb 02 20:21:00 GMT 2021 and was last updated at Tue Feb 02 20:21:00 GMT 2021

Alert Definition Name: Virtual machine disk I/O write latency is high
Alert Definition Description: Virtual machine disk I/O write latency is high
Object Name : DC3
Object Type : VirtualMachine
Alert Impact: health
Alert State : warning
Alert Type : Storage
Alert Sub-Type : Performance
Object Health State: warning
Object Risk State: info
Object Efficiency State: info
Control State: Open
Symptoms: SYMPTOM SET - self

Symptom Name: Virtual machine disk write latency at Warning level
Object Name: DC3
Object ID: 77a07045-4f83-4bb2-8dca-0c88bd237984
Metric: virtualDisk:Aggregate of all instances|totalWriteLatency_average
Message Info: 22.867 > 15.0

Recommendations:
- Use Storage vMotion to migrate this virtual machine to a different datastore with higher IOPS
- Use Storage vMotion to migrate some virtual machines to a different      datastore
- Increase IOPS for the datastores connected to the virtual machine
- If the virtual machine has multiple snapshots, delete the older snapshots
- Check whether you have enabled Storage IO Control on the datastores connected to the virtual machine
Notification Rule Name: Gmail
Notification Rule Description: vROPS in Home lab
Alert ID : 42d1a03b-58b8-4c29-89df-f9c9b5d3c776
VCOps Server - vrops-1.virtual.local

Alert details

The last line of the email “Alert details” is a URL that directly opens the appropriate Alert in vROPS.
https://vrops.acme.com/ui/index.action#/object/all/77a07045-4f83-4bb2-8dca-0c88bd237984/alertsAndSymptoms/alerts/42d1a03b-58b8-4c29-89df-f9c9b5d3c776
In the URL we recognize two items:

– 77a07045-4f83-4bb2-8dca-0c88bd237984 is the Object ID, that is an vROPS object like a VM or Cluster.
– 42d1a03b-58b8-4c29-89df-f9c9b5d3c776 is the unique Alert ID.

The first REST message is a POST and looks like this:

{
  "updateDate": 1612297260600,
  "resourceId": "77a07045-4f83-4bb2-8dca-0c88bd237984",
  "adapterKind": "VMWARE",
  "Health": 2,
  "impact": "health",
  "criticality": "ALERT_CRITICALITY_LEVEL_WARNING",
  "Risk": 1,
  "resourceName": "DC3",
  "type": "ALERT_TYPE_STORAGE_PROBLEM",
  "resourceKind": "VirtualMachine",
  "alertName": "Virtual machine disk I/O write latency is high",
  "Efficiency": 1,
  "subType": "ALERT_SUBTYPE_PERFORMANCE_PROBLEM",
  "alertId": "42d1a03b-58b8-4c29-89df-f9c9b5d3c776",
  "startDate": 1612297260600,
  "info": "9027",
  "status": "ACTIVE"
}

Note that the properties have a different name compared to the email message, see the overview below:

“REST” : email
========================================================
“updateDate”: Alert Updated
“resourceId”: Object ID
“adapterKind”: Adapter Kind, see AlertDefinition below
“Health”: Object Health State
“impact”: Alert Impact
“criticality”: Alert State
“Risk”: Object Risk State
“resourceName”: Object Name
“type”: Alert Type
“resourceKind”: Object Type
“alertName”: Alert Definition Name
“Efficiency”: Object Efficiency State
“subType”: Alert Sub-Type
“alertId”: Alert ID
“startDate”: Alert Started
“info”: ?????
“status”: Alert Status (Active or Cancelled)

– Date/time values are in Unix epoch format (e.g. 1612297260600 converts to Tuesday 2 February 2021 20:21:00.600).
Converts can be done here.
– Health, Risk and Efficiency scores ranges from 1 – 4 (1=Info, 2=Warning, 3=Immediate or 4=Critical).
– Alert State / Criticality levels: Info, Warning, Immediate or Critical

The propertie info still puzzles me. I can’t find the value anywhere. However, there does seem to be a relationship with the Alert Definition description. With identical consecutive alerts I see the same value of “info”. If someone has more information about this propertie, please leave a message.

The mail in which the same alert is cancelled looks like this:

Alert was cancelled at Tue Feb 02 21:01:00 GMT 2021:
Info:DC3 VirtualMachine is acting abnormally since Tue Feb 02 20:21:00 GMT 2021 and was last updated at Tue Feb 02 20:21:00 GMT 2021

Alert Definition Name: Virtual machine disk I/O write latency is high
Alert Definition Description: Virtual machine disk I/O write latency is high
Object Name : DC3
Object Type : VirtualMachine
Alert Impact: health
Alert State : warning
Alert Type : Storage
Alert Sub-Type : Performance
Object Health State: warning
Object Risk State: info
Object Efficiency State: info
Control State: Open
Recommendations:
- Use Storage vMotion to migrate this virtual machine to a different datastore with higher IOPS
- Use Storage vMotion to migrate some virtual machines to a different datastore
- Increase IOPS for the datastores connected to the virtual machine
- If the virtual machine has multiple snapshots, delete the older snapshots
- Check whether you have enabled Storage IO Control on the datastores connected to the virtual machine
Notification Rule Name: Gmail
Notification Rule Description: Hoi een nieuw alarm van vROPS in Home lab
Alert ID : 42d1a03b-58b8-4c29-89df-f9c9b5d3c776
VCOps Server - vrops-1.virtual.local

The REST message is now a PUT message and looks like this.
A new propertie “cancelDate” was added. The “status” has a new value.

{
  "cancelDate": 1612299660606, 
  "updateDate": 1612297260600,                              
  "resourceId": "77a07045-4f83-4bb2-8dca-0c88bd237984",     
  "adapterKind": "VMWARE",
  "Health": 2,
  "impact": "health",
  "criticality": "ALERT_CRITICALITY_LEVEL_WARNING",
  "Risk": 1,
  "resourceName": "DC3",                                    
  "type": "ALERT_TYPE_STORAGE_PROBLEM",
  "resourceKind": "VirtualMachine",
  "alertName": "Virtual machine disk I/O write latency is high",
  "Efficiency": 1,
  "subType": "ALERT_SUBTYPE_PERFORMANCE_PROBLEM",
  "alertId": "42d1a03b-58b8-4c29-89df-f9c9b5d3c776",        
  "startDate": 1612297260600,
  "info": "9027",
  "status": "CANCELED"
}

Below, the corresponding Alert Definition in JSON format.

   
{
    "id": "AlertDefinition-VMWARE-VMWriteLatency",
    "name": "Virtual machine disk I/O write latency is high",
    "description": "Virtual machine disk I/O write latency is high",
    "adapterKindKey": "VMWARE",
    "resourceKindKey": "VirtualMachine",
    "waitCycles": 1,
    "cancelCycles": 1,
    "type": 18,
    "subType": 19,
    "states": [
    {
        "severity": "AUTO",
        "base-symptom-set": {
        "type": "SYMPTOM_SET",
        "relation": "SELF",
        "symptomSetOperator": "OR",
        "symptomDefinitionIds": [
            "SymptomDefinition-VMWARE-VMWriteLatencyCritical",
            "SymptomDefinition-VMWARE-VMWriteLatencyImmediate",
            "SymptomDefinition-VMWARE-VMWriteLatencyWarning"
        ]
        },
        "impact": {
        "impactType": "BADGE",
        "detail": "health"
        },
        "recommendationPriorityMap": {
        "Recommendation-df-VMWARE-StorageVMotionVM": 4,
        "Recommendation-df-VMWARE-IncreaseIopsForDatastores": 5,
        "Recommendation-df-VMWARE-StorageVmotionVmIOPS": 3,
        "Recommendation-df-VMWARE-CheckStorageIOControl": 1,
        "Recommendation-df-VMWARE-DeleteOldSnapshotOfVm": 2
        }
    }
    ]
},

That is all for now. I welcome any additional information regarding this topic.


Check your Internal Certificates!

02/01/2021

In the past year I have experienced two incidents in which important applications were no longer available. In both cases the cause turned out to be an expired internal certificate. Although these incidents can be solved using KB articles, the lesson is to check these critical components at least once a year. With the start of a new year, this is a good time to pay attention to this topic. First vRealize Operations Manager (vROPS).

vROPS

Expiration of vROPS internal certificate has the following symptoms:
– Unable to log into the Admin UI.
– The cluster is Offline and you are unable to bring it Online with the message “Data Retriever is not initialized yet. Please wait.”.

The procedure to Replace expired internal certificate in vRealize Operations can be found in this KB.

The best way to check the validity of the certificate is using a browser; connect to the vROPS master node over port 6061 and check the validity of the certificate.

You can also run the following command, I don’t consider this useful because the script doesn’t return an expiration date.

 
 
# /bin/grep -E --color=always -B1 'java.security.cert.CertPathValidatorException: validity check failed|java.security.cert.CertificateExpiredException' $ALIVE_BASE/user/log/*.log | /usr/bin/tail -20

For both scenarios; “Certificate has expired” or “Certificate has not yet expired”, the procedure for replacing the expired internal certificate is described. The procedure for “Certificate has expired” is unfortunately a bit more cumbersome to perform.

The KB mentioned before, states that “Starting in vRealize Operations 8.0, a pop up is displayed in the UI, warning when certificate expiration will occur.”. But bettter safe then sorry and perform the check on a regular interval.

vCenter Server

Another product that comes with internal certificates is vCenter Server. After expiration of the STS certificate, you cannot login to vCenter Server anymore. In some cases (see KB below for more details), the STS certificate has a lifetime of only 2 years!

VMware KB Checking Expiration of STS Certificate on vCenter Server (79248) is there to help you to identify the expiration date. Attached to the KB, you will find a Python script named checksts.py. Follow the instructions and run the script. In my case (recent vCSA 7.x), no actions are needed.

However, in case the STS certificate is expired, you will find instructions for replacing this certificate for the vCSA or a vCenter Server on Windows.

In VMware KB “Signing certificate is not valid” error in VCSA 6.5.x/6.7.x and vCenter Server 7.0.x (76719) you will find instructions and another script named fixsts.sh for replacing the STS certificate.

Step 5 and 6 in the resolution are important, restart of services may fail if there are other expired certificates. Step 6 presents a one-liner to do the check.

The first certificates are about to expire in July 2022.
Also in this case references to KB’s to replace these certificates using the vSphere Certificate Manager. Also be aware that you may encounter the situation as described in VMware KB “Failed to login to vCenter as extension, Cannot complete login due to an incorrect user name or password”, ESX Agent Manager (com.vmware.vim.eam) solution user fails to log in after replacing the vCenter Server certificates in vCenter Server 6.x (2112577).

If you are on vSphere 7 (and some editions of vSphere 6.7), there is an even more convenient option.
One of my colleagues (thank you Joop Kramp) discovered this pre-configured vSphere Alarm
In more recent versions (at least in vSphere 7).

Hopefully these checks will help you to avoid unexpected downtime of important management applications in your vSphere environment.

As always, I thank you for reading.


Tips for vRealize Log Insight or Operations Manager

19/12/2020

Recently I wanted to test from a vROPS host whether certain ports on a remote system were reachable. In the good old days, a telnet client was a useful tool for this.
Today appliances like the vCSA, vRealize Log insight and the Operations Manager (vROPS) are based on PhotonOS and contain few unnecessary utilities (which is a good thing by the way, because what is not there, you do not have to maintain).
Another widely used utility netcat, is also not present in vROPS (although it is still present in Log Insight).
Fortunately PhotonOS is based on Linux and the following option / trick is available to control ports anyway.

To start, login with an SSH client to a vROPS host as user root.
The following command will test a port for a given host:

# cat < /dev/tcp/<hostname or IP>/<port>

If the requested port is closed, the following response is received:

root@vrli-1 [ ~ ]# cat < /dev/tcp/192.168.100.51/222
-bash: connect: Connection timed out
-bash: /dev/tcp/192.168.100.51/222: Connection timed out

On the other hand a successful connection will often return no response and should be ended by submitting a [Ctrl]+C, or show something like this:

root@vrli-1 [ ~ ]# cat < /dev/tcp/192.168.100.51/22
SSH-2.0-OpenSSH_8.1
^C

Another tip, while investigating the options for sending Alerts from vROPS or Log Insight to a REST-enabled application, it may be necessary to gain more insight into the structure and content. VMware does provide some information about the content, but when you want to see it for yourself, there are some useful tools. One of these tools is webhook.site. A prerequisite is that the application needs Internet access. Consider the (temporary) installation of a vROPS or Log Insight instance in a test environment for this reason.
The use of webhook site is very simple.
Go to https://webhook.site , a unique URL is prepared for you

Copy “Your unique URL”
Follow the instructions, given in the VMware documentation for the configuration of the REST plugin

For URL, paste the unique URL and under Content type, select: application/JSON.

Press TEST, to test the configuration.

In vROPS, you will receive “Validate Connection – Test connection successful.” message, if everything goes well. Now switch to the webhook.site.

Here, you can review the structure/content of the alert sent by vROPS or Log Insight and other details. You can also export the alert in CURL or HAR format, see the “Export as” option at the top row.

As always, i thank you for reading.


vROPS, how to disable tagging

09/12/2020

 

 

 

 

 

 

Just a quick write-up for my own convenience. Recently I came across a situation where it was necessary to temporarily disable tagging on a vRealize Operations Manager 8.x (vROPS) cluster, due to an issue with a connected vCenter Server. On the why, I don’t want to go further into this.

However,I did learn something useful about changing this setting.

The steps to disable tagging for a vROPS node are pretty simple:

  1. Login to the vROPS console as user root.
  2. Add the following line:

    tagEnabled=false

    to the file /usr/lib/vmware-vcops/user/plugins/inbound/ vmwarevi_adapter3/conf/vmware.properties.
  3. Restart the collector process with this command:

    # service vmware-vcops restart collector

If you have a single vROPS server, you are good to go. However if you have a vROPS cluster, with a master node, data nodes and remote collectors, then take some extra caution.

In my case, I started with one of the RC nodes, made the change and noticed that a few minutes after restarting the service, the extra line had disappeared from the vmware.properties file.

The reason for that behavior is that during a restart Remote collectors inherit the content of the vmware.properties from the Master node.

So, first disable tagging on the Master node, restart the collector process, then restart the other remote collectors.

Take away is that remote collectors will inherit settings from the master node during a restart.

Finally, after an upgrade of vROPS, check the vmware.properties file, more than once previously made changes were found to have disappeared after the upgrade.

That’s all for now. As Always, I thank you for reading.

Photo by Vitaly Vlasov via Pexels


vSphere LCM – updating Images

06/12/2020

With the release of vSphere 7.0, the vSphere Update manager has been transformed as vSphere Lifecycle Manager. The biggest improvement is the introduction of managing clusters with images besides the familiar baselines concept. This post by Steven Bright is an excellent introduction about the concept of images and how to set up. Of course there is also the official VMware documentation about LCM.
So after reading Steven’s post, I followed instructions, created an image for ESXi 7.0 update 1 – 16850804 added the USB Fling as an additional component and upgraded the NUCs in my home lab without any problem.
Recently ESXi 7.0U1a and 7.0U1b were released, that raised the question how to update the Image?

Screenshot taken after the upgrade…

Under Cluster > Updates > Hosts > Image, next to the EDIT option, is the “Check for recommended images”. This reports no new images to download, so time for something else.

First download the ESXi depot file (I use https://my.vmware.com/group/vmware/patch#search for searching and downloading patches), in my case: Vmware-ESXi-7.0U1b-17168206-depot.zip.

In the vSphere Client go to: Lifecycle Manager, under Actions select Import Updates and import the depot file. After a few moments, the latest updates will appear under Image Depot, ESXi versions. As updates are cumulative, note that also update 7.0 U1a is now available.

Update 7.0 U1a and 7.0 U1b added…

Now go back to: Cluster > Updates, select Image and select EDIT. Now under ESXi Version the latest updates are available. Select the version of your choice (in my case 7.0 U1b).

Select the new ESXi version …

Note that the other components are still in the image and (if needed) can also be changed. If everything is completed, press VALIDATE, to validate the image, if everything is OK, then SAVE the image.

Almost immediately, LCM will notify you that the cluster is not compliant and you can start the remediation.

As Always, I thank you for reading.


Home lab refresh

10/10/2020

Since 2010, my VMware home lab was running on two servers; a HP ProLiant ML 110 G5 and a ML 110 G6 . First the G5 was taken out of active duty because of its 8 GB memory limit. Fortunately, it was possible to upgrade the memory of the G6 from the supported 16 GB to 32 GB, so the G6 remained usable quite some time, for labs with a vCenter Server and 3 virtual ESXi hosts. Recently, it became too tedious to run the latest vSphere editions.

A home lab is a valuable resource for various reasons. When preparing for a VMware exam, like the Datacenter VCP, you can practice installation and configuration of ESXi, vCenter Server, but also other tools like NSX, vROPS or LogInsight. A home lab is also useful for investigations which cannot be done at work in a production environment, to practice changes or upgrades and last but not least break and fix (one of my favorite use cases and highly educational).

If you want to practice with vSphere and other products, there are several options, which mainly depend on available budget, but also on other factors. The possibilities vary from a lab-in-the-cloud such as VMware Hands On Labs to VMware Workstation or a 19-inch rack filled with servers and switches. In my situation, decisive factors were limited space (I live in an apartment) low noise production and low energy consumption and the requirement to run a nested ESXi cluster with tools like LogInsight and vROPS. For a full vSphere 7 plus Kubernetes lab, however, a reasonable amount of hardware is required!

The old and the new, small but powerful

After some searching on the Internet you will soon come across the Intel NUCs, although not mentioned on the official VMware HCL, beloved by the community, see here and here.

Intel NUCs currently support 64GB of memory. The tenth generation is besides an i3, available in an i5 (4 cores) and an i7 (6 cores). My choice fell on the i5 (budget). Intel NUCs come with a processor, but without memory and disk(s), the final composition can be found on my Gear page.

The set-up of the Intel NUCs is not difficult, on the previously mentioned blogs of Virtuallyghetto.com and Virten.net you can find enough information for a successful installation.

The NUCs are installed with the latest ESXi 7.0 and are managed by a vCSA. To support the deployment of vSphere 6.7 and 7.0 labs, I use two Windows domain controllers (DNS and DHCP), a Windows scripting host and a pfSense firewall. For the deployment of the labs I gratefully use the nested ESXi appliances and the deployment scripts as provided by William Lam. With this a complete environment will be available in no time.


PowerShell Tips 1

06/06/2020

As you probably know, PowerShell is built on .NET, to be more precise Windows PowerShell is built on the .NET Framework, where PowerShell Core is built on .NET Core.

When you work with PowerShell in many cases you won’t be very concerned about this fact, but in some cases you can’t ignore it.

The other day while working on a PowerCLI script to get and set the logforwarding for a vCenter Server Appliance (vCSA), see also this older post.
The “get” part worked well. To retrieve the hostname, the port and protocol of the forwarding log servers run the following line of code:

 
(Get-CisService -name 'com.vmware.appliance.logging.forwarding').get()

For the set part, I created:


$spec = New-Object PSObject -Property @{
	hostname="logger1.net"
	port=514
	protocol="UDP"
}

(Get-CisService -name 'com.vmware.appliance.logging.forwarding').set($spec)

However this failed, creating the following error message:

 

Parameter 'cfg_list' expects values of type  'System.Collections.Generic.List`1[[System.Management.Automation.PSObject, 
System.Management.Automation, Version=3.0.0.0, Culture=neutral, PublicKeyToken=31bf3856ad364e35]]' 
but received value of type 'System.Management.Automation.PSObject'.
At line:1 char:1
+ (Get-CisService -name 'com.vmware.appliance.logging.forwarding').set( ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    + CategoryInfo          : OperationStopped: (:) [], CisException
    + FullyQualifiedErrorId : VMware.VimAutomation.Cis.Core.Types.V1.CisException

From the documentation, it was already known to me that a vCSA supports a total of 3 log forwarding hosts – hence the ‘cfg_list’, but how to interpret this error message? The parameter ‘cfg_list’ must be of a certain type, but how to solve this. Luckily my colleague Bouke (you can see what is on his mind on https://www.jume.nl ), quickly showed me the solution by specifying the variable in the correct type.

The following piece of code does the ‘set’ job. The solution is in the first line; setting the correct type (variable $speclist) for the ‘cfg_list’ parameter.


$speclist = [System.Collections.Generic.List[PSobject]]::new()

$spec = New-Object PSObject -Property @{
	hostname="logger1.net"
	port=514
	protocol="UDP"
}
$speclist.add($spec)
$spec = New-Object PSObject -Property @{
	hostname="logger2.net"
	port=514
	protocol="UDP"
}
$speclist.add($spec)

(Get-CisService -name 'com.vmware.appliance.logging.forwarding').set($speclist)


As always, I thank you for reading.