Aria Suite Lifecycle – Error LCMVRLICONF40004

Recently, I wanted to start a belated upgrade of an Aria Operations for Log cluster with the help of tool VMware Aria Suite Lifecycle. The first step is running an inventory. Within seconds, error LCMVRLICONF40004 was presented.
Fig. 1
The “invalid hostname” message was not very helpful, the “show more” section provided the following information:

com.vmware.vrealize.lcm.vrli.
Exception: Cannot execute ssh commands.
Exception encountered : Session.connect: java.security.spec.
at com.vmware.vrealize.lcm.
at com.vmware.vrealize.lcm.
at java.base/java.util.
at java.base/java.util.
at java.base/java.lang.Thread.

The first step, based on the “Cannot execute ssh commands”, was checking if the nodes of the Aria Operations for Log cluster were reachable and the correctness of the root password used by Lifecycle. Result everything OK.

Next step, after some ‘googling´, I have found VMware KB “InvalidKeySpecException Error Code : ‘LCMVRNICONFIG90115’ when performing inventory sync in Aria Suite Lifecycle Manager Inventory Sync for Aria Operations for Networks (96553)” which contains a reference to error LCMVRLICONF40004.

The KB reveals the cause of the issue “Recent Aria Suite Lifecycle PSPACKs specifically version 8.14 Pspack 4 and above have hardened the SSH settings on the Aria Suite Lifecycle appliance. This can cause communication issues for products which do not support any of the newer macs or ciphers.”, so the cause is clear.

For the final fix, open KB “Steps for removing weak SHA1 algorithms and ciphers from VMware Aria Products (95835)”, mentioned here.

From the second KB, follow the instructions in section “VMware Aria Operations for Logs”, pointing to the KB “Remove SHA1 from SSH service in VMware Aria Operations for Logs 8.12.x and 8.14.x (95974)”. The third KB finally contains the steps that will solve the issue. For VMware Aria Operations for Logs version 8.12 follow the steps. It comes down to saving the current the version of the /etc/sshd/sshd_config file and make some modifications as described. Do not forget to restart the sshd daemon and repeat the steps for all nodes of the VMware Aria Operations for Logs cluster.

Now you should be able to successfully update the VMware Aria Operations for Logs cluster.

Aria Operations for Logs 8.14.1

VMware recently published VMware Aria Operations for Logs version 8.14 (8.14 Build 22564181, released on 19-10-2023).
Like most updates, this one comes with functional improvements and security enhancements.
In my opinion, the main reason to install this version is because it fixes a serious vulnerability (CVSS score 8.1), present in the previous versions since 8.6. See this bulletin for more information.

For a complete overview of the enhancements, see the 8.14 release notes. The most prominent are:

  • The product now supports the external NSX Advanced Load Balancer, providing more load-balancing configurations than the Aria Operations internal Load Balancer.
  • New Content packs are added for OpenShift and Tanzu Kubernetes Grid

Continue reading

Aria Operations for Logs, found a bug in Access Control

It looks like VMware Aria Operations for Logs (previously Log Insight) contains a bug in the Access Control part. I will explain; Access is granted by assigning a role to users or groups. A role is a group of permissions that ultimately determines what a user is allowed to do.
For more information about the Roles and permissions in Aria Operations for logs, read my post.Fig. 1 – Four predefined Roles

Continue reading

Aria Operations for Logs API and Workspace ONE issue

A brief description, recently I ran into a issue when working with an Aria Operations for Logs cluster configured to use Workspace One Access enabled for Active Directory support as an authentication source. Logging in via the GUI was not the problem, I was updating scripts as described in this post to work with Workspace One Access.

While working with the Aria Operations for Logs API, the first step is to authenticate. Authentication requires 3 parameters: username, password and provider.

The username and password do not require further explanation, although, one should also pay attention to the username in some cases.
The provider refers to the three supported authentication sources: Local (in case the local admin account is used), Active Directory (in case Active Directory support has been configured) and finally Workspace One Access.

The first part of the solution; the name of the provider should be written with the capital letters in the right places, like “Local”, “ActiveDirectory” and finally “vIDM” for Workspace One Access.

The second part, note the username. As an example when Workspace One Access has been enabled for an Active Directory named “acme.com”, the username should be something like “user@acme.com”. This notation will not work: “acme\user”.

I hope you can benefit from this, thank you for reading.

Log Insight – Cassandra 101

Intro

Recently, I was involved in some Log Insight incidents. To successfully resolve them, an action in Log Insight’s Cassandra database was required. Besides application logic, Cassandra, the distributed NoSQL database, is an important component of Log Insight. So, time for a closer look at Cassandra in Log Insight.

Warning: Be careful, after a login into the database; never execute commands without knowing the impact. Certain commands can render Log insight unusable!

Access

For this article, I am using Log Insight version 8.10. VMware has published KB 57901 “How to Access the Cassandra Database in vRealize Log Insight”, which contains instructions on how to access the Cassandra database. First requirement is to start an SSH session to a Log Insight node and log in as user root. In older versions of Log Insight, the process was a bit more cumbersome and required retrieving the password, since version 8.8 just one command suffices:

root@li-vip3 [ ~ ]# cqlsh-no-pass
Connected to loginsight at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.11 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
lisuper@cqlsh> 

Now cqlsh, the CLI for interacting with Cassandra using the Cassandra query language (CQL) has been started, as user “lisuper”.

Continue reading

vRSLCM – Exception while validating for Scale-Out

Recently, I have been exploring the capabilities of VMware’s vRealize Suite Lifecycle Manager (from now on vRSLCM). vRSLCM is a product for deployment, configuration, upgrading & patching, scale-up and scale-out of VMware products like; vRealize Automation, Orchestrator, Operations Manager, Network Insight, Log Insight and Business for Cloud. See this link for more information.

I usually do this by first installing the product and then running various scenarios, such as this one for Log Insight:
1. Deploy a 3 node Log Insight cluster, version 8.6.0
2. Upgrade to version 8.6.2
3. Scale-Out, by adding an extra worker node to the cluster

My first attempt adding an extra worker node to the existing Log Insight cluster ended, after choosing “ADD COMPONENTS”, with the following message:

Fig. 1

The existing Log Insight nodes were up and running, so what happened?
A log file named
/var/log/vrlcm/vmware_vrlcm.log was very useful and explains what is happening during this action, see the following lines:

2022-05-26 12:57:06.750 INFO  [pool-3-thread-20] c.v.v.l.p.v.VrliScaleoutOvaValidationTask -  -- vRLI instance :: {
  "vrliHostName" : "vRLI-1.virtual.local",
  "port" : "9543",
  "username" : "admin",
  "password" : "JXJXJXJX",
  "provider" : "Local",
  "agentId" : null
}
2022-05-26 12:57:09.817 ERROR [pool-3-thread-20] c.v.v.l.d.v.r.c.VRLIRestClient -  -- Failed to get the vRLI authentication token. No route to host (Host unreachable)
2022-05-26 12:57:12.889 ERROR [pool-3-thread-20] c.v.v.l.p.v.VrliScaleoutOvaValidationTask -  -- Exception while validating the vRLI VA OVA for Scaleout : 

Port 9543 is used while communicating with the Log Insight API, the “Failed to get the vRLI authentication token” makes it clear that communication with the primary node, named vRLI-1.virtual.local, is not possible, hence the “No route to the host”. A ping command from the vRSLCM to the primary node by hostname, yields no results and is a confirmation that the DNS registration has gone haywire.

After the DNS registration is restored, the primary node is resolvable again and the scale-out can be continued.

Bottom line, when you see this message, check DNS and/or network connectivity to the targets.

Log Insight Duplicate Webhooks

After upgrading Log Insight to version 8.4.x, something strange happened, let me explain.
Log Insight has the ability to forward Alerts for further processing via a Webhook. For some information about usage of Webhooks, see this post.

Webhooks are configured separately and can be used hereafter in the configuration of an Alert as a notification option, whether or not in combination with email address(es).

Fig. 1 – Alert with email configured, webhook not yet selected

Continue reading

Log Insight node fails to start

Recently I experienced the following situation, the primary node of a
3-node Log Insight Cluster (version 8.4.1) would not start. The OS started but the loginsight service was stuck in an endless start/stop loop.

It was also not possible to log in from the primary node’s web interface. The other clusternodes were still accessible via the internal load balancer. After logging in, it appeared that the status of the cluster (Administration > Cluster Nodes) was also unavailable.

Time to set up an SSH session to the impacted node and examine the log files.

In the /storage/var/loginsight/runtime.log, I noticed this event, which does not come as a surprise:

[2021-10-26 11:03:48.462+0200]
["main"/10.11.12.13 FATAL]
[com.vmware.loginsight.daemon.LogInsightDaemon] 
[Error starting services]
com.vmware.loginsight.daemon.LogInsightDaemon$StartupFailedException:
Daemon startup failed: 
All host(s) tried for query failed (tried: /10.11.12.13:9042
(com.datastax.driver.core.exceptions.TransportException: 
[/10.11.12.13:9042] Cannot connect))

Note: 10.11.12.13 being the IP address of the failed node.

In the /storage/var/loginsight/Cassandra.log, I noticed this event:

ERROR [HintsDispatcher:1] 2021-10-26 09:52:34,026 
HintsDispatchExecutor.java:243 - 
Failed to dispatch hints file 8a6f9de8-5d96-455d-a709-3e9d54826031-1634732135629-1.hints:
file is corrupted ({})

So it seems that the hints file is corrupt. Some research on Google shows that in that case the hints file should be deleted.

The hints file is part of the Cassandra database and can be found in folder:

/usr/lib/loginsight/application/lib/apache-cassandra-3.11.10/data/hints

and has a name like: 8a6f9de8-5d96-455d-a709-3e9d54826031-1634732135629-1.hints

Continue reading

Error loading Alerts in vRealize Log Insight 8.x

Recently, after upgrading a Log insight cluster from version 8.2.0 to 8.4.1, the cluster seems to be in good shape, but opening Alerts from the Administration menu shows “Error loading Alerts” instead of the Alerts.

Other Symptons:

  • Opening Certificates from the Administration menu shows “Error loading certificates” instead of the Certificates.
  • Testing integration with vROPS or vCenter shows “Could not connect to host”.

We logged into the Log Insight UI of all cluster nodes, it became clear that not all nodes were having this issue, but about 50 percent.

This issue is caused by a corrupt certificate store on the unhealthy nodes.

How to check the state of the certificate store?

Open a SSH session to an unhealthy node as user root and run the following command:

# /usr/java/jre-vmware/bin/keytool -list -keystore /usr/java/jre-vmware/lib/security/cacerts

When the response is: “keytool error: java.io.EOFException”, the cacerts file is corrupt.

Also the /storage/var/loginsight/ui_runtime.log file will show errors like: “java.security.KeyStoreException: problem accessing trust store”.

Workaround:

  1. Open a SSH session to a healthy node as user root.
  2. Before applying the workaround, the certificate store can be checked, using the same command as above.
    When the response is: “Enter keystore password:”, the cacerts file is OK.
    Cancel with keystroke: [Ctrl]+C.
  3. Using the built-in SCP Utility, copy the /usr/java/jre-vmware/lib/security/cacerts file from the healthy node to the /usr/java/jre-vmware/lib/security/ directory of all unhealthy nodes in the vRealize Log Insight cluster.
    # scp /usr/java/jre-vmware/lib/security/cacerts root@unhealthynode:/usr/java/jre-vmware/lib/security/cacerts
    

    Where unhealthynode is the IP or Fully Qualified Domain name (FQDN) of a unhealthy node.

  4. You don’t even have to restart the unhealthy Log Insight node. Just check the node after the file copy.

As always, I thank you for reading.

Skyline Health Detector

Skyline is VMware’s proactive self-service support technology available to customers with an active Production Support or Premier Services contract, based on using Skyline Collector and Skyline Advisor. But there is more, in the latest vCenter Server 7.x edition you will also find Skyline Health. Skyline Health is built-in in vCenter Server, no additional installation required. To make use of Skyline Health, you must participate in the Customer Experience Improvement Program (CEIP) to use online health checks and vCenter Server must be able to reach the Internet. Skyline Health will run about 136 health checks and present the results grouped in categories. While browsing the results, the “Self support Diagnostics” section caught my attention, in particular the “VMware Skyline Health Diagnostics”.

According to the documentation; “VMware Skyline Health Diagnostics (SHD) is VMware’s self-service diagnostics platform. It uses product logs to detect problems and provides recommendations in the form of KB articles or steps to remediate them. A vSphere Administrator can use this tool to troubleshoot before contacting the VMware Global Support Service.”

SHD can detect issues in vCenter Server, ESXi and vSAN. Some of the benefits of SHD:
1. SHD runs on-prem, it can also work offline without any internet connectivity.
2. Based on the detected symptoms, the tool provides correct VMware Knowledge Base articles/remediation steps.
3. Get recommendations for a problem from VMware support services.
4. Early recommendations and remediation helps business continuity.

Continue reading