2. vRealize Operations

vRealize Operations

Log Insight sports out of the box dashboards that visualizes security-related activities in your vRealize Operation. It also has a dashboard to help you watch and troubleshoot upgrade.

Authentication Analysis

Auth analysis

Auth analysis

Let’s go through some of its contents in-depth.

There are two privileged login ID that customers have access that audit requires compliance reporting: root and admin. Other privileged account such as MaintenanceAdmin is not accessible. How many times do admin ID login with the wrong passwords in a specific period? The following chart counts each time a login failure happens.

Login failures

The count is possible because Log Insights create a field out of the log entries. The field is named vmw_vr_ops_admin_attempt and that’s what plotted over time. I’m showing two examples of actual log entries, with value added context by Log Insight.

Log list

The log entries themselves are in turn filtered using the following query. Log Insight has awareness of vRealize Operations via the variable vmw_vr_ops_appname.

Filters

vRealize Operations has 2 UI that are accessed via separate URLs. The Admin UI is for platform administration such as upgrading vRealize Operations, so it’s important to track login activities. How many failed login attempts made in the Admin UI is also provided out of the box.

We’ve covered admin. How about root? As this is a Linux account as opposed to vRealize Operations account, check the Linux content pack. The concept is the same.

What IP address do they login from? This helps you trace the location of the user who used the account. The following chart shows 6 different users and when they log in. We can drill down to any of them to see the exact time and the IP address used.

IP login source

Alternatively, you can present in table format to show the IP address information. If you want the time stamp, use the Field Table feature.

Table view

Users addition, especially unexpected ones, can be a cause of security concern. Audit team may ask for the lists of users added in certain period. Users deletion is typically not an audit concern, but could be useful in troubleshooting. You can figure who deleted the user account and when.

The widgets in the third row of the dashboard covers users creation, import and deletion. By now you can guess that we can plot this event over time too. The following shows the list of user accounts that got deleted and when. I cut the fullname as that’s part of actual product validation.

User Deletion events

The query that produces the above chart is the following. To some extend, it’s actually human-readable!

Deletion filter

Insufficient permission access logs events where users tried performing activities that they do not have sufficient privilege. The following shows some example where users were denied access.

Privileged access denied

If user has the access, you will see something like this

Privileged access allowed

The widget “Security related message” is my personal favourite, as it demonstrate the adhoc troubleshooting ability of Log Insight. The following shows the many security related messages. They are grouped by the vRealize Operations nodes, so you can exclude certain types or zoom into a particular type.

Node events

How was the above achieved? The following shows the actual query. Log Insight automatically group log entries of similar type and give them a unique event_type. This means you can filter out the types that are not relevant, like what I have done below.

Filter out

Activity Audit

Activity dashboard

Use this dashboard to select a user and audit that user’s actions. The type of actions you can audit are

  • Dashboards, Views and Reports. This covers creation, update, deletion, import, schedule (report) and generate (report).

  • Alerts. This covers alert definition, symptom definition, notification rules and recommendation.

  • Environment. This covers Application, Custom Data Center and Custom Group activities

  • Inventory. This covers resource creation, resource deletion, resource changes, collection start, collection stop, maintenance mode start, maintenance mode end.

  • Configuration. This covers changes in global setting, cost settings, credential, collector groups, etc.

The widgets are separated as in the menu of vRealize Operations Manager UI, so you can easily to orientate.

Let’s go through some of its contents in-depth.

Activity query

The query that produces the above chart is the following. I’ve excluded dashboard_add_tab from the category as it dominates the chart.

Query example

Let’s drill down into a specific task. Let’s say we have some views deleted and we need to know who deleted them and when. For that, we select the view_definition_delete and add it. Log Insight will automatically add the field name and operator. No need to manually type!

Changed filter

Since we’re down into a single activity, we can now plot the chart over time, grouped by the user. We can see here that the user is the system account maintenanceadmin.

MaintenanceAdmin activity

Let’s take another example from the dashboard. This time we will take user management, where you can track things like user deletion, role creation, user password change and many others. The following shows some of those activities.

User account activities

I’ve filtered out user_archive event as its value dominates the chart. As mentioned earlier, no need to manually type. Simply click on one of the log entries, choose a filter from the pop-up menu and that’s it!

Filter user archive

As usual, you can have a table of who did what when.

User actions

Let’s take one last example. I will take the configuration activity as it shows a range of interesting events. As usual, I started with a bar chart as it lets me see the activity name. We can see that changes in global settings dominate the result.

All activities

We already know how to filter it, so I’ll show you another way. Go to the Event Types. You’ll see the log entries grouped by type of events. To filter out, simply click on the X icon.

Alternative filtering

Once filtered out, the following is what I get. Let us know if there are events that you need to be logged that’s not trapped.

Filtered results

The query to get the above is complex

Query screenshot

vmw_vr_ops_category value:

.*POLICY.*|super_metric_.*|disk_rebalance|collector_group.*|global_settings_set|GLOBAL_SETTINGS_OVERRIDE|dynamic_threshold_.*|license.*|DESCRIBE|CREDENTIAL_.*|CERTIFICATE_.*|MAINTENANCE_SCHEDULE_.*|WLP_.*|SCHEDULE_.*|RIGHTSIZING_.*|RECLAIM.*|OUTBOUND_.*|LOG_CONFIGURATION_.*|COST_.*|REFLIB_UPDATE

Delete Activity Audit

Delete activity dashboard

Let’s dive into “Deletion activity per resource” widget by opening it in Interactive Analytics page. You get the same information shown on the widget, but this time you can adjust it. I’ve made each time block to be 10 minute instead of 1 hour so I can see the changes better.

Deletion per resource

Note that not all activities show the actual resource name being added/modified/deleted. In the following screenshot, I’ve highlighted in green where the affected resource name being shown, and in orange where it is not captured.

Missing resource name

All the columns above are Log Insight field, a type of variable. Each extracted field has its own rules for extraction. Log Insight scans events and extracts fields whenever predefined patterns get matched. Let’s take vmw_vr_ops_username as an example, and shows its extraction formula.

Extraction field Extraction regex

All the VMware extracted fields are prefixed with vmw_ followed by the product name.

We’re now ready to evaluate the filters used in the preceding chart. It requires three filters working together, meaning they all must be true. It’s an AND operator, not an OR operator.

Filters

The first filter uses vmw_vr_ops_category field to filter out the events which were generated as a result to some kind of deletion.

The second filter uses filepath filter, a special system wide filter, to track the name and the path of the file from where events are collected. In vRealize Operations, all audit related events are collected from analytics.audit log files.

The last filter uses vmw_vr_ops_username field to exclude logs generated by service and system users. We have to exclude automation admin, maintenance admin, migration admin and system as they are not accessible by users.

Upgrade Troubleshooting

You can monitor the above stages as they progress via Log Insight. Yup, pretty much like watching a live streaming, as the logs are streamed into the Log Insight dashboard. The dashboard sports 9 widgets arranged in 4 rows.

Upgrade dashboard

The “Upgrade Range” widget shows when the upgrade started and when it completed. It covers the time range of the upgrade process. If the process was successful, you’ll see two columns, one marking the start and one marking the end, as shown in the following.

Upgrade events

This widget is actually capable of monitoring multiple upgrades running in parallel. You can filter to only show the environment or nodes you are going to monitor by specifying their values in the fields vmw_vr_ops_clustername, vmw_vr_ops_hostname, and vmw_vr_ops_nodename.

The actual query to produce the chart is this.

Upgrade query

When you see “uploaded into reserved” that means the upgrade process has started. When you see “Completed operation CLEANUP for pakID” that means the upgrade process for that node has been completed successfully. The following shows examples of actual messages you should expect to see.

Upgrade messages

The “Overall events over time” widget is showing the proportion of all logs generated during the upgrade.

The “Overall errors, warnings and exceptions” widget is showing the proportion of all logs with errors, warnings and exceptions generated during the upgrade.

The second row of the dashboard shows 2 widgets. They cover the main two services responsible for upgrade: PAK Manager and CaSa. The widget monitoring errors, warnings and exceptions generated from that services.

The query is as the following:

Upgrade Query

The third row of the dashboard covers errors, warning and exceptions separately. Ensure that the spikes here are not unusual high.

An example of the errors during the upgrade:

Upgrade error

Some examples of warnings during the upgrade:

Warnings

An example of the exceptions:

Exceptions

Errors, warnings and exceptions may be not critical in this case and upgrade process may not be affected.

The last row contains the list of queries which are based on known Knowledge Base articles. Use it to check if a particular KB article is relevant in your environment. The last widget covers all log entries with exit code: 1. This is typically the reason for the failure of the upgrade process.

Collection Troubleshooting

vRealize Operations collect data every 5 minutes by default. For metric, it will store the value. For property, it will only store if there is a change. That means for metric, you should see a consistent collection every 5 minutes, something like this.

Metric collection interval

If you see missing dots, that means there is no metric stored. That could be because there is no metric to begin with (e.g. the VM is powered off) or there is a collection failure.

If you suspect a problem in collection, dive in “VMware - vROps 6.7+” content pack and go to the “Cluster - Collector Overview” dashboard. Check under “Collector Service shutdown events by Node, Role” widget and “Collector Service error events by node, role” widget and see if there are error messages.


This page was last updated on July 1, 2021 by Stellios Williams with commit message: "Added image alt-text, re-added links, cleaned Markdown, re-wrote most tables in Markdown"

VMware and the VMware taglines, logos and product names are trademarks or registered trademarks of VMware in the U.S. and other countries.