poniedziałek, 2 października 2017

System Uptime Report SQL Query


    Some time ago I was asked by the customer how to differentiate the actual downtime of the server from the network connectivity failure with usage of SCOM. Obviously from SCOM's Availability Reports' perspective, which was a main tool used by a customer to assess the state of their assets, there is no differentiation at all. In order to come up with a backup solution providing this information. Unfortunately the built-in reporting mechanism is not very convenient when it comes particular counter, because when you try to run the report for a group of multiple Health Service objects, it will aggregate all of them and try to calculate a mean value for every sample, which makes absolutely no sense in this case and produces a saw-shaped diagram like the one, you can observe below.

System UpTime report for a group of Health Service objects

    Creation of one report subscription per server could be a daunting task for few hundreds of objects, therefore we took an approach of taking the data directly from the database.

Usage:
    The SQL query presented below will provide the samples os System UpTime performance rule for a particular object from the database. It has to be run against SCOM DataWarehouse database. You have to replace the XXXX values below with the SQL regular expression matching the pattern, that will suit your needs. The example of the output produced by the query is shown below.



Exemplary output of System Uptime SQL Query

Code:
SELECT DisplayName,
[DateTime]
FROM Perf.vPerfRaw
JOIN vManagedEntity ON vPerfRaw.ManagedEntityRowId = vManagedEntity.ManagedEntityRowId
WHERE PerformanceRuleInstanceRowId IN
(SELECT PerformanceRuleInstanceRowId FROM vPerformanceRuleInstance
WHERE RuleRowId IN (SELECT RuleRowId FROM vPerformanceRule WHERE CounterName LIKE 'System Up Time'))
AND DisplayName LIKE '%XXXX%'
AND FullName LIKE '%HealthService%'
ORDER BY DisplayName,[DateTime] DESC
   

wtorek, 26 września 2017

Troubleshooting - Unusual Error during Linux Agent Deployment

Symptoms:

   During the installation of SCOM agent on RHEL servers I encountered the following error message during the process of signing agent's certificate:

Exception message: Unable to create certificate context
; {ASN1 bad tag value met.
}


    The message quite unusual - apart from one previous case I could not find any other reference to this problem associated in any way with SCOM.

Resolution:

    The only suggested solution - a firewall problem has been ruled out in first place. After trying several approaches it turned out, that during certificate signing process, SCOM agent was trying to use the older versions of two particular libraries, that the ones present on the system, and failed due to this. The workaround applied was creation of the symbolic links named as the old library file pointing to the new files with the following commands and manually re-initiating certificate signing process:

cd /usr/lib
sudo ln -s libcrypto.so.1.0.1e libcrypto.so.1.0.0
sudo ln -s libssl.so.1.0.1e  libssl.so.1.0.0
sudo /opt/microsoft/scx/bin/tools/scxsslconfig -f -v

    Following up on the threads suggesting this approach (even though for a different problem) I figured out, that the problems reported to have been fixed with that script were mitigated with the release of next Cumulative Update for Management Pack for UNIX and Linux Operating Systems. After verification it turned out, that the agent binaries were taken from SCOM 2012 R2 Sp1 iso and didn't contain the latest fixes applied to the Management Pack. After downloading the latest version of the binaries the "ASN1 bad tag value" problem disappeared for all the Linux servers

niedziela, 27 sierpnia 2017

Troubleshooting - Disappearing Run As Profiles Configuration Settings

Symptoms:

   Sometimes you have a general feeling, that there is something wrong with the infrastructure, and by looking around you catch the symptoms one after another until you are able to compose an overall image of the problem. This is what happened in this case I had with one of the customers recently, that has been resolved together with Microsoft Premier Support. It seems very interesting though, and that's why I have decided to share it with you. Here are all the symptoms observed before pinning the problem down, in more or less chronological order:

1. The groups created in the SCOM were not available for choice in the reports. They appeared in the console, but not in the Reporting part of SCOM (which suggests problems with processing data from Ops DB to DataWarehouse DB)
2. Big amount of data stored in the Staging area of the DataWarehouse DB. Running the following T-SQL query revealed hundreds of thousands of rows in the Alert and State parts of the Staging area

SELECT count(*) from Alert.AlertStage
SELECT count(*) from Event.EventStage
SELECT count(*) from Perf.PerformanceStage
SELECT count(*) from State.StateStage

3. Data Warehouse Data Collection State errors showing up in the Health Explorer of Management Servers themselves in SCOM
4. Large amount of 31551 events in SCOM event viewer log informing about failures while storing data into Data Warehouse. They look similar to the following event:
Log Name:      Operations Manager

Source:        Health Service Modules

Date:          27/01/2013 22:00:15

Event ID:      31551

Task Category: Data Warehouse

Level:         Error

Keywords:      Classic

User:          N/A

Computer:      XXX

Description:

Failed to store data in the Data Warehouse. The operation will be retried.

Exception 'SqlException': Management Group with id 'VVVVVVVV-VVVV-VVVV-VVVV-VVVVVVVVVVVV' is not allowed to access Data Warehouse under login 'YYY\WRITER'

 

One or more workflows were affected by this. 

 

Workflow name: Microsoft.SystemCenter.DataWarehouse.CollectPerformanceData

Instance name: XXX

Instance ID: {WWWWWWWW-WWWW-WWWW-WWWW-WWWWWWWWWWWW}
Management group: ZZZ

Reason:

    It turns out, that we suffered from an issue, that Microsoft admitted to be kind of a bug, which seems to randomly occur in different environments. It turns out, that on rare occasions default configuration of SCOM Run As accounts for Data Warehouse created during the installation of SCOM servers might disappear from Run As profiles configuration. The root cause of this behavior unfortunately hasn't been yet identified by Microsoft.

Resolution:

    In order to resolve the problem you have to re-introduce the settings once again. Below you can find the screenshots of properly configured Data Warehouse Account and Data Warehouse Report Deployment Account Run As profiles

Data Warehouse Run As Profiles default configuration

     After re-introducing the configuration everything should get back to normal.

czwartek, 10 sierpnia 2017

Life Hacks - Removing Direct Membership Rules of Certain Type in Bulk

    Today I was cleaning up a collection with hundreds of direct membership rule entries. The previous approach was taken temporarily and has now been replaced with include membership rules. Now - even though it is very easy to add multiple computers to the collection via GUI, based on a chosen pattern with the usage of Create Direct Membership Rule Wizard, it is not so convenient to remove them as the window is not scalable and it only has three rows visible at the time:

The view of Membership rules tab in Collection Properties window

 
    Here is where a little script could come in handy and help to take care of this task.

Usage:
    Swap the following strings in the script below with the names from your SCOM infrastructure:

  • _CollectionId_ - put the ID of the target collection
  • _CollectionName_ - put the name of the target collection
  • _NamingRegExp_ -  put the regular expression fitting your needs. If you want to remove all rules, just put "*" or remove that part of the code: ...| ?{$_.RuleName -like "_NamingRegExp_"}...

     The code has to be saved as a .ps1 file and run on the SCCM server via PowerShell connected to SCCM site server.

Code:
$Rules = Get-CMDeviceCollectionDirectMembershipRule -CollectionId "_CollectionId_" | ?{$_.RuleName -like "_NamingRegExp_"}
foreach ($Rule in $Rules) {
 Write-Host Removing $Rule.RuleName Rule -Foreground Yellow
 Remove-CMDeviceCollectionDirectMembershipRule -CollectionId "_CollectionId_"  -ResourceId $Rule.ResourceID -Force
}

The output of the script

Additional Notes:
    I have experienced certain problems using this script on the systems with PowerShell version 4.0. I received the following error while trying to pass the ResourceID parameter to the Get-CMDeviceCollectionDirectMembershipRule cmdlet:


The error message received on a system with PowerShell 4.0

    Instead of troubleshooting the problem I created a workaround using a CollectionName parameter, you can use the code below in case you encounter similar issues:

$Rules = Get-CMDeviceCollectionDirectMembershipRule -CollectionName "_CollectionName_" | ?{$_.RuleName -like "_NamingRegExp_"}
foreach ($Rule in $Rules) {
                Write-Host Removing $Rule.RuleName Rule -Foreground Yellow
                Remove-CMDeviceCollectionDirectMembershipRule -CollectionName "_CollectionName_" -ResourceId $Rule.ResourceID -Force 
}

poniedziałek, 8 maja 2017

Life Hacks - Finding SCOM Monitoring Objects from The Alerts by GUIDs

    From time to time SCOM generates the following alerts, that are not showing the FQDN of a problematic server, but only the GUID of the Managed Object:

The agent was not able to submit data on behalf of another computer because agent proxy is not enabled. Details:Health service ( XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX ) should not generate data about this managed object ( YYYYYYYY-YYYY-YYYY-YYYY-YYYYYYYYYYYY ).
 

    For some reason those alerts do not contain the name of the server to enable proxying, but obviously we will need it in order to fix the problem

Usage:
     If you want to get it there is a cmdlet you have to run through PowerShell
  • XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX - replace it with the GUID you find in a respective place in the received alert
     The cmdlet has to be obviously executed in the Operations Manager Shell on the SCOM server. It will return the FQDN of the server, that you will need to enable proxying for

Code:

Get-SCOMMonitoringObject | ?{$_.Id -match "XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX"}


The output of the cmdlet

środa, 8 marca 2017

Advanced Troubleshooting - SCCM 2007 WMI permissions issue

Symptoms:
    I decided to share with you one of the old, but also most cryptic cases that I have resolved some time ago due to the fact there wasn't any related trace that I could find in the Internet showcasing this issue.


    The problem has been observed several times in SCCM 2007 infrastructure in the past and it usually occurred after the power outage or any unexpected reboot of an SCCM server. After the system is back up and running users, who actually are configured to have access to SCCM are welcomed with the following view when starting the console:



Configuration Manager Console error
 
    Another interesting aspect of the problem was, that the local admins could still access SCCM without any problem. Issue affected only SCCM admins without the local admin rights on the SCCM server. The temporary workaround was to provide such privileges to all users, but obviously it could not be considered a permanent solution.

Reason:
    After numerous in depth investigations it turned out, that the root cause of the problem were missing permissions to WMI namespaces. The reason for the loss of such permissions remains unknown, but the problem is reproducible.

Resolution:
      In order to fix the problem the following actions have to be performed:

1. Access Start Menu, open Run prompt, and execute mmc command
2. Once the Console is opened press Ctrl+M in order to add the snap-in
3. Navigate down, choose WMI Control and click on Add button
4. Choose to connect to the local computer and Click on Ok button
5. Click again on the Ok button
6. Click on the arrow next to WMI Control (local) Snap-in and when it disappears right-click on the Snap-In and access the properties
 7. Access the Security tab and drill down all the way to root\sms and root\sms\site_XXX namespaces. It's them that are missing the permissions
   
    The permissions have to be set up in a proper way in order to allow SMS Admins group access SCCM Console again and perform delegated actions. The settings can be retrieved from a fresh SCCM 2007 installation by comparison. Or you can use below proper configuration's screenshots as a reference
 
Proper configuration of WMI permissions for the root\sms namespace

Proper configuration of WMI permissions for the root\sms\site_XXX namespace

piątek, 10 lutego 2017

SCOM Agents Migration Part 2. Migrating to Another Server within the Same Management Group

    Sometimes you might have a need to redirect the SCOM Agents to the new infrastructure after changing Management Servers.This can be done from the SCOM Console level, assuming, that:
1. You deployed the SCOM agents via Discovery from the SCOM Console, or you performed a change on the Ops DB necessary in order to be able to manage manually deployed SCOM Agents from the Console
2. The old Management Servers are still active and communicate with your Management Group
    It might happen though, that you have an SCCM Deployment, or any other way of distributing the agent, which for some reasons was not updated after the migration and deployed agents with old settings despite the SCOM infrastructure change. The situation is problematic, because by simply re-installing the agent you will not be able to change the settings. SCOM Agent keeps the old configuration and keeps trying to contact non-existing servers. Recently I posted a solution to migrate the SCOM agents to a new Management Group. You can find it here:

    Unfortunately this solution is not invasive enough to migrate objects within the same Management Group. SCOM is pretty persistent when it comes to keeping the configuration of the agent. When you remove the configuration and then add the new entry for the same Management Group the configuration will revert to the previous one. What needs to be done is the Health Service State flush, and the script in this post is doing this as well
Usage:
     Swap the following strings in the script below with the names from your SCOM infrastructure:
  • _MGName_ - put the name of your Management Group
  • _NewMGServerFQDNName_1_ - put the FQDN of the first new Management Server
  • _NewMGServerFQDNName_2_ - put the FQDN of the second new Management Server
     The code has to be saved as a .vbs file and run on the SCOM client either manually or for instance via SCCM deployment
Additional Notes:
    The script as well as the previous version contains the random seed in order to spread the load between two Management Servers.
Code:
Option Explicit
On Error Resume Next

Dim objMSConfig
Set objMSConfig = CreateObject("AgentconfigManager.MgmtSvcCfg")

Dim oShell
Set oShell = CreateObject("Wscript.Shell")

Dim objMG

oShell.run "CMD /C ""NET STOP HealthService""", 0, true
oShell.run "CMD /C ""RMDIR /S /Q ""C:\Program Files\System Center Operations Manager\Agent\Health Service State.old""""", 0, true
oShell.run "CMD /C ""REN ""C:\Program Files\System Center Operations Manager\Agent\Health Service State"" ""Health Service State.old""""", 0, true


Call objMSConfig.RemoveManagementGroup ("_MGName_")

Randomize
If Rnd > 0.5 Then
    Call objMSConfig.AddManagementGroup ("_MGName_", "_NewMGServerFQDNName_1_",5723)
Else
    Call objMSConfig.AddManagementGroup ("_MGName_", "_NewMGServerFQDNName_2_",5723)
End If

Call objMSConfig.ReloadConfiguration