MonitorAgent installation and configuration

MonitorAgent is an agent that is a part of QBroker project. As an agent, it periodically checks predefined occurrences or listens on some data sources for certain patterns. It is a Java standalone process running as a daemon with a bunch of the monitor objects, well designed for each category of occurrences. An occurrence can be any incident or event generated by applications, hardware, etc. Each of the occurrences is monitored by an instance of a monitor object that is registered with the container, MonitorAgent. The monitor object consists of two components. The first one is the report component which has a method for testing or detecting the occurrence. The result of the report component is also called a report. The second component is for action that evaluates the test result (report) and invokes various actions in case of failures or exceptions. Periodically, MonitorAgent launches the report component and passes the result to the action component. The action component checks the result (report) to determine the priority of the event. It sends an event to a centralized event collector, EventFlow, for further analysis and evaluations. In case of failure, MonitorAgent can invoke the pre-configured actions, such as sending an email alert, launching an action script to restart the service, or logging the errors either to ServiceNow or syslog, according to the pre-configured policies. Actually, it is the customized components that are doing the dirty jobs. MonitorAgent is just a container to run and manage all registered monitor instances. It also provides services such as schedule service, thread pooling, report sharing, centralized repository, dynamic deployment and workflow support. Besides the monitors for reports and actions, instances of MessageFlow can also be running inside MonitorAgent. The integration with MessageFlow provides the support for dynamic monitors, node level event processing and flexible workflows.

MonitorAgent is supposed to run on every physical or virtual box. Together with the EventFlow, StatsFlow and the centralized configuration repository, as well as the webadmin console, it is not difficult to build a Monitor Network with domain, sites and nodes and manage them through web browsers. Here is a conceptual diagram of a Monitor Network.

In the diagram, there are multiple components. Agent is for the instance of MonitorAgent running on each of nodes. EventCollector is a web service to log the events and metrics from agents or other applications. EventFlow is an instance of QBroker Flow for event correlations and escalations. StatsFlow is another instance of QBroker Flow to process metrics. Configuration Repository is a web based centralized configuration file store for deployment. WebAdmin is a web application for management of the repository and runtime operations. With this kind of hierarchical framework, the decisions and the actions can be made on any levels, such as, the monitor level, the node level, the domain level and the organization level.

Even though MonitorAgent can be owned by any user and is homed at any place, here we focus on the default one. By default, MonitorAgent runs as qbadm and is homed at /opt/qbroker on Unix platform. On Windows, it runs as system and is homed at C:\home\qbeoker. The directory for configurations is agent. It is recommended to enable the configuration repository and make changes on the configuration repository only. If the web based admin tool is available, always use the tool to manage the repository and operation tasks on all monitors. If you are interested in the source code, you can find the open source project of QBroker at GitHub.

Installation

By default, MonitorAgent will be installed in /opt/qbroker and owned by qbadm:qb.

Unix Platform

The installation on Unix platform is simple. If your box has web access to https://yannanlu.github.io, it will be really simple. You just need to login on the box and run the followinig command to have it installed:

wget -O - https://yannanlu.github.io/misc/installQB.sh | sudo bash
In some cases, the web access to https://yannanlu.github.io may not be allowed. So you will have to download the tar ball and the installation script from https://yannanlu.github.io. Then you need to copy them to the box for the installation. Here is the procedure with the step-by-step tasks:

Windows Platform

The installation on Windows platform is a bit different. We are not going to discuss it here.

Operation

Once MonitorAgent is installed and configured properly, you should have the following filesystem layout on a Unix patform:

pathfunctionexample
/opt/qbroker home dir of QBroker  
/opt/qbroker/bin dir for startup script and other utilities  
/opt/qbroker/bin/agentctl startup script of MonitorAgent ./agentctl restart
/opt/qbroker/lib dir for Java libraries and shared libraries  
/opt/qbroker/agent dir of configuration files for MonitorAgent  
/opt/qbroker/agent/Agent.json master configuration file Agent.json
/opt/qbroker/templates dir for template files  
/opt/qbroker/templates/mail.txt template file for email alert mail.txt
/var/log/qbroker dir for QBroker logs  
/var/log/qbroker/MonitorAgent.log log file of MonitorAgent  
/var/log/qbroker/MonitorAgent.out stdout and stderr of MonitorAgent  
/var/log/qbroker/completed status file of MonitorAgent  
/var/log/qbroker/.status dir for stateful files  
/var/log/qbroker/archive dir for archived logs  
/var/log/qbroker/checkpoint dir for checkpoint data  
/var/log/qbroker/stats dir for statistical and historical logs  

The normal operation tasks involve start, stop, restart, troublshooting on the log files, configuration management and deployment. MonitorAgent keeps its configuration files in /opt/qbroker/agent. Among the various configuration files, the master configuration file is most important. It is /opt/qbroker/agent/Agent.json. Here is an example: Agent.json. The details on the configurations will be covered by a dedicated section. Here we will focus on how to start, stop, restart on the local box and where are the logs.

To start MonitorAgent, go to /opt/qbroker/bin and run ./agentctl restart as the owner or root. If this command is not invoked by its owner, it will try to su to the owner and probably prompt for the password of the owner. MonitorAgent runs as qbadm by default. In case you want MonitorAgent to restart applications owned by others, you should use sudo for the access. Please check with sudo's man pages on how to allow a user to run scripts on behalf of someone else. There are plenty of examples in the repository. You may find something helpful. To stop MonitorAgent, go to /opt/qbroker/bin and run ./agentctl stop as the owner or root. To check if the process is running or not, run ./agentctl status. The alternative way to operate on MonitorAgent is to use WebAdmin's operation view to start/stop/restart MonitorAgent remotely.

If you are lucky, MonitorAgent's process will be running as a daemon. Otherwise, you need to troubleshoot the problem. Always check the errors in /var/log/qbroker/MonitorAgent.log and /var/log/qbroker/MonitorAgent.out. If you are still not able to start MonitorAgent, please ask around for help.

Deployment

Once MonitorAgent has been installed, you can start to configure the monitor components for your needs. The next section will explain the configurations on each component. Here we focus on the repository and deployment.

Since you may have MonitorAgent installed on multiple boxes, it will be very difficult to manage them if there is no centralized repository. The current MonitorAgent supports the web-based JSON configuration repository. You just need to define, modify config files in the repository. Once they are done, you just need to publish the changes on the repository as the deployment. MonitorAgent is supposed to monitor the repository and picks up changes within a couple of minutes. Then it will reload those modified objects and continue to work. If the web repository feature is not enabled with the instance of MonitorAgent, it will never see the changes. In this case, you will have to push the changes to the box and bounce the instance. We call this case as the synchronous deployment. As the contrast, the former is asynchronous deployment.

In fact, a JSON web-based repository is just a regular web site serving a set of JSON configuration files via HTTP. There are two different deployment processes. One is asynchronous deployment that is just to publish the JSON content, very similar to a production web site with static content. But it requires the application's active involvement. The other is synchronous deployment, requiring a push of changes and a restart on the application. There are three questions to be answered before you make a decision on which deployment method to use. The first two are whether the application supports the web-based repository and whether the feature is enabled or not. If both answers are positive, the next question will be whether the application is running and working properly. If any of the answers is no, you will have to use the synchronous deployment. Otherwise, you can use either of the deployment methods.

If WebAdmin has been installed, it should be used for the management of the configuration repository. WebAdmin is powered by Javascript at the client side, QBroker at the middle tier and MySQL at the backend. It is a web based tool designed for generic applications. For the aplications like MonitorAgent, EventFlow and QFlow, it allows users to manage their configuration repository and carry out the routine operation tasks via web browsers such as Firefox or Safari. It supports both the synchronous deployment and the asynchronous deployment. It is also highly integratable and extensible because of its middle tier of QBroker. We are not going to cover how to use WebAdmin to deploy changes here. We are going to focus on the command line deployment only.

In case WebAdmin is not available, you still can make the deployment via the command line utility. If you have made any changes to the repository, please remember to import the changes to the database once WebAdmin is back up. It is very important to keep the repository and the database in sync. We will revisit this issue later.

The command line deployment is to run deploy.sh as the owner of the repository. The utility supports both the synchronous deployment and the asynchronous deployment. Either way, you do not have to login on each of the box to do the task again and again. The command line utility will save your time, especially when you are dealing with clusters with multiple boxes.

By default, the repository is on /www/wdap/docroot/agent. Since MonitorAgent runs as qbadm, you will have to have the access to modify them. For your own domain, you can choose a box with Nginx and have /www/wdap/docroot/agent copied over. Inside /www/wdap/docroot/agent, there are multiple directories named after categories. All the boxes are grouped into categories. For example, panda1/2 are grouped into the service of panda, simply because they share a lot of functionalities.

Here is the procedure to deploy changes asynchronously:

Here is the procedure to deploy changes synchronously:

Configuration

The configuration files of MonitorAgent are actually a set of JSON files. Usually, these JSON files are stored in the repository and deployed to the boxes. If you know what to do, you can use your favorite editor to modify them either on the box or on the repository. If WebAdmin is available, always use it to create, modify, delete, import, export, upload and deploy configuration files of MonitorAgent. However, we are not going to cover how to use WebAdmin to modify JSON configuration files. Here we will focus the structure of the configuration file, its content and what the content is for. After all, you have to know what to change on the JSON configuration files before you really modify them.

Each type of MonitorAgent component has its own configuration file. The data schema of the components varies from type to type. Please do not feel bad if you are confused by various configuration files for MonitorAgent. As you know, MonitorAgent is designed to support various monitor components. It is up to the developers to specify what are required in the configraion file. However, you may find a set of sample files for each of scenarios in the repository or in /opt/qbroker/agent/examples on some of the boxes. As a good start, you can copy a similar example and modify it according to your needs. The JSON file usually explains by itself. Even though you may not understand every properties, you will still be able to figure out what to change in most cases. If you encounter problems to configure a monitor, just ask around for help.

The master configuration file of MonitorAgent is /opt/qbroker/agent/Agent.json. Here is an example: Agent.json. For each individual monitors, there may be a dedicated configuration file. Once the configuration is ready, please run /opt/qbroker/bin/agentctl filename in the config directory to check the syntax of the configuration file. You may need to fix the problem if the syntax check fails.

If you view the master configuration file, you will see four parts. The first part is the properties for MonitorAgent container itself. It specifies how often for MonitorAgent to run each of components (heartbeat in second), where to send the event (http://panda:8082/event), where to log locally, where is the repository and whether to turn on the disable feature, etc. The second part is the AdminServer which supports remote control and query synchronously. The third is a MonitorGroup list. Each MonitorGroup lists names of the monitors. The name has to be unique within the container and it can be defined in the same file or in a separate file in the same directory. Whenever you add a new monitor, you need to define the monitor first and then add the entry to one of MonitorGroups in the master file. If the repository is configured, you just need to deploy the changes. Otherwise, you have to bounce the monitor to activate the new changes. The last is a MessageFlow list. A MessageFlow contains multiple message nodes representing certain workflow. If the repository is defined, the changes to MessageFlows will be reloaded automatically after they are deployed. Otherwise, a synchronous deployment and a bounce on the container will be required.

Here is an example of the AdminServer's definition:

{
  ...
  "AdminServer": {
    "Name": "admin",
    "ClassName": "org.qbroker.net.SimpleHttpServer",
    "URI": "https://localhost:6627/admin/jms",
    "Operation": "handle",
    "Capacity": "64",
    "Partition": "0,32",
    "KeyStoreFile": "/opt/qbroker/agent/keystore.jks",
    "KeyStorePassword": "xxxx",
    "TrustAllCertificates": "true",
    "Timeout": "10",
    "RestartScript": "/bin/bash -c \"/opt/qbroker/bin/agentctl restart &\""
  },
  ...
}

where TrustAllCertificates is set to true for client queries in case keystore.jks is self signed.

Here is an example of definition for a MessageFlow:

{
  ...
  "MessageFlow": [{
    "Name": "default",
    "Description": "dispatch events for Agent",
    "Capacity": "1024",
    "XAMode": "0",
    "Debug": "1",
    "PauseTime": "2",
    "StandbyTime": "60",
    "Node": [
      "node_switch"
    ],
    "Persister": [
      "pstr_event",
      "pstr_nohit"
    ]
  }] // end of MessageFlow
}
MessageFlow is optional. If there is a need, you can define multiple MessageFlows. MessageFlow can be used to provide arbitrory services and/or to listen on certain requests or data streams. Both the dynamic monitor support and node level event correlations are implemented via MessageFlow.

All the monitors in the same group will be processed within the same thread in the same order of the list. You can have multiple groups for independent monitors. As you may know, the orders of the groups are not well defined in an MT envirenment. However, the order within the same group is honored. Each group can have its own heartbeat, timeout and debug. Among all the groups, the default group is special. First, it is never able to be disabled by all means. Second, it will run first at the time of startup or relaod. So you should only put basic monitors and reports in the default group. The property of MaxNumberThread controls the maximum number of concurrent threads in the thread pool.

In each MonitorGroup, it lists all the names of the monitors. It can also contain Map objects for monitor templates. Here is an example:

{
  ...
  "MonitorGroup": [
    {
      "Name": "default",
      "Monitor": [
        "global_var",
        "rotation_agent_out",
        "rotation_agent_stats"
      ] // end of monitor
    },{
      "Name": "queue",
      "Heartbeat": "120",
      "Capacity": "128",
      "Monitor": [
        "MyQueue_jlog",
        {
          "Name": "broker_sonic",
          "Template": "broker##id##",
          "Item": ["1","2"]
        }
      ]
    }
  ], // end of MonitorGroup
  ...
}

As you can see, there are two monitor objects defined in the group of queue. The first one is referencing the name of the monitor, MyQueue_jlog. There should be a json file, MyQueue_jlog.json in the folder. The second is a Map containing a Name, broker_sonic, as the reference to the configureation template file; and a Template, broker##id##, as the name template to set the names of new monitors; plus a list of items used as the values for ##id## to generate a new monitor from the configuration template. For example, the name of the monitor for the first item will be broker1. As you can see, the variable of ##id## in the name template has been replaced by 1, the value of the first item. So MonitorAgent will generate 2 new monitors sharing the same property template. This is convenient since you do not have to define the monitor for every hosts.

In the example of above, the item list is static. So we call the monitor template as static. MonitorAgent also supports dynamic monitor templates. If a monitor template is dynamic, the data for Item has to be a Map that defines a MonitorReport object to generate the item list dynamically. Here is an example:

{
  ...
  "MonitorGroup": [
    {
      "Name": "queue",
      "Heartbeat": "120",
      "Capacity": "256",
      "Monitor": [
        {
          "Name": "queue_sonic",
          "Template": "##queue##",
          "Item": {
            "Type": "GenericList",
            "Description": "JMS/JMX listing on SonicMQ",
            "URI": "tcp://##hostname##:2506",
            "Username": "qbadm",
            "Password": "xxxx",
            "RequestCommand": "DISPLAY Domain1.brQA01Container:ID=brQA01,category=metric,type=queue",
            "DataField": "List",
            "KeyTemplate": "##name##",
            "ReportMode": "local",
            "Step": "1",
            "XPatternGroup": [{
              "Pattern": [
                "SampleQ\\d+"
              ]
            }]
          }
        }
      ]
    }
  ],
  ...
}

If you compare this example to the previous one, you will notice the changes on the content and data type of Item. There is no list of items any more. Instead, the block of Item defines a MonitorReport object to generate the list of items dynamically. The data field of the report is specified as List. When MonitorAgent dispatches the dynamic group to a working thread, it will invoke the method of generateReport() and retrieves the list of queues from the report. This way, you do not need to care about what queue is available.

Dynamic monitor template is really useful to monitor dynamic objects. Monitor on Apache ActiveMQ's queue is a good example. As we know, queues in ActiveMQ can be generated by applications. So some of queues come and go. It is very challenge to keep tracking on them. With dynamic monitor template, MonitorAgent is able to discover new queues to monitor and to remove the monitor when a queue is gone. Here is the configuration example:

{
  ...
  "MonitorGroup": [
    {
      "Name": "amq",
      "Heartbeat": "60",
      "Monitor": [
        {
          "Name": "queue_jmx",
          "Template": "##queue##_jmx",
          "Substitution": "s/^.*,Destination=//",
          "Item": {
            "Type": "GenericList",
            "Description": "JMX listing on ActiveMQ",
            "URI": "service:jmx:rmi:///jndi/rmi://localhost:8999/jmxrmi",
            "Username": "admin",
            "Password": "xxxx",
            "MBeanName": "org.apache.activemq:BrokerName=localhost,Type=Queue,*",
            "DataField": "List",
            "ReportMode": "local",
            "Step": "5",
            "XPatternGroup": [{
              "Pattern": [
                "(example|sample|test)$"
              ]
            }]
          }
        }
      ]
    }
  ],
  ...
}

In this example, the MonitorGroup will lauch the JMX query to the ActiveMQ service every 5 min. It will generate a list of queues on the service. With that list, a monitor will be generated based on the template of queue_jmx for each queue. The monitor runs every minute to watch the queue. If a queue disappears, its monitor will be removed accordingly. Therefore, dynamic monitor template manages a list monitors based on another monitor. Currently, only one variable is supported for the monitor template.

To define an individual monitor, you need to specify its properties and the policies. The property set and the rule set depend on the type of the monitors. Not all occurrences are supported by MonitorAgent. For each type of the supported occurrences, there is at least one Java class that implements the report component and the action component for the occurrence. Some of the properties are mandatory, and some conditional mandatory, others optional. Some of the properties are the common ones used for classifications and correlations. Others are unique to each individual types of monitors. Here is the list of common properties for all monitors.

Property Name Data Type Requirement Description Examples
Name alphanumeric with no spaces mandatory name of the monitor qmgr_proc
Site alphanumeric with no spaces optional site that the omonitor is associated with DEVOPS
Category alphanumeric with no spaces optional category of the monitor for event correlation WMQ or ESB
Type alphanumeric with no spaces mandatory type of the monitor ProcessMonitor
ClassName alphanumeric with no spaces mandatory fullname of the Java class for the implementation org.qbroker.monitor.FileMonitor
URI string of URL mandatory the universal resource idetifier file:///var/log/nginx/access.log
Description text optional brief description for the monitor cross-watch for MonitorAgent on panda
Step integer optional to generate the report once every specific number of heartbeats 2 (default is 1)
Tolerance integer optional to ignore the first specific number of consecutive failures 2 (default is 0)
MaxRetry integer optional to invoke the action up to specific number of times if failure persists 1 (default is 2)
MaxPage integer optional to send page alerts up to specific number of times if failure still persists 0 (default is 2)
QuietPeriod integer optional to keep quiet up to specific number of times if failure still persists 12 (default is 0)
ExceptionTolerance integer optional to ignore the first specific number of the exceptions from the monitor 5 (default is 2)
DependencyGroup list optional list of dependency group click here for details
StaticDependencyGroup list optional list of static dependency group click here for details
ActionGroup list optional list of the actions click here for details
Reference map optional the map of reference click here for details
ActiveTime map mandatory the map with the active time slot for MonitorAgent to watch the occurrence click here for details
Most of the common properties are simple Strings. Among them, the most important properties are Type, ClassName and URI. Type is actually the short name of the monitor object designed for a specific type of occurrences. ClassName is the full name of the Java class for the implemetation. URI specifies what to be monitored and where it is. The other two types of properties are either Maps or Lists. The Map data type contains multipe key-value pairs. Whereas the List data type is a sequencial list of items whose order matters. Let's go over those common properties of either Map or List.

DependencyGroup

DependencyGroup is a List containing dependencies of the monitor. For a given monitor, it may depend on other monitors. In this case, we say this monitor has dependencies. On the other hand, the monitor may have its own dependents, ie, some other monitors may depend on the current monitor.

If DependencyGroup is defined for a monitor, the monitor will check the dependency first. The result may be success or failure. The monitor continues the normal operation if it is a success. Otherwise, the monitor will be disabled. In another word, MonitorAgent uses the DependencyGroup to mimic the IF statement so that it can control the work flow. You can put multiple dependencies into DependencyGroup to mimic logic AND and logic OR relationships.

In fact, a Dependency is actually an instance of Monitor with both ReportMode and DisableMode set properly. The Dependency can be defined in-line or in a separate file. One it is defined, you just need to specify or reference the dependency via its name. Here is an example of monitor which can be used as a Dependency:

{
  "Name": "rpt_panda",
  "ClassName": "org.qbroker.monitor.ScriptLauncher",
  "Site": "DEVOPS",
  "Type": "ScriptLauncher",
  "Category": "REPORT",
  "Description": "report on hostname",
  "Step": "1",
  "Tolerance": "0",
  "MaxRetry": "2",
  "MaxPage": "1",
  "QuietPeriod": "12",
  "ExceptionTolerance": "2",
  "Script": "/bin/uname -n",
  "ScriptTimeout": "40",
  "ReportMode": "final",
  "DisableMode": "1",
  "XPatternGroup": [{
    "Pattern": ["^panda\\.?"]
  }],
  "ActiveTime": {
    "TimeWindow": [{
      "Interval": "00:00:00-24:00:00"
    }]
  }
}
where DisableMode identities this report is for a dependency. ReportMode defines the scope of the report. In case this report is deployed to a box with the hostname not matching panda, this report will be a failure. As the result, it will disable its all dependents. If DisableMode is set to -1, it will reverse the test result, ie, the dependents will be disabled only if the report is a success.

Here is an example of DependencyGroup with a Dependency defined in-line:

{
  ...
  "DependencyGroup": [{
    "Dependency": [{
      "Name": "repo_agent",
      "ClassName": "org.qbroker.monitor.URLMonitor",
      "URI": "http://panda:8082/agent/panda/agent.json",
      "Operation": "HEAD",
      "Username": "omadm",
      "Password": "xxxx",
      "MaxBytes": "0",
      "Pattern": "Last-[mM]odified: (\\w+, \\d+ \\w+ \\d+ \\d+:\\d+:\\d+ \\w+)",
      "DateFormat": "EE, dd MMM yyyy HH:mm:ss zz",
      "Timeout": "60",
      "TimeOffset": "0"
    }]
  }],
  ...
}
More dependencies can be added to the list for the logic relationships of AND and OR. All dependencies inside the same list of Dependency will be evaluated as AND. All dependencyGroups will behave like OR.

With a group of dependencies, you can easily control the monitor flow dynamically.

StaticDependencyGroup

StaticDependencyGroup is a DependencyGroup that will be evaluated at the startup only. If it is failed, the monitor will be disabled permanently. It is used to partition monitors across the multiple platforms or hosts.

{
  ...
  "StaticDependencyGroup": [{
    "Dependency": ["rpt_panda"]
  }],
  ...
}
where the Dependency of rpt_panda has been defined separately.

ActionGroup

ActionGroup is a list of the actions for MonitorAgent to invoke as the response to certain events. In an action, the content of the event is accessible via the variable names, such as ##hostname## for hostname of the event. Here is an example:

{
  ...
  "ActionGroup": [{
    "URI": "script://localhost",
    "Priority": "^ERR$",
    "Timeout": "30",
    "Script": "/opt/qbroker/init.d/S50QFlow_EVENT restart"
  },{
    "URI": "smtp://web.qbroker.org",
    "Priority": "^ERR$",
    "Email": ["warn@web.qbroker.org"],
    "Subject": "##hostname##: ##priority## ##name## died",
    "TemplateFile": "/opt/qbroker/templates/mail_proc.txt"
  },{
    "URI": "smtp://web.qbroker.org",
    "Priority": "^CRIT$",
    "Email": ["page@web.qbroker.org"],
    "Subject": "##hostname##: ##priority## ##name## died",
    "TemplateFile": "/opt/qbroker/templates/mail_proc.txt"
  }],
  ...
}

This ActionGroup has defined three actions. The first one is to bounce the process if the priority of the event is ERR. The other two are for email alerts. One only reacts on the ERR event and sends an email alert as warning. The other reacts on the CRIT and sends the alert as a page. In each action, you can define format templates for specific event types. You can also define a substitution rule to modify content of the events. In order to see what happens to an action, you can add Debug tag and set it to 1. Here is an example of Action with Subscription defined to modify the content of the event:

{
  ...
  "ActionGroup": [{
    "URI": "jdbc:oracle:thin:@localhost:1520/mydb",
    "Priority": "^ERR$",
    "Username": "monitor",
    "Password": "xxxx",
    "SQLStatement": "DELETE FROM MY_LOCK WHERE LOCK_NAME = '##leadingBlock##' AND CREATE_DATE < SYSDATE - 1.0/24",
    "Substitution": [{
      "leadingBlock": "s/^[-:0-9 ]+\\| //"
    }],
    "DBTimeout": "50"
  }],
  ...
}
where ##leadingBlock## is referencing the attribute of leadingBlock of the event. The value of the attribute is something as follows:
2016-08-17 13:27:22 | PurgeProcessedMessageCommandLock
The definition of Substitution is to cut off the timestamp, the spaces and the pipe char. So that only PurgeProcessedMessageCommandLock will be used to replace the variable of ##leadingBlock## in SQLStatement.

This Action is to run the SQLStatement on an Oracle DB to delete the aged lock with the lock name specified in the event.

Reference

Reference is a Map specifying a separate report as the reference object. It is used in the time or number correlations. For example, CMS publisher has primary template and secondary templates for each publish event. The rendering of the primary template is a synchronous process. But the secondary templates are asynchronous. Therefore, the files in the secondary templates may get delayed or failed in their publish processes. MonitorAgent can be used to monitor the timestamps of those files in the secondary templates and correlated them with the file in the primary template.

Here is an example:

{
  ...
  "Reference": {
    "Name": "url_timestamp",
    "URI": "ftp://www.qbroker.org/www/wdap/rrd/generic.json",
    "Type": "FileMonitor",
    "User": "qbadm",
    "Password": "xxxx",
    "Timeout": "30",
    "TimeZone": "GMT"
  },
  ...
}
This reference defines a report on the timestamp of the URL. Its timestamp will be used as the reference in the time correlation process.

ActiveTime

ActiveTime is a Map specifying when the monitor is active. This is introduced to address the blackout issue. For example, most of the web server will have daily log rotations. You do not want MonitorAgent to alert you during the rotation. Therefore, you can specify the blackout window so that the monitor will not be active during the rotation. ActiveTime is also used to schedule time-driven jobs, just like cronjobs.

Here is an example of ActiveTime:

{
  ...
  "ActiveTime": {
    "StartTime": "2005/03/18.08:00:00.EST",
    "StopTime": "2005/12/18.23:59:59.EST",
    "Blackout": ["6,00:00:00-24:00:00", "7", "4/6,20:00:00-04:00:00"],
    "TimeWindow": [{
      "Interval": "00:30:00-09:45:00"
    },{
      "Interval": "10:30:00-17:45:00"
    },{
      "Interval": "18:30:00-23:45:00"
    }
  },
  ...
}
This example tells us it starts from 3/18 and ends on 12/18 and it will be active during the weekdays except for April 6th from 8:0PM thru 4:00AM next day. During the active day, there are three active time windows.

As you see, ActiveTime contains at least one active time window. Within the active time window, user can define threshold for certain monitors. For example, FileMonitor and AgeMonitor require threshold defined. A threshold is two or three numbers delimited by comma. The range below the first number always means NORMAL. The range between first two numers means WARNING. The range beyond the second number means ERR.

Events

In the diagram of the Enterprise Monitor Network, MonitorAgent is acting as an agent on the nodes. Each instance of monitor will be represented by the type of Events. An Event is a self-described structure message, similar to a JSON message or a JMS MapMessage. It has certain mandatory properties, such as priority, name, site, type, text, etc. It may also have other customized and free-formed properties. MonitorAgent uses Event to store information about the what observed by the monitor. The benefit of Event is for different applications on different platforms to easily parse, match, evaluate, correlate, present and process the information carried by the message.

The primary task of the action component is to transform the raw data from the report into a more readable, operationable and interchangeable event. With Event, the details of the occurrence will be able to flow across the network. This mobility allows any monitor to publish its reports with Event and allows other applications to subscribe the content based on their interests.

Since all the monitor alerts are actually Events, it is important to understand the structure of the specific type of Event when you are creating or configuring the monitor. In fact, each type of the monitor has its own type of Event. Among the various properties, some are mandatory. Others may be common or unique for the type. Here lists the common properties of an Event:

Attribute Name Description
priority priority of Event
name name of monitor
uri URI of the monitor
type type of Event
site site of Event
category category of Event
text message of Event
hostname hostname where the event is from
program application name that sends Event
owner owner of application
pid Unix process ID of the application
date date and time of the event
testTime date and time of the report
status status of the report
actionScript status of the action script: executed, skipped and not configured
actionCount number of times that it occurs in a row
description description of the monitor

The most important thing is to know what properties are available for certain types of events and what they mean. The developers are supposed to document the data structure or schema for each types of events. If you can not find the documentation on a specific type of Event, you are still able to figure out most of the properties after such an event is delivered to you.

When something occurrs, the action component will generate an event on it and increases its ActionCount. By evaluating the ActionCount, the monitor knows the history of the occurrence and will decide if any escalation is needed. If the samething is happening again and again, the action component will escalate the priority of the event based on the predefined policies. Here is the matrix of event escalation for common scenarios:
ActionCount(c)WARNINGERRCRIT
0<c<=Tlog  
0<c-T<=R log, mail, run script 
0<c-T-R<=P  log, page
0<c-T-R-P<=Qnothing
where c is for ActionCount, T for Tolerance, R for MaxRetry, P for MaxPage and Q for QuitePeriod. Once ActionCount exceeds QuitePeriod, the escalation will start over again at ERR.

Supported Monitors

Currently, MonitorAgent only supports the following types:
Type ClassName Description
AgeMonitor org.qbroker.monitor.AgeMonitor to monitor the age of an object and correlate with other objects
ChannelMonitor org.qbroker.wmq.ChannelMonitor to monitor a WMQ channel and check its flow rate
DBRecord org.qbroker.monitor.DBRecord to query a DB table and scan all records for certain patterns
ExpectedLog org.qbroker.monitor.ExpectedLog to launch a script and expect a specific new log entries showing up in the log file
FileMonitor org.qbroker.monitor.FileMonitor to monitor the latest modified time of a file
IncrementalMonitor org.qbroker.monitor.IncrementalMonitor to monitor the incremental of a number to see if it is out of range
JMSHealthChecker org.qbroker.jms.JMSHealthChecker to health-check a JMS Destination via sending a msg to it
JMSLogMonitor org.qbroker.jms.JMSLogMonitor to monitor a JMS Destination implemented via a log file
JMSMonitor org.qbroker.jms.JMSMonitor to monitor a JMS application via its queue and the log file
JMXQMonitor org.qbroker.monitor.JMXQMonitor to monitor a generic JMS Destination via JMX Service
LatestRecord org.qbroker.monitor.LatestRecord to monitor the update time of a database record and correlate with other updates
MultiFileMonitor org.qbroker.monitor.MultiFileMonitor to monitor the latest modified time of a group of files
NumberMonitor org.qbroker.monitor.NumberMonitor to monitor a number to see if it is out of range
ProcessMonitor org.qbroker.monitor.ProcessMonitor to monitor a unix process with specific patterns
PropertyMonitor org.qbroker.monitor.PropertyMonitor to monitor a JSON property file to see if it is modified or not
QueueMonitor org.qbroker.wmq.QueueMonitor to monitor the queue depth or its changes on a WMQ queue
ScriptLauncher org.qbroker.monitor.ScriptLauncher to launch a script and check its output for errors
ServiceMonitor org.qbroker.monitor.ServiceMonitor to monitor service metrics via Monit status page
SyntheticMonitor org.qbroker.monitor.SyntheticMonitor to launch a Selenium script to check a web site
SonicMQMonitor org.qbroker.sonicmq.SonicMQMonitor to monitor the metrics and its changes on a SonicMQ broker
URLMonitor org.qbroker.monitor.URLMonitor to monitor the update time of a web page and correlate with other updates
UnixlogMonitor org.qbroker.monitor.UnixlogMonitor to monitor a log file and detect the new occurrences of log entries matching at least one of the given patterns
WebOperator org.qbroker.monitor.WebOperator to access a given URL and check its content for errors
WinlogMonitor org.qbroker.monitor.WinlogMonitor to monitor Windows Event log

Each type of the monitor object handles one specific scenario. It is up to the report component to file a report and determine the status of the report. It is up to the action component to determine if the report is a failure and the level of priority of the failure as well as the actions to invoke. Even though the number of supported monitors are limited, you still can combine them just like Lego to create and monitor new occurrences.

None of systems is perfect. Neither is MonitorAgent. Sometimes, MonitorAgent's test part may fail due to some unexpected exceptions. For example, a FileMonitor tries to get the timestamp of a remote file. The network outage fails the ftp process. As the result, MonitorAgent will not be able to judge when the file is updated. In this case, MonitorAgent treats it as an exception and the action component is supposed to handle the exceptions.

UnixlogMonitor

UnixlogMonitor scrapes the logs of applications. It detects the new occurrences of the log entries that match at least one of the given patterns. If a matching log entry is detected, the monitor treats it as an error that in turn triggers the actions. For example, if SportsTicker receiver catches an exception, it logs into its own log file. UnixlogMonitor can catch the exception logs and sends alerts. UnixlogMonitor guarantees that each of the new entries is checked only once, provided that each log entry has a well defined timestamp and the timestamp increases monotonously.

Once the monitor catches any entries, it will send an event. The event only contains the last entry and the number of the match entries. Therefore, it is customer's job to check the log for the details. Otherwise, you may miss the important log entries.

Currently, UnixlogMonitor is using Java SimpleDateFormat to parse the timestamps. If you do not know how to configure the timestamp pattern, please read the book on that class or ask someone for it.

Here are the type specific properties in order to configure an instance of UnixlogMonitor.
Property Name Requirement Description Examples
URI mandatory URI of the log file log:///var/log/nginx/error_log
TimePattern mandatory A SimpleDateFormat pattern to parse timestamp of logs yyyy-MM-dd HH:mm:ss,SSS
PerlPattern optional A Perl pattern to parse out the portion of timestamp of logs if needed ^\w+ (\d\d\d\d-\d+-\d+ \d+:\d+:\d+,\d\d\d)
LogSize optional maximum number of lines scanned for each log entry 5 (default 1)
ReferenceFile mandatory full filename for storing state info /var/log/qbroker/.status/nginx_error.log
ErrorIgnored optional threashold for number of match entries to trigger CRIT event (bypassing normal escaltions) 10 (default is 0 for off)
PatternGroup mandatory a list of Perl pattern groups to match certain logs see example
XPatternGroup optional a list of Perl pattern groups to exclude certain logs see example
Here is an example: ersr_log.json.

The most confusing property is TimePattern. Given a log file, how can you determine what TimePattern to use? Well, TimePattern is determined by the timestamp of the logs. Compare the timestamp of the log file with the following table and try to find one with the exactly same pattern:

TimePattern Example
EE MMM d HH:mm:ss yyyy "Mon Aug 27 10:03:16 2001 ..."
yyyy-MM-dd HH:mm:ss,SSS "2002-12-31 07:23:01,762 ..."
yyyy-MM-dd'T'HH:mm:ss.SSS'Z' "2002-07-21T10:45:23.215Z ..."
dd/MMM/yyyy:HH:mm:ss "23/Feb/2002:17:03:51 ..."
[yyyy-MM-dd HH:mm:ss "[2002-05-28 11:43:13] ..."
[yyyy/MM/dd HH:mm:ss "[2002/07/21 10:45:23] ..."
[yyyy-MM-dd HH:mm:ss,SSS "[2002-05-28 11:43:13,276] ..."
MM/dd/yy HH:mm:ss "09/21/02 10:45:23 ..."
MM/dd/yy hh:mm:ss a "03/21/99 07:15:43 PM ..."
yyyy-MM-dd.HH:mm:ss "2002-07-21.10:45:23 ..."
yyyy-MM-dd HH:mm:ss "2002-07-21 10:45:23 ..."
yyyy-MM-dd HH:mm:ss zz "2002-07-21 10:45:23 EDT ..."
MMM d HH:mm:ss "Aug 3 10:03:16 ..."
MM-dd-yyyy HH:mm:ss SSS "09-21-2002 10:45:23 267 ..."
yyyy.MM.dd ss:mm:HH "2002.07.21 30:45:23 ..."
ss:mm:HH:dd:MM:yyyy "53:45:23:29:11:2003 ..."
ss "1464507681 ..."
ss.SSS "1464507681.456 ..."

If there is no similar one, you just need to create one for the new type of the timestamp. TimePattern is implemented via Java SimpleDateFormat. Its Javadoc will be really helpful for you to create a new TimePattern. If it is too complicated for you, please ask around for help.

Let's talk about the PatternGroup. PatternGroup is a list of Perl Pattern groups used to match certain patterns in the log entries. Here is an example:

{
  ...
  "PatternGroup": [{
    "Pattern": ["(ERROR|FATAL) "]
  },{
    "Pattern": ["WARN ", "\\.NullPointerException"]
  }],
  ...
}
This PatternGroup has defined two pattern groups. The first one is to match either ERROR or FATAL in the log. The 2nd one contains two patterns and it is to catch the WARN log with .NullPointerException. As you can see, within a PatternGroup, there may be multiple patterns. It requires the log entry matches all of them at the same time to count as a match. For those two groups of patterns, their relationship is logic OR. It means either groups get matched will be counted as a match.

By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of UnixlogMonitor:

This message was automatically generated.

           Type: UnixlogMonitor
          Owner: qbadm
   ActionScript: not configured
         Status: Normal
    ActionCount: 1
           Date: 2005/06/24.18:26:15.226.EDT
     NumberLogs: 2
            Pid: 11711
      LastEntry: [2005-06-24 18:22:39,657] ERROR - exception
org.qbroker.publisher.DistributionException: failed to put /www/wdap/rrd/images/mq/abc.png: 550 /: Permission denied 
           Text: 'log:///var/log/qbroker/publisher.log' has 2 new matching entries
            Uri: log:///var/log/qbroker/publisher.log
    Description: QBroker publisher log for testing
       TestTime: 2005/06/24.18:26:15.216.EDT
       Hostname: panda
       Priority: WARNING
           Name: publisher_log
        Program: MonitorAgent

A brief explanation on the monitor and this type of message can be found at
https://yannanlu.github.io/agent.html#UnixlogMonitor.

This alert tells us the logfile of /var/log/qbroker/publisher.log contains 2 error entries and the last one occurred at [2005-06-24 18:22:39,657]. As you can see, the monitor only sends the first 5 lines of the last error entry. In order to get the details of the error, you have to open the log file to look for it. It is up to the admins to figure out what caused the error and how to handle the error. Apart from the common properties , the following table lists all other type specific properties and their descriptions:

Attribute Description
numberLogs number of matching entries
lastEntry the original content of the last entry up to the limited lines
Please pay attention to the first letter of the attribute names. They are all in lower case, even though they are displayed with upper case in the email.

ScriptLauncher

ScriptLauncher is to launch a script provided by customers. It assumes that the script will not have any unexpected output if it succeeds. If there are anything spit out, the monitor treats it as a failure and invokes the pre-configured actions.

ScriptLauncher allows you to integrate your own monitor script with the MonitorAgent. The MonitorAgent will provide history tracking, event generating, email alerting, and other services. For example, you can ignore the first occurrence, call restart script at second and third failures, log to Bugzilla at next two occurrences, etc.

Here are the type specific properties in order to configure an instance of ScriptLauncher.

Property Name Requirement Description Examples
Script mandatory full path and name of the script /mydir/test.sh
ScriptTimeout optional seconds to timeout the script 30
XPatternGroup optional an list Perl pattern groups to ignore some of outputs from the script ^Sun Micro
Here is an example: rrd_daily.json.

SyntheticMonitor

SyntheticMonitor runs a Selenium script as a headless browser on a web site in terms of a list of tasks. The first step is always to get the page of the given URI. If NextTask is defined in the config, each of the tasks will be executed one after another in the listed order. If any task fails, the entire test will fail.

NextTask defines a list of actions implemented via Selenium script engine. Each member of NextTask will contain Operation, LocatorType, LocatorValue, PauseTime in ms and WaitTime in sec, etc. The basic idea for each of the actions is to look for certain DOM object with certain value and click on it or send some text to it.

Here are the type specific properties in order to configure an instance of SyntheticMonitor.

Property Name Requirement Description Examples
NextTask mandatory a list of actions of the test script see example
ScriptTimeout optional milliseconds to timeout the script 60000
PageLoadTimeout optional milliseconds to timeout the first page load 90000
Here is an example: synthetic_dev.json.

ProcessMonitor

ProcessMonitor monitors the processes on a box. It checks the process table for the special patterns. If there is a match to all the patterns, it assumes the process is running. Otherwise, the process is down and the actions will be invoked as the response to the failure. This monitor is quite effective as long as the given pattern list is unique. However, it will not be able to detect anything wrong if the process is hung.

Sometimes, the system is busy or the resource is tight. MonitorAgent may fail to get process info. In this case, MonitorAgent treats it as an exception rather than a failure. The property of ExceptionTolerance determines how many times MonitorAgent will ignore the exceptions. If the exception persists, MonitorAgent will upgrade its priority and treats it as a failure.

Here are the type specific properties in order to configure an instance of ProcessMonitor.

Property Name Requirement Description Examples
PatternGroup mandatory a list of Perl pattern groups in order to single out the process SportsTicker
PSCommand optional ps command /usr/ucb/ps -auxwww (default)
PSTimeout optional seconds to timeout ps command 30
PidPattern optional pattern to match the pid ^\w+\s+(\d+)

Here is an example: cron_proc.json.

By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of ProcessMonitor:

This message was automatically generated.
Process of mlb_scorecast died on panda1:

    ActionCount: 2
   ActionScript: executed
           Date: 2005/12/07.11:12:27.859.EST
       Hostname: panda1
         Status: normal
          Owner: qbadm
            Pid: 28376
       Priority: ERR
           Name: eventflow_proc
     NumberPids: 0
        Program: MonitorAgent
       TestTime: 2005/12/07.11:12:24.449.EST
           Text: eventflow_proc is down
           Type: ProcessMonitor
    Description: process monitor on EventFlow

This monitor checks Unix process table for a predefined patterns periodically.
If this is the first occurrence of the error, please connect to panda1, and
then restart the process.  If it persists, please contact the on call.

A brief explanation on the monitor and this type of message can be found at
https://yannanlu.github.io/agent.html#ProcessMonitor.

This alert tells us the process of EventFlow on panda1 was not running at the time of 2005/12/07.11:12:24.449.EST. It is the second time that the monitor had detected the failure in a row. The action script has been executed to respond to the failure. Apart from the common properties , the following table lists all other type specific properties and their descriptions:

Attribute Description
numberPids number of pids matching the patterns
pids list of the pids
Please pay attention to the first letter of the attribute names. They are all in lower case, even though they are displayed with uper case in the email.

NumberMonitor

NumberMonitor monitors a generic number and compares it to the predefined ranges. If the number falls into certain range, the monitor will send an event as the alert. It can be used to monitor memory usage of a process, or disk usage of a host, number of connections, etc.

Property Name Requirement Description Examples
URI mandatory URI of the test proc:///bin/ps?rss
Pattern mandatory pattern for parsing the number \s+\d+\s+(\d+) [^0-9]
Operation optional method for aggregation count (default)
CritialRange optional data range for ERR [90,)
ErrorRange mandatory data range for ERR [60,90)
WarningRange optional data range for WARNING [0,60)
Timeout optional seconds to timeout the test 30 or 60 in default

Among the three ranges, it is required to define at least one of them. Here is an example: db_conn_num.json.

IncrementalMonitor

IncrementalMonitor monitors the incremental value of a generic number and compares it to the predefined ranges. If the incremental falls into certain range, the monitor will send an event as the alert. In case of the data reset, the monitor will reset its state, too. So it is assuming the number is increasing always except for reset. It can be used to monitor network trafics or byte count, etc.

Property Name Requirement Description Examples
URI mandatory URI of the test proc:///bin/ps?rss
Pattern mandatory pattern for parsing the number \s+\d+\s+(\d+) [^0-9]
Operation optional method for aggregation max, min, first, last or count in default
CritialRange optional data range for ERR [90,)
ErrorRange mandatory data range for ERR [60,90)
WarningRange optional data range for WARNING [0,60)
Timeout optional seconds to timeout the test 30 or 60 in default

As you can see, it is very similar to NumberMonitor. Please remember those data ranges are for incrementals.

WebOperator

WebOperator monitors a web listener locally or remotely. It sends a GET request to the web server periodically. It also checks the content of the requested web page. In case of fatal failures, like connection refused or wrong content, it invokes the action to send alert or to restart the web server.

Property Name Requirement Description Examples
URI mandatory URI of the page http://www.qbroker.org/index.html
Timeout optional seconds to timeout the request 30 or 60 in default
Pattern optional a string expected returned with the page QBroker or default: 200 OK
MaxBytes optional max bytes to read the page 1020000 or default 512
Here is an example: was_web.json.

FileMonitor

FileMonitor monitors lateness of expected updates to a given local file or a given remote file. It assumes that the file always exists and is readable all times. Once the monitor gets the time stamp of the file, it compares the time stamp with the predefined thresholds and determines if the update to the file is late. If it is late, the monitor treats the occurrence as an error and sends event.

FileMonitor also supports mtime correlation and size correlation between two updates. In this case, there are two objects involved. One is the reference. The other is the target file to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, FileMonitor adjusts its timer and correlates this change with the target file. If the target file has been updated accordingly, the FileMonitor treats it OK. Otherwise, it will send alerts according to the predefined tolerance on the lateness of the target file being updated.

Here are the type specific properties in order to configure an instance of FileMonitor.

Property Name Requirement Description Examples
URI mandatory uri of the file ftp://panda/var/log/qbroker/completed
User mandatory only for ftp userid used in ftp orb
Password mandatory only for ftp password used in ftp aabbcc
SetPassiveMode optional to use passive ftp false (default is not defined)
Timeout optional seconds to timeout in ftp 30 (default is 60)
Reference optional a map with all info of the reference object see examples
TriggerSize optional threshold of size that triggers change 1

You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: completed_f.json.

Here are two scenarios to use this monitor object. The first case is for an end-to-end health-check on a mover. Periodically, a script is called to move a file. The FileMonitor in turn checks the timestamp of the file on the other end. If the file has not been updated within a certain amount of time, it indicates the move either failed or extremely slow.

The other case is for an end-to-end test on an web server. Periodically, the WebOperator requests a page at a given URL. The web server logs each access requests into its access log. The FileMonitor in turn checks the time stamp of the access log. If the log has not been updated within a certain amount of time, it indicates the web server's logging process is malfunctioning.

In order to configure FileMonitor to do time correlation, you have to specify a map named Reference in its property map. The reference map contains most of the properties required by an Update object, such as URI, Name, etc. The tolerance of the lateness will be controlled by the threshold parameters. In fact, FileMonitor will create a separate instance for the reference. The action part will actually do the time correlation between two objects.

In case of the size correlation, you must specify the TriggerSize in the property map. The TriggerSize is zero or any positive number that defines two different states. One is the state that the size is less than the TriggerSize. The other is the opposite. In case state of the reference changes, FileMonitor will check the state of the target file. If both files are in the same states, FileMonitor thinks it OK. Otherwise, FileMonitor will send alerts according to the predefined tolerance on the lateness of the target file keeping its state in sync.

By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of FileMonitor:

This message was automatically generated.

  ReferenceSize: 11399
           Type: FileMonitor
          Owner: qbadm
  ReferenceTime: 2005/06/23.06:06:32.000.EDT
   ActionScript: not configured
    ActionCount: 3
           Date: 2005/06/23.06:36:49.710.EDT
            Pid: 21382
    Description: time correlation on Money intl page
     LatestTime: 2005/06/22.16:30:30.000.EDT
         Status: very late
           Text: 'ftp://json.qbroker.org/www/wdap/rrd/data.json' has not been updated in the last 846 minutes
            Uri: ftp://json.qbroker.org/www/wdap/rrd/data.json
         Status: very late
       TestTime: 2005/06/23.06:36:49.237.EDT
       Hostname: panda1
       Priority: CRIT
           Name: data_json
        Program: MonitorAgent
      Reference: ftp://loon/www/wdap/rrd/data.json

This monitor periodically checks the mtime of the files,
ftp://json.qbroker.org/www/wdap/rrd/data.json and
ftp://loon/www/wdap/rrd/data.json, to see if they have been updated recently.
If not, it indicates something wrong somewhere along the line.  Please login on
the box and look into it.

A brief explanation on the monitor and this type of message can be found at
https://yannanlu.github.io/agent.html#FileMonitor.

Here Uri tells you what file is being checked. If the file is on a remote machine, the path shows you what server it is on. Text tells you since how many minutes, the file has not been updated. We call this as the very late occurrence. In this case, the test is for the mover. It is up to admin to determine what caused this late event. Apart from the common properties , the following table lists all other type specific properties and their descriptions:

Attribute Description
latestTime mtime of the file
reference uri of the reference (optional)
referenceTime mtime of the reference (optional)
referenceSize size of the reference (optional)

MultiFileMonitor

MultiFileMonitor monitors lateness of expected updates on a group of files, local or remote. It assumes that all the file always exist and are readable all times. Once the monitor gets the time stamps of all the file, it will sort them and find the earlist one as leading timestamp. The monitor compares the leading timestamp with the predefined thresholds and determines if it is late or not. If it is late, the monitor treats the occurrence as an error and sends event.

MultiFileMonitor also supports mtime correlation and size correlation. In this case, there are two objects involved. One is the reference. The other is the leading timestamp to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, MultiFileMonitor adjusts its timer and correlates this change with the leading timestamp. If the leading timestamp has been updated accordingly, the MultiFileMonitor treats it OK. Otherwise, it will send event according to the predefined tolerance on the lateness.

Here are the type specific properties in order to configure an instance of MultiFileMonitor.

Property Name Requirement Description Examples
URI mandatory uri of the base directory ftp://loon/www/wwap/rrd/
FileName mandatory a list of all files to be monitored data/mq/abc.rrd
User mandatory only for ftp userid used in ftp qbadm
Password mandatory only for ftp password used in ftp aabbcc
SetPassiveMode optional to use passive ftp false (default is not defined)
Timeout optional seconds to timeout in ftp 30 (default is 60)
Reference optional a map with all info of the reference object see examples
TriggerSize optional threshold of size that triggers change 1
You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: aged_group.json.

By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of MultiFileMonitor:

This message was automatically generated.

           Type: MultiFileMonitor
          Owner: qbadm
   ActionScript: not configured
        Details: data/mq/abc.rrd: 2313 2005/02/17.17:35:02.000.EST
         data/num/xyz.rrd: 2112 2005/02/17.18:28:25.000.EST
    ActionCount: 6
         Status: Very late
           Date: 2005/02/17.18:48:32.340.EST
            Pid: 27272
           Text: ftp://loon/www/wdap/rrd/: at least one of the files has not been updated in the last 73 minutes
    LeadingTime: 2005/02/17.17:35:02.000.EST
            Uri: ftp://loon/www/wdap/rrd/
    Description: check mtime on mobile files
       TestTime: 2005/02/17.18:48:26.571.EST
       Hostname: panda1
       Priority: CRIT
           Name: rrd_files
        Program: MonitorAgent

A brief explanation on the monitor and this type of message can be found at
https://yannanlu.github.io/agent.html#MultiFileMonitor.

Here Uri gives you the base uri for all the files. Details lists the list of the files with their relative path, size and mtime. LeadingTime shows the mtime of the oldest (leading) file. Text tells you since how many minutes, the leading file has not been updated. We call this as the very late occurrence. It is up to admin to determine what caused this late event. Apart from the common properties , the following table lists all other type specific properties and their descriptions:

Attribute Description
leadingTime mtime of the oldest file
details list of files with the relative path to the uri and their size and mtime
reference uri of the reference (optional)
referenceTime mtime of the reference (optional)
referenceSize size of the reference (optional)

ExpectedLog

ExpectedLog monitors a generic log file periodically and expects some new log entries with certain patterns showing up frequently in the log file. If the expected log entry does not show up in the log file within a predefined period of time, ExpectedLog treats the log late and sets this occurrence as a failure for actions. If the application does not log frequently, ExpectedLog can run a script to trigger certain logs before each check.

ExpectedLog also supports time correlations. You can use the script to touch a file with a spcified time stamp. The module will check the log for the new entries and compare their time stamps with the mtime for the file.

For example, we use ExpectedLog to monitor syslog. Every 5 minutes, ExpectedLog runs logger to log a syslog entry. Then MonitorAgent checks the syslog to see if the new entry is there. If not, the syslog daemon may be either hung or died.

Here are the type specific properties in order to configure an instance of ExpectedLog.
Property Name Requirement Description Examples
URI mandatory URI of the log file log:///var/log/nginx/errors_log
TimePattern mandatory A SimpleDateFormat pattern to parse timestamp of logs yyyy-MM-dd HH:mm:ss,SSS
LogSize optional maximum number of lines for a log entry 5 (default 1)
ReferenceFile mandatory filename for storing state info /var/log/qbroker/.status/web_int.log
PatternGroup mandatory an Array instance of patterns Error
XPatternGroup optional an Array instance of patterns to be excluded Test
TestScript optional full path and name of the script /usr/bin/true
SleepTime optional seconds to sleep between script executing and log checking 60 (default is 0)

You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: syslog_elog.json.

ChannelMonitor

ChannelMonitor monitors the channels and their attachments of WMQ Queue Managers. It connects to a specified queue manager and checks the status of a given channel. It also checks the queue associated with the channel for the flow rate, and for the number of the processes that open the queue for read or write. There are many scenarios causing fatal failures on the channel as well as its attachments. In case of the fatal errors, the action will be invoked.

Here are the type specific properties in order to configure an instance of ChannelMonitor.

Property Name Requirement Description Examples
QueueManager mandatory name of the WMQ queue manager BROKER1
ChannelName mandatory name of the WMQ channel BKR1.SUB2
ChannelType mandatory type of the WMQ channel Sender or Receiver
QueueName optional name of the queue associated to the channel ST_IN
QueueOpenMode optional how the queue is opened KeepOpen or NotKeepOpen
StatsLog optional filename for storing channel stats /var /log/qbroker/stats/broker2.log
Here is an example: host.sub1_chl.json.

QueueMonitor

QueueMonitor monitors the queues of WMQ Queue Managers. It connects to a specified queue manager and checks the number of messages in a given queue. For each of queues watched by the monitor, there is a threshold for the maximum number of messages in that queue. If the current number of the messages exceeds the threshold, the error occurs. This error triggers the action of email alerts and the cleaning up on the queue if it is configured.

Here are the type specific properties in order to configure an instance of QueueMonitor.

Property Name Requirement Description Examples
QueueManager mandatory name of the WMQ queue manager BROKER1
QueueName mandatory name of the WMQ queue ST_IN
WaterMark mandatory threshold of number of messages in the queue 1000 or 0.8 (80%)
Here is an example: catch_q.json.

JMSMonitor

JMSMonitor monitors a JMS application that reads/writes messages from/to an MQ Queue. Either the application itself or other apps logs to a log file. It assumes that the application logs an entry whenever it picks up messages from the queue or puts a message to the queue. The Monitor combines both QueueMonitor and UnixlogMonitor and checks the message dequeue rate (how many messages have been picked up). If the dequeue rate is zero but the current depth of the queue is non-zero, the monitor treats the incident as an error to indicate the application hung. This error triggers the action of email alerts and the action program if it is configured.

Here are the type specific properties in order to configure an instance of JMSMonitor.

Property Name Requirement Description Examples
URI mandatory uri for the connection t3://jcmsref1:7001
ConnectionFactoryName mandatory for JNDI connection factory of JNDI j2eecms_i2/jms/JMSConnectionFactory
ContextFactory mandatory for JNDI context factory of JNDI com.sun.jndi.fscontext.RefFSContextFactory
Username optional username for JNDI jmstester
Password optional password for JNDI test
QueueName mandatory name of the WMQ queue CMS_APS
Operation mandatory the way to handle messages Get or Put
LogFile optional filename of the log file /var/log/qbroker/nohit.log
TimePattern mandatory time pattern of the log yyyy-MM-dd HH:mm:ss,SSS
LogSize optional maximum number of lines for a log entry 5 (default 1)
StatsLog optional filename for storing queue stats /var/log/qbroker/stats/broker_jms.log
ReferenceFile mandatory filename for storing state info /var/log/qbroker/.status/broker_jms.log
PatternGroup mandatory an Array instance of the pattern Just pushed (\d+) events.
NumberDataFields mandatory number of fields to sum up 1 or 0 (default)
Here is an example: broker_jms.json.

JMSHealthChecker

JMSHealthChecker sends a health-check msg to a JMS Destination and checks if the msg is accepted or not. In case of failure, the monitor treats the incident as an error to indicate the destination is not available. This error triggers the action of email alerts and the action program if it is configured.

Here are the type specific properties in order to configure an instance of JMSHealthChecker.

Property Name Requirement Description Examples
URI mandatory uri for the connection wmq://panda1
ConnectionFactoryName mandatory for JNDI connection factory of JNDI QueueConnectionFactory
ContextFactory mandatory for JNDI context factory of JNDI com.sun.jndi.fscontext.RefFSContextFactory
Username optional username for JNDI jmstester
Password optional password for JNDI test
QueueName mandatory for put name of the JMS queue ESB_XML
TopicName mandatory for pub name of the JMS topic MyTopic
JMSPropertyGroup optional map to define msg properties {"JMSType": "healthcheck"}
MessageBody optional text of the health-check msg This is a health-check
Here is an example: panda1_qhc.json.

JMXQMonitor

JMXQMonitor monitors a generic JMS Destination via JMX service. Most JMS vendor supprts JMX for management and monitor purposes. This monitor takes advantage of that. It monitors the metrices of the destination, such as enq rate, deq rate, current depth, number of consumers and number of producers, etc. In case the messages stuck in the destination, it sends an event as the alert.

Here are the type specific properties in order to configure an instance of JMXQMonitor.

Property Name Requirement Description Examples
URI mandatory uri for the JMX connection service:jmx:rmi:///jndi/rmi://localhost:8686/jmxrmi
Username optional username for JMX admin
Password optional password for JMX admin
MBeanName mandatory name of the MBean for the destination com.sun.messaging.jms.server:type=Destination,subtype=Monitor,desttype=q,name="EVENT_Q_1"
StatsLog optional filename for storing queue stats /var/log/qbroker/stats/EVENT_Q_1.mq
Here is an example: queue_jmx.json.

SonicMQMonitor

SonicMQMonitor monitors the message storage on a SonicMQ broker via JMS/JMX. Currently, it only supports broker, queues and durable subscriptions. Since there is no metric for number of deq and enq, it is not able to report the flow rate.

Here are the type specific properties in order to configure an instance of SonicMQMonitor.

Property Name Requirement Description Examples
URI mandatory uri for the SonicMQ tcp://localhost:2506
Username optional username admin
Password optional password xxxx
ObjectName mandatory name of the object to monitor Domain1.brQA01Container:ID=brQA01,category=metric,type=queue,name="EventQueue"
StatsLog optional filename for storing queue stats /var/log/qbroker/stats/EventQueue.mq
Here is an example: queue_sonic.json.

URLMonitor

Use URLMonitor to get the update time and/or size of the given page. It is assumed that the page exists and is accessible all times. The page should contain update time stamp that will be parsed by URLMonitor.

You can use URLMonitor to monitor when the page has been updated. If not, how late it is. In case of very late, URLMonitor sends alerts.

URLMonitor also supports mtime correlations and size correlations between two updates. In this case, there are two updates involved. One is the reference. The other is the target page to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, URLMonitor adjusts its timer and correlates this change with the target page. If the target page has been updated accordingly, the URLMonitor treats it OK. Otherwise, it will send alerts according to the predefined tolerance on the lateness of the target page being updated.

In order to configure URLMonitor to do time correlations, you have to specify a map named reference in its property map. The reference map contains most of the properties required by an Update object, such as URI, Name, Type, etc. The tolerance of the lateness will be controlled by the threshold parameters. In fact, URLMonitor will create a separate instance for the reference object. The method of performAction() will actually do the time correlations between two objects.

In case of the size correlations, you must specify the TriggerSize in the property map. The TriggerSize is zero or any positive number that defines two different states. One is the state that the size is less than the TriggerSize. The other is the opposite. In case state of the reference changes, URLMonitor will check the state of the target page. If both objects are in the same states, URLMonitor thinks it OK. Otherwise, URLMonitor will send alerts according to the predefined tolerance on the lateness of the target page keeping its state in sync.

Property Name Requirement Description Examples
URI mandatory URI of the page http://www.qbroker.org/index.html
Timeout optional seconds to timeout the request 30 or 60 in default
Pattern mandatory pattern for parsing update time Updated: (\d+):(\d+) ([ap])\.m\. \w+ \(\d+ GMT\) (\w+) (\d+), (\d\d\d\d)
DateFormat mandatory format string for update time hh mm a MMM dd yyyy

You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: panda_url.json.

LatestRecord

Use LatestRecord to get the update time of the latest record and/or number of the certain records in a database table. It is assumed that the database and the table exist and are accessible all times. The record should contain update time stamp that will be parsed by LatestRecord.

You can use LatestRecord to monitor when the record has been updated. If not, how late it is. In case of very late, LatestRecord sends alerts.

LatestRecord also supports mtime correlations and size correlations between two updates. In this case, there are two updates involved. One is the reference. The other is the target page to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, LatestRecord adjusts its timer and correlates this change with the target. If the target record has been updated accordingly, LatestRecord treats it OK. Otherwise, it will send alerts according to the predefined tolerance on the lateness of the target record being updated.

In order to configure LatestRecord to do time correlations, you have to specify a map named reference in its property map. The reference map contains most of the properties required by an Update object, such as URI, Name, Type, etc. The tolerance of the lateness will be controlled by the threshold parameters. In fact, LatestRecord will create a separate instance for the reference object. The method of performAction() will actually do the time correlations between two objects.

In case of the size correlations, you must specify the TriggerSize in the property map. The TriggerSize is zero or any positive number that defines two different states. One is the state that the size is less than the TriggerSize. The other is the opposite. In case state of the reference changes, LatestRecord will check the state of the target page. If both objects are in the same states, LatestRecord thinks it OK. Otherwise, LatestRecord will send alerts according to the predefined tolerance on the lateness of the target record keeping its state in sync.

Property Name Requirement Description Examples
URI mandatory URI of the database jdbc:oracle:thin:@broker1:1530:mqsibkdb
DBDriver mandatory classname of the database driver oracle.jdbc.driver.OracleDriver
Username optional username for DB connection qbadm
Password optional password of the username xxxx
SQLQuery mandatory SQL query for the latest record SELECT max(timestamp) FROM stories
Timeout optional seconds to timeout the query 30 or 60 in default
TimePattern mandatory SimpleDateFormat pattern for parsing update time yyyy-MM-dd HH:mm:ss

You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: batchqueue_db.json.

DBRecord

Use DBRecord to query a database table and then scan all the records for certain predefined patterns. It is assumed that the database and the table exist and are accessible all times. If the monitor catches any records, it will send an event. In that event, it only contains the last record and the number of the matching records. Therefore, it is customer's job to check the table for the details. Otherwise, you may miss the important records.

Property Name Requirement Description Examples
URI mandatory URI of the database jdbc:oracle:thin:@broker1:1530:mqsibkdb
DBDriver mandatory classname of the database driver oracle.jdbc.driver.OracleDriver
Username optional username for DB connection qbadm
Password optional password of the username xxxx
SQLQuery mandatory SQL query for the latest record SELECT max(timestamp) FROM stories
Timeout optional seconds to timeout the query 30 or 60 in default
PatternGroup mandatory list of patterns error
Here is an example: query_db.json.

PropertyMonitor

Use PropertyMonitor to watch a JSON property file. If the timestamp of the file changes, it will reload the file and diff with the cached copy to see if there is any changes. If there is a change, the monitor will send an event on the change. MonitorAgent is using it to monitor its own configuration files and automatically reload them if any change is detected.

Property Name Requirement Description Examples
URI mandatory URI of the config file file:///opt/qbroker/agent/Agent.json
Basename mandatory name of the master conifg file Agent
ComponentGroup mandatory map of components Monitor
PropertyFile optional full name of the property file /opt/qbroker/agent/Agent.json
Here is an example: repository_agent.json.

AgeMonitor

Use AgeMonitor to watch the lifetime of an object. If the lifetime of the object falls into certain ranges, the monitor will send event as the alert. For example, you can use AgeMonitor to monitor some processes. If the process runs too long, the monitor can kill it provided the action script is configured to do that.

Property Name Requirement Description Examples
URI mandatory URI of the test proc:///bin/ps?etime
Pattern mandatory pattern for parsing time \d+\s+(\d[^ ]+\d)
DateFormat mandatory format string for time D-HH:mm:ss
Operation optional method for aggregation max, first, last or min in default
Timeout optional seconds to timeout the test 30 or 60 in default

There are some other parameters required by the monitor, depending on the scheme of the URI. You also need to specify the threshold for the age. The threshold are defined in ActiveTime for time dependence support. It takes three non-zero numbers as seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: mqm_age.json.

WinlogMonitor

WinlogMonitor is used to scrape Windows event log with pattern match support. Unlike UnixlogMonitor depending on the increasing timestamps, WinlogMonitor uses WMIC to query Windows event log via a Javascript.

Property Name Requirement Description Examples
URI mandatory URI of monitor wlog:///application/AGILITYSDK
Script mandatory a Java script to query event log cmd.exe /c "cscript c:\home\qbroker\bin\extractLog.js //Nologo -l Application -s AGILITYSDK -s MSSQLSERVER -t 2 -t 1 -a ##yyyy####MM####dd####HH####mm####ss##.##SSS## -i ##RN## 2>nul"
LogSize optional maximum number of lines for a log entry 1
ReferenceFile mandatory filename for storing state info C:\home\qbroker\status\live.log
PatternGroup mandatory a list of Perl pattern groups to match certain logs see example
XPatternGroup optional a list of Perl pattern groups to exclude certan logs see example

Here is an example: ecs_wlog.json.

Todo list

MonitorAgent is a part of QBroker that is an ongoing open source project at GitHub.