MonitorAgent is an agent that is a part of QBroker project. As an agent, it periodically checks predefined occurrences or listens on some data sources for certain patterns. It is a Java standalone process running as a daemon with a bunch of the monitor objects, well designed for each category of occurrences. An occurrence can be any incident or event generated by applications, hardware, etc. Each of the occurrences is monitored by an instance of a monitor object that is registered with the container, MonitorAgent. The monitor object consists of two components. The first one is the report component which has a method for testing or detecting the occurrence. The result of the report component is also called a report. The second component is for action that evaluates the test result (report) and invokes various actions in case of failures or exceptions. Periodically, MonitorAgent launches the report component and passes the result to the action component. The action component checks the result (report) to determine the priority of the event. It sends an event to a centralized event collector, EventFlow, for further analysis and evaluations. In case of failure, MonitorAgent can invoke the pre-configured actions, such as sending an email alert, launching an action script to restart the service, or logging the errors either to ServiceNow or syslog, according to the pre-configured policies. Actually, it is the customized components that are doing the dirty jobs. MonitorAgent is just a container to run and manage all registered monitor instances. It also provides services such as schedule service, thread pooling, report sharing, centralized repository, dynamic deployment and workflow support. Besides the monitors for reports and actions, instances of MessageFlow can also be running inside MonitorAgent. The integration with MessageFlow provides the support for dynamic monitors, node level event processing and flexible workflows.
MonitorAgent is supposed to run on every physical or virtual box. Together with the EventFlow, StatsFlow and the centralized configuration repository, as well as the webadmin console, it is not difficult to build a Monitor Network with domain, sites and nodes and manage them through web browsers. Here is a conceptual diagram of a Monitor Network.
In the diagram, there are multiple components. Agent is for the instance of MonitorAgent running on each of nodes. EventCollector is a web service to log the events and metrics from agents or other applications. EventFlow is an instance of QBroker Flow for event correlations and escalations. StatsFlow is another instance of QBroker Flow to process metrics. Configuration Repository is a web based centralized configuration file store for deployment. WebAdmin is a web application for management of the repository and runtime operations. With this kind of hierarchical framework, the decisions and the actions can be made on any levels, such as, the monitor level, the node level, the domain level and the organization level.
Even though MonitorAgent can be owned by any user and is homed at any place, here we focus on the default one. By default, MonitorAgent runs as qbadm and is homed at /opt/qbroker on Unix platform. On Windows, it runs as system and is homed at C:\home\qbeoker. The directory for configurations is agent. It is recommended to enable the configuration repository and make changes on the configuration repository only. If the web based admin tool is available, always use the tool to manage the repository and operation tasks on all monitors. If you are interested in the source code, you can find the open source project of QBroker at GitHub.
By default, MonitorAgent will be installed in /opt/qbroker and owned by qbadm:qb.
The installation on Unix platform is simple. If your box has web access to https://yannanlu.github.io, it will be really simple. You just need to login on the box and run the followinig command to have it installed:
wget -O - https://yannanlu.github.io/misc/installQB.sh | sudo bashIn some cases, the web access to https://yannanlu.github.io may not be allowed. So you will have to download the tar ball and the installation script from https://yannanlu.github.io. Then you need to copy them to the box for the installation. Here is the procedure with the step-by-step tasks:
wget --no-check-certificate https://75.131.197.149/qbroker.tgz wget https://yannanlu.github.io/misc/installQB.sh
sudo bash ./installQB.sh .
The installation on Windows platform is a bit different. We are not going to discuss it here.
Once MonitorAgent is installed and configured properly, you should have the following filesystem layout on a Unix patform:
path | function | example |
---|---|---|
/opt/qbroker | home dir of QBroker | |
/opt/qbroker/bin | dir for startup script and other utilities | |
/opt/qbroker/bin/agentctl | startup script of MonitorAgent | ./agentctl restart |
/opt/qbroker/lib | dir for Java libraries and shared libraries | |
/opt/qbroker/agent | dir of configuration files for MonitorAgent | |
/opt/qbroker/agent/Agent.json | master configuration file | Agent.json |
/opt/qbroker/templates | dir for template files | |
/opt/qbroker/templates/mail.txt | template file for email alert | mail.txt |
/var/log/qbroker | dir for QBroker logs | |
/var/log/qbroker/MonitorAgent.log | log file of MonitorAgent | |
/var/log/qbroker/MonitorAgent.out | stdout and stderr of MonitorAgent | |
/var/log/qbroker/completed | status file of MonitorAgent | |
/var/log/qbroker/.status | dir for stateful files | |
/var/log/qbroker/archive | dir for archived logs | |
/var/log/qbroker/checkpoint | dir for checkpoint data | |
/var/log/qbroker/stats | dir for statistical and historical logs |
The normal operation tasks involve start, stop, restart, troublshooting on the log files, configuration management and deployment. MonitorAgent keeps its configuration files in /opt/qbroker/agent. Among the various configuration files, the master configuration file is most important. It is /opt/qbroker/agent/Agent.json. Here is an example: Agent.json. The details on the configurations will be covered by a dedicated section. Here we will focus on how to start, stop, restart on the local box and where are the logs.
To start MonitorAgent, go to /opt/qbroker/bin and run ./agentctl restart as the owner or root. If this command is not invoked by its owner, it will try to su to the owner and probably prompt for the password of the owner. MonitorAgent runs as qbadm by default. In case you want MonitorAgent to restart applications owned by others, you should use sudo for the access. Please check with sudo's man pages on how to allow a user to run scripts on behalf of someone else. There are plenty of examples in the repository. You may find something helpful. To stop MonitorAgent, go to /opt/qbroker/bin and run ./agentctl stop as the owner or root. To check if the process is running or not, run ./agentctl status. The alternative way to operate on MonitorAgent is to use WebAdmin's operation view to start/stop/restart MonitorAgent remotely.
If you are lucky, MonitorAgent's process will be running as a daemon. Otherwise, you need to troubleshoot the problem. Always check the errors in /var/log/qbroker/MonitorAgent.log and /var/log/qbroker/MonitorAgent.out. If you are still not able to start MonitorAgent, please ask around for help.
Once MonitorAgent has been installed, you can start to configure the monitor components for your needs. The next section will explain the configurations on each component. Here we focus on the repository and deployment.
Since you may have MonitorAgent installed on multiple boxes, it will be very difficult to manage them if there is no centralized repository. The current MonitorAgent supports the web-based JSON configuration repository. You just need to define, modify config files in the repository. Once they are done, you just need to publish the changes on the repository as the deployment. MonitorAgent is supposed to monitor the repository and picks up changes within a couple of minutes. Then it will reload those modified objects and continue to work. If the web repository feature is not enabled with the instance of MonitorAgent, it will never see the changes. In this case, you will have to push the changes to the box and bounce the instance. We call this case as the synchronous deployment. As the contrast, the former is asynchronous deployment.
In fact, a JSON web-based repository is just a regular web site serving a set of JSON configuration files via HTTP. There are two different deployment processes. One is asynchronous deployment that is just to publish the JSON content, very similar to a production web site with static content. But it requires the application's active involvement. The other is synchronous deployment, requiring a push of changes and a restart on the application. There are three questions to be answered before you make a decision on which deployment method to use. The first two are whether the application supports the web-based repository and whether the feature is enabled or not. If both answers are positive, the next question will be whether the application is running and working properly. If any of the answers is no, you will have to use the synchronous deployment. Otherwise, you can use either of the deployment methods.
If WebAdmin has been installed, it should be used for the management of the configuration repository. WebAdmin is powered by Javascript at the client side, QBroker at the middle tier and MySQL at the backend. It is a web based tool designed for generic applications. For the aplications like MonitorAgent, EventFlow and QFlow, it allows users to manage their configuration repository and carry out the routine operation tasks via web browsers such as Firefox or Safari. It supports both the synchronous deployment and the asynchronous deployment. It is also highly integratable and extensible because of its middle tier of QBroker. We are not going to cover how to use WebAdmin to deploy changes here. We are going to focus on the command line deployment only.
In case WebAdmin is not available, you still can make the deployment via the command line utility. If you have made any changes to the repository, please remember to import the changes to the database once WebAdmin is back up. It is very important to keep the repository and the database in sync. We will revisit this issue later.
The command line deployment is to run deploy.sh as the owner of the repository. The utility supports both the synchronous deployment and the asynchronous deployment. Either way, you do not have to login on each of the box to do the task again and again. The command line utility will save your time, especially when you are dealing with clusters with multiple boxes.
By default, the repository is on /www/wdap/docroot/agent. Since MonitorAgent runs as qbadm, you will have to have the access to modify them. For your own domain, you can choose a box with Nginx and have /www/wdap/docroot/agent copied over. Inside /www/wdap/docroot/agent, there are multiple directories named after categories. All the boxes are grouped into categories. For example, panda1/2 are grouped into the service of panda, simply because they share a lot of functionalities.
Here is the procedure to deploy changes asynchronously:
Here is the procedure to deploy changes synchronously:
The configuration files of MonitorAgent are actually a set of JSON files. Usually, these JSON files are stored in the repository and deployed to the boxes. If you know what to do, you can use your favorite editor to modify them either on the box or on the repository. If WebAdmin is available, always use it to create, modify, delete, import, export, upload and deploy configuration files of MonitorAgent. However, we are not going to cover how to use WebAdmin to modify JSON configuration files. Here we will focus the structure of the configuration file, its content and what the content is for. After all, you have to know what to change on the JSON configuration files before you really modify them.
Each type of MonitorAgent component has its own configuration file. The data schema of the components varies from type to type. Please do not feel bad if you are confused by various configuration files for MonitorAgent. As you know, MonitorAgent is designed to support various monitor components. It is up to the developers to specify what are required in the configraion file. However, you may find a set of sample files for each of scenarios in the repository or in /opt/qbroker/agent/examples on some of the boxes. As a good start, you can copy a similar example and modify it according to your needs. The JSON file usually explains by itself. Even though you may not understand every properties, you will still be able to figure out what to change in most cases. If you encounter problems to configure a monitor, just ask around for help.
The master configuration file of MonitorAgent is /opt/qbroker/agent/Agent.json. Here is an example: Agent.json. For each individual monitors, there may be a dedicated configuration file. Once the configuration is ready, please run /opt/qbroker/bin/agentctl filename in the config directory to check the syntax of the configuration file. You may need to fix the problem if the syntax check fails.
If you view the master configuration file, you will see four parts. The first part is the properties for MonitorAgent container itself. It specifies how often for MonitorAgent to run each of components (heartbeat in second), where to send the event (http://panda:8082/event), where to log locally, where is the repository and whether to turn on the disable feature, etc. The second part is the AdminServer which supports remote control and query synchronously. The third is a MonitorGroup list. Each MonitorGroup lists names of the monitors. The name has to be unique within the container and it can be defined in the same file or in a separate file in the same directory. Whenever you add a new monitor, you need to define the monitor first and then add the entry to one of MonitorGroups in the master file. If the repository is configured, you just need to deploy the changes. Otherwise, you have to bounce the monitor to activate the new changes. The last is a MessageFlow list. A MessageFlow contains multiple message nodes representing certain workflow. If the repository is defined, the changes to MessageFlows will be reloaded automatically after they are deployed. Otherwise, a synchronous deployment and a bounce on the container will be required.
Here is an example of the AdminServer's definition:
{ ... "AdminServer": { "Name": "admin", "ClassName": "org.qbroker.net.SimpleHttpServer", "URI": "https://localhost:6627/admin/jms", "Operation": "handle", "Capacity": "64", "Partition": "0,32", "KeyStoreFile": "/opt/qbroker/agent/keystore.jks", "KeyStorePassword": "xxxx", "TrustAllCertificates": "true", "Timeout": "10", "RestartScript": "/bin/bash -c \"/opt/qbroker/bin/agentctl restart &\"" }, ... }where TrustAllCertificates is set to true for client queries in case keystore.jks is self signed.
Here is an example of definition for a MessageFlow:
{ ... "MessageFlow": [{ "Name": "default", "Description": "dispatch events for Agent", "Capacity": "1024", "XAMode": "0", "Debug": "1", "PauseTime": "2", "StandbyTime": "60", "Node": [ "node_switch" ], "Persister": [ "pstr_event", "pstr_nohit" ] }] // end of MessageFlow }MessageFlow is optional. If there is a need, you can define multiple MessageFlows. MessageFlow can be used to provide arbitrory services and/or to listen on certain requests or data streams. Both the dynamic monitor support and node level event correlations are implemented via MessageFlow.
All the monitors in the same group will be processed within the same thread in the same order of the list. You can have multiple groups for independent monitors. As you may know, the orders of the groups are not well defined in an MT envirenment. However, the order within the same group is honored. Each group can have its own heartbeat, timeout and debug. Among all the groups, the default group is special. First, it is never able to be disabled by all means. Second, it will run first at the time of startup or relaod. So you should only put basic monitors and reports in the default group. The property of MaxNumberThread controls the maximum number of concurrent threads in the thread pool.
In each MonitorGroup, it lists all the names of the monitors. It can also contain Map objects for monitor templates. Here is an example:
{ ... "MonitorGroup": [ { "Name": "default", "Monitor": [ "global_var", "rotation_agent_out", "rotation_agent_stats" ] // end of monitor },{ "Name": "queue", "Heartbeat": "120", "Capacity": "128", "Monitor": [ "MyQueue_jlog", { "Name": "broker_sonic", "Template": "broker##id##", "Item": ["1","2"] } ] } ], // end of MonitorGroup ... }
As you can see, there are two monitor objects defined in the group of queue. The first one is referencing the name of the monitor, MyQueue_jlog. There should be a json file, MyQueue_jlog.json in the folder. The second is a Map containing a Name, broker_sonic, as the reference to the configureation template file; and a Template, broker##id##, as the name template to set the names of new monitors; plus a list of items used as the values for ##id## to generate a new monitor from the configuration template. For example, the name of the monitor for the first item will be broker1. As you can see, the variable of ##id## in the name template has been replaced by 1, the value of the first item. So MonitorAgent will generate 2 new monitors sharing the same property template. This is convenient since you do not have to define the monitor for every hosts.
In the example of above, the item list is static. So we call the monitor template as static. MonitorAgent also supports dynamic monitor templates. If a monitor template is dynamic, the data for Item has to be a Map that defines a MonitorReport object to generate the item list dynamically. Here is an example:
{ ... "MonitorGroup": [ { "Name": "queue", "Heartbeat": "120", "Capacity": "256", "Monitor": [ { "Name": "queue_sonic", "Template": "##queue##", "Item": { "Type": "GenericList", "Description": "JMS/JMX listing on SonicMQ", "URI": "tcp://##hostname##:2506", "Username": "qbadm", "Password": "xxxx", "RequestCommand": "DISPLAY Domain1.brQA01Container:ID=brQA01,category=metric,type=queue", "DataField": "List", "KeyTemplate": "##name##", "ReportMode": "local", "Step": "1", "XPatternGroup": [{ "Pattern": [ "SampleQ\\d+" ] }] } } ] } ], ... }
If you compare this example to the previous one, you will notice the changes on the content and data type of Item. There is no list of items any more. Instead, the block of Item defines a MonitorReport object to generate the list of items dynamically. The data field of the report is specified as List. When MonitorAgent dispatches the dynamic group to a working thread, it will invoke the method of generateReport() and retrieves the list of queues from the report. This way, you do not need to care about what queue is available.
Dynamic monitor template is really useful to monitor dynamic objects. Monitor on Apache ActiveMQ's queue is a good example. As we know, queues in ActiveMQ can be generated by applications. So some of queues come and go. It is very challenge to keep tracking on them. With dynamic monitor template, MonitorAgent is able to discover new queues to monitor and to remove the monitor when a queue is gone. Here is the configuration example:
{ ... "MonitorGroup": [ { "Name": "amq", "Heartbeat": "60", "Monitor": [ { "Name": "queue_jmx", "Template": "##queue##_jmx", "Substitution": "s/^.*,Destination=//", "Item": { "Type": "GenericList", "Description": "JMX listing on ActiveMQ", "URI": "service:jmx:rmi:///jndi/rmi://localhost:8999/jmxrmi", "Username": "admin", "Password": "xxxx", "MBeanName": "org.apache.activemq:BrokerName=localhost,Type=Queue,*", "DataField": "List", "ReportMode": "local", "Step": "5", "XPatternGroup": [{ "Pattern": [ "(example|sample|test)$" ] }] } } ] } ], ... }
In this example, the MonitorGroup will lauch the JMX query to the ActiveMQ service every 5 min. It will generate a list of queues on the service. With that list, a monitor will be generated based on the template of queue_jmx for each queue. The monitor runs every minute to watch the queue. If a queue disappears, its monitor will be removed accordingly. Therefore, dynamic monitor template manages a list monitors based on another monitor. Currently, only one variable is supported for the monitor template.
To define an individual monitor, you need to specify its properties and the policies. The property set and the rule set depend on the type of the monitors. Not all occurrences are supported by MonitorAgent. For each type of the supported occurrences, there is at least one Java class that implements the report component and the action component for the occurrence. Some of the properties are mandatory, and some conditional mandatory, others optional. Some of the properties are the common ones used for classifications and correlations. Others are unique to each individual types of monitors. Here is the list of common properties for all monitors.
Property Name | Data Type | Requirement | Description | Examples |
---|---|---|---|---|
Name | alphanumeric with no spaces | mandatory | name of the monitor | qmgr_proc |
Site | alphanumeric with no spaces | optional | site that the omonitor is associated with | DEVOPS |
Category | alphanumeric with no spaces | optional | category of the monitor for event correlation | WMQ or ESB |
Type | alphanumeric with no spaces | mandatory | type of the monitor | ProcessMonitor |
ClassName | alphanumeric with no spaces | mandatory | fullname of the Java class for the implementation | org.qbroker.monitor.FileMonitor |
URI | string of URL | mandatory | the universal resource idetifier | file:///var/log/nginx/access.log |
Description | text | optional | brief description for the monitor | cross-watch for MonitorAgent on panda |
Step | integer | optional | to generate the report once every specific number of heartbeats | 2 (default is 1) |
Tolerance | integer | optional | to ignore the first specific number of consecutive failures | 2 (default is 0) |
MaxRetry | integer | optional | to invoke the action up to specific number of times if failure persists | 1 (default is 2) |
MaxPage | integer | optional | to send page alerts up to specific number of times if failure still persists | 0 (default is 2) |
QuietPeriod | integer | optional | to keep quiet up to specific number of times if failure still persists | 12 (default is 0) |
ExceptionTolerance | integer | optional | to ignore the first specific number of the exceptions from the monitor | 5 (default is 2) |
DependencyGroup | list | optional | list of dependency group | click here for details |
StaticDependencyGroup | list | optional | list of static dependency group | click here for details |
ActionGroup | list | optional | list of the actions | click here for details |
Reference | map | optional | the map of reference | click here for details |
ActiveTime | map | mandatory | the map with the active time slot for MonitorAgent to watch the occurrence | click here for details |
DependencyGroup is a List containing dependencies of the monitor. For a given monitor, it may depend on other monitors. In this case, we say this monitor has dependencies. On the other hand, the monitor may have its own dependents, ie, some other monitors may depend on the current monitor.
If DependencyGroup is defined for a monitor, the monitor will check the dependency first. The result may be success or failure. The monitor continues the normal operation if it is a success. Otherwise, the monitor will be disabled. In another word, MonitorAgent uses the DependencyGroup to mimic the IF statement so that it can control the work flow. You can put multiple dependencies into DependencyGroup to mimic logic AND and logic OR relationships.
In fact, a Dependency is actually an instance of Monitor with both ReportMode and DisableMode set properly. The Dependency can be defined in-line or in a separate file. One it is defined, you just need to specify or reference the dependency via its name. Here is an example of monitor which can be used as a Dependency:
{ "Name": "rpt_panda", "ClassName": "org.qbroker.monitor.ScriptLauncher", "Site": "DEVOPS", "Type": "ScriptLauncher", "Category": "REPORT", "Description": "report on hostname", "Step": "1", "Tolerance": "0", "MaxRetry": "2", "MaxPage": "1", "QuietPeriod": "12", "ExceptionTolerance": "2", "Script": "/bin/uname -n", "ScriptTimeout": "40", "ReportMode": "final", "DisableMode": "1", "XPatternGroup": [{ "Pattern": ["^panda\\.?"] }], "ActiveTime": { "TimeWindow": [{ "Interval": "00:00:00-24:00:00" }] } }where DisableMode identities this report is for a dependency. ReportMode defines the scope of the report. In case this report is deployed to a box with the hostname not matching panda, this report will be a failure. As the result, it will disable its all dependents. If DisableMode is set to -1, it will reverse the test result, ie, the dependents will be disabled only if the report is a success.
Here is an example of DependencyGroup with a Dependency defined in-line:
{ ... "DependencyGroup": [{ "Dependency": [{ "Name": "repo_agent", "ClassName": "org.qbroker.monitor.URLMonitor", "URI": "http://panda:8082/agent/panda/agent.json", "Operation": "HEAD", "Username": "omadm", "Password": "xxxx", "MaxBytes": "0", "Pattern": "Last-[mM]odified: (\\w+, \\d+ \\w+ \\d+ \\d+:\\d+:\\d+ \\w+)", "DateFormat": "EE, dd MMM yyyy HH:mm:ss zz", "Timeout": "60", "TimeOffset": "0" }] }], ... }More dependencies can be added to the list for the logic relationships of AND and OR. All dependencies inside the same list of Dependency will be evaluated as AND. All dependencyGroups will behave like OR.
With a group of dependencies, you can easily control the monitor flow dynamically.
StaticDependencyGroup is a DependencyGroup that will be evaluated at the startup only. If it is failed, the monitor will be disabled permanently. It is used to partition monitors across the multiple platforms or hosts.
{ ... "StaticDependencyGroup": [{ "Dependency": ["rpt_panda"] }], ... }where the Dependency of rpt_panda has been defined separately.
ActionGroup is a list of the actions for MonitorAgent to invoke as the response to certain events. In an action, the content of the event is accessible via the variable names, such as ##hostname## for hostname of the event. Here is an example:
{ ... "ActionGroup": [{ "URI": "script://localhost", "Priority": "^ERR$", "Timeout": "30", "Script": "/opt/qbroker/init.d/S50QFlow_EVENT restart" },{ "URI": "smtp://web.qbroker.org", "Priority": "^ERR$", "Email": ["warn@web.qbroker.org"], "Subject": "##hostname##: ##priority## ##name## died", "TemplateFile": "/opt/qbroker/templates/mail_proc.txt" },{ "URI": "smtp://web.qbroker.org", "Priority": "^CRIT$", "Email": ["page@web.qbroker.org"], "Subject": "##hostname##: ##priority## ##name## died", "TemplateFile": "/opt/qbroker/templates/mail_proc.txt" }], ... }
This ActionGroup has defined three actions. The first one is to bounce the process if the priority of the event is ERR. The other two are for email alerts. One only reacts on the ERR event and sends an email alert as warning. The other reacts on the CRIT and sends the alert as a page. In each action, you can define format templates for specific event types. You can also define a substitution rule to modify content of the events. In order to see what happens to an action, you can add Debug tag and set it to 1. Here is an example of Action with Subscription defined to modify the content of the event:
{ ... "ActionGroup": [{ "URI": "jdbc:oracle:thin:@localhost:1520/mydb", "Priority": "^ERR$", "Username": "monitor", "Password": "xxxx", "SQLStatement": "DELETE FROM MY_LOCK WHERE LOCK_NAME = '##leadingBlock##' AND CREATE_DATE < SYSDATE - 1.0/24", "Substitution": [{ "leadingBlock": "s/^[-:0-9 ]+\\| //" }], "DBTimeout": "50" }], ... }where ##leadingBlock## is referencing the attribute of leadingBlock of the event. The value of the attribute is something as follows:
2016-08-17 13:27:22 | PurgeProcessedMessageCommandLockThe definition of Substitution is to cut off the timestamp, the spaces and the pipe char. So that only PurgeProcessedMessageCommandLock will be used to replace the variable of ##leadingBlock## in SQLStatement.
This Action is to run the SQLStatement on an Oracle DB to delete the aged lock with the lock name specified in the event.
Reference is a Map specifying a separate report as the reference object. It is used in the time or number correlations. For example, CMS publisher has primary template and secondary templates for each publish event. The rendering of the primary template is a synchronous process. But the secondary templates are asynchronous. Therefore, the files in the secondary templates may get delayed or failed in their publish processes. MonitorAgent can be used to monitor the timestamps of those files in the secondary templates and correlated them with the file in the primary template.
Here is an example:
{ ... "Reference": { "Name": "url_timestamp", "URI": "ftp://www.qbroker.org/www/wdap/rrd/generic.json", "Type": "FileMonitor", "User": "qbadm", "Password": "xxxx", "Timeout": "30", "TimeZone": "GMT" }, ... }This reference defines a report on the timestamp of the URL. Its timestamp will be used as the reference in the time correlation process.
ActiveTime is a Map specifying when the monitor is active. This is introduced to address the blackout issue. For example, most of the web server will have daily log rotations. You do not want MonitorAgent to alert you during the rotation. Therefore, you can specify the blackout window so that the monitor will not be active during the rotation. ActiveTime is also used to schedule time-driven jobs, just like cronjobs.
Here is an example of ActiveTime:
{ ... "ActiveTime": { "StartTime": "2005/03/18.08:00:00.EST", "StopTime": "2005/12/18.23:59:59.EST", "Blackout": ["6,00:00:00-24:00:00", "7", "4/6,20:00:00-04:00:00"], "TimeWindow": [{ "Interval": "00:30:00-09:45:00" },{ "Interval": "10:30:00-17:45:00" },{ "Interval": "18:30:00-23:45:00" } }, ... }This example tells us it starts from 3/18 and ends on 12/18 and it will be active during the weekdays except for April 6th from 8:0PM thru 4:00AM next day. During the active day, there are three active time windows.
As you see, ActiveTime contains at least one active time window. Within the active time window, user can define threshold for certain monitors. For example, FileMonitor and AgeMonitor require threshold defined. A threshold is two or three numbers delimited by comma. The range below the first number always means NORMAL. The range between first two numers means WARNING. The range beyond the second number means ERR.
In the diagram of the Enterprise Monitor Network, MonitorAgent is acting as an agent on the nodes. Each instance of monitor will be represented by the type of Events. An Event is a self-described structure message, similar to a JSON message or a JMS MapMessage. It has certain mandatory properties, such as priority, name, site, type, text, etc. It may also have other customized and free-formed properties. MonitorAgent uses Event to store information about the what observed by the monitor. The benefit of Event is for different applications on different platforms to easily parse, match, evaluate, correlate, present and process the information carried by the message.
The primary task of the action component is to transform the raw data from the report into a more readable, operationable and interchangeable event. With Event, the details of the occurrence will be able to flow across the network. This mobility allows any monitor to publish its reports with Event and allows other applications to subscribe the content based on their interests.
Since all the monitor alerts are actually Events, it is important to understand the structure of the specific type of Event when you are creating or configuring the monitor. In fact, each type of the monitor has its own type of Event. Among the various properties, some are mandatory. Others may be common or unique for the type. Here lists the common properties of an Event:
Attribute Name | Description |
---|---|
priority | priority of Event |
name | name of monitor |
uri | URI of the monitor |
type | type of Event |
site | site of Event |
category | category of Event |
text | message of Event |
hostname | hostname where the event is from |
program | application name that sends Event |
owner | owner of application |
pid | Unix process ID of the application |
date | date and time of the event |
testTime | date and time of the report |
status | status of the report |
actionScript | status of the action script: executed, skipped and not configured |
actionCount | number of times that it occurs in a row |
description | description of the monitor |
The most important thing is to know what properties are available for certain types of events and what they mean. The developers are supposed to document the data structure or schema for each types of events. If you can not find the documentation on a specific type of Event, you are still able to figure out most of the properties after such an event is delivered to you.
When something occurrs, the action component will generate an event on it and increases its ActionCount. By evaluating the ActionCount, the monitor knows the history of the occurrence and will decide if any escalation is needed. If the samething is happening again and again, the action component will escalate the priority of the event based on the predefined policies. Here is the matrix of event escalation for common scenarios:
ActionCount(c) | WARNING | ERR | CRIT |
---|---|---|---|
0<c<=T | log | ||
0<c-T<=R | log, mail, run script | ||
0<c-T-R<=P | log, page | ||
0<c-T-R-P<=Q | nothing |
Currently, MonitorAgent only supports the following types:
Type | ClassName | Description |
---|---|---|
AgeMonitor | org.qbroker.monitor.AgeMonitor | to monitor the age of an object and correlate with other objects |
ChannelMonitor | org.qbroker.wmq.ChannelMonitor | to monitor a WMQ channel and check its flow rate |
DBRecord | org.qbroker.monitor.DBRecord | to query a DB table and scan all records for certain patterns |
ExpectedLog | org.qbroker.monitor.ExpectedLog | to launch a script and expect a specific new log entries showing up in the log file |
FileMonitor | org.qbroker.monitor.FileMonitor | to monitor the latest modified time of a file |
IncrementalMonitor | org.qbroker.monitor.IncrementalMonitor | to monitor the incremental of a number to see if it is out of range |
JMSHealthChecker | org.qbroker.jms.JMSHealthChecker | to health-check a JMS Destination via sending a msg to it |
JMSLogMonitor | org.qbroker.jms.JMSLogMonitor | to monitor a JMS Destination implemented via a log file |
JMSMonitor | org.qbroker.jms.JMSMonitor | to monitor a JMS application via its queue and the log file |
JMXQMonitor | org.qbroker.monitor.JMXQMonitor | to monitor a generic JMS Destination via JMX Service |
LatestRecord | org.qbroker.monitor.LatestRecord | to monitor the update time of a database record and correlate with other updates |
MultiFileMonitor | org.qbroker.monitor.MultiFileMonitor | to monitor the latest modified time of a group of files |
NumberMonitor | org.qbroker.monitor.NumberMonitor | to monitor a number to see if it is out of range |
ProcessMonitor | org.qbroker.monitor.ProcessMonitor | to monitor a unix process with specific patterns |
PropertyMonitor | org.qbroker.monitor.PropertyMonitor | to monitor a JSON property file to see if it is modified or not |
QueueMonitor | org.qbroker.wmq.QueueMonitor | to monitor the queue depth or its changes on a WMQ queue |
ScriptLauncher | org.qbroker.monitor.ScriptLauncher | to launch a script and check its output for errors |
ServiceMonitor | org.qbroker.monitor.ServiceMonitor | to monitor service metrics via Monit status page |
SyntheticMonitor | org.qbroker.monitor.SyntheticMonitor | to launch a Selenium script to check a web site |
SonicMQMonitor | org.qbroker.sonicmq.SonicMQMonitor | to monitor the metrics and its changes on a SonicMQ broker |
URLMonitor | org.qbroker.monitor.URLMonitor | to monitor the update time of a web page and correlate with other updates |
UnixlogMonitor | org.qbroker.monitor.UnixlogMonitor | to monitor a log file and detect the new occurrences of log entries matching at least one of the given patterns |
WebOperator | org.qbroker.monitor.WebOperator | to access a given URL and check its content for errors |
WinlogMonitor | org.qbroker.monitor.WinlogMonitor | to monitor Windows Event log |
Each type of the monitor object handles one specific scenario. It is up to the report component to file a report and determine the status of the report. It is up to the action component to determine if the report is a failure and the level of priority of the failure as well as the actions to invoke. Even though the number of supported monitors are limited, you still can combine them just like Lego to create and monitor new occurrences.
None of systems is perfect. Neither is MonitorAgent. Sometimes, MonitorAgent's test part may fail due to some unexpected exceptions. For example, a FileMonitor tries to get the timestamp of a remote file. The network outage fails the ftp process. As the result, MonitorAgent will not be able to judge when the file is updated. In this case, MonitorAgent treats it as an exception and the action component is supposed to handle the exceptions.
UnixlogMonitor scrapes the logs of applications. It detects the new occurrences of the log entries that match at least one of the given patterns. If a matching log entry is detected, the monitor treats it as an error that in turn triggers the actions. For example, if SportsTicker receiver catches an exception, it logs into its own log file. UnixlogMonitor can catch the exception logs and sends alerts. UnixlogMonitor guarantees that each of the new entries is checked only once, provided that each log entry has a well defined timestamp and the timestamp increases monotonously.
Once the monitor catches any entries, it will send an event. The event only contains the last entry and the number of the match entries. Therefore, it is customer's job to check the log for the details. Otherwise, you may miss the important log entries.
Currently, UnixlogMonitor is using Java SimpleDateFormat to parse the timestamps. If you do not know how to configure the timestamp pattern, please read the book on that class or ask someone for it.
Here are the type specific properties in order to configure an instance of UnixlogMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the log file | log:///var/log/nginx/error_log |
TimePattern | mandatory | A SimpleDateFormat pattern to parse timestamp of logs | yyyy-MM-dd HH:mm:ss,SSS |
PerlPattern | optional | A Perl pattern to parse out the portion of timestamp of logs if needed | ^\w+ (\d\d\d\d-\d+-\d+ \d+:\d+:\d+,\d\d\d) |
LogSize | optional | maximum number of lines scanned for each log entry | 5 (default 1) |
ReferenceFile | mandatory | full filename for storing state info | /var/log/qbroker/.status/nginx_error.log |
ErrorIgnored | optional | threashold for number of match entries to trigger CRIT event (bypassing normal escaltions) | 10 (default is 0 for off) |
PatternGroup | mandatory | a list of Perl pattern groups to match certain logs | see example |
XPatternGroup | optional | a list of Perl pattern groups to exclude certain logs | see example |
The most confusing property is TimePattern. Given a log file, how can
you determine what TimePattern to use? Well, TimePattern is
determined by the timestamp of the logs. Compare the timestamp of the log
file with the following table and try to find one with the exactly same
pattern:
TimePattern | Example |
---|---|
EE MMM d HH:mm:ss yyyy | "Mon Aug 27 10:03:16 2001 ..." |
yyyy-MM-dd HH:mm:ss,SSS | "2002-12-31 07:23:01,762 ..." |
yyyy-MM-dd'T'HH:mm:ss.SSS'Z' | "2002-07-21T10:45:23.215Z ..." |
dd/MMM/yyyy:HH:mm:ss | "23/Feb/2002:17:03:51 ..." |
[yyyy-MM-dd HH:mm:ss | "[2002-05-28 11:43:13] ..." |
[yyyy/MM/dd HH:mm:ss | "[2002/07/21 10:45:23] ..." |
[yyyy-MM-dd HH:mm:ss,SSS | "[2002-05-28 11:43:13,276] ..." |
MM/dd/yy HH:mm:ss | "09/21/02 10:45:23 ..." |
MM/dd/yy hh:mm:ss a | "03/21/99 07:15:43 PM ..." |
yyyy-MM-dd.HH:mm:ss | "2002-07-21.10:45:23 ..." |
yyyy-MM-dd HH:mm:ss | "2002-07-21 10:45:23 ..." |
yyyy-MM-dd HH:mm:ss zz | "2002-07-21 10:45:23 EDT ..." |
MMM d HH:mm:ss | "Aug 3 10:03:16 ..." |
MM-dd-yyyy HH:mm:ss SSS | "09-21-2002 10:45:23 267 ..." |
yyyy.MM.dd ss:mm:HH | "2002.07.21 30:45:23 ..." |
ss:mm:HH:dd:MM:yyyy | "53:45:23:29:11:2003 ..." |
ss | "1464507681 ..." |
ss.SSS | "1464507681.456 ..." |
If there is no similar one, you just need to create one for the new type of the timestamp. TimePattern is implemented via Java SimpleDateFormat. Its Javadoc will be really helpful for you to create a new TimePattern. If it is too complicated for you, please ask around for help.
Let's talk about the PatternGroup. PatternGroup is a list of Perl Pattern groups used to match certain patterns in the log entries. Here is an example:
{ ... "PatternGroup": [{ "Pattern": ["(ERROR|FATAL) "] },{ "Pattern": ["WARN ", "\\.NullPointerException"] }], ... }This PatternGroup has defined two pattern groups. The first one is to match either ERROR or FATAL in the log. The 2nd one contains two patterns and it is to catch the WARN log with .NullPointerException. As you can see, within a PatternGroup, there may be multiple patterns. It requires the log entry matches all of them at the same time to count as a match. For those two groups of patterns, their relationship is logic OR. It means either groups get matched will be counted as a match.
By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of UnixlogMonitor:
This message was automatically generated. Type: UnixlogMonitor Owner: qbadm ActionScript: not configured Status: Normal ActionCount: 1 Date: 2005/06/24.18:26:15.226.EDT NumberLogs: 2 Pid: 11711 LastEntry: [2005-06-24 18:22:39,657] ERROR - exception org.qbroker.publisher.DistributionException: failed to put /www/wdap/rrd/images/mq/abc.png: 550 /: Permission denied Text: 'log:///var/log/qbroker/publisher.log' has 2 new matching entries Uri: log:///var/log/qbroker/publisher.log Description: QBroker publisher log for testing TestTime: 2005/06/24.18:26:15.216.EDT Hostname: panda Priority: WARNING Name: publisher_log Program: MonitorAgent A brief explanation on the monitor and this type of message can be found at https://yannanlu.github.io/agent.html#UnixlogMonitor.
This alert tells us the logfile of /var/log/qbroker/publisher.log contains 2 error entries and the last one occurred at [2005-06-24 18:22:39,657]. As you can see, the monitor only sends the first 5 lines of the last error entry. In order to get the details of the error, you have to open the log file to look for it. It is up to the admins to figure out what caused the error and how to handle the error. Apart from the common properties , the following table lists all other type specific properties and their descriptions:
Attribute | Description |
---|---|
numberLogs | number of matching entries |
lastEntry | the original content of the last entry up to the limited lines |
ScriptLauncher is to launch a script provided by customers. It assumes that the script will not have any unexpected output if it succeeds. If there are anything spit out, the monitor treats it as a failure and invokes the pre-configured actions.
ScriptLauncher allows you to integrate your own monitor script with the MonitorAgent. The MonitorAgent will provide history tracking, event generating, email alerting, and other services. For example, you can ignore the first occurrence, call restart script at second and third failures, log to Bugzilla at next two occurrences, etc.
Here are the type specific properties in order to configure an instance of ScriptLauncher.
Property Name | Requirement | Description | Examples |
---|---|---|---|
Script | mandatory | full path and name of the script | /mydir/test.sh |
ScriptTimeout | optional | seconds to timeout the script | 30 |
XPatternGroup | optional | an list Perl pattern groups to ignore some of outputs from the script | ^Sun Micro |
SyntheticMonitor runs a Selenium script as a headless browser on a web site in terms of a list of tasks. The first step is always to get the page of the given URI. If NextTask is defined in the config, each of the tasks will be executed one after another in the listed order. If any task fails, the entire test will fail.
NextTask defines a list of actions implemented via Selenium script engine. Each member of NextTask will contain Operation, LocatorType, LocatorValue, PauseTime in ms and WaitTime in sec, etc. The basic idea for each of the actions is to look for certain DOM object with certain value and click on it or send some text to it.
Here are the type specific properties in order to configure an instance of SyntheticMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
NextTask | mandatory | a list of actions of the test script | see example |
ScriptTimeout | optional | milliseconds to timeout the script | 60000 |
PageLoadTimeout | optional | milliseconds to timeout the first page load | 90000 |
ProcessMonitor monitors the processes on a box. It checks the process table for the special patterns. If there is a match to all the patterns, it assumes the process is running. Otherwise, the process is down and the actions will be invoked as the response to the failure. This monitor is quite effective as long as the given pattern list is unique. However, it will not be able to detect anything wrong if the process is hung.
Sometimes, the system is busy or the resource is tight. MonitorAgent may fail to get process info. In this case, MonitorAgent treats it as an exception rather than a failure. The property of ExceptionTolerance determines how many times MonitorAgent will ignore the exceptions. If the exception persists, MonitorAgent will upgrade its priority and treats it as a failure.
Here are the type specific properties in order to configure an instance of ProcessMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
PatternGroup | mandatory | a list of Perl pattern groups in order to single out the process | SportsTicker |
PSCommand | optional | ps command | /usr/ucb/ps -auxwww (default) |
PSTimeout | optional | seconds to timeout ps command | 30 |
PidPattern | optional | pattern to match the pid | ^\w+\s+(\d+) |
Here is an example: cron_proc.json.
By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of ProcessMonitor:
This message was automatically generated. Process of mlb_scorecast died on panda1: ActionCount: 2 ActionScript: executed Date: 2005/12/07.11:12:27.859.EST Hostname: panda1 Status: normal Owner: qbadm Pid: 28376 Priority: ERR Name: eventflow_proc NumberPids: 0 Program: MonitorAgent TestTime: 2005/12/07.11:12:24.449.EST Text: eventflow_proc is down Type: ProcessMonitor Description: process monitor on EventFlow This monitor checks Unix process table for a predefined patterns periodically. If this is the first occurrence of the error, please connect to panda1, and then restart the process. If it persists, please contact the on call. A brief explanation on the monitor and this type of message can be found at https://yannanlu.github.io/agent.html#ProcessMonitor.
This alert tells us the process of EventFlow on panda1 was not running at the time of 2005/12/07.11:12:24.449.EST. It is the second time that the monitor had detected the failure in a row. The action script has been executed to respond to the failure. Apart from the common properties , the following table lists all other type specific properties and their descriptions:
Attribute | Description |
---|---|
numberPids | number of pids matching the patterns |
pids | list of the pids |
NumberMonitor monitors a generic number and compares it to the predefined ranges. If the number falls into certain range, the monitor will send an event as the alert. It can be used to monitor memory usage of a process, or disk usage of a host, number of connections, etc.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the test | proc:///bin/ps?rss |
Pattern | mandatory | pattern for parsing the number | \s+\d+\s+(\d+) [^0-9] |
Operation | optional | method for aggregation | count (default) |
CritialRange | optional | data range for ERR | [90,) |
ErrorRange | mandatory | data range for ERR | [60,90) |
WarningRange | optional | data range for WARNING | [0,60) |
Timeout | optional | seconds to timeout the test | 30 or 60 in default |
Among the three ranges, it is required to define at least one of them. Here is an example: db_conn_num.json.
IncrementalMonitor monitors the incremental value of a generic number and compares it to the predefined ranges. If the incremental falls into certain range, the monitor will send an event as the alert. In case of the data reset, the monitor will reset its state, too. So it is assuming the number is increasing always except for reset. It can be used to monitor network trafics or byte count, etc.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the test | proc:///bin/ps?rss |
Pattern | mandatory | pattern for parsing the number | \s+\d+\s+(\d+) [^0-9] |
Operation | optional | method for aggregation | max, min, first, last or count in default |
CritialRange | optional | data range for ERR | [90,) |
ErrorRange | mandatory | data range for ERR | [60,90) |
WarningRange | optional | data range for WARNING | [0,60) |
Timeout | optional | seconds to timeout the test | 30 or 60 in default |
As you can see, it is very similar to NumberMonitor. Please remember those data ranges are for incrementals.
WebOperator monitors a web listener locally or remotely. It sends a GET request to the web server periodically. It also checks the content of the requested web page. In case of fatal failures, like connection refused or wrong content, it invokes the action to send alert or to restart the web server.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the page | http://www.qbroker.org/index.html |
Timeout | optional | seconds to timeout the request | 30 or 60 in default |
Pattern | optional | a string expected returned with the page | QBroker or default: 200 OK |
MaxBytes | optional | max bytes to read the page | 1020000 or default 512 |
FileMonitor monitors lateness of expected updates to a given local file or a given remote file. It assumes that the file always exists and is readable all times. Once the monitor gets the time stamp of the file, it compares the time stamp with the predefined thresholds and determines if the update to the file is late. If it is late, the monitor treats the occurrence as an error and sends event.
FileMonitor also supports mtime correlation and size correlation between two updates. In this case, there are two objects involved. One is the reference. The other is the target file to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, FileMonitor adjusts its timer and correlates this change with the target file. If the target file has been updated accordingly, the FileMonitor treats it OK. Otherwise, it will send alerts according to the predefined tolerance on the lateness of the target file being updated.
Here are the type specific properties in order to configure an instance of FileMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | uri of the file | ftp://panda/var/log/qbroker/completed |
User | mandatory only for ftp | userid used in ftp | orb |
Password | mandatory only for ftp | password used in ftp | aabbcc |
SetPassiveMode | optional | to use passive ftp | false (default is not defined) |
Timeout | optional | seconds to timeout in ftp | 30 (default is 60) |
Reference | optional | a map with all info of the reference object | see examples |
TriggerSize | optional | threshold of size that triggers change | 1 |
You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: completed_f.json.
Here are two scenarios to use this monitor object. The first case is for an end-to-end health-check on a mover. Periodically, a script is called to move a file. The FileMonitor in turn checks the timestamp of the file on the other end. If the file has not been updated within a certain amount of time, it indicates the move either failed or extremely slow.
The other case is for an end-to-end test on an web server. Periodically, the WebOperator requests a page at a given URL. The web server logs each access requests into its access log. The FileMonitor in turn checks the time stamp of the access log. If the log has not been updated within a certain amount of time, it indicates the web server's logging process is malfunctioning.
In order to configure FileMonitor to do time correlation, you have to specify a map named Reference in its property map. The reference map contains most of the properties required by an Update object, such as URI, Name, etc. The tolerance of the lateness will be controlled by the threshold parameters. In fact, FileMonitor will create a separate instance for the reference. The action part will actually do the time correlation between two objects.
In case of the size correlation, you must specify the TriggerSize in the property map. The TriggerSize is zero or any positive number that defines two different states. One is the state that the size is less than the TriggerSize. The other is the opposite. In case state of the reference changes, FileMonitor will check the state of the target file. If both files are in the same states, FileMonitor thinks it OK. Otherwise, FileMonitor will send alerts according to the predefined tolerance on the lateness of the target file keeping its state in sync.
By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of FileMonitor:
This message was automatically generated. ReferenceSize: 11399 Type: FileMonitor Owner: qbadm ReferenceTime: 2005/06/23.06:06:32.000.EDT ActionScript: not configured ActionCount: 3 Date: 2005/06/23.06:36:49.710.EDT Pid: 21382 Description: time correlation on Money intl page LatestTime: 2005/06/22.16:30:30.000.EDT Status: very late Text: 'ftp://json.qbroker.org/www/wdap/rrd/data.json' has not been updated in the last 846 minutes Uri: ftp://json.qbroker.org/www/wdap/rrd/data.json Status: very late TestTime: 2005/06/23.06:36:49.237.EDT Hostname: panda1 Priority: CRIT Name: data_json Program: MonitorAgent Reference: ftp://loon/www/wdap/rrd/data.json This monitor periodically checks the mtime of the files, ftp://json.qbroker.org/www/wdap/rrd/data.json and ftp://loon/www/wdap/rrd/data.json, to see if they have been updated recently. If not, it indicates something wrong somewhere along the line. Please login on the box and look into it. A brief explanation on the monitor and this type of message can be found at https://yannanlu.github.io/agent.html#FileMonitor.
Here Uri tells you what file is being checked. If the file is on a remote machine, the path shows you what server it is on. Text tells you since how many minutes, the file has not been updated. We call this as the very late occurrence. In this case, the test is for the mover. It is up to admin to determine what caused this late event. Apart from the common properties , the following table lists all other type specific properties and their descriptions:
Attribute | Description |
---|---|
latestTime | mtime of the file |
reference | uri of the reference (optional) |
referenceTime | mtime of the reference (optional) |
referenceSize | size of the reference (optional) |
MultiFileMonitor monitors lateness of expected updates on a group of files, local or remote. It assumes that all the file always exist and are readable all times. Once the monitor gets the time stamps of all the file, it will sort them and find the earlist one as leading timestamp. The monitor compares the leading timestamp with the predefined thresholds and determines if it is late or not. If it is late, the monitor treats the occurrence as an error and sends event.
MultiFileMonitor also supports mtime correlation and size correlation. In this case, there are two objects involved. One is the reference. The other is the leading timestamp to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, MultiFileMonitor adjusts its timer and correlates this change with the leading timestamp. If the leading timestamp has been updated accordingly, the MultiFileMonitor treats it OK. Otherwise, it will send event according to the predefined tolerance on the lateness.
Here are the type specific properties in order to configure an instance of MultiFileMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | uri of the base directory | ftp://loon/www/wwap/rrd/ |
FileName | mandatory | a list of all files to be monitored | data/mq/abc.rrd |
User | mandatory only for ftp | userid used in ftp | qbadm |
Password | mandatory only for ftp | password used in ftp | aabbcc |
SetPassiveMode | optional | to use passive ftp | false (default is not defined) |
Timeout | optional | seconds to timeout in ftp | 30 (default is 60) |
Reference | optional | a map with all info of the reference object | see examples |
TriggerSize | optional | threshold of size that triggers change | 1 |
By default, the monitor sends an Event to react upon any occurrences according to the preconfigured ruleset. Here is an example of the email alert for the type of MultiFileMonitor:
This message was automatically generated. Type: MultiFileMonitor Owner: qbadm ActionScript: not configured Details: data/mq/abc.rrd: 2313 2005/02/17.17:35:02.000.EST data/num/xyz.rrd: 2112 2005/02/17.18:28:25.000.EST ActionCount: 6 Status: Very late Date: 2005/02/17.18:48:32.340.EST Pid: 27272 Text: ftp://loon/www/wdap/rrd/: at least one of the files has not been updated in the last 73 minutes LeadingTime: 2005/02/17.17:35:02.000.EST Uri: ftp://loon/www/wdap/rrd/ Description: check mtime on mobile files TestTime: 2005/02/17.18:48:26.571.EST Hostname: panda1 Priority: CRIT Name: rrd_files Program: MonitorAgent A brief explanation on the monitor and this type of message can be found at https://yannanlu.github.io/agent.html#MultiFileMonitor.
Here Uri gives you the base uri for all the files. Details lists the list of the files with their relative path, size and mtime. LeadingTime shows the mtime of the oldest (leading) file. Text tells you since how many minutes, the leading file has not been updated. We call this as the very late occurrence. It is up to admin to determine what caused this late event. Apart from the common properties , the following table lists all other type specific properties and their descriptions:
Attribute | Description |
---|---|
leadingTime | mtime of the oldest file |
details | list of files with the relative path to the uri and their size and mtime |
reference | uri of the reference (optional) |
referenceTime | mtime of the reference (optional) |
referenceSize | size of the reference (optional) |
ExpectedLog monitors a generic log file periodically and expects some new log entries with certain patterns showing up frequently in the log file. If the expected log entry does not show up in the log file within a predefined period of time, ExpectedLog treats the log late and sets this occurrence as a failure for actions. If the application does not log frequently, ExpectedLog can run a script to trigger certain logs before each check.
ExpectedLog also supports time correlations. You can use the script to touch a file with a spcified time stamp. The module will check the log for the new entries and compare their time stamps with the mtime for the file.
For example, we use ExpectedLog to monitor syslog. Every 5 minutes, ExpectedLog runs logger to log a syslog entry. Then MonitorAgent checks the syslog to see if the new entry is there. If not, the syslog daemon may be either hung or died.
Here are the type specific properties in order to configure an instance of ExpectedLog.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the log file | log:///var/log/nginx/errors_log |
TimePattern | mandatory | A SimpleDateFormat pattern to parse timestamp of logs | yyyy-MM-dd HH:mm:ss,SSS |
LogSize | optional | maximum number of lines for a log entry | 5 (default 1) |
ReferenceFile | mandatory | filename for storing state info | /var/log/qbroker/.status/web_int.log |
PatternGroup | mandatory | an Array instance of patterns | Error |
XPatternGroup | optional | an Array instance of patterns to be excluded | Test |
TestScript | optional | full path and name of the script | /usr/bin/true |
SleepTime | optional | seconds to sleep between script executing and log checking | 60 (default is 0) |
You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: syslog_elog.json.
ChannelMonitor monitors the channels and their attachments of WMQ Queue Managers. It connects to a specified queue manager and checks the status of a given channel. It also checks the queue associated with the channel for the flow rate, and for the number of the processes that open the queue for read or write. There are many scenarios causing fatal failures on the channel as well as its attachments. In case of the fatal errors, the action will be invoked.
Here are the type specific properties in order to configure an instance of ChannelMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
QueueManager | mandatory | name of the WMQ queue manager | BROKER1 |
ChannelName | mandatory | name of the WMQ channel | BKR1.SUB2 |
ChannelType | mandatory | type of the WMQ channel | Sender or Receiver |
QueueName | optional | name of the queue associated to the channel | ST_IN |
QueueOpenMode | optional | how the queue is opened | KeepOpen or NotKeepOpen |
StatsLog | optional | filename for storing channel stats | /var /log/qbroker/stats/broker2.log |
QueueMonitor monitors the queues of WMQ Queue Managers. It connects to a specified queue manager and checks the number of messages in a given queue. For each of queues watched by the monitor, there is a threshold for the maximum number of messages in that queue. If the current number of the messages exceeds the threshold, the error occurs. This error triggers the action of email alerts and the cleaning up on the queue if it is configured.
Here are the type specific properties in order to configure an instance of QueueMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
QueueManager | mandatory | name of the WMQ queue manager | BROKER1 |
QueueName | mandatory | name of the WMQ queue | ST_IN |
WaterMark | mandatory | threshold of number of messages in the queue | 1000 or 0.8 (80%) |
JMSMonitor monitors a JMS application that reads/writes messages from/to an MQ Queue. Either the application itself or other apps logs to a log file. It assumes that the application logs an entry whenever it picks up messages from the queue or puts a message to the queue. The Monitor combines both QueueMonitor and UnixlogMonitor and checks the message dequeue rate (how many messages have been picked up). If the dequeue rate is zero but the current depth of the queue is non-zero, the monitor treats the incident as an error to indicate the application hung. This error triggers the action of email alerts and the action program if it is configured.
Here are the type specific properties in order to configure an instance of JMSMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | uri for the connection | t3://jcmsref1:7001 |
ConnectionFactoryName | mandatory for JNDI | connection factory of JNDI | j2eecms_i2/jms/JMSConnectionFactory |
ContextFactory | mandatory for JNDI | context factory of JNDI | com.sun.jndi.fscontext.RefFSContextFactory |
Username | optional | username for JNDI | jmstester |
Password | optional | password for JNDI | test |
QueueName | mandatory | name of the WMQ queue | CMS_APS |
Operation | mandatory | the way to handle messages | Get or Put |
LogFile | optional | filename of the log file | /var/log/qbroker/nohit.log |
TimePattern | mandatory | time pattern of the log | yyyy-MM-dd HH:mm:ss,SSS |
LogSize | optional | maximum number of lines for a log entry | 5 (default 1) |
StatsLog | optional | filename for storing queue stats | /var/log/qbroker/stats/broker_jms.log |
ReferenceFile | mandatory | filename for storing state info | /var/log/qbroker/.status/broker_jms.log |
PatternGroup | mandatory | an Array instance of the pattern | Just pushed (\d+) events. |
NumberDataFields | mandatory | number of fields to sum up | 1 or 0 (default) |
JMSHealthChecker sends a health-check msg to a JMS Destination and checks if the msg is accepted or not. In case of failure, the monitor treats the incident as an error to indicate the destination is not available. This error triggers the action of email alerts and the action program if it is configured.
Here are the type specific properties in order to configure an instance of JMSHealthChecker.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | uri for the connection | wmq://panda1 |
ConnectionFactoryName | mandatory for JNDI | connection factory of JNDI | QueueConnectionFactory |
ContextFactory | mandatory for JNDI | context factory of JNDI | com.sun.jndi.fscontext.RefFSContextFactory |
Username | optional | username for JNDI | jmstester |
Password | optional | password for JNDI | test |
QueueName | mandatory for put | name of the JMS queue | ESB_XML |
TopicName | mandatory for pub | name of the JMS topic | MyTopic |
JMSPropertyGroup | optional | map to define msg properties | {"JMSType": "healthcheck"} |
MessageBody | optional | text of the health-check msg | This is a health-check |
JMXQMonitor monitors a generic JMS Destination via JMX service. Most JMS vendor supprts JMX for management and monitor purposes. This monitor takes advantage of that. It monitors the metrices of the destination, such as enq rate, deq rate, current depth, number of consumers and number of producers, etc. In case the messages stuck in the destination, it sends an event as the alert.
Here are the type specific properties in order to configure an instance of JMXQMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | uri for the JMX connection | service:jmx:rmi:///jndi/rmi://localhost:8686/jmxrmi |
Username | optional | username for JMX | admin |
Password | optional | password for JMX | admin |
MBeanName | mandatory | name of the MBean for the destination | com.sun.messaging.jms.server:type=Destination,subtype=Monitor,desttype=q,name="EVENT_Q_1" |
StatsLog | optional | filename for storing queue stats | /var/log/qbroker/stats/EVENT_Q_1.mq |
SonicMQMonitor monitors the message storage on a SonicMQ broker via JMS/JMX. Currently, it only supports broker, queues and durable subscriptions. Since there is no metric for number of deq and enq, it is not able to report the flow rate.
Here are the type specific properties in order to configure an instance of SonicMQMonitor.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | uri for the SonicMQ | tcp://localhost:2506 |
Username | optional | username | admin |
Password | optional | password | xxxx |
ObjectName | mandatory | name of the object to monitor | Domain1.brQA01Container:ID=brQA01,category=metric,type=queue,name="EventQueue" |
StatsLog | optional | filename for storing queue stats | /var/log/qbroker/stats/EventQueue.mq |
Use URLMonitor to get the update time and/or size of the given page. It is assumed that the page exists and is accessible all times. The page should contain update time stamp that will be parsed by URLMonitor.
You can use URLMonitor to monitor when the page has been updated. If not, how late it is. In case of very late, URLMonitor sends alerts.
URLMonitor also supports mtime correlations and size correlations between two updates. In this case, there are two updates involved. One is the reference. The other is the target page to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, URLMonitor adjusts its timer and correlates this change with the target page. If the target page has been updated accordingly, the URLMonitor treats it OK. Otherwise, it will send alerts according to the predefined tolerance on the lateness of the target page being updated.
In order to configure URLMonitor to do time correlations, you have to specify a map named reference in its property map. The reference map contains most of the properties required by an Update object, such as URI, Name, Type, etc. The tolerance of the lateness will be controlled by the threshold parameters. In fact, URLMonitor will create a separate instance for the reference object. The method of performAction() will actually do the time correlations between two objects.
In case of the size correlations, you must specify the TriggerSize in the property map. The TriggerSize is zero or any positive number that defines two different states. One is the state that the size is less than the TriggerSize. The other is the opposite. In case state of the reference changes, URLMonitor will check the state of the target page. If both objects are in the same states, URLMonitor thinks it OK. Otherwise, URLMonitor will send alerts according to the predefined tolerance on the lateness of the target page keeping its state in sync.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the page | http://www.qbroker.org/index.html |
Timeout | optional | seconds to timeout the request | 30 or 60 in default |
Pattern | mandatory | pattern for parsing update time | Updated: (\d+):(\d+) ([ap])\.m\. \w+ \(\d+ GMT\) (\w+) (\d+), (\d\d\d\d) |
DateFormat | mandatory | format string for update time | hh mm a MMM dd yyyy |
You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: panda_url.json.
Use LatestRecord to get the update time of the latest record and/or number of the certain records in a database table. It is assumed that the database and the table exist and are accessible all times. The record should contain update time stamp that will be parsed by LatestRecord.
You can use LatestRecord to monitor when the record has been updated. If not, how late it is. In case of very late, LatestRecord sends alerts.
LatestRecord also supports mtime correlations and size correlations between two updates. In this case, there are two updates involved. One is the reference. The other is the target page to be monitored. The reference controls the correlation process. Whenever the reference is updated or modified, LatestRecord adjusts its timer and correlates this change with the target. If the target record has been updated accordingly, LatestRecord treats it OK. Otherwise, it will send alerts according to the predefined tolerance on the lateness of the target record being updated.
In order to configure LatestRecord to do time correlations, you have to specify a map named reference in its property map. The reference map contains most of the properties required by an Update object, such as URI, Name, Type, etc. The tolerance of the lateness will be controlled by the threshold parameters. In fact, LatestRecord will create a separate instance for the reference object. The method of performAction() will actually do the time correlations between two objects.
In case of the size correlations, you must specify the TriggerSize in the property map. The TriggerSize is zero or any positive number that defines two different states. One is the state that the size is less than the TriggerSize. The other is the opposite. In case state of the reference changes, LatestRecord will check the state of the target page. If both objects are in the same states, LatestRecord thinks it OK. Otherwise, LatestRecord will send alerts according to the predefined tolerance on the lateness of the target record keeping its state in sync.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the database | jdbc:oracle:thin:@broker1:1530:mqsibkdb |
DBDriver | mandatory | classname of the database driver | oracle.jdbc.driver.OracleDriver |
Username | optional | username for DB connection | qbadm |
Password | optional | password of the username | xxxx |
SQLQuery | mandatory | SQL query for the latest record | SELECT max(timestamp) FROM stories |
Timeout | optional | seconds to timeout the query | 30 or 60 in default |
TimePattern | mandatory | SimpleDateFormat pattern for parsing update time | yyyy-MM-dd HH:mm:ss |
You also need to specify the threshold for the lateness. The threshold are defined in ActiveTime. It takes three non-zero numbers in seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: batchqueue_db.json.
Use DBRecord to query a database table and then scan all the records for certain predefined patterns. It is assumed that the database and the table exist and are accessible all times. If the monitor catches any records, it will send an event. In that event, it only contains the last record and the number of the matching records. Therefore, it is customer's job to check the table for the details. Otherwise, you may miss the important records.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the database | jdbc:oracle:thin:@broker1:1530:mqsibkdb |
DBDriver | mandatory | classname of the database driver | oracle.jdbc.driver.OracleDriver |
Username | optional | username for DB connection | qbadm |
Password | optional | password of the username | xxxx |
SQLQuery | mandatory | SQL query for the latest record | SELECT max(timestamp) FROM stories |
Timeout | optional | seconds to timeout the query | 30 or 60 in default |
PatternGroup | mandatory | list of patterns | error |
Use PropertyMonitor to watch a JSON property file. If the timestamp of the file changes, it will reload the file and diff with the cached copy to see if there is any changes. If there is a change, the monitor will send an event on the change. MonitorAgent is using it to monitor its own configuration files and automatically reload them if any change is detected.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the config file | file:///opt/qbroker/agent/Agent.json |
Basename | mandatory | name of the master conifg file | Agent |
ComponentGroup | mandatory | map of components | Monitor |
PropertyFile | optional | full name of the property file | /opt/qbroker/agent/Agent.json |
Use AgeMonitor to watch the lifetime of an object. If the lifetime of the object falls into certain ranges, the monitor will send event as the alert. For example, you can use AgeMonitor to monitor some processes. If the process runs too long, the monitor can kill it provided the action script is configured to do that.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of the test | proc:///bin/ps?etime |
Pattern | mandatory | pattern for parsing time | \d+\s+(\d[^ ]+\d) |
DateFormat | mandatory | format string for time | D-HH:mm:ss |
Operation | optional | method for aggregation | max, first, last or min in default |
Timeout | optional | seconds to timeout the test | 30 or 60 in default |
There are some other parameters required by the monitor, depending on the scheme of the URI. You also need to specify the threshold for the age. The threshold are defined in ActiveTime for time dependence support. It takes three non-zero numbers as seconds delimited by comma. You can also use HH:mm:ss to specify the time. The first number is for NORMAL, the second for WARNING and the third for ERR. Here is an example: mqm_age.json.
WinlogMonitor is used to scrape Windows event log with pattern match support. Unlike UnixlogMonitor depending on the increasing timestamps, WinlogMonitor uses WMIC to query Windows event log via a Javascript.
Property Name | Requirement | Description | Examples |
---|---|---|---|
URI | mandatory | URI of monitor | wlog:///application/AGILITYSDK |
Script | mandatory | a Java script to query event log | cmd.exe /c "cscript c:\home\qbroker\bin\extractLog.js //Nologo -l Application -s AGILITYSDK -s MSSQLSERVER -t 2 -t 1 -a ##yyyy####MM####dd####HH####mm####ss##.##SSS## -i ##RN## 2>nul" |
LogSize | optional | maximum number of lines for a log entry | 1 |
ReferenceFile | mandatory | filename for storing state info | C:\home\qbroker\status\live.log |
PatternGroup | mandatory | a list of Perl pattern groups to match certain logs | see example |
XPatternGroup | optional | a list of Perl pattern groups to exclude certan logs | see example |
Here is an example: ecs_wlog.json.
MonitorAgent is a part of QBroker that is an ongoing open source project at GitHub.