monitoring, metrics, alerting

In the previous post we’ve seen what informations are useful to monitor a Windows application server. While useful data it won’t allow a complete vision of our applications behaviour if we don’t monitor also the important services, and what service is more important than your database?

If you are working in a Windows enviroment there’s a good probability you are using Microsoft’s SQL Server and you will have to monitor and fine-tune it’s usage to prevent problems and identify wrong usages.

Telegraf already has a plugin for sql server that will extract information for your instance by executing a query. While this works well it has two problems:

  • due to a bug it can only connect to SQL server 2008 R2 SP3 (this is a minor problem, I hope your SQL Servers are at least updated to a version which has not run out of support)
  • it extracts a lot of data most of which is useful only in limited cases.

To make it simple we can craft a configuration which includes only the most relevant and important counters. As in the case of an application server you will need the same base data:

  • use the disk, cpu and mem plugins
  • Network Interface :
  • “Bytes Received/sec”
  • “Bytes Sent/sec”
  • “Bytes Total/sec”
  • Processor:
    • ”% Idle Time”
    • ”% Interrupt Time”
    • ”% Privileged Time”
    • ”% User Time”
    • ”% Processor Time”
  • LogicalDisk :
    • “Disk Bytes/sec”
    • “Disk Read Bytes/sec”
    • “Disk Write Bytes/sec” (all istances except _Total)

Now to the SQL Server- specific Performance counters (replace INSTANCE with the name of your instance, there can be more than one on a single server):

  • MSSQL$INSTANCE:Buffer Manager :
    • “Buffer cache hit ratio”
    • “Database Pages”
    • “Free list stalls/sec”
    • “Page life expectancy”
    • “Page lookups/sec”
    • “Page reads/sec”
    • “Page writes/sec”
  • MSSQL$INSTANCE:Databases:
    • “Active transactions”
    • “Transactions/sec”
    • “Write transactions/sec”
  • MSSQL$INSTANCE:General Statistics:
    • “Processes blocked”
  • MSSQL$INSTANCE:Latches:
    • “Average Latch Wait Time (ms)”
    • “Latch Waits/sec”
    • “Total Latch Wait Time (ms)”
  • MSSQL$INSTANCE:Locks:
    • “Average wait time (ms)”
    • “Lock Requests/sec”
    • “Timeout locks (timeout > 0)/sec”
    • “Lock Timeouts/sec”
    • “Lock Wait Time (ms)”
    • “Lock Waits/sec”
    • “Number of Deadlocks/sec”
  • MSSQL$INSTANCE:SQL Statistics:
    • “Batch Requests/sec”
    • “SQL Compilations/sec”
    • “SQL Re-Compilations/sec”
  • MSSQL$INSTANCE:Transactions:
    • “Free Space in tempdb (KB)”
    • “Transactions”
    • “Version Cleanup rate (KB/s)”
    • “Version Generation rate (KB/s)”
    • “Version Store Size (KB)”
    • “Version Store unit count”
    • “Version Store unit creation”
    • “Version Store unit truncation”
  • MSSQL$INSTANCE:Wait Statistics:
    • “Lock waits”
    • “Log buffer waits”
    • “Log write waits”
    • “Memory grant queue waits”
    • “Network IO waits”
    • “Non-Page latch waits”
    • “Page IO latch waits”
    • “Page latch waits”
    • “Thread-safe memory objects waits”
    • “Transaction ownership waits”
    • “Wait for the worker”
    • “Workspace synchronization waits”

These performance counters will give you a pretty complete view of your SQL Server behaviour allowing you to inspect and correlate performance problems in your applications with the state of your database.

monitoring, alerting, metrics, performance counters

Once all the programs described in the previous post are installed we can start to familiarize with the system.

One question arises: what data to capture? This depends on the role of the server where the collecting agent is installed.

For an application server we can collect some basic data:

  • we can use Telegraf inputs cpu, disk e mem
  • some useful Performance Counters to collect are:
    • Processor : “% Idle Time”, “% Interrupt Time”, “% Privileged Time”, “% User Time”, “% Processor Time” (only _Total)
    • Network Interface: “Bytes Received/sec”, “Bytes Sent/sec”, “Bytes Total/sec” (all istances except _Total)
    • LogicalDisk : “Disk Bytes/sec”, “Disk Read Bytes/sec”, “Disk Write Bytes/sec” (all istances except _Total)

If the server hosts web applications you can add these Performance Counters:

  • ASP.NET : “Requests Queued”
  • ASP.NET Apps v4.0.30319 : “Requests/sec”, “Sessions active”
  • Web Service: “Connection Attempts/sec”, “Current Connections”

If you are using MSMQ:

  • MSMQ Queue : “Messages in Queue” (you can read them all or specify individual queues),
  • MSMQ Service : “Total bytes in all queues” (this is important as MSMQ by default has a 1GB limit for messages saved and will refuse new messages when that limit is reached)

monitoring, windows, metrics, alerting

In the previous post I described some of the challenges in monitoring a Windows-only enviroment. Now I’ll describe the solution I’ve chosen and has been working for almost a year without an hitch to monitor an hundred VMs.

If you remember from the previous post the four areas to consider are:

  • Data collection
  • Data storage
  • Data visualization
  • Alerting

Data collection

As we said one of the most important things is the ability to read Windows Performance Counters and write data to our chosen storage. Here the solution was simple Telegraf.

Telegraf can be easily installed as a Windows Service and most importantly:

  • can save to numerous type of storage allowing you flexibility in choosing the storage type
  • already has numerous plugins allowing you to scrape data from numerous sources (if a plugin does not exists it’s simple to create a new one if you know a little Go)

Data storage

The second choice to make regarded the storage layer. While using Telegraf allows a great degree of flexibility we had to choose a solution with a good Windows support, low hardware requirements and able to be queried by the tools we chose for data visualization and alerting.

The choice was simple, we chose InfluxDb which can be installed as a WIndows service, has low requirements and integrates well with the other tools in the stack.

Notably it has also a concept of data retention allowing you to specify an automatic cleanup of your data after 30 days.

Data visualization

To visualize the collected data we chose to use Grafana which is able to plot graphs reading directly from InfluxDb and from a lot of other data sources giving you a lot of flexibility and allowing you to easily navigate and correlate your data.

Alerting

Initially we tried to use for this part Grafana new support for alerts but we soon realized it was too limited to work well (no multiple levels, no different alerting rules, …) so we switched later to Bosun, a monitoring and alerting tool created by Stack Exchange. This tool can easily query InfluxDb, allows you to define complex rules and tweak them over time.

Also it has a dashboard with all active alerts where you can handle them byacknowledging them or closing when solved.

Reamrks

While this solution has proved to work well, other combinations are possible.

  • swapping Grafana and Bosun for Chronograf and Kapacitor (produced by the same developers as InfluxDb and telegraf)
  • using Prometheus as data storage
  • using ElastiCsearch as storage, Kibana for visualization and Watcher for alerting

monitoring, windows, time-series, metrics, alerting

Developing distributed applications in a Windows-only enviroment, with hosted Virtual machines, can present some challenges and require you to find solutions that are not so standardized due to this type of enviorment not being very common.

One of the challenges is collecting informations about your VMs, some important services (SQL Server, RabbitMQ, ElasticSearch,..) and your virtualization enviroment to be able to inspect them or be alerted in case of problems (or better before they happen).

To setup this kind of data collection and monitoring there are four area to consider. I will describe them and then propose a solution I’ve found works really well with minimal setup and is easy to configure.

The four areas to consider are:

  • Data collection
  • Data storage
  • Data visualization
  • Alerting

Data collection

The first step is deciding what data you need to collect and find a program that can read it at a pre-defined interval and store where you want it. While there are many tools of this type, we need one that is able to work on Windows and adapt to it’s peculiarities. This means having something able to read Windows Performance Counters.

Performance Counter are a Windows-only mechanism that is used to report values about the operating system but can be also used by applications like SQL Server. Some examples are:

  • number of bytes of memory used
  • number of messages in a MSMQ queue
  • output troughtput for every network card
  • number of phisycal reads per seconds (SQL Server)

The full list of these counters can be seen from Performance Monitor, where they are divided in categories for easier navigation. The Performance Monitor interface is a throwback to the nineties but we need only the list of counters reachable when adding a new counter.

Pay attention to the fact that Performance Counter names are localized so mind the possibility that there may be a difference in name between you local computer and production servers.

Options: Telegraf, Collectd, scollector, Elastic Beats,….

Data storage

The second and most important part is the storage of your collected data. The usual solution is using a time-series database which is a form of database specialized in saving and reading time-series (metrics). Multiple options are available that can easily run on windows like InfluxDB and Prometheus. Your choice should be based primarily on two factors:

  • can my data collector program write to the chosen storage?
  • can I easily visualize data from the storage?

Availability and disaster recovery may play a part in your choice but after running our chosen system for a year we treat this data as important but disposable, meaning we have no high availability and we accept the possibility of losing the data. We also keep only the last thirty days of data, a compromise allowing us to inspect trends in a metric but limiting the size of storage needed.

Another important factor to consider is how much data your chosen tool is able to ingest. If you collect 10 metrics per machine every ten seconds what can work with a hundred server may fail with ten thousand. Fortunately most are able to ingest massive quantities of data with limited resources and if this becomes a problem you can use multiple databases, partitioning the resources by type or group of servers.

Options: Prometheus, InfluxDb, ElasticSearch, Graphite, OpenTSDB

Data visualization

The third part is data visualization, for which you need a tool that is:

  • easy to use
  • able to show the current status at a glance
  • able to let you inspect and correlate historical data to pinpoint problems in a post-mortem
  • easily configurable by everyone

While this part seems “easy” a good visualization tool will make the difference. It will help you correlate a spike in your response times with anomalies in your database metrics. It will help you pinpoint configuration errors in your servers. It will let you uncover stray queries that every two hours read the entire database to generate a report. In short a good visualization tool will let you discover so much more about you systems and applications than you ever thought existed.

The fourth point is also important. While creating standard visualizations is really useful it should also be possible to explore the data freely by anyone without having to ask to a central group to create a new visualization every time it is needed. This will let everyone move freely witout having to wait for someone else.

Options: Grafana, Kibana, Chronograf,…

Alerting

The last part is alerting on the collected data. Ideally you want to be able to:

  • define rules about possible problems with levels of importance (at least warning and critical)
  • having the ability to quickly change the defined rules
  • having the possibility to define alerting destinations
  • having the possibility to view at a glance the status of all your alerts and track their status over time

Defining different levels of importance will let you see problems way before they matter, like full disk storage, and give you time to act or ignore them during the night. Having the ability to tweak these rules will let you cut down on the number of alerts which can lead you to alert fatigue if left unchecked making you ignore an important alert because it is buried in another thousand of less important notifications.

If you can define different alert destinations you can escalate the important things while leaving the minor ones to be handled when you have time. Having also a dashboard which lets you see all active alerts, their status and past status will let you more easily coordinate with others in your team.

Options: Grafana, Bosun, Chronograf

documentation, knowledge base

One of the classic problems arising when developing professionally is how to create documentation about our software in a way that is easily disccoverable and updatable from every developer. We faced this problem and tried some options on the way before settling on the easiest and most approachable option we could find.

First solution: Sharepoint

This was the solution implemented when I was hired. It basically boils down to a website where documents can be uploaded in a folder-like structure with permission to upload, read and modify them. While it worked it had some serious problems:

  • lack of search across documents
  • permissions had to be requested to an administrator to view documents (this could be avoided by giving read permissions to everyone in truth)
  • permission had to be requested to modify documents, even to correct a typo
  • finding documents was not easy as they could be there but you could be missing the necessary permissions to see them
  • its folder-like structure allowed to organize content in a logical way

Second solution: The wiki

In the attempt to solve some of these problems we decided to use a Wiki to write our documentation. While it solved some problems it had some disadvantages:

  • we had full text search across all documentation
  • permission were not needed allowing everyone to contribute and amend the documentation with full traceability
  • a logical organization of the documentation was missing requiring each developer to hand mantain it by using links in every page making it tedious and error-prone

The final solution

At this point we realized we needed a diferent solution, something allowing us to:

  • organizing in a hyerarchical way our documentation easily
  • allowing everyone to easily update it

To reach this objective we used Metalsmith to create a git repository containing our documentation written in Markdown. Each time someone edits a page, the repository is built by the CI system and a static site is deployed. The structure of the site reflects exactly the on-disk folder structure allowing the topics organization.

In this configuration no permissions are needed as every edit is reflected in the source control history and can easily be reverted.

If you want to try it you can download a personalizable skeleton at https://github.com/Bjornej/KnowledgeBase. To run it just

npm install
node -harmony build.js

an your static site will be built.