History

In 2011, following a simple tweet by John Vincent, a whole new movement started in the open source world. It all began with a simple statement: “monitoring sucks”.

Indeed, at the time, the open source monitoring tool set was pretty poor.

There were some commercial solutions that were pretty good for the domain they covered (HP OpenView for SNMP, Splunk for logs,…), but in the open source world the “go to” solution was Nagios, and not the others.

Nagios is not bad (and indeed we still use it at Nuxeo), but it has its limits. It’s mainly made for polling every “x” time, and it’s pretty binary in what it reports (works/broken). We needed some new tools.

John Vincent’s tweet probably came at the right time, with the right audience. It coalesced an ever-growing group of like-minded people around the idea of “we need better tools, let’s make them”.

Another important milestone in this movement was Coda Hale’s speech about metrics that pretty much says: “Don’t just check for a few predetermined things, measure and record everything, you never know when you’ll need them and what they will tell you”.

A lot has changed since then: there is a plethora of rapidly-evolving tools that see wide adoption, there has been a paradigm shift from “poll status from the application” to “push things to the monitoring tools and let them react in realtime”.

Monitoring has evolved, and it’s still changing.

What is this monitoring thing you’ve been talking about anyway?

“Monitoring” is an often used term (well, in the IT world anyway), but it’s actually used for different things. Mostly, it’s about “does my application work?”, “will it keep working?” and “what happened that made it stop working correctly?”.

Past, present, future, they are all important in the life of an application. That is the root of Coda Hale’s talk: you need to know how your application is actually behaving in the real world, see how things evolve, and from that determine what you need to watch, and plan what you need to do to keep it working. Don’t just react, be proactive so that you never need to react.

So, how does it translate in practical terms?

Mainly, record everything that you think is relevant (and possibly some that you don’t think is relevant): metrics, logs, any information you have really.

This doesn’t just have value for your operations team, it can also have business value.

For instance, if you sell an application/service but also offer a free trial, you can analyze logs to see who uses your application, which ones are customers, which ones are trials, the number of new signups for trials, then split it all in geographical locations to see which sales department needs to make a bigger push to encourage potential customers to actually sign up with you (or which ones to give up on, but that’s not something monitoring can tell you, it gives you the information, you decide how to act on it!).

The more information you have, the better you can exploit it, it’s up to you how to analyze it, the main point here is to have the information in the first place.

Ok, so I digressed a bit there. Back to down-to-earth things!

With the information you’re keeping, you can make graphs from metrics and archive structured logs data.

That can be used for post-mortem diagnosis, or to predict trends; for instance:

  • When your application was not available, the database connection pool was saturated (which gives you a hint: you need an alert on this),
  • At the current rate your disk will be full in 3 months so you need to place an order for more capacity (but you have a little time to do it).

So, in the end, “monitoring” is: gathering data and acting on it.

I fell asleep at the “History” part, do you have some examples?

Yes, I do!

A warning note first: the tools we use evolve very fast, the examples below may not work for you depending on the versions you use.

Let’s see how we can put the theory in practice with a basic Nuxeo installation: a single Nuxeo instance with an Apache reverse-proxy in front and a local PostgreSQL database (this is pretty much what we have on the VMs we distribute).

Collecting

We want to get the Apache access logs and the Nuxeo server log for indexing, as well as some metrics about Nuxeo, the JVM, the database and the system itself.

For the logs, the files already exist, it’s just a matter or reading new lines as they are added. We’ll use logstash1 for this.

Logstash is a tool that can read text data from multiple sources, manipulate that data (parsing is to get structured data), then send it somewhere else. It does this through a set of inputs, filters and outputs. There are a lot of existing ones already, and it’s rather easy to add your own if you have specific needs.

Let’s see how the configuration looks for our example:

input {
  file {
    path => "/var/log/nuxeo/server.log"
    type => "nuxeo"
    sincedb_path => "/opt/logstash/.sincedb_serverlog"
  }
  file {
    path => "/var/log/apache2/nuxeo_access.log"
    type => "apache"
    sincedb_path => "/opt/logstash/.sincedb_accesslog"
  }
}

filter {
  # Add a tag to all events to identify the instance
  mutate {
    add_tag => ["mynuxeo"]
  }
  # Group all lines until they start with a new timestamp (useful for stacktraces)
  multiline {
    type => "nuxeo"
    pattern => "^20"
    negate => true
    what => "previous"
  }
}

output {
  redis {
    tags => ["logs"]
    host => "indexer.example.com"
    data_type => "list"
    key => "logstash"
  }

}

Here, we have two simple file inputs that just keep reading the files. We give them each a type which will be used later to determine which filters and outputs will be applied to them.

In the filters we start by adding a tag to identify where all those events come from, it can be something like “intranet” or “customer X” or anything that suits you really. We could just use the hostname, but hostnames can change if you move a service, so we prefer to just add a custom tag.

The second filter is used to group lines together. Keep in mind that each line in the logs will be read individually, but for a stacktrace for instance they’re not all individual events, so we want to group them together. Note that the example above is pretty basic and will not work in all cases (and it’s not year 2100 compliant, but there should be time to adapt it until then!).

Afterwards, we just send everything to a redis2 server.

“Hold on! You haven’t parsed or indexed anything!” you’re going to say.

And you’d be right!

Most of the time, you are not just going to collect data from a single host, so what we do is just gather events, add some meta data to say what they are and where they come from, then ship them off to a central indexer that will do most of the work.

Another consideration is the resources usage: when you have a lot of activity, parsing and indexing can take a lot of resources, you really don’t want to do that on the same host as your production application!

In our case, we just use redis as a pub/sub queue. If you prefer there are other options such as AMQP.

We’ll come back later to logstash to see what happens next in the life of our events (or you can scroll down now if you can’t stand the suspense), let’s stick with our instance for now.

So far, we’ve handled collection of log events, but we also want metrics.

Since Nuxeo 5.7.1, we have integrated Coda Hale’s aptly-named Metrics3 library (see https://doc.nuxeo.com/x/gBDF for more details). Normally, you would just push those metrics directly to a Graphite4 server, but in our case we want to manipulate them first, so we send them to logstash instead.

We activate metrics in the nuxeo.conf with the following configuration:

metrics.graphite.enabled=true
metrics.graphite.host=localhost
metrics.graphite.port=2030
metrics.graphite.period=10
metrics.log4j.enabled=true
metrics.log4j.period=10
metrics.tomcat.enabled=true
metrics.tomcat.period=10

And on the logstash side we add the following in the “input” section:

graphite {
  host => "127.0.0.1"
  port => 2030
  type => "graphite"
}

To collect system and database metrics, we use Diamond5, which can also push metrics in the Graphite format, we just configure it to output to the same logstash input that Nuxeo uses.

At this point, we have the Apache logs, the Nuxeo logs, the Nuxeo metrics (which include the JVM ones), the database metrics and the system metrics, all sent to logstash, which forwards them to a remote server through redis.

Monitoring

Indexing

So, what do we do with all those events piling up in redis?

You guessed it, logstash again! It’s time to do the actual parsing and/or manipulation of our events.

We want to send different types of events to different processors, each with their own format, so based on the event type, we do different manipulations, and then send the result to the final processors.

On the collector (indexer) side, the configuration will look like this:

input {
  redis {
    host => "127.0.0.1"
    type => "redis-input"
    # these settings should match the output of the agent
    data_type => "list"
    key => "logstash"
    # We use json_event here since the sender is a logstash agent
    format => "json_event"
  }
}

filter {
  # Graphite to Riemann
  clone {
    type => "graphite"
    clones => ["riemann"]
  }
  grok {
    type => "riemann"
    pattern => "servers.(?<host>[wd]+).(?<service>[S]+) (?<metric>[S]+) %{INT:epoch}n?"
    singles => true
  }
  mutate {
    type => "riemann"
    replace => ["@source_host", "%{host}"]
  }
  date {
    type => "riemann"
    match => [ "epoch", "UNIX" ]
  }
  # Parse server.log lines
  grok {
    type => "nuxeo"
    pattern => "(?m)%{TIMESTAMP_ISO8601:date} %{LOGLEVEL:level}(?: [%{JAVAFILE:class}])?(?: %{GREEDYDATA:logmessage})?"
    add_tag => ["logs"]
  }
  # Parse Apache lines
  grok {
    type => "apache"
    pattern => "%{COMBINEDAPACHELOG}"
    add_tag => ["logs"]
  }
  # Fix apache logs timestamp
  date {
    type => "apache"
    timestamp => "dd/MMM/yyyy:HH:mm:ss Z"
  }
  geoip {
    type => "apache"
    source => "clientip"
  }
}

output {
  graphite {
    type => "graphite"
    host => "127.0.0.1"
    port => 2003
    fields_are_metrics => true
  }
  elasticsearch_http {
    tags => ["logs"]
    host => "127.0.0.1"
    flush_size => 100
  }
  riemann {
    type => "riemann"
    host => "127.0.0.1"
    port => 5555
    riemann_event => [
      "service", "%{service}",
      "metric", "%{metric}"
    ]
  }

}

The first half of the filters takes the graphite metrics and first duplicates them with the new type “riemann”.

Riemann6, is a monitoring-oriented event processor. We use it for alerting based on some of the metrics we’re going to graph, we’ll come back to it later. The main point at this part is that we are copying the graphite metrics then format the copy into a format suitable for consumption by riemann.

The second half of the filters deals with logs.

We use regular expressions to extract fields from the different logs, hence the “structured logs” part. We can further manipulate those fields to extract some information, for instance we run the geoip filter on the client IP address from the Apache logs, this will add geolocation information fields to our apache events.

Once we’ve prodded our events to give us all the information we want formatted the way the logstash outputs want, we just send them to their final destinations:

  • Graphite metrics go into Graphite (surprise!)
  • Riemann events go into Riemann (…)
  • Structured logs go into elasticsearch7

Monitoring Infrastructure

Graphite will index metrics, which you can then query to make graphs of… anything you like. You can see an example dashboard using the default included front-end on the Metrics page in our documentation (linked earlier).

As mentioned before, Riemann is a monitoring-oriented event processor. We are just starting to use it and it can be kind of complex to configure (especially if you don’t know clojure), so I’m just gonna give you a simple example of how it can be used and leave the rest as an exercise to the reader!

You will remember we are getting JVM information in the Graphite metrics, for the JVM uptime it comes in the form of “server.mynuxeo.nuxeo.jvm.uptime X”. In logstash, we transformed it for Riemann into:

  • host: mynuxeo - service: nuxeo.jvm.uptime - metric: X

We can then configure Riemann to alert us when it stops receiving this event.

First, we define a custom mailer (this goes into the main part of your riemann.config):

(def email-state (mailer {} {:subject (fn [events] (str
        "State "
        (let [state (:state (first events))] (if (clojure.string/blank? state) "up" state))
        " on "
        (:host (first events))))
}))

Then, we tell riemann to send us a mail when the state of this event changes (this goes into the streams part of your riemann.config):

    (by :host
      (changed-state
        (where (service "nuxeo.jvm.uptime")
           (email-state "[email protected]")
           (fn [event] (info "State change" event))
        )
      )
    )

(Note that this example is probably far from perfect, as mentioned, we’re only just starting with it).

This means if your Nuxeo crashes for some reason, you’ll be alerted very fast (depending on how often you configure your metrics to be sent and the time to let you configure in Riemann). If you were using “classical” polling to check if your server was still up, you wouldn’t know until the next time the “is it up?” check happens.

In Nuxeo 5.7.2 we introduce metrics based on the database connection pool, they give a good indicator of whether your Nuxeo is saturated.

You can of course also use the system metrics to create different alerts: CPU is over 80% for one minute, disk space is getting low, …

If you try Riemann out, don’t hesitate to share your configuration rules!

Last but not least, we have the elasticsearch/Kibana8 duo. When you send your structured log data to elasticsearch, every field you defined is indexed (with a Lucene back-end). Elasticsearch then provides you with a powerful API to search those indexes logs. Kibana is a front-end that helps you delve into your logs using this API.

By indexing structured logs from different sources (you could add the system logs in there too), it becomes much easier to correlate things. You can find out if things that return a 500 response code in Apache all come from the same IP, or from the same country, or what requests happen at the same time as you get stacktraces in your Nuxeo logs, or pull some statistics out of your logs. As elasticsearch is very fast, you can start with a broad request and then fine-tune it on the go.

Once you are confortable with it, you can also make near-realtime dashboards out of it, here is an example of what we did while trying out Kibana v3 and the GeoIP fields:

Monitoring with Kibana

Conclusion

What all those tools aim to do is give you easy access to your information. Information is not knowledge, but it’s the basis for it, you just have to find out how to exploit it! Give those tools a whirl (or others, there are plenty of fine ones that we didn’t mention), you won’t regret it!

[1] http://logstash.net/

[2] http://redis.io/

[3] https://github.com/codahale/metrics

[4] http://graphite.wikidot.com/

[5] https://github.com/BrightcoveOS/Diamond

[6] http://riemann.io/

[7] http://www.elasticsearch.org/

[8] https://www.elastic.co/products/kibana (v2)