Telemetry and Metrics in Elixir

Learn about the metrics you can collect in Elixir with the telemetry_metrics library

In the previous article, I introduced you to Telemetry. The telemetry library provides a way to:

  • generate events
  • create handler functions that will receive the events
  • register which handlers will be called when an event occurs

The principal advantage of telemetry is a standardized way to emit events and collect measurements in Elixir. This way, third-party libraries and code that we write ourselves can use it to generate events in a consistent way.

Let's now take a step back and consider a couple of things. An event is just a measure of something that happened at a given point in time. By itself doesn't tell us much. We might need more data to compare it against or to infer some trend or some rate of change. But the measure by itself is not very useful.

With that in mind, the telemetry developers created a library called telemetry_metrics. According to their documentation, telemetry_metrics is:

Common interface for defining metrics based on :telemetry events.

Ok, but what are metrics, and how it is different from measurements taken with events? Well, again according to their docs:

Metrics are aggregations of Telemetry events with specific name, providing a view of the system's behavior over time.

Ok, it seems clear, right? Not really.

If you start using the telemetry_metrics library you'll notice that in fact DOESN'T AGGREGATE anything!!!

What?

Exactly that. Doesn't aggregate anything.

If you generate 1000 events with a measurement each, this library is not going to keep, manipulate, calculate, summarize or do anything by itself.

So, what is useful for, you might ask?

Its aim is to define what kind of metrics you're going to create from events. That's it. Just a declaration, a definition, a contract if you will. Nothing else.

So, how do you get the real metrics, the processed results of your carefully collected measurements?

For that, you need something else. You need a reporter.

The reporter's responsibility is to aggregate and process each event received so that it conforms to the specified metric it handles.

telemetry_metrics ships with a simple reporter called ConsoleReporter that doesn't do much. It only outputs the measurement and metric details to the console. Not much, but allows you to verify that everything works and you're collecting data to generate metrics correctly.

For more useful reporters you should either write them yourself or get them from someone that has already written it.

In this article, I am going to show you how to use the default ConsoleReporter and how to create a custom reporter.

Prerequisites

Use asdf and install these versions of elixir and erlang:

asdf install erlang 24.2.1
asdf global erlang 24.2.1
asdf install elixir 1.13.3-otp-24
asdf global elixir 1.13.3-otp-24

Create an Elixir app

mix new metrics --sup

This will create an Elixir app with an application callback where we can attach our telemetry events and configure our metrics.

Install dependencies

Add telemetry and telemetry_metrics to your mix.exs:

  defp deps do
    [
      {:telemetry, "~> 1.0"},
      {:telemetry_metrics, "~> 0.6.1"}
    ]
  end

Create an event emitter

Let's create a simple event emitter that we can use to test our metrics. Open the lib/metrics.ex file that mix created for you and replace its contents with this:

defmodule Metrics do
  def emit(value) do
    :telemetry.execute([:metrics, :emit], %{value: value})
  end
end

As you can see it uses :telemetry.execute to emit an event passing the value we provide.

Define some metrics

Ok, now the interesting part: let's define some metrics we want to collect. telemetry_metrics allows to create several types of metrics:

  • counter/2 which counts the total number of emitted events
  • sum/2 which keeps track of the sum of selected measurement
  • last_value/2 holding the value of the selected measurement from the most recent event
  • summary/2 calculating statistics of the selected measurement, like maximum, mean, percentiles, etc.
  • distribution/2 which builds a histogram of selected measurement

We are going to start with the basics and define a counter metric, assuming we want to count how many times the event has happened.

Create a new file lib/metrics/telemetry.ex and put this in it:

defmodule Metrics.Telemetry do
  use Supervisor
  import Telemetry.Metrics

  def start_link(arg) do
    Supervisor.start_link(__MODULE__, arg, name: __MODULE__)
  end

  def init(_arg) do
    children = [
      {Telemetry.Metrics.ConsoleReporter, metrics: metrics()}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

  defp metrics do
    [
      counter("metrics.emit.value")
    ]
  end
end

Let's analyze it. First of all, it is a simple Supervisor that we need to start from some other process in order to work. We'll attach it to our main application supervisor tree. More on that later.

Then, the init function is configuring the children for this supervisor. We have a single child here, the ConsoleReporter. And we are passing a list of metrics to it as initial arguments.

Here is where it becomes interesting. Telemetry.Metrics.counter is the definition of the metric we want to use. We pass a string: "metrics.emit.value". This string will be split using the period as separator and everything but the last part (metrics.emit), will be the event to attach to. The last part (value) will be taken from the measurement passed to the telemetry event handler.

Let's explain it. We have this call in Metrics.emit/1:

    :telemetry.execute([:metrics, :emit], %{value: value})

and we have a metrics definition in Metrics.Telemetry.metrics/0:

      counter("metrics.emit.value")

As you see, the counter is going to attach a handler for the [:metrics, :emit] event and get the :value attribute from the second argument to telemetry.execute call (the measurement):

  • event: "metrics.emit" -> [:metrics, :emit]
  • measurement attribute: "value" -> :value

If we were collecting the :query_time of [:my_app, :repo, :query] event, we would write "my_app.repo.query.query_time".

Ok, let's continue.

We are using the ConsoleReporter to collect our metrics. We defined the counter metric. We also said that ConsoleReporter do nothing but output the values received. That's is enough, for now, to check that we are in fact collecting metrics with our ConsoleReporter.

But one thing is missing. We need to attach our Metrics.Telemetry supervisor to the application supervisor, otherwise, nothing will happen. Open application.ex and change the start/2 function to this:


  def start(_type, _args) do
    children = [
      Metrics.Telemetry
    ]

    opts = [strategy: :one_for_one, name: Metrics.Supervisor]
    Supervisor.start_link(children, opts)
  end

As you see we are putting our Metrics.Telemetry as a child of the application Supervisor.

Test it

Let's try it with iex. Open a shell terminal, get the dependencies and open an iex session:

mix deps.get
iex -S mix

Emit an event:

iex(1)> Metrics.emit(4)
[Telemetry.Metrics.ConsoleReporter] Got new event!
Event name: metrics.emit
All measurements: %{value: 4}
All metadata: %{}

Metric measurement: :value (counter)
Tag values: %{}

:ok
iex(2)>

Yay, it works. Although we didn't attach any handler to the [:metrics, :emit] event manually, the ConsoleReporter handled it for us thanks to telemetry_metrics. As I said, it only echoes what is passed in, but the important thing here is that it works.

If the reporter were a little more advanced it would do something with the passed value.

CustomReporter

Let's create a simple CustomReporter by ourselves to see how this works. One thing to notice is that the reporter needs to have a memory of previous values in order to do its calculations. In our case, we need to remember how many events we have received in order to increment it every time we receive a new event. We could use :ets or a GenServer or Agent depending on how complex our implementation needs to be. For this tutorial, I am going to use an Agent as it is very simple to use to keep some persistent value.

Create a file called lib/metrics/telemetry/reporter_state.ex and write this:

defmodule Metrics.Telemetry.ReporterState do
  use Agent

  def start_link(initial_value) do
    Agent.start_link(fn -> initial_value end, name: __MODULE__)
  end

  def value do
    Agent.get(__MODULE__, & &1)
  end

  def increment do
    Agent.update(__MODULE__, &(&1 + 1))
  end
end

Let's attach it as a child to our main application supervisor so that it is started when the app starts. Edit application.ex and change the start/2 function to this:

  def start(_type, _args) do
    children = [
      {Metrics.Telemetry.ReporterState, 0},
      Metrics.Telemetry
    ]

    opts = [strategy: :one_for_one, name: Metrics.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Now the application will automatically start a counter agent with an initial value of 0 that we can use from every other part of the app by using its name Metrics.Telemetry.CounterAgent. Let's try it:

iex -S mix
iex(1)> Metrics.Telemetry.ReporterState.value()
0
iex(2)> Metrics.Telemetry.ReporterState.increment()
:ok
iex(3)> Metrics.Telemetry.ReporterState.value()    
1

Nice, we have our agent working. We can use it to maintain the state of our custom reporter.

Let's write our reporter. Create a file called lib/metrics/telemetry/custom_reporter.ex:

defmodule Metrics.Telemetry.CustomReporter do
  use GenServer

  alias Metrics.Telemetry.ReporterState
  alias Telemetry.Metrics

  def start_link(metrics: metrics) do
    GenServer.start_link(__MODULE__, metrics)
  end

  @impl true
  def init(metrics) do
    Process.flag(:trap_exit, true)

    groups = Enum.group_by(metrics, & &1.event_name)

    for {event, metrics} <- groups do
      id = {__MODULE__, event, self()}
      :telemetry.attach(id, event, &__MODULE__.handle_event/4, metrics)
    end

    {:ok, Map.keys(groups)}
  end

  def handle_event(_event_name, measurements, metadata, metrics) do
    metrics
    |> Enum.map(&handle_metric(&1, measurements, metadata))
  end

  defp handle_metric(%Metrics.Counter{} = metric, _measurements, _metadata) do
    ReporterState.increment()

    current_value = ReporterState.value()

    IO.puts("Metric: #{metric.__struct__}. Current value: #{current_value}")
  end

  defp handle_metric(metric, _measurements, _metadata) do
    IO.puts("Unsupported metric: #{metric.__struct__}")
  end

  @impl true
  def terminate(_, events) do
    for event <- events do
      :telemetry.detach({__MODULE__, event, self()})
    end

    :ok
  end
end

A lot of things happening here, but don't get distracted by those. What you should notice is the handle_event function that distributes the event to all the specific metric handlers to, well, handle the event.

The handle_metric function uses ReporterState to keep the current state and increments the running counter that we use to track how many times we have received this event. It also prints the new current state to the console.

Let's use our CustomReporter instead of the ConsoleReporter. Change the init/1 function in telemetry.ex to be like this:

  def init(_arg) do
    children = [
      {Metrics.Telemetry.CustomReporter, metrics: metrics()}
    ]

    Supervisor.init(children, strategy: :one_for_one)
  end

As you see, now we use our CustomReporter and we pass the metrics to it just as we did with the ConsoleReporter. Let's try it. Open the console and start iex:

iex -S mix
iex(1)> Metrics.emit(4)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: 1
:ok
iex(2)> Metrics.emit(5)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: 2
:ok
iex(3)> Metrics.emit(2)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: 3
:ok

And there it is, now the counter is correctly keeping track of how many times we have received the event.

Let's review a few more things about the CustomReporter. Most of them are better explained in the official docs here.

It traps the exit signal to detach the handlers when the app terminates or the process is about to finish. It also groups together similar events that require several metrics tracking. For example, suppose you want to collect counter and sum metrics for the event "my_app.repo.query.query_time". This code gets the event once and then calls handle_metric two times, one with the %Telemetry.Metrics.Counter{} as the first argument and one with %Telemetry.Metrics.Sum{} as the first argument. And for both of those calls, the second argument is the measurement we just received.

Add support for more metrics

Ok, so far so good. But if you notice we are just handling one metric. Let's add support for the sum metric.

First, to sum all the event measurements, we need to also keep track of the running sum so far. Our ReporterState is only handling a single integer as the state. Let's change it to now store a tuple, with the first element being the count and the second element being the sum.

Change ReporterState to this:

defmodule Metrics.Telemetry.ReporterState do
  use Agent

  def start_link(initial_value) do
    Agent.start_link(fn -> initial_value end, name: __MODULE__)
  end

  def value do
    Agent.get(__MODULE__, & &1)
  end

  def increment do
    Agent.update(__MODULE__, fn {count, sum} -> {count + 1, sum} end)
  end

  def sum(value) do
    Agent.update(__MODULE__, fn {count, sum} -> {count, sum + value} end)
  end
end

Now the agent has a composite state and a function to increment the count part and a function to add to the total part. This is a naive implementation, of course, and doesn't even care about both parts of the state getting out of sync, but for our purposes, it will suffice. We needed a way to store state, we have it.

Let's change our CustomReporter. Remove the old handle_metric functions and put these ones instead:

  defp handle_metric(%Metrics.Counter{} = metric, _measurements, _metadata) do
    ReporterState.increment()

    current_value = ReporterState.value()

    IO.puts("Metric: #{metric.__struct__}. Current value: #{inspect(current_value)}")
  end

  defp handle_metric(%Metrics.Sum{} = metric, %{value: value}, _metadata) do
    ReporterState.sum(value)

    current_value = ReporterState.value()

    IO.puts("Metric: #{metric.__struct__}. Current value: #{inspect(current_value)}")
  end

  defp handle_metric(metric, _measurements, _metadata) do
    IO.puts("Unsupported metric: #{metric.__struct__}")
  end

Again, nothing fancy. Now we handle a second type of metric, the %Metrics.Sum{} and, similarly to the count one, we use the ReporterState to keep track of the event measurements so far.

Let's tell telemetry_metrics to also handle this type of metrics. Update the metrics/0 function of telemetry.ex to this:

  defp metrics do
    [
      counter("metrics.emit.value"),
      sum("metrics.emit.value")
    ]
  end

Now the sum metric is also being collected.

Finally, change the start/2 function in application.ex to this:

 def start(_type, _args) do
    children = [
      {Metrics.Telemetry.ReporterState, {0, 0}},
      Metrics.Telemetry
    ]

    opts = [strategy: :one_for_one, name: Metrics.Supervisor]
    Supervisor.start_link(children, opts)
  end

We are set. Let's try it. In the shell:

iex -S mix
iex(1)> Metrics.emit(4)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: {1, 0}
Metric: Elixir.Telemetry.Metrics.Sum. Current value: {1, 4}
:ok
iex(2)> Metrics.emit(3)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: {2, 4}
Metric: Elixir.Telemetry.Metrics.Sum. Current value: {2, 7}
:ok
iex(3)> Metrics.emit(2)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: {3, 7}
Metric: Elixir.Telemetry.Metrics.Sum. Current value: {3, 9}
:ok
iex(4)> Metrics.emit(1)
Metric: Elixir.Telemetry.Metrics.Counter. Current value: {4, 9}
Metric: Elixir.Telemetry.Metrics.Sum. Current value: {4, 10}
:ok

And that's it. Our CustomReporter is now capable of tracking two different metrics for us. Of course, a production system will have a better way to store and keep track of the series of measurements so that it has more guarantees about the data ingested. But that is out of the scope of this article.

Summary

We learned that telemetry_metrics:

  • offers a way to define 5 types of metrics in a standard way.
  • doesn't really care or dictates how the measurements should be stored, aggregated, or manipulated
  • the reporter responsibility is to ingest and implement the real manipulation of the data

There are several open-source implementations of reporters that allow us to make our measurements available to tools like Prometheus or StatsD servers.

In a future article, I'll talk about integrating with them.

Source code

You can find the source code for this article in this github repository.

About

I'm Miguel Cobá. I write about Elixir, Elm, Software Development, and eBook writing.