Skip to main content
  1. Decisions/

Use OpenTelemetry

·757 words·4 mins
Agent IO
Author
Agent IO
Table of Contents
Use OpenTelemetry and Grafana for metrics, logging, and tracing.

Grafana is an open-source platform used for data visualization and monitoring. It is available in a Docker image as grafana/otel-lgtm that appears easy to run locally or in a Nomad cluster.

OpenTelemetry (OTEL) is a collection of APIs, SDKs, and tools for telemetry that is based on a standardized API.

Possible uses:

  • Analytics of IO usage saved as metrics (for internal product analysis)
  • Performance analysis (tracing) of IO components (for internal debugging)
  • Metrics for APIs and web backends managed with IO (to eventually be offered to external users).

This overview of OpenTelemetry looked promising, but on first reading, it adds a usability layer to the OTEL Go package and doesn’t reveal much about what’s happening underneath. It’s also missing a good example to exercise the usability layer and OTEL.

To understand more, I worked through Getting Started. From that I was able to send metrics, logs, and traces to a Grafana container running locally:

docker run -p 3000:3000 -p 4317:4317 -p 4318:4318 --rm -ti grafana/otel-lgtm

Ultimately, it seems (very) likely that IO will directly build on the OTEL protos. They are simple and easy to integrate, and using them directly seems simpler than integrating Go implementations that use externally-compiled protos and Google’s gRPC. This would also allow IO to use Buf Connect for these calls. All OTEL gRPC calls are unary, adding to the simplicity of direct API usage. This would require reimplementing buffering and export scheduling, but it’s likely that the official package contains much more than IO actually needs and building this gives us a high level of control. For example, it would be easy to send metrics to multiple backends (e.g. a user-specific backend and a service-wide backend).

Someone else doesn’t care for the Go OTEL library.

Currently I can compile the opentelemetry protos in the IO repo and have a small test program that sends metrics and logs that are visible in a local Grafana image. I’m also proxying exports through IO to observe messages.

Next steps:

  • Run the grafana image in Nomad on agentio.dev (under the otel subdomain) and secure it with something, probably an API key. This would be available as an IO ingress with HTTPS.
  • Configure IO to send analytics to otel.agentio.dev.

TODO:

Pros
#

  • OTEL has a nice set of features for logging, monitoring, and tracing.
  • OTEL has broad adoption with many available tools.
  • OTEL gRPC is easy to observe and manage with IO.
  • Envoy also supports OpenTelemetry, though this has not been explored.

Cons
#

As things progress, the cons are rapidly fading.

  • Complexity? Integration complexity is much less if we use the protos directly.
  • Reimplementation cost? We don’t seem to need much to send metrics and logs.
  • Upsells? It’s not clear what we need on the collection side. Will this lead us to a dependency on something proprietary and costly?

Grafana is resource-intensive. My current droplets aren’t up to this.

The Grafana instance itself will also need adequate resources. While the exact amount will depend on the complexity of your dashboards and the number of users accessing them, a minimum of 2 vCPUs and 4GB of RAM is a good starting point.

I can run it at home just fine in Docker, so I’m setting up an instance at grafana.timbx.me and otel.timbx.me (collection).

Investigating further, Nomad has been killing this allocation because it exceeds the default 300MB memory limit. Raising that to 2GB allows the container to run on my home cluster. This might also be ok in the droplets, but I’ll keep the deployment on my faster home system for now.


Running the otel-lgtm instance in Nomad requires exposing two services:

tim@oscar:~/clusters/oscar/jobs$ curl http://localhost:4646/v1/allocation/0184f2b2-2fe3-989c-9c2c-15212f568a9a/services | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   715  100   715    0     0   224k      0 --:--:-- --:--:-- --:--:--  232k
[
  {
    "Address": "146.190.175.199",
    "AllocID": "0184f2b2-2fe3-989c-9c2c-15212f568a9a",
    "CreateIndex": 42444,
    "Datacenter": "dc1",
    "ID": "_nomad-task-0184f2b2-2fe3-989c-9c2c-15212f568a9a-group-otel-lgtm-grafana-http",
    "JobID": "otel-lgtm",
    "ModifyIndex": 42444,
    "Namespace": "default",
    "NodeID": "be88b7fb-e8ef-2cab-94ca-49bf7c95db25",
    "Port": 30289,
    "ServiceName": "grafana",
    "Tags": []
  },
  {
    "Address": "146.190.175.199",
    "AllocID": "0184f2b2-2fe3-989c-9c2c-15212f568a9a",
    "CreateIndex": 42444,
    "Datacenter": "dc1",
    "ID": "_nomad-task-0184f2b2-2fe3-989c-9c2c-15212f568a9a-group-otel-lgtm-otel-grpc",
    "JobID": "otel-lgtm",
    "ModifyIndex": 42444,
    "Namespace": "default",
    "NodeID": "be88b7fb-e8ef-2cab-94ca-49bf7c95db25",
    "Port": 27521,
    "ServiceName": "otel",
    "Tags": []
  }
]

Currently services are referenced as nomad:job, which assumes one service per job. This needs to be extended to nomad:job:service.

  • nomad:otel-lgtm:otel for the otel gRPC ingestion service (exposed internally on port 4317)
  • nomad:otel-lgtm:grafana for the grafana service (exposed internally on port 3000)

Comments
#