It's a Tuesday afternoon, and management has been in a state of poorly contained hysteria all day, running in and out of meetings continuously. The team is trying to focus on work, but it's hard to avoid gossiping.
"Do you think it's ransomware?" says your colleague, and you shake your head. You've had the same thought, of course, but #security in Slack is in the normal state of blissful silence, and there's nothing going on in #ops either. It's about time for the yearly reorganization, but that usually doesn't cause so much... distress. Maybe there's been some scandal at the recent strategy retreat. Did the CEO lose their executive decision dice again?
"Do you think they're fighting about reintroducing OKRs again?"
You're about to pack up your things and go home when your team lead finally comes into the team office. They have a determined look on their face, like a decision has been made, and they don't like it. You get a sinking feeling in your gut, like you're not going to like it either.
"Rachel has quit, she won't be coming in to work again. Apparently, she has already moved to the Cayman Islands to avoid having to be on-call ever again."
Oh crap, that's bad. Rachel has been there since forever, single-handedly running dozens of important enterprise services that generate tons of revenue for the business. She's never had time to document anything, always running from one fire to another, while somehow magically managing to do maintenance work on absolutely everything. The engineers have known for years that she's the single most important person in the organization.
"Management has decided that we're taking over the Camelo service. Unfortunately, we're not getting any new resources for maintenance. Since it's business-critical and undocumented, we're not allowed to make changes to it. Oh, and from now on, we're on call."
No changes? On call?? You've never heard of this Camelo service before, how
critical could that be? It's definitely not in any of the observability
platforms... Is it one of those ancient things that are running in some tmux
session left hanging by a long retired engineer?
"It's a critical revenue stream to the business. Nobody is really sure what exactly it's doing, but it seems to have something to do with certain dairy products. It's some JVM stack, so I said our team could take it on. They've given us some jar files, but we haven't found the source code yet."
...
Adding observability to an application without changing code
This is the second part in our blog series about using OpenTelemetry;
you can read the first one here if you missed it. In
this post, we're going to cover what the engineer in the introduction can do
to set up some observability and discover what the Camelo service is doing. The goal
is to give you a place to start experimenting and tinkering. The official
documentation is vast and somewhat challenging to navigate without going in cycles.
In this post, we're introducing 2 software components of the OpenTelemetry project:
- The OpenTelemetry Collector, which can serve as a one-stop shop as a destination for all your telemetry and make sure it ends up in the right observability solution.
- The OpenTelemetry Java Agent, which can instrument your application without requiring any source code changes.
camelo.jar
and all the code is available in this
github repository
if you'd like to experiment with anything yourself.
Debugging OpenTelemetry setup on a developer machine
It is hard to get anything done without a short feedback loop, so we'll start by setting up an OpenTelemetry collector locally. Later on, we will configure the collector to send data to an observability solution. For now, we will configure it to print the telemetry it receives to a console, so we can find out what this Camelo service is all about. The collector needs a configuration file, we can use this one:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
service:
pipelines:
traces:
receivers: [ otlp ]
exporters: [ nop ]
metrics:
receivers: [ otlp ]
exporters: [ nop ]
logs:
receivers: [ otlp ]
exporters: [ debug ]
exporters:
debug:
verbosity: detailed
nop:
This instructs the collector to listen to port 4317 for otel/grpc data and port 4318 for otel/http. Both of these can receive logs, traces and metrics data. We've set up pipelines for all three kinds of telemetry data, we discard metrics and traces for now, and forward all logs to an exporter that prints it on the console.
We can use this collector configuration to see what kind of data we're able to pick up from Camelo, to verify that we've instrumented it correctly. The configuration reference is here, and the most important words in a logical order are:
receivers
are used to retrieve data from a myriad of protocolsprocessors
can augment or rewrite data in a pipeline (not used in this configuration)exporters
are used to put that data somewherepipelines
connectreceivers
withexporters
, optionally usingprocessors
to process the data between
It's convenient to run the OpenTelemetry collector in docker for local development, and the configuration file above
will work with this docker-compose.yml
file:
services:
otel-collector:
image: otel/opentelemetry-collector:latest
container_name: otel-collector
command: [ "--config=/etc/otel-collector-config.yaml" ]
volumes:
- ./otel-collector-debug-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
Starting it with docker compose up
will yield something like this:
[+] Running 1/1
✔ Container otel-collector Created 0.0s
Attaching to otel-collector
otel-collector | 2025-02-09T11:56:41.697Z info service@v0.119.0/service.go:186 Setting up own telemetry...
otel-collector | 2025-02-09T11:56:41.697Z info builders/builders.go:26 Development component. May change in the future. {"kind": "exporter", "data_type": "logs", "name": "debug"}
otel-collector | 2025-02-09T11:56:41.697Z info service@v0.119.0/service.go:252 Starting otelcol... {"Version": "0.119.0", "NumCPU": 12}
otel-collector | 2025-02-09T11:56:41.697Z info extensions/extensions.go:39 Starting extensions...
otel-collector | 2025-02-09T11:56:41.698Z info otlpreceiver@v0.119.0/otlp.go:112 Starting GRPC server {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4317"}
otel-collector | 2025-02-09T11:56:41.698Z info otlpreceiver@v0.119.0/otlp.go:169 Starting HTTP server {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "0.0.0.0:4318"}
otel-collector | 2025-02-09T11:56:41.698Z info service@v0.119.0/service.go:275 Everything is ready. Begin running and processing data.
This is enough to start instrumenting Camelo, to figure out what kind of observability data we can get from it.
What even is a camelo.jar
Let's first try to run this using java, just to see what happens:
java -jar camelo.jar
17:17:30.129 [main] INFO CameloServer$ -- Server starting on port 8080
17:17:30.131 [main] INFO CameloServer$ -- Access on i.e. http://localhost:8080/
^C%
Apparently, it's some sort of web service and thankfully, it appears to have logging. That would be perfect for
the OpenTelemetry collector we just configured! We've downloaded the java agent for OpenTelemetry
from
this page.
The OpenTelemetry java agent is capable of instrumenting the byte code of our application before it starts,
so the telemetry data can be made available to our collector. Let's try it!
java -javaagent:opentelemetry-javaagent.jar -jar camelo.jar
[otel.javaagent 2025-02-09 17:24:14:188 +0100] [main] INFO io.opentelemetry.javaagent.tooling.VersionLogger - opentelemetry-javaagent - version: 2.12.0
17:24:15.382 [main] INFO CameloServer$ -- Server starting on port 8080
17:24:15.392 [main] INFO CameloServer$ -- Access on i.e. http://localhost:8080/
It's working, we're already seeing the logs in the collector. They're very verbose, so here's an excerpt:
...
ScopeLogs #0
ScopeLogs SchemaURL:
InstrumentationScope CameloServer$
LogRecord #0
ObservedTimestamp: 2025-02-09 16:24:15.386833 +0000 UTC
Timestamp: 2025-02-09 16:24:15.382002 +0000 UTC
SeverityText: INFO
SeverityNumber: Info(9)
Body: Str(Server starting on port 8080)
Trace ID:
Span ID:
...
So at least we know we'll get something now. What else could we pick up with the OpenTelemetry agent? One way to check is to run this command:
java \
-Dotel.javaagent.debug=true \
-javaagent:opentelemetry-javaagent.jar \
-jar camelo.jar \
&> server-otel-startup.log
This creates a lot of output. Let's highlight a few things we've found:
grep -o 'Applying instrumentation: [^ ]*' server-otel-startup.log
Applying instrumentation: executors
Applying instrumentation: internal-lambda
Applying instrumentation: internal-reflection
Applying instrumentation: internal-class-loader
Applying instrumentation: internal-url-class-loader
Applying instrumentation: undertow
Applying instrumentation: logback-appender
Applying instrumentation: logback-mdc
Applying instrumentation: executors
Applying instrumentation: hikaricp
Applying instrumentation: jdbc
Applying instrumentation: java-util-logging
Applying instrumentation: internal-class-loader
It looks like we'll get some data from jdbc
and hikaricp
, so camelo
probably uses
a database. undertow
is an http engine, and we've already seen that we're getting logs
from something -- probably logback
. Since these things are instrumented now, we can
expect to pick up logs, traces and/or metrics from them. Cool!
What is this tracing thing we keep hearing about?
If we modify the OpenTelemetry collector configuration, we can take a look and try
to pick up only traces instead, so we can figure out what they are. We'll do that
by setting exporters
in the logs:
section in the configuration to [nop]
and
the traces:
section to [debug]
, then take down the collector with docker compose down
and run docker compose up
again.
To receive any trace data, some event must start a trace. Hopefully, the undertow
instrumentation will take care of doing that for us, so let's try making a request
to the application by running:
curl http://localhost:8080
Givsgud!
Place orders at http://localhost:8080/order/:n (i.e. http://localhost:8080/order/3)
Check order inventory at http://localhost:8080/orders
Huh, what a strange message. Looks like maybe this is some sort of system for placing orders of some kind? But look, the collector picked up something!
otel-collector | InstrumentationScope io.opentelemetry.undertow-1.4 2.12.0-alpha
otel-collector | Span #0
otel-collector | Trace ID : a7d3845d1c03bb597ac49df0d5efa035
otel-collector | Parent ID :
otel-collector | ID : 84a979344483557e
otel-collector | Name : GET
otel-collector | Kind : Server
otel-collector | Start time : 2025-02-10 16:58:20.907362 +0000 UTC
otel-collector | End time : 2025-02-10 16:58:20.90890325 +0000 UTC
otel-collector | Status code : Unset
otel-collector | Status message :
otel-collector | Attributes:
otel-collector | -> thread.id: Int(47)
otel-collector | -> http.request.method: Str(GET)
otel-collector | -> http.response.status_code: Int(200)
otel-collector | -> url.path: Str(/)
otel-collector | -> server.address: Str(localhost)
otel-collector | -> client.address: Str(127.0.0.1)
otel-collector | -> server.port: Int(8080)
otel-collector | -> network.peer.address: Str(127.0.0.1)
otel-collector | -> url.scheme: Str(http)
otel-collector | -> thread.name: Str(XNIO-1 I/O-5)
otel-collector | -> network.protocol.version: Str(1.1)
otel-collector | -> user_agent.original: Str(curl/8.7.1)
otel-collector | -> network.peer.port: Int(60690)
otel-collector | {"kind": "exporter", "data_type": "traces", "name": "debug"}
Notice how there's a Trace ID
now. We may see this Trace ID
as the Parent ID
of
another Span
within the same trace. For example, if an HTTP request is made to a different
system. The OpenTelemetry agent will also instrument any http clients it can find,
so that the Trace ID
can propagate correctly to other systems that may also send trace
information. This is super useful for some kinds of architectures (looking at you, microservices).
Traces are a lot like structured logs that allow nesting in a parent/child relationship. There's a lot of structured information associated with our trace, we can see where the client came from, the user agent string and thread ids.
This is a good time to note that the debug collector is very helpful if you want to
use the collector to process incoming data, for example, to remove attributes that
could contain personally identifiable information -- maybe the client.address
in this case?
Checking out metrics
We've checked two of the three kinds of data that the OpenTelemetry agent can pick up
for us. Let's check if there's any kind of metric data available by editing the
open-telemetry configuration again, this time setting the traces
exporters to [nop]
and the metrics
exporters to [debug]
. Then taking down the collector with docker compose down
and starting it again with docker compose up
, let's wait and see if we get something...
...
otel-collector | Metric #6
otel-collector | Descriptor:
otel-collector | -> Name: db.client.connections.create_time
otel-collector | -> Description: The time it took to create a new connection.
otel-collector | -> Unit: ms
otel-collector | -> DataType: Histogram
otel-collector | -> AggregationTemporality: Cumulative
otel-collector | HistogramDataPoints #0
otel-collector | Data point attributes:
otel-collector | -> pool.name: Str(HikariPool-1)
otel-collector | StartTimestamp: 2025-02-10 16:55:20.38978 +0000 UTC
otel-collector | Timestamp: 2025-02-10 17:11:56.454185 +0000 UTC
otel-collector | Count: 4
otel-collector | Sum: 2.000000
otel-collector | Min: 0.000000
otel-collector | Max: 1.000000
...
Oh crap, it's actually making database connections to something. Let's hope it's not production! Maybe it's best to stop it before we break something...
Is auto-instrumentation without code supported for other platforms?
Yes! If you want to get started with experimenting, there's an interesting collection of links to check out at zero code instrumentation
At the time of writing, these runtimes have some sort of support for this:
- Go
- .NET
- Python
- PHP
- Java
- JavaScript (node.js)
Now what?
In the next post of this series, we'll expand our docker-compose setup with an observability
solution that we can use to visualize logs, traces and metrics so we can figure out what exactly
camelo.jar
is actually doing. Stay tuned!