The future of latency profiling in Go
Tue, Jul 18, 2017Note: This article contains non-finalized ideas; we may end up not implementing any of this but ideally we should do work towards the direction explained here.
Go is the language to write servers, Go is the language to write microservices. Yet, we haven’t done much in the past for latency analysis and observability/diagnostics of request/RPC performance.
GopherCon 2017 was an opportunity for me to discuss our roadmap for latency analysis. I have talked to many whose main job is to provide instrumentation solutions to the ecosystem. A few common problems have been pointed out by pretty much everyone I talked to:
- Instrumentation requires manual labor. Go code cannot be auto-instrumented by intercepting calls.
- Lack of a standard library package; third party packages cannot provide out-of-the-box instrumentation without external dependencies.
- Dropped traces; libraries don’t know how to propagate traces to the outside world. We need a context.Contextkey to propagate traces and be able to discover the current trace by looking into the incoming context.
- Lack of standard library support; e.g. packages like database/sqlcan be instrumented to create spans for each ExecContext if the given context has already has a trace ID.
- Lack of diagnostics data available from runtime per trace ID. It would be ideal to be able to record runtime events (e.g. scheduling events) with trace IDs and then pull them to further investigate low-level runtime events happened in the lifetime of a request.
Apart from the Go-specific issues, we often came back to the problem of the wild fragmentation in the tracing community and how the lack of the compatibility among tracing backends damage the possibility of establishing more in the library space. There is not much we can do beyond the boundaries of Go other than advocating for a requirement to fix fragmentation which I already personally do.
Instrumentation
We are currently not interested to solve (1) in a fashion other languages do by providing primitives that can intercept every call. Initially there is a lot to be done by creating a common instrumentation library and putting manual spans in place. A common instrumentation layer also solves the problems explained at (2) and (3).
To address these items, we will propose a package with a trace context representation, FromContext/NewContext to propagate trace context via context.Context, and a small API to create/end/annotate spans.
Users will be able to start and stop trace collection in a Go program dynamically; and export collected trace data.
Users will need to write transformation code if they would like to follow an existing distributed trace (e.g. an existing Zipkin trace propagated via an incoming HTTP request’s header).
Once we establish a package, we can revise the standard library packages to see where we can inject out-of-the-box instrumentation. Some existing ideas:
- database/sql: A span can be created for each ExecContext and finished when exec is completed to measure latency.
- os/exec: A span can be created for CommandContext to measure command exec latency.
- net/http: http.Transport can create spans for outgoing requests.
The next steps for net/http is to be able to propagate traces via http.Request. Ideally we want http.Transport to inject the right trace context header to the outgoing requests and http.Handlers to extract trace contexts into req.Context. The wild fragmentation in the tracing backends don’t help us much here. Each backend requires a different encoding/decoding to serialize/deserialize trace contexts and different HTTP headers to put them in place. There is an ongoing effort to unify things in this area and we will wait for it rather than trying to meet the backend-specific requirements.
There is also an experimental work to annotate runtime events recorded by the execution tracer with trace IDs, which will address the basic requirements of (5). If you need to collect more precise data on what else is happening in the lifetime of a trace, you will be optionally record runtime events and attach them to the current trace.
Visualization
Nothing has been planned so far to visualize the per-node data. We expect the exported data can be transformed into the data format of the user’s existing distributed tracing backend and visualized there. For those who are looking for a local env setup, I suggest Zipkin given it is very easy to run it locally as a standalone service. I am in also favor of maintaining high-quality transformation drivers for Zipkin or OpenTracing somewhere outside of the standard lib.
Conclusion
We have clearer idea what we want to achieve in the scope of Go for latency profiling. The next steps are converting these ideas into proposals and discuss them with the broader Go community, give feedback to the tracing community for the standardization efforts, and create awareness of these concepts and tools.