In conversation with Charity Majors, CTO of Honeycomb
In this episode of Tractable, Kshitij Grover is joined by Charity Majors, CTO of Honeycomb, to discuss Observability 1.0 and 2.0 and the critical shift from fragmented data sources to a unified source of truth for more efficient debugging and problem-solving. From identifying signs of organizations entrenched in '1.0 thinking' to the need for shorter feedback loops, Charity examines the key factors driving the evolution of observability tools and practices.
Kshitij Grover [00:00:05]:
Welcome to another episode of Tractable. I'm your host, Kshitij, cofounder and CTO here at Orb. And today, I'm really excited to have Charity on the podcast. Charity is the CTO of Honeycomb, and Honeycomb is a platform that helps you understand your application in production, sometimes called observability, but I'm sure we'll dive into that today. Honeycomb is used by orgs like Vanguard, Slack, and Intercom. And, as many of you all know, Charity also writes and shares a bunch around engineering management and organizational design in addition to technical topics, so we'll have a ton to talk about. Charity, welcome.
Charity Majors [00:00:38]:
Thank you for having me. We've been trying to make this work for a while now.
Kshitij Grover [00:00:43]:
Yes. I wanna get into the meat of it. I know you have a ton of really well-formed thoughts at Honeycomb around observability or or what the industry at broad calls observability. Let's just start with a core thesis that you all have, which is this difference between Observability 1.0 & 2.0. So maybe you can give some context around what that means, and then we can dive into what the kind of precise differences are.
Charity Majors [00:01:08]:
So when we started talking about observability back in 2016, it was a way of differentiating between what we were trying to build and what the rest of the world was doing, which which was very rooted in metrics, which have no context, and the monitoring sort of approach which works really well for systems that fail in predictable ways, which increasingly is not our the systems that we're working on. And so at this point, I would say Observability 1.0 refers to tools that have many sources of truth. There's a famous definition that observability has 3 pillars, metrics, logs, and traces. Most people are paying for way more than 3 pillars. They've usually got run tools and APM tools and logging tools. Maybe they got structured and unstructured logs, and they've got their SLO, and all of these tools, they they're not connected.
Charity Majors [00:01:53]:
The only thing that connects them is the engineer who sits in the middle visually going, that spike looks like that spike, and maybe copy paste the IDs around from tool to tool. But you've got many sources of truth, nothing ties them together, which means that you have to rely a lot on guessing and intuition and, like, past experience when debugging systems. And Observability 2.0, it's a single source of truth. Arbitrarily wide structured data blobs, logs, whatever you wanna call them, events. But because there's one source of truth with all the shared context, you can derive metrics from them. You can derive traces from them by visualizing them over time. But this all connected and you as an engineer could go, here's my SLO. It's burning down at this rate.
Charity Majors [00:02:36]:
Here are the events that are violating the SLO. Here's how they're different from the events that are not violating the SLO. Like, I was just talking to a customer of ours who's just started rolling out front end Observability, and they're going from this like weeks long process of identifying latency problems, trying to repro them. Literally weeks, has been reduced down to minutes. Because when that connective tissue is there, you can just ask the question. You can just see exactly what's happening. So much of debugging boils down to, here's the thing I care about. I might not know why I care about it.
Charity Majors [00:03:08]:
And in fact, most of the process of debugging is figuring out what's different about this thing that I care about versus everything else that I don't care about.
Kshitij Grover [00:03:18]:
Yeah. That that's interesting. I'm wondering, like and and and you mentioned this a little bit, but what are some signs on the ground? So let's say I'm an IC engineer. What are some signs that my org is stuck in this, like, 1.0 style of thinking? You know, no matter what tool I use, like, what does that look like tactically?
Charity Majors [00:03:34]:
The huge difference between 1.0 and 2.0 is also that certainly 1.0 is very much about operating your code. And it's intrinsically reactive. Right? And 2.0 is very much about how you develop your code. There's so much I think it was just like dark matter in software engineering. It's like, why it doesn't seem like we should be moving this slow. Why are we moving so slow? Like, there you can't really put your finger on it because you can't see. You know? And, like, when you have an observability 2.0 mindset and toolkit, you can see where that time is going. You can see exactly what's happening.
Charity Majors [00:04:06]:
And this is something it's Plato's allegory of the cave, like, trying to explain to a blind person who's lived in a cave their whole life what it's like that. There's a bit of look, you almost have to take it on faith that it is this different. Because vendors have been lying through their let's not say lying, but, like, exaggerated the impact of what they sell you from day 1. Right? And so there are very few things in software engineering that have actually have this kind of outsized impact. And in my experience, good observability tooling is one of them. Because we all we always talk as managers, directors, especially, you're always looking at the feedback loops inside your organization. Because these feedback loops amplify each other. Right? And so if it takes you 15 minutes from the time that you write the code to deploy the code versus, 2 hours.
Full transcript here.