Lorin Hochstein (Netflix)

July 27, 2021

Today’s conversation is about resilience, and as today’s guest, Lorin Hochstein, notes; “Resilience is about the stuff that isn’t visible through the metrics.” Lorin is a senior software engineer at Netflix who is on a mission to improve the company’s engineering department through creating a culture within which peer-to-peer learning and the process of reflecting on past mistakes are foundational. Lorin is responsible for the development of a few grassroots programs at Netflix which address the company’s lack of deliberate knowledge sharing, which he talks about today. We also discuss the value of close calls as opposed to incidents, and how Lorin works around the challenge of measuring the negative outcomes which didn’t occur. Although he makes sure to point out that he does not bear a “staff” title (Netflix does not have them), he is certainly doing some interesting staff-type work, and his passion for value creation is inspiring.

Links

Lorin Hochstein

Listen

Download Episode

Transcript

Note: This transcript was generated using automated transcription and may contain errors.

David: Welcome to the Staff Eng podcast where we interview software engineers who have progressed beyond the career level, into staff levels and beyond. We’re interested in the areas of work that sets staff plus level engineers apart from other individual contributors. Things like setting technical direction, mentorship and sponsorship, providing engineering perspective to the org, etc. My name is David Noel Romos and I’m joined joined by my co host, Alex Kessinger. We’re both staff bus engineers who have been working in software for over a decade. Alex, please tell us a bit about today’s guest.

Alex: Yeah, Loren Hochstein is a senior software engineer at Netflix, where he works on the managed Delivery team. And you might also know him from Twitter as no Root Cause. I’m excited to share this episode with you all because we touch on the topics of resilience and reliability, which I really think underscores probably most staff engineering roles. So let’s dive in.

David: All right, Lauren, thank you so much for taking the time to join us today. I’m looking forward to chatting with you. Could you please start by just sort of telling us who you are and what you do?

Lorin: Sure. So, I’m Loren Hochstein. I work at Netflix. I feel like a little bit like a fraud here because I am like, I am not a staff. I am a senior software engineer at Netflix because Netflix doesn’t really have levels like that. Pretty much everyone, Almost all the ICs are seniors. So I work in the delivery engineering part of Netflix these days on a project called Managed Delivery. So we’re working on a sort of declarative delivery system that is easier to reason about than traditional pipeline stuff. So there’s a lot of interesting problems around automation and a system that’s automatically doing things that I found really interesting. And so I kind of convince that team to take me on. I like to jump around a lot. It’s like my third team at Netflix now, since I’ve been here for about six years. Cool.

Alex: Yeah, and we totally understand. I think that the title is. It’s not as important as, like, you know, the work that you do. So really don’t worry about it. I think we’re going to talk about some really interesting things. So I’m curious, you know, regardless of sort of titles, you know, what do senior software engineers do at Netflix? Is there a typical set of expectations or does everyone sort of have their.

Lorin: Own spin on the role? Yeah, so one of the really interesting things about Netflix is that historically they’ve only hired seniors, so there is not a mix of juniors and seniors. On the teams. So everyone on the team is a senior and also we do. You build it, you run it. So everyone on the team who’s doing software development is also doing operations. So in a sense, kind of like everyone is expected to be a leader and do leadership stuff. And so in addition to doing, everyone does development, right? So I do, you know, definitely a lot of coding. Everyone does design work. Everyone does, you know, coordination with other teams. Right. To do cross functional kind of cat herdy stuff. Right. Everyone has to do a little bit of cat herding and so you can sort of choose how much of that you want to do. And it’s sort of up to you based on your particular interest on like where you’re going to spend your time, right. Like, are you going to spend your time thinking about like, what is the long term design stuff with this project that we’re working on so that we don’t, you know, hit a whole lot of pain in a couple months because it’s too hard to change. Maybe you’re interested in. Okay, I want to coordinate with this other, you know, we’re all, we say, what is it? Loosely coupled, highly aligned, right? Is the. Is the way we talked about it. Although, you know, often it’s loosely coupled. Loosely aligned, right. Like it’s. When you’re, when you’re loosely coupled, you’re sort of optimized for moving quickly individually, but not necessarily for alignment. That’s much harder to do. And so there are a lot of engineers now who work more around cat herngy kind of stuff, trying to get the teams to move in the same direction, trying to coordinate like all these different teams are doing different efforts and you want to make sure things are coherent. I work under the platform engineering Org, so all our customers are internal Netflix engineers. And historically it’s been very sort of disjointed experience for them. Like all the tools are totally different and so there’s a bigger push now to make that more cohesive. But that means better coordination. And since there’s no architects or anything, it’s sort of tricky to move things forward and get big things done that way. I actually personally don’t do as much of the cat herdingy kind of stuff. The stuff that I personally do that crosses teams is like me crossing teams. The way I think about it is I like to spread knowledge around by moving around the org. And one thing that we have, I think not been as good at as other companies at Netflix is historically people have not moved around as much. It’s gotten A lot better recently. But when I started there was no internal mechanism to let you move around. It was almost taboo a little bit. And now there’s an internal job site and it’s more of a thing. But that was definitely something that was quite different when I first got there.

Alex: Interesting. So it sounds like you talked a little bit about how you sort of approach your job maybe a little bit differently than necessarily the broad expectations. Is there anything else that you feel like you do that’s special to the way that you practice being a senior software engineer?

Lorin: Yeah, so I mean, I’m personally kind of interested in grassroots Y kind of stuff like bottom up things, working with the engineers, not necessarily on larger initiatives, but improving things. So I, for example, run Systems Reading, which is a paper reading group inside of Netflix where people get together and talk about interesting papers. It’s funny, when I got there I saw there was a group that had existed at one time, there was like a Google group, but it had lay fallow, like everyone involved had gone and it had stopped. And so we started that up again. I did something called, I tried to do something called Oops where I get people to talk about sort of near misses that have happened inside of Netflix. Not just the big incidents, but the stuff that didn’t necessarily have customer impact. But there’s interesting things to learn from there. So that’s another example. I brought in Hillel Wayne. He teaches on tla. One of the first workshops he did was at Netflix. I brought him in there. So there’s interest in that. So these are things that are not like we’re going to move a big rock up a hill to accomplish this, but it’s trying to kind of upskill the engineers inside the organization.

David: This is a really interesting aspect of what I would sort of think of as staff level work. And obviously Netflix, that label exist, but this idea of knowledge sharing, one of the challenges that I think a lot of organizations face is that they don’t sort of. A lot of organizations fail to incentivize that properly or fail to reward it properly. First of all, would you agree with that? And do you think that there’s like that Netflix does things better in that regard? Is there sort of like, you know, beyond sort of thinking it’s the right thing to do? Are there any incentives nudging you in the direction of sort of helping the folks around you level up?

Lorin: Yeah. So this is another area where I think Netflix has gotten better in the past few years. When I first started, the expectation was we are going to hire Seniors and we’re just sort of assume that you’re going to be at that level and we’re not really going to invest in upskilling you. We’re hiring you to be high skilled. Now there is a developer education Org. So it’s changed over time and now there’s more investment in, you know, I would say like improving the education, the skills of the people inside the org. A lot of that is around like, you know, sort of like classes and training y kind of stuff. But the stuff that I’m more interested in is like learning from other people inside the organization. Right. I find like I have always personally learned best by like you know, looking over someone else’s shoulder, working next to someone who’s really, really good. And you know, if you’re not deliberate about that, so I mean that happens organically on teams but if you’re not deliberate about sharing that, it doesn’t happen. And I don’t think there’s a huge organizational push for that. That’s just sort of something I’m trying to push from the bottom. I mean one of the challenges is that there’s not enough time to do anything. Everyone everywhere has more work than they have capacity. You always have issues. So it’s always hard to make space for stuff that does not obviously have a near term impact. And so spending the resources to do stuff like that is hard to justify. And so I find you sort of like kind of have to do it on the side a little bit. Right. Like one of my motivations for doing the oops work, for getting people to do write ups of near misses is because I want them to teach other people how to deal with operational stuff. So one thing I do with my team, so before I was on this team I was so actually I started Netflix on what’s called the Chaos Team. I applied from the website, I was like, oh, I’ve heard of Chaos Monkey, that’s really cool. And so I got on the team and I thought that was really interesting. And what happened though when I was on that team building tools that were intentionally causing failures in production is that I found that it was actually more interesting. Like the real failures are more interesting than the sort of like synthetic ones we were injecting. And I just sort of got like sucked into that world of incidents. And I’d always look over at the incident management team, which was our sister team and I was like, oh, that’s really cool. And I ended up moving on to that team because I just wanted to spend all my time studying Incidents. And those folks are super good at operations. That’s all they do. Then I did that for about a year. I was on the incident management team, and I’m like, okay, I want to be a regular software engineer again. And then I came onto this team and this team did not have as much operational expertise. And so because I had been that on other team for a year. So what I did on my team is I run a meeting called this week in Manage Delivery Operations, where we talk about all the things that have happened this week that are interesting operations wise. And the goal of it is to have people talk through, okay, what did you see at this time? Okay, what were you thinking? Where did you look? And to try to teach people from the experiences of other people to understand how they were debugging in the moment, which is not typically the way people think in terms of talking about what happened after an incident. So you have to kind of be deliberate about that. You have to have that as a goal that you want people to walk through, to see through other people’s eyes, to learn from their experience. It’s hard to scale something like that. I’d say it’s working pretty well on my team. People have gotten much better. It’s a lot of fun. But the trick is to sort of infect the rest of the org so that people start doing that and sort of spread it that way. But that’s not quite a process thing, but it’s sort of like a habit that has to be developed. Right. And so building better habits across an org is, I would say, sort of the kind of interesting staffish level work that I’m trying to do in some small way.

David: Yeah, I think that idea of really trying to change culture. So first of all, the only way you can change culture is by influence. Right. You can’t mandate a change to culture. Right. And so one of the things that Alex and I talk about a lot in the podcast and otherwise is that, like, the main distinction between, or sort of the interesting distinction between staff, engineers and like, more traditional types of leaders within organizations is that we’re like, explicitly handed, you know, we’re explicitly expected to influence folks, but we’re not handed in the authority to do it. Right. And that might seem like a handicap, but I think when you’re thinking about changing culture, it’s sort of the only way that you can do it. And so it’s fascinating that you sort of like intentionally set out, you’ve realized this area where, like, the business obviously would get a lot of value if you were able to change the culture toward one that approached operations differently. And actually, I think we’ll circle back to sort of what that culture would look like, because I have a lot of questions there. But assuming that such a culture exists, you’re trying to shift the culture into that direction and the only option toward doing that is to influence other folks. Is this something that, like, is this like an explicit strategy that you’ve like outlined to management and they bought into it? Is it more sort of like, oh, Lauren’s off doing his thing and we sort of trust him, or like, how does that situation work?

Alex: Yeah.

Lorin: So on my current team, it’s more the latter. More like, okay, Lauren’s off like doing this thing. On my first team, it was like that too. Like my first team, it was like, okay, Lauren’s off doing these weird things, studying, you know, incidents, even though that’s not what he does. When I was on the second team, when I was on what’s called the core team, that was more explicit. That was more, okay, I’m going to be doing some resilience type stuff. This is sort of my scope. I had to go on call because all the engineers at the time on that team, when I was on there, the only way I could really get on that team was to do incident response as well as the analysis. I didn’t want to do the response, but I’m like, all right, I’ll do it. The other option was to become a tpm. I didn’t want to do that, but there it was quite explicit. And we ended up hiring a couple of new, like, resilience engineers onto that team. And I worked on creating job descriptions for that and I was involved in hiring those folks. So there, it was a little more deliberate on that. And then I honestly sort of kind of got burnt out on that and I said, so I did that. After a year, I’m like, wow, this is really hard. And it was too much, I think, to do both the on call kind of work to move back and forth. So one of the challenges I found at an organization where you have to work at different levels like we do, one day you’re coding and debugging and next day you’re doing sort of larger scope project stuff. I have a hard time moving up and down those levels. And on that team I had a hard time switching back and forth between doing incident response and then doing the sort of broader analysis. And then, okay, how do we look across a whole bunch of incidents and find themes and how do we, what do we do with this. And so I was like, okay. The problem was I got what I wanted and what I found. The second time in my career that I went after something that was hard and I got it. And I was like, wow. Actually, day to day, this is not really what I want to do. And so I went back to a more traditional role. But I still am very interested in that style. I sort of have to do it out of the corner of my eye, I think. I feel like I have to do the, like, higher impact stuff on the side rather than being the primary focus, because otherwise it’s just too much for me.

Alex: That’s a really interesting insight. When you felt like you recognized that the role that you had got wasn’t what you wanted, how did you go about having that conversation with your manager or your organization to transition to a different role?

Lorin: Yeah, so my manager at the time was really, really great. So he was the one, I would say was mostly responsible for me being able to move over to that team. At the time, when I moved over to that team, the core team, my manager was then an IC on that team, and he sort of sponsored me to come over. And then he became manager, and then he created space for these other roles and the more human factors stuff. But he was super easy to talk to, and he knew that I was getting stressed out. And I told him at one point that I couldn’t do both. I just told him one point, I just can’t do both. And he took me off call for a while and I just told him that I did not find myself being happy with that. But he was just like, he’s great. We still get along really, really well. And he was just a very, very approachable person to talk to. And he’s like, okay, you want to switch teams? We’ll make it happen. So I was very fortunate. Yeah.

Alex: I think there’s a lot of people who listen to this who are staff, and they may not be exactly where they want to be or senior. And having those kinds of conversations could probably be incredibly stressful because you have to sort of acknowledge, maybe I don’t want to do the thing that I’m doing. But in my experience, and I know it’s not universal, when you actually just bring these things up and talk to your managers, they’re usually very compassionate about these kinds of things. So it’s good to hear more examples of that. Yeah.

Lorin: I mean, I would say that the hardest part of that is actually saying it out loud to someone when you are at the point where you’re like, actually, this isn’t really what I want to do. You might think that and feel that, but saying it out loud to someone is extremely cathartic. And especially, especially saying it out loud to your manager is a huge thing. Yeah. I would say. I see a lot of people at Netflix switch back and forth between IC to manager and then back because they think they want to do something, they think they want to try that path. And you go there and you’re like, well, actually, no, this isn’t a good fit for me. And many of them just oscillate back to IC again. Nice.

Alex: I wanted to talk a little bit about the work that you’re doing around resiliency. I thought it really interesting example that you brought up was the OOPS group or the OOPS talking. I’m curious about that because a lot of places I’ve worked at, if an incident didn’t happen, people would have been like, great, we did our job and we didn’t have an incident. And so do you think you could explain a little bit, like, why an OOPS or like a close call is almost as important as an incident to learn from as an incident might be?

Lorin: Yeah. So, I mean, it depends on what you’re trying to get out of it. Right. So to me, I think of incidents as a way of understanding how the system actually works. So one of the challenges is we all work in these sort of huge machines and we all only see these little tiny parts of it, like our own part, right? And when something unexpected happens, when an operational surprise happens, something happened in the system that somebody didn’t expect, there was something we didn’t know about the system, and that’s usually really interesting. And it’s very often an interaction between two parts. We all have our own parts, and these things sort of fit together and we don’t realize that something weird is going to happen. And even if there’s no customer impact, you can still learn just as much about this thing about your system you didn’t know from those sort of close calls. And the other thing is that I am interested in things like, okay, is there something confusing about a control interface, like an operator interface? And you can still learn from those about that, and you can still deal with problems like that or just watching. I mean, my favorite is watching experts in action and the close calls. Typically there’s an expert that caught something early. So I want to be able to learn from their experience. And so if I can get them to capture that experience and I can read over it Then I can learn from that. There was one guy on a team who. This always blows my mind. So there’s this service at Netflix and it’s Java based and you could actually run a repl on. And it basically runs a lot of jars that people want. And he connects to it and ran a repl and was querying the internal state of it to see that it had gotten into a bad state. And that just sort of blew my mind that you could do that. That’s usually thought a lot. You can’t do a repl in production. That’s nuts. But of course, the Rails people do that all the time, right? So, yeah. So if you’re interested in learning in particular about how experts do things, I think that close calls are great or maybe even better. The challenge, once again, is like making space for that. It takes time to do that. I mean, we get very few people doing it and even me, I try to do them when they happen. We have operational surprises on my team and sometimes I get halfway through and I’m like, oh, I’m too busy, I’m not going to finish this up. And I have several. I have many half finished oopses that I just never ended up publishing, which I feel bad about. And then the real irony, the scary thing, is that I’ll hear something and I’ll go talk to someone and say, hey, I saw this surprise happen. Can you write it up? And the person’s like, no, I’m totally underwater, I can’t. And I’m like, well, actually. Actually, that’s really dangerous. The oopses that don’t get written up are on the teams that are running too close to the margin. And so the places where we have the least signal are the ones where the most danger is. And that’s kind of scary. And so one thing that I’ve always been really, really interested in is how do we collect those kinds of signals that we don’t usually see about teams that are running into trouble so that we can act early on them? Yeah.

Alex: The thing that I’m struck by is the. The value of a near miss is like, it’s easier to talk about because you didn’t cause an incident. Right. And so people are, I think, more open to the idea of talking about it, which is always nice. Do you feel like these things that you have done, are they influencing, you know, the organization you work in in a positive manner?

Lorin: Yeah. So I think it’s very small scale. So, like, I sort of am able to infect people in different parts of the organization, right. Like, I think if you like, you step back, you probably won’t see that much impact. It’s hard to see. And honestly, sometimes I don’t even really know, but I think you can find like little clusters, right? And sort of starts to spread around that way. Like putting like a drop of ink in the water. And that’s sort of like I have found, like that tends to be the most effective way to make these sort of changes. It’s like you need. Right. And this is like well known, right. You need a champion, right. So the only way to really get like a change to happen is to have a champion who’s pushing it. And so if you can build champions, then you can sort of orchestrate change that way. And I guess I’m trying to get people excited and interested in this sort of thing. Like the people who write up the oopsies are the people who start to get really into it, right. Like, that is their self motivator. They’re like, oh, this is really cool. I like reading about these. I want to write them up myself. And there’s like, I have an oopsies channel and I slack about like, hey, look at this cool thing that happened here. And so, I mean, I don’t know, maybe there’s very little impact. I mean, it’s very easy to say, look, I don’t really see anything. But I’m hoping that as I sort of infect people that it sort of spreads that way.

Alex: Nice. Do you feel like you could name sort of like the cultural value that you’re hoping to spread throughout the company?

Lorin: Yeah.

David: So.

Lorin: I don’t know if I’d phrase it as a value. Like, I’m trying to think how to articulate it.

Alex: There’s definitely a notion of value. Whatever you. I’m not so worried about the specific verbiage.

Lorin: Sure. So I’m very interested in distributing operational expertise. Right. So I mean operational in particular, that’s my personal interest. But expertise. So basically, at every organization there are people who are really, really good at stuff, right? And I’m sure you both can name people in your orgs you’ve worked with that are really good. And my question is always like, how do we leverage those people in a way to bring everyone else up? And so that is sort of the value that I sort of push the hardest on that I’m most interested in is how do you take people that are good and make them better? By leveraging the people in the org who are better and spreading their skill around. We’re Good as a society, I would say from training people up from novice to like intermediate, but going beyond that is a different way of. It’s not like training. Right. The learning is different and it’s more experiential. And so how do you sort of scale up people’s experience, scale up their expertise? That, to me, is the kind of grand challenge of improving engineering in an organization. Yeah.

Alex: Do you feel like this sort of blocker to going from intermediate to expert is that, like, complexity is growing at such a rate and our ability to build capacity is probably what moves us into the expert level. But, like, building capacity into being an expert is such a mysterious thing at this point because the complexity is so high. Do you think that that’s like maybe one of the big blockers to sort of leveling up expertise in our modern. Especially when we work in tech and we work in distributed systems and that kind of stuff?

Lorin: So, interestingly, I don’t think so. Complexity is definitely an issue, and we all face that all the time. We are overwhelmed with the amount of complexity, but that’s always a problem, and the systems are always too complex for us to really get a true handle on. I think the primary obstacle to upskilling is the carving out time for reflection. Right. The way you get better from your experiences, the way you leverage experiences, either your own or someone else’s, is by reflecting on them, by spending that time to look back. And when you’re stretched, when you don’t have time to think about it, then you don’t have an opportunity to actually make the most of those experiences and get better. And so that, to me is the hardest part. So there’s like, capacity in that sense is carving out the time to look back and understand what happened. So as an example, once an organization reaches a certain size, migrations are going to be happening all the time at some point. It’s not like, are you doing a migration? It’s like, how many migrations are happening and you have to get good at. Every organization has to have. Once it reaches a certain level, doing migrations well has to become a core competency. And I don’t know about you, but in my experience, many times the migrations are very painful. But I found it extremely rare for people to reflect and say, okay, those migrations, what happened? What did we think was going to happen? How did it actually go? What did we learn from them? Usually it’s like, okay, it’s done, let’s forget about it and move on. And I think this is my pet theory, is one of the reasons we don’t really get better over time, even though we think we will. Okay, last time that was more terrible, but this time it’ll be better, is that we don’t spend that effort to learn as much as we can from the previous migrations so that in the future we can design our systems to make the next one easier. And I just see this happening again and again. And I have on my list of things to do. I would love to go back at Netflix and treat as case studies the various migrations that we’ve done to understand what can we learn from them. But it hasn’t happened. I haven’t carved that time out, and that would be an interesting role. But, like, and I don’t know, I mean, I don’t know if you two have had experience with that, like, looking back at migrations, but, you know, I have to say, I haven’t really seen it happen very much.

David: Yeah, I think that’s a good point. I think, broadly speaking, sort of retroactively analyzing anything is hard to do in our organizations, right? They’re trying to move forward so quickly. And I know that we. I kind of harped on this already a little bit earlier, but I’m tempted to go back to it because now that I sort of understand a bit more about the changes that you’re trying to drive. For myself, and I think probably for a lot of people listening, like, you’re kind of. You’re preaching to the converted, right? It’s like, yes, let’s make more time for this stuff. And I think the sort of refrain or the sort of, like, the hesitation that I certainly feel and that I think a lot of other people feel is like, sure, but, like, how do I justify that to management? Right? And so going back to that question of, like, what’s the story that you tell? Right. To a certain extent, you can just kind of do stuff, right? I’ve been there, done that. Don’t ask for. For permission, schedule the retro meeting, whatever it takes. Right? But, like, you know, it sounds like this is. This has become a pretty big part of your job, and after a point, someone’s going to ask, all right, Lauren, like, what was your, you know, write your performance evaluation for the half or whatever. Right? And it’s like, what goes in there?

Lorin: Yeah. So we don’t have performance evaluations.

David: Oh, awesome.

Lorin: Right. Which is kind of wild. Which is actually one of the things I like about the Org. But of course, you get resources, right? Like, it’s one thing to, on my own, do things on the side. But it’s another thing to say, okay, now I want to spin up a team to do this and then it’s going to be like, well, are we going to get an ROI on this? Is it worth it? And honestly, I have not been super successful at that, to be honest with you. But here’s my sort of general thoughts on that. And it’s funny because Netflix is, at least in my org and platform has not been as, I don’t know, explicit about thinking in terms of, okay, how much progress have we made on certain things? And now we’re doing more OKR ish kind of stuff. So I would say in the future it’s going to be even harder to justify. You sort of have to. I was fortunate that I convinced my manager that this stuff was important. They bought into it and my manager’s skip level at the time was also into it. And so you had champions throughout the hierarchy. And this is one of the things with resilience is that you have to be able to justify doing things even if you can’t show a metric for it, that this is the right thing to do.

David: That’s one of the worst things, right, Is because the metric is basically bad things don’t happen and the action that you’re trying to take is cultural change. So it’s a very slow change where the feedback loop is going to be that nothing happened. It’s really difficult to measure.

Lorin: Right? Yeah. I can’t give you a count of the number of incidents that didn’t happen. That’s the metric that I would like, but I can’t. And so you kind of have to infect management. And so the question is like, how do you do that? Right. And so one thing that I was doing right before I switched teams and unfortunately didn’t finish because of COVID and stuff was like, it’s one thing to look at individual incidents and go into a lot of detail, but we were looking. I was doing some work with some peers, putting Ryan Kitchens, who’s still on the team, looking at, okay, let’s look across the incidents that happened this year. And not like metrics wise and buckets, but like, what are themes that we can see because we did like more qualitative analysis on the instance. Can we look at patterns? Okay, here’s something that somebody didn’t know. One huge problem that you’re going to see again and again and again is that there is some missing bit of shared context. Like this person didn’t know X and this person didn’t know Y. And now in an organization, this is also the hardest problem to solve, is getting the information into the heads of the people who need it, the right information. And Netflix is the opposite of Apple. It’s like super open in terms of information. But that means that you could spend full time just reading docs and do nothing else and you still wouldn’t get all the information and you would get no work done. Right? And so it’s not just an access thing. It is like, how do you figure out what the important bits are? And that is really, really hard. But it’s a critical factor that comes up again and again and again. And here one of the other challenges is I can come up with problems, but not this sort of approach is good at finding problems, but not necessarily solutions. You’re going to sort of try different things, but if you can provide insights to management about stuff that they wouldn’t see otherwise, I think that is how you show that there’s value. Look at this thing, Look, I saw that this team is starting to go underwater and if we don’t do something, then three people are going to leave and they’re going to get burnt out. If you can provide those insights and you say, look, this is how I know this, and it’s qualitative analysis, then I think you can make an argument for more resources to do that. You’ve got to provide the insights. And there’s a famous quote by, I think, Danny Kahneman, the psychology researcher, he says that no one ever made a decision based on a number. They need a story. And what we do, the resilience stuff, it’s all stories. And so if you can tell a good story about why this stuff is valuable, then you can, I hope, then you can argue for it. But I mean, to be honest, very few orgs are able to justify this. And it’s hard. And I would not say I’ve cracked this nut yet. And management can change and that’s it. And the whole thing changes and you lose it. And so it’s very, I would say, fragile and precarious and very contingent on the particular details of your org. You can kind of do what you can to foster this sort of, I don’t know, qualitative analysis of what’s going on, but it’s easy to lose.

David: So maybe going back five years or so, every engineer that you asked would agree that developer productivity is important and your ability to deploy changes quickly to production is important, important and like your ability to have automated test coverage is important. All these things are like engineers, broadly Agreed. And managers who came up as engineers probably agreed as well, but, like, they didn’t have a way of quantifying it. And then, you know, the main change that I think happened in that arena is when Accelerate was published with Gene Kim and Jez Humble and Nicole Forsgren. And, you know, they sort of coalesced around like these four key metrics and they tried to support that, like delivery, lead time, deployment frequency, meantime, to restore and change, fail percentages, like that sort of the gold standard by which all developer productivity can be judged. And I don’t think it actually made a difference on the ground. Engineers always knew that stuff was important and they continued to know that stuff was important. But I think it made a difference to management because now people could point and say, hey, like, here’s the rationale, right? These are now our metrics for the. Org. And you know, you guys have to judge us based on that, basically. Do you think there’s sort of an analogous thing that’s possible for resilience engineering? And do you think that’s coming?

Lorin: Yeah. So I think the real challenge for resilience engineering is to tell management that you cannot get away with relying on a small number of metrics to do these sorts of things. That’s the key thing, and it’s really hard. So the appeal of metrics like the. And I totally think you’re absolutely right, the findings that Dr. Forkstrain published about and wrote up in Accelerate, any of those things, if you talk to engineers, they would say, yeah, these are important. We knew this. No one says, oh, no, I don’t care how fast it takes to deploy, I don’t mind waiting an extra two hours or a day. This was known. But it is very tempting for leadership, which is trying to oversee an organization that they can’t see much of. No one knows, I don’t know about you, but my manager doesn’t know what I do during the day. They have no visibility. It’s very, very hard to manage something where you just can’t see what’s going on. And so metrics give them visibility. You can say, okay, how are we doing? What’s our MTTR look like? How’s the trend? What’s the time between you commit and it actually goes out to production. But resilience is about the fact that the interesting stuff is. Well, I don’t know if it’s about the fact, but a big part of it I would say, at least from my perspective, is that it’s the stuff that you can’t see that way. It’s the stuff that is not visible through the metrics. It is like the workarounds that people are doing to get those metrics up, but they’re actually taking additional risks because of that. What are we sacrificing to improve those metrics? So there’s all these signals and no matter. So you could say, okay, do a huge number of metrics, but that’s not practical for leadership because if you give them 1000 metrics, what are they going to do with that? So the challenge is, how do leaders get signals about what’s important, what’s dangerous? So what I worry about is when the metrics are fine, but there’s a danger if the metrics are bad. So the thing with the metrics, if the metrics are bad, that usually means there’s a problem. If the metrics are fine, there can be a problem, but you don’t see it. And that’s what I worry about the most, is the metrics are fine, they’re stable, but there’s a problem and there’s a risk. And we don’t see it happening because people are putting off this tech debt or whatever or sacrificing some operational stuff. And so the challenge for leadership is, okay, how does leadership get better at collecting those kinds of signals from the organizations in ways that are not easily visible? And that is really, really hard. And that is a very tough sell because leaders are already completely squeezed the same way line managers are, the same way we are directors. Everyone up the chain is stretched to capacity. And to tell them, okay, I’m going to make your life harder. You’re going to have to work harder to figure out new ways to collect information that you didn’t see before. I’m going to write qualitative reports that are like 50 pages or something, or 30 pages, which I’ve done. Rather than give you a graph that shows you our products, it’s, wah, forget it. That is a really tough sell. And I’m not a manager. It’s a very difficult thing, I think very comfortable as an ic, But I think that is the pitch we have to make. And that is a very, very difficult pitch. And if you look historically at, I would say, trends around management, they are usually like, here is a process that will make this tractable. Where we are saying, look, you just have to become an expert and you have to build these muscles and figure out how to talk to people and listen and get information from different sources. And it’s a much tougher sell. And I don’t exactly know how to sell them. We sort of have to, I’m hoping, actually. So you mentioned stuff comes up from engineering management. I’m hoping in new generations of ICs that are the learn about resilience engineering kind of stuff, when they become managers, they will have these perspectives. But you’re talking about generational change. This is like progress, like one funeral at a time kind of thing. It’s like maybe multi generational.

Alex: One thing I’m struck by is I often have the experience I think that you’re talking about, which is like, I want to protect you from negative outcomes. People are like, great, do that. But at the end of the day, let’s say you do that. It’s hard to prove that you have protected people from X number of negative outcomes. But I think it seems like a lot of the folks who are focusing on resilience are starting to understand that the same things that contribute to your resilience contribute to your capacity to do more work. Right, because you talk a lot about Rasmussen and that sort of like the boundaries around work, there’s like a financial boundary, there’s sort of like other things, but work is constantly pushing us towards an error boundary. And so if you aren’t doing the work to increase your capacity, you’re going to hit the error boundary. And that’s where incidents happen. But the same thing that sort of pushes the error boundary away from us as we do more and more and bigger and bigger and more complex work is also increasing our capacity to do work. Do you feel like there’s a story that we could tell that’s more of the positive? Like we are increasing your team’s ability to do more things over time. And do you feel like that could be a more interesting or a more compelling story to tell management than like, we’ve protected you from X number of negative events?

Lorin: Yeah, I totally think you can. And so to make a pitch about improving expertise on my team, improving operational expertise meant that we were more quickly able to diagnose problems. We spent a lot less time debugging certain issues because we had visibility, because of metrics. And so less time spent troubleshooting is more time spent developing and delivering value. And it also, the engineers just perform at a higher level. We do become more efficient in that sense. And so I think the learning aspects of the sort of upskilling is compelling because it’s saying, look, we’re going to sort of reduce the overhead of the firefighting kind of stuff, the kind of stuff that drags on us in a way that it’s not just, okay, we’re spending a whole bunch of time paying down tech debt. That’s one way to improve productivity, but that’s also a chunk of time. So I think you can definitely make the arguments around. Around improving expertise. That that’s just like, there’s clearly an ROI there. There’s clearly like, we are going to get better as an organization. Right? We know that experts, everyone knows experts are more valuable. That’s why we pay seniors higher salaries than juniors. Right? Everyone is aware of that. And so I think that’s an easier pitch to make about the learning and as a mechanism for improving the. Yeah, capacity is a good term. The challenge in increasing capacity is then you just ask to do more, right? So you improve capacity and then they throw more work, okay, you can move faster, then we’re going to thr more work at you. And so you just like you move the boundary out and you move closer to the boundary. The harder part is, I mean, it’s always an eternal struggle to carve out the additional. The thing about capacity is that you have to keep some of it, right. Like, capacity means you have some extra juice that you can use when you need it, right? And you’re not sort of. And you need some social organizational capital to justify not running at full capacity. I mean, this is a challenge with the centralized incident management team. Like, these folks just sit around waiting for incidents to happen. Like, you could have them be software engineers in building stuff, right? But there are extra capacity. That’s around.

Alex: One of the things that I think is interesting about this is it sounds like what we’re sort of saying is like the work that Dr. Forsgren has done in Accelerate, it’s valuable, but it doesn’t paint the whole picture.

Lorin: Right.

Alex: There’s lots of things that we’re saying is like, there’s always going to be this squishy space. And the company or the culture that you work in has to value exploring the space constantly, because that’s where you’re going to find the things that you can’t measure is in that sort of squishy space. But maybe there could be. What about things like psychological safety and other things that if you know that a team has psychological safety, maybe they’re better at exploring the squishy space, right? So maybe there’s ways in which you can measure or evaluate a team where it’s not like, are you following these 10 metrics? But it’s like, do you have the right environment to create the ability to discover the unknowable at the moment.

Lorin: Yeah, I feel like psychological safety is really sort of caught on. I think everyone, at least I don’t know, pays lip service to it. Netflix is pretty good. I would say people are pretty because once again they’re all seniors. Everyone sort of has strong opinions. They’re general. I mean a lot of people have imposter syndrome coming in. But then people are comfortable disagreeing with each other and are okay with that. I have tried, I would say for so one of the things and I blogged about this, I try to make it okay for when I have done rinse and it write ups that I name everyone’s names. I put there explicitly because that’s okay. It’s not the if you’re here then we trust that you are good. And so the assumption is that if we want to learn as much as possible, we should assume that everyone who was involved was doing things that made sense to them at the time. And by putting the names in, we’re signaling there’s nothing to be ashamed of here. And I do this myself. But of course when you push to production and something breaks, you feel terrible. We’re humans, we feel bad when we’re involved in breaking things. To me, the psychological safety thing, I’m very lucky to work in an organization where I feel it’s there. And so it’s hard for me to I can say these things. But I don’t work in places where I have read horror stories about government contractor stuff during the healthcare.gov where someone basically got fired, sort of got fired on the spot kind of thing or they accidentally dropped a database. There are environments that are like that. But I don’t know what to do about that. I’m fortunate I don’t work in one of those. I would just leave. I could choose my environment. I’m very, very privileged about that. I think if you don’t have psychological safety you have a huge problem. And it’s much harder to do these things unless you’re at a place where you feel where I can go and say half baked things to my team and here’s I’m sketching out a doc and it’s probably all wrong but we’re just going to talk about it. One of the things I’ve been recently reading a book about engineers. It’s called Designing Engineers and it’s about how engineers actually do design. And the guy who wrote it is a professor of engineering at MIT and he did some case studies. He went out to various companies and sort of observe what was going on and what he found was that a lot of the design work happens in the meetings, in the interactions between people, where different people have sort of incomplete views of what’s going on. And then they talk and they sort of negotiate what’s happening. And it’s in those interactions between people where the design actually happens. And I think one of the things that I would like to try to push is to think of the team or the org as the unit. It’s not like I’m designing it or I’m operating the system because I’m on call, but we are collectively doing this and each of us only has a partial view. And it’s the emerging result of what we do that is the thing. It’s not like I did this and you did this, but we are doing this together. And you should not expect to. You don’t have the whole picture. You only have one perspective. And it’s the interaction of us together that is the thing that is developing and operating these services. And it’s. We talk a little bit about that, but it’s a big perspective shift. And even I’m still wrapping my head around that. This is a joint cognitive system is what the resilience people would say. This is what we have. It’s not just us, it’s the system that we care about.

David: I think it’s interesting though, that they’re using psychological safety as a reference, because there too, I would argue that there’s an analog to the Accelerate book, which was Google’s project Aristotle, which was the. The seminal thing that translated psychological safety into like a concept that all managers could get behind, because now there’s like a research paper that validates it. And there again, they actually have metrics that you can use to measure psychological safety in a team. Whereas, like, to us as engineers, I don’t think we would have tried to go. Go about and do that. It’s just sort of a yes or no thing. And I’m still sort of left thinking that like, you know, either there used to be a sea change in management, which maybe, you know, goes back to what you were alluding to, to like one funeral at a time, but. But I feel like even then we might still be looking for something that can translate sort of the resilience culture that you’re describing into cliff notes for managers that they can measure. I don’t know if that’s ever going to happen or if it’s even sort of realistic to talk about it that way. I do want to hear your thoughts there.

Lorin: So I think what we need to do is we need to figure out a way to provide management with a tool for aggregating the sort of massive information that they have access to that is not simply metrics. Right. We need to give them an alternative. And I think we don’t have a good story around that today. Right. Like you mentioned just now, metrics around psychological safety or whatever. And once again, that’s a way of aggregating data. Right. And they need that. They only have a certain amount of bandwidth, and we need to figure out a way to provide them or upskill them with a way for them to aggregate the signals without relying on metrics. And I think we just haven’t figured that out yet. No one’s written like, well, I guess there’s been Resilient Management book, but we need more in that direction.

David: Interesting. So we have a few minutes left, and there’s two questions that we ask everybody. One of them is just sort of. And it sounds like you’ve got a lot, so I’m excited to hear your answer this question. The question is, what sort of resources have influenced the way that you work? And that can be books, of course, and research papers and conference talks, but it can also be just people that you follow, et cetera.

Lorin: Sure, yeah. So, I mean, I got sucked into this two ways. One is reading a book by Sidney Decker called Drift Into Failure, which really was sort of like my entree into this resilience world. So I have an academic background, and that’s sort of one of his more academic books. And it just completely. I don’t know, I loved it, and I strongly recommend it. And the other one is John Alspaugh, who has been banging this drum for a very, very long time. And at one point I was like, okay, fine, let me start looking into this stuff. And John works with David woods, who was Sidney Decker’s PhD advisor. So the connection is there. And so after John constantly evangelizing about this material, I started to read about it, and then I just got completely sucked in and started reading tons and tons of papers. And if you go to resiliencepapers, Dot Club, you can see my list of papers that I’ve collected. I haven’t even read all of them, but I’ve read many of them, and there’s just a ton there.

Alex: Nice. Sidney Decker was my entryway as well. I love the in the tunnel, out of the tunnel perspective. That was the first thing that really resonated with me in terms of like, oh, we’re looking at this all wrong. So highly recommended. So our last question is how much of your time do you spend coding nowadays?

Lorin: Quite a bit. I would say. Roughly half my time is spent coding. So I’m really like a traditional, you know, software engineer. It varies. You know, some days it’s more docs and meetings. But, you know, I do spend a good chunk of my time still coding.

Alex: Nice. Awesome.

David: Awesome. Well, Lauren, thanks so much for joining us on the show today. It was really lots of fun.

Lorin: Yeah, I enjoyed it.

David: That’s it. Thanks so much for listening to staff Eng. If you enjoyed today’s show, please consider adding a review on itunes, Spotify or your podcaster of choice. It helps others find the show. It is a really useful signal to us that folks are finding value of this so that we keep doing it.

Alex: You can find the notes from today’s episode at Our website podcast staffenge.com the website also has our contact info. Please don’t be shy.

Lorin: Sam.

Lorin Hochstein (Netflix)

Listen#

Transcript#

Listen

Transcript