Engineering and Autonomy in the Age of Microservices – Nic Benders

Nic Benders (New Relic)

Description

Microservices architectures have removed many of the traditional technical constraints from engineering teams. But most organizations are still designed to be run in a monolithic top-down mode.

At New Relic we started our microservices transformation by restructuring our teams, but the results didn’t turn out the way we wanted. It turns out that moving teams around wasn’t enough. We needed to rethink the way the teams were built and run. Ultimately, we went back to our #datanerd roots to solve this problem and ended up giving our engineers far more control than anyone thought possible. This shift towards empowerment and autonomy is similar to the way offshore gambling sites operate, offering users from various locations a sense of control and freedom over their gaming experience, unbound by their physical location.

Presentation Slides

Transcript

Nic Benders: Hi everybody.

My name is Nic Benders, I’m the chief architect at New Relic. I’m a master of the slide advancer. No? No. The chaos is with us. That’s much more promising. All right, I’m going to talk to you today about engineering autonomy. I hope that you are ready for some org design, and that that’s what gets you stoked, because that’s what we’re going to go in to.

A lot of people have been talking about the people element of microservices. I want to dive deeper into that, and really look at what you can accomplish in an engineering org, once you start to change the way that you’re looking at why you have an engineering org, and how an org operates. For those of you who were here last year, I spoke last year, I talked about API design. I did not talk about stock advice, and I, again, will not give you stock advice.

New Relic started off our microservices journey with these two relatively micro services. One that did all of our data collection, and one that did all of our data display. Over time, they got to be quite a bit bigger. Once they got to be quite a bit bigger, we started to have a bad time. We had all of the problems that people talk about with a monolith. We had teams stepping on each other during critical releases, we had a lot of confusion. People were afraid to refactor, because the system is so large and complicated.

Microservices arrived, and everything was grand, and we won. This is an actual chart that shows a dependency diagram of 380 services that make up our production environment. It’s a little bit complicated. The reason it’s a little bit complicated is something you’ve heard about a lot today, which is this guy. This is Mel Conway, the man behind Conway’s Law.

This is something everybody has talked about, it’s very near and dear to the microservices philosophy. The reason is because this architecture diagram of our production environment is constrained to be a replica of this; our organizational diagram.

Now, a lot of people will say Well, you know, I’m in a pretty strange organization, but at least my org chart doesn’t look quite like that, unless I work at Microsoft. It’s not the org chart. This is a diagram that shows the communications between our teams and the dependencies between our teams. Every Post-it Note, or set of cards there represents a team, and all these lines are all the ways in which one team needs another team to do work, in order for them to succeed. This is the situation that arose that fabulous microservices diagram that you saw.

To understand why this is so important, we have to take a look back at Mel Conway, here. Often overlooked, when you look at Conway’s Law are two elements. One is that he’s talking about your communication structures. Not your organizational structures. The key isn’t Do you have three teams to build your three-pass compiler. The key is How do those teams work together? This is going to set up not just how many dots are on that chart, whether it shows 380 dots, or whether it shows 200 dots, but all of those lines between them. Those edges between the nodes are the key to complexity in your system. Not the nodes themselves.

The other part about Conway’s Law that’s really important is that it’s a law. He says constrained. This is not Conway’s suggestion. You cannot cheat this. It’s not quite E=MC squared, but we’re talking about that level of tenacity. We tried to kind of nibble around it, and we said Oh, well, you know, we will have good intentions, we will make our teams small, we will divide them up into an org that is set up around business principles and not around platform layers. We still have that chart. We still have that chart because we didn’t fix the underlying issues as to how the communications worked.

Keeping Mel Conway with us for our journey, let’s turn back to microservices for a minute. There are a lot of definitions of microservices. This comes from Martin Fowler’s website. I’m sure everyone in the room has seen it and read through this list before, or seen a close facsimile of it.

The characteristics of a microservices architecture: this is what we wanted to build. We want to break things up via services, we want to have business capabilities. Again, not technical capabilities, but delivering business value, and we want to have these products and not projects. A team isn’t take a unit of work that’s assigned to it and delivering that piece, and then moving on with their lives. A team is owning their work product.

In order to achieve this, we need to take this list of microservices characteristics, and turn them into microservices organization characteristics. How do we want our organization to look and to operate? You’ll see that there’s very little change here. What we’ve done is we’ve said You know, componentization via teams. The team is the core element in your organization. The same way the service is the core element in your architecture. Everything that we do is about building those teams, operating those teams, and ensuring that they have the correct interfaces around them.

Again, we’re organizing to business capabilities. Again, we’re thinking about long-term ownership of the software that we’re building, and of the market that we are serving from our teams. I’ve snuck a few more words in here, the old smart end points and dumb pipes. This applies to teams, too. If you think about the way in which you have been rebuilding your software architecture to move to simple transports for data, where each end point, it knows what to do, it handles its job, and then uses a simple transport to the next piece. Follow this pattern in your communications tools, also.

The teams, again, are the actors, here. You don’t need a super heavyweight status process. You don’t need all of these expensive pipes to move that data. You need simple ways to move the data, which, in this case, might be status, it might be context, it might be the work that needs to be done. This is easy. You can put it on a wiki page. You can post it on an internal blog. Use a lightweight tool that gets the information to the teams, and then let the teams be the smarts in your system, not the communication.

If we go down the rest of this list, what we see is that we need something like this. I am going to read this slide. We need durable ownership teams, organized around business capabilities with the authority to choose their own tasks and the ability to complete those tasks independently. Almost every word in this is chosen very carefully. We have to emphasize that durability, emphasize the business capability, the authority, the ability, and the independence. These are the main elements that make up a nice, independent, autonomous team.

In order to do this, we’re going to have to step away from central control, in the same way that we had to step away from the monolith. When we had the monolith, you had to carefully organize all of your work, in the microservices world, we let teams make choices. Everyone is comfortable with these basic concepts. This is how we end up with half of our engineering team working in Node, and the other half in Scala. It goes beyond these simple technical decisions into really, how do you make decisions in your organization? Are you making your decisions in a monolithic manner, and then pushing that out to the teams, or are you moving the information? Moving the data, to compute in your org, so that the teams can make the decisions.

In order to achieve that independence, you have to eliminate the dependencies between the teams. A dependency, as we’ve said on that chart, is a place where my team needs another team to do work in order for me to succeed. Put another way, this is an opportunity to fail. Every time a team has a dependency on another team, this is a chance for something to go wrong in your org, and cause a project to fail for reasons outside of the primary team’s control.

We’re going to do this through org structure changes, and through some tooling improvements, but mostly through those org changes. Now, of course, anytime you see org changes, everybody’s ears perk up, because it’s our favorite topic. We’re going to do a re-org. Engineers love re-orgs. We do them all the time, and they always solve our problems, right? You know, it works out well.

It doesn’t tend to work out well. Leading up to this realization we had, we had probably done three departmental org changes in the last two years. Each one of them consisted of taking some teams, changing the names of one or two of the teams, and reassigning them to different directors. Nothing says I’ve totally changed the way I work. Like reporting my boss to a new boss so my grand boss has changed. That really fires you up.

We wanted to do something a little bit different. We’d say Well, we’ve got a pretty touch problem here. What do we do when we have tough problems? Easy, we’re engineers, we solve problems for a living. We’re going to solve this one. We’re going to approach this the same way we would approach any engineering challenge. We do impossible things every day, that’s what everyone in this room does for a living. Coming up with an org structure that actually accomplishes our goals shouldn’t be too far from that.

We started off, the first thing we realized was that we’re going to need to optimize for agility. Earlier today, you heard, right? Microservices is about speed, and it’s that agility, the ability to make changes quickly, not the efficiency of operation. Our org structure has to follow this same idea. Up until this point, people’s primary concern had often been the efficiency or the utilization of a team. When you are sitting on a team, and you are with all of the people who are experts in the same technology, and you’re focused down to a single type of work, you have a work queue that stretches back that ensures that you never under-run your buffer. This is an ultra-high utilization solution. It is a recipe for disaster when it comes to latency.

If you’ve done network engineering, you know that long-transmit queues equal poor latency. Because, a small change has to wait until it moves through there. Then you institute quality of service, and all these manual change or control processes. While I have a high-priority ticket, it needs to move through. Before you know it, the only work that’s getting done in your org are P-1 critical hot bugs. This requires, then, to kind of back off of this, a lot of management. We have that centralized control again. If we want teams to be able to operate successfully without centralized control, what we need are short work queues, and we need decision-making at the edge.

You know, we’ve hired smart people. Let them be smart. It takes a tremendous amount of energy and resources to hire your engineers, to hire your PMs, to hire your designers. Let them do their jobs, let them do a job that’s bigger than what you thought they were capable of doing. They know the domain way better than anyone else does. As chief architect, I know almost nothing about the level of detail that an engineer working on a project does. That engineer has far better information than I do.

Let that person be smart, let them get the information that I do have, which is more easily transmitted than information from every person in the org back to me. Then, let them make decisions. The next thing we said was You know what? We’re data nerds, it’s part of our brand, it’s part of our core values as New Relic. We should use data to do our org change. Many times, when I’ve been involved with re-orgs, the way it happens is we go out to some off-site, and we sit around, all the VPs get together. We say You know what, we’ve got a serious problem, and we need a serious solution. We brainstorm for, could be 40 minutes, could be a really long time, and then we come up with some org changes, and we want to try them.

This is appealing because it’s got a good bias to action, it means that we’re not letting problems fester, but it also means that we’re making decisions on imperfect information. An organization can be quantified, just like a piece of software. When we do the analysis, the design and the rollout of our org change, we want engineers to make the decisions, and we want to be deliberate about this. The first thing, and the most important thing in our org change is that we needed to break these dependencies. That chart is a nightmare. Everything there is just opportunity for failure.

We take the chart that we built, and we built this by actually going around to all of the teams in our engineering department, all of the teams in our product department, and asking them to come and crowdsource it. This is two very large pieces of paper that were rolled out in the lunchroom, and we have people collaborating remotely, as well as in-person, and we drew it all out. Then, because this is basically impossible to work with, we translated it into a graph, because I love graphs, and here, the nodes on the graph are teams and the edges are the dependencies. You’ll note immediately that there are less edges on this graph than on the hand one. When we first tried to translate it, it was too gnarly for us to even make sense of.

We removed a lot of duplicate edges. We removed things like Well, everyone depends on the network being up. Okay, noted, just put a checkbox over here that says The networking team has to do their job. Then we started working on proposals to simplify this. Again, this came from engineering. This didn’t come from VPs. This didn’t come from our CEO, this was the individual contributors on the team, and the managers coming in, looking at our org chart, and making suggestions and trying out new ideas.

We got to this, which is a little bit better. There’s a lot fewer edges on this, if you count them, but there’s still too many. Then, we proposed some solutions that look like this. This is a lot better. This is a pretty clean dependency diagram. Something that you feel like Yeah, I could understand this, I know who holds the keys to my success, but we’re going to have to make some more serious changes to get here. These changes are going to have to be technical and organizational.

The first thing we’re going to have to do, is we’ve got to make really strong teams. If each of those teams are going to carry that quantity of weight and that amount of responsibility, we have to change the way that they are built, and the way that they are run. We need something called a full ownership team. Many people are familiar with full stack teams. A full stack team is one where you have people with an operations expertise, maybe database, back-end services and some UI, so you can touch all of those technology bases.

But you’re missing something in a full stack team. What you’re missing is the business ownership. You’re missing having a product manager, or business owner on the team who can help you make decisions in your business context. To make full ownership teams from our full stack teams, we put PMs on the teams that were product-facing, and we selected technical product managers for our internal teams who did not have an outside customer.

Even with this, we needed, again, more. We need T-shaped engineers on the team; we need people who have a breadth as well as a depth. Now, note, although, I, myself, am I total generalist. I have very little in the vertical department there. That’s not really what we need in our teams. We need people who are technical experts in one, or, maybe, several fields. They have to also have a good general basis to understand the problems that their team’s facing, and to be able to work together so that you don’t have five one-person teams who sit together. You really have a single five-person team.

For instance, someone who is a great UI engineer, knows JavaScript forwards and backwards, can dance with React. Should also be able to be on-call for your Java services and to restart them, deploy them, read the source code, understand the JVM tuning. They don’t need to be an amazing Java engineer. They just have to understand the problem that those components are trying to solve so that they can see how all the pieces fit together.

This is really important for autonomy and for agility. Because we’re not going to have a centralized control structure who’s going to come in and say This is the way your UI should talk to your mid-tier. The team is going to make these decisions. The team has to be made up of people who understand how they all fit together. Saying We are inverting control in this org.

As chief architect, one of the first things I did was actually remove architecture reviews. Because architecture reviews are a top-down method of control, where we say Oh, bring your ideas to us, and we’ll tell them if they’re good. Then what would happen is, often, we were wrong, and that’s one bad outcome, but a worse outcome is a team brings an idea they think is wrong and we say it’s right. Now, they say You know what, that was architecture team said we had to do it this way.

We have taken away from them the autonomy by taking away the accountability for their own decisions. We’re inverting control in this org so that just like in our microservices pattern, each service should stand alone. Each team must stand alone and must own their decisions. It came down to looking at this radical org structure change. This was coming in to March of last year and saying You know what, we’re going to have to make a lot of changes. We need to take our teams today, kind of blow up the teams and reform them into these different containers to change the way that people work.

All right, all of these teams, we’ve got Java skills needed on a bunch of teams, [inaudible 00:20:08] skills needed on a bunch of teams. We started going through our little Rolodex, and we said All right, well, this engineer was hired for Ruby, but I happen to know they know Java. This rapidly kind of broke down, and we said the only people who really know what the engineers are capable of are the engineers. If you’re trying to build a system that gives control to your individual contributors, that gives control to your product managers, and to your engineers and to your designers, then you can’t do it by taking away control from them in the process.

We decided that we were going to do team self-selection. It was true to the purpose of the exercise, and it solved this thorny problem, where we don’t really know what people know. Only they do.

Self-selection is a seemingly crazy idea, where you put out all of the jobs that are available to do in your department, and you let all of your individual contributors show up and pick the ones that they are going to do. It is not self-application. It is not manager selection. We are not saying Oh, come by and like, we’ll do an internal job interview. No, it’s literal self-selection. If an engineer says I have decided that I’m going to be on the insights team, then that engineer is on the insights team, and nobody can say boo about it. This, obviously, is harder than it looks.

This turned out to be a lot harder than it looked. It sounded really great in that meeting, let me tell you. The most obvious problem was that managers really didn’t like it. A manager has spent their career learning how to form the perfect team. To bring people together who have different problem-solving approaches, and different technical backgrounds and personalities and form them into this great team. We just told them Actually, you don’t get a say in who’s on your team. That was super unpopular.

It was also a missed opportunity for us, we, essentially said You know what, you’re managers and one of the great parts about your job is that you pretty much have to do what I say. They did, because managers are used to taking a lot of really unpleasant tasks, and so they just kind of took this one in stride. This is a major change management failure, and I want to call this out. Because if you’re ever rolling out this type of change, your managers should be your best friends on this.

Because we didn’t go to the managers first, and include them in the process, we included the engineer’s view, but not the manager’s view. Then, when the engineers said Hey, you know, I’ve got some reservations about this idea. The managers said Yeah, tell me about it. Not because they’re trying to be mean, because honestly, they had reservations too. Managers are also people.

I wish that we had done better with this. Ultimately, we were able to, mostly, soothe our way past it. The big surprise for us is that the engineers didn’t like it either. We thought the engineers were going to be over the moon. They’re going to be like Finally, I get to pick, you know, what it is I’m doing. Like, I am in charge of my own destiny. There were two problems with this. One, people thought we were lying. People thought that this was some kind of elaborate musical chairs ritual, which would result in fewer jobs available than jobs they wanted to do.

Like Well, but I love doing JavaScript, and I’m going to show up, and there are going to be three jobs available, and they’re all going to be DB2 DBAs. People were really worried about that, and I think there’s an element of just kind of growing up, being picked last at dodge ball, maybe, that kind of appeals to the engineer’s psyche. That was hard. The other thing that’s a little more subtle was that people were worried that engineers wouldn’t make good decisions. That they wouldn’t pick a team that was great for their career or that somebody who they really just liked didn’t like, was going to join their team. It was just going to be socially awkward, and that they were going to have show up at this critical moment, and make a decision that could not be undone.

You know, in the past, this is something their manager’s taken care of, there are professionals for this. You know, I’ll be completely honest, we almost backed down. I went up to the executive sponsor for this project, and I said It is torches and pitchforks as far as the eye can see. We have trashed morale in this department attempting to do something that we really kind of decided a little bit on a whim. I don’t think it’s worth this quantity of suffering. He said to me If we want to create an org where individuals are empowered to make meaningful choices, then we cannot do it by having the VPs pick who goes on the teams. Go back, figure out what’s causing all this trouble from the engineers and the managers, and fix it. So we did.

We figured out that the first failure, as I’ve described, is a failure of us to empathize. We sat in our meeting room, and we thought this was going to be grand, and we were taken aback when people said No, this doesn’t look grand. We had to really understand what their concerns were, and not just what we thought they were. It’s not Oh, people are worried because there’s one really loud guy, and they don’t want him to join their team. No, people were worried about real, substantial things. We had to talk to them one-on-one, and we had to listen much more carefully.

Then we had to communicate and communicate and keep communicating and just say over and over again This is not some kind of stealth layoff, this is not a job fair, this is not a place where you need to reapply to get a job that you already have. This is nothing more than an opportunity for you to exercise some self-determination and to do something that you want to do, that we didn’t know you wanted to do, and just keep hammering this message. Let people know what the opportunities were, so they weren’t making a snap decision. You could go through all of this.

We had to rely on our values. One of our key values has always been that We will take care of you. If engineer had come to me before self-selection and said You know, I’m really unhappy on my team, and I’m thinking of leaving the company, because, I don’t like this type of work. Of course, I would scramble, and I would make sure that we found a place for them to work where they felt engaged and enjoyed what they were doing. Self-selection will not change that. We were not shifting the burden of being responsible for our employees to the employee. We were shifting the burden of making day-to-day decisions to them, but we’re still there to look out for their career.

We went over this and went over it and went over it. After about three weeks, the fear level went from complete chaos to kind of wariness and we were ready to go. What we did is we did it all in a big event. This is the event hall that we rented out. We put up these big balloons with the names of all the teams. We had circulated a couple weeks in advance, a list of every team, all of the skills that the team needed. Not the roles on the team; so not This team needs two Java engineers and one JavaScript engineer. Instead saying This team needs people who know Java and JavaScript. That could be one person who knows both, and a bunch of people who are ranked newbies. It could be a mix of some people who know each. Just as long as the team itself … Because, remember, the team is our unit. The team has to be viable, not the individual members of the team.

We, then, had everyone start in their old teams at the event. To get people to look around, we said Regardless of whether you’re changing teams or not, get up, walk around, look at the signs and all the other teams, meet the managers of all the other teams and find out a little bit about this org, and then you can go right back down to your old teams. We brought in our leadership, and our chief product officer, and our SEP of engineering, kind of ceremonially kicked it off. They released people to go find new teams, and reinforced, again, that we will take care of them.

At this point, no one had changed teams yet, but we had already found opportunities. We had taken people who had spent their entire careers at New Relic on a single team, or in a single function, and forced them to look hard at what else was done in the organization, and think about all of the different places where they could be contributing. At this point, I was almost ready to declare victory. Even if nobody switched teams, what we had is we’d created a feeling that if the job that you’re doing today maybe doesn’t turn out to be a great choice for you in the long-term, there are several other jobs that are also interesting. Those other teams, who you might have thought Oh, this is just a dependency of mine, every time I need something, that other team lets me down. Now you know what they’re trying to produce.

You know what it is, why they exist, and what their charter is, and what they want. We created a shared understanding in the organization. We actually did switch teams. About a third of the people who were involved in the exercise, and we had, I think, 300 some people in the exercise switch team. This gentleman here, this is Cory Johannsen, who is the engineer who inspired me to give this talk. I actually co-presented with him on this topic with him at FutureStack, and he is just as rarefied as the picture there with the pipe cleaner monocle makes him look.

Cory has changed teams, he had been a back-end engineer for his entire career. He had worked in embedded system, he worked in some really way-back-in-the-stack, high-throughput Java. He came to this thinking There’s not going to be any interesting jobs for Java people. Which, of course, was the exact opposite of what I was terrified about the whole time. Where I said I need more people who have back-end skills. He walked around, discovered there was tons of teams that needed back-end skills that he didn’t know about, and he kind of just fell in love with a team’s charter who wasn’t the team he had been on. He didn’t intend to switch teams, but he just was like You know what, let’s try it. Like, what’s the worst thing that can happen? They’ll take care of me anyway. He joined a new team. In fact, almost all of the members of that team were new.

Truly, it wasn’t switching the people on a team, it was a new team, that had been formed. Lots of these new teams formed. People came from all over the org, came together and created new teams with diverse skillsets, diverse technology backgrounds, diverse personal views, and we got this fabulous cross-pollination in the org. We really got those teams that you see in those diagrams when we talk about Conway’s Law. A team that contains lots of different types of people who are brought together because they are trying to solve the same business problem, not because of a technical background.

Once we have these teams, now what? The first thing we needed to do was to determine how they were going to function. Again, this was a task that we assigned to the teams. We asked them to create their working agreements. Working agreements, for those of you who aren’t quite as deep in some of these agile org design things, simply put, it’s the answer to this … The completion of this sentence: We work together best when …

In the lead-up to this, we’d talked a lot about psychological safety, and about what makes a good team, and those things. A lot of teams answered this question with several working agreements about ensuring that all viewpoints on their team were equally shared, or the mechanics of their meetings, or when they went out for coffee, or which days of the week they would not have meetings.

There was also some nuts and bolts things. Cory’s team picked these. This is the insights team that Cory joined. They picked continuous deployment, no big surprise there. Weekly demos and retros, and then this kind of oddball: mob programming. Which is just super-duper extreme pair programming, where, literally, the entire team sits around a single computer and one person types at a time.

Normal me would’ve said The hell you are. But that’s not really the theme here. The theme is teams make choices. The team has made a choice, and so let’s let them try it. They’ve actually had huge success with it. It’s something that I recommend on a maybe, more limited basis to many teams. Because what we had here is a team of six individuals, four of whom had never worked on the product in question. Cory was learning JavaScript from scratch, as part of this. A good way to learn these technologies, and to learn your business domain, is to actually literally work together as a team to solve each problem.

This, again, is an optimization for agility, and not for efficiency or utilization. It seems like a wild waste of utilization to have five people overlooking the shoulder of the one person typing. But, it let them make decisions quickly as a group that everyone understood the impact of, and they understood the history of, and this gave them tremendous agility. It really worked. The whole thing really worked. We were sitting before this re-org, I am not afraid to tell you, thinking Well, perhaps this was it, we have killed the engineering org, everyone will get up, be miserable, leave. I will be fired tomorrow. The teams came together, they were fantastic teams, people who we never would’ve picked, which was the whole point. The engineers knew things we didn’t know.

When it came together, they shipped a ton of software. We exceeded all of our projection for what we were going to get done this year, even though we spent the first six months of the year, essentially, taking on friction from the re-org, and we had people form entirely new teams, so they had to learn their technologies again from scratch, and they still exceeded our goals. Because they were able to work so much faster on the things that matter. Not working more hours, not having more tasks per engineer, but doing what mattered.

This was in May that we kicked this off. There was one item here, at the bottom of our characteristics of microservices architecture, the evolutionary design that I wanted to revisit. We just did our revisiting. We had our retrospective. We looked back on what worked and what didn’t. One of the things that didn’t really work for us the first time was teams felt like they understood their technical domains, they knew how to work together, but they didn’t really understand what their boundaries were. The way of working for us, for all of us, for decades has largely been one of centralized control. Where, when you do things on your own, it’s subversive, you kind of try to keep it quiet, perhaps. You have those Well, I’m going to work on this on the side until it’s guaranteed success, because I want to make sure that nobody cancels this or throws a bunch of wrenches in it.

We wanted to step away from all of that, really change the way that people worked, and they needed some more clarity on this. This is what we came up with, which is a team’s rights and responsibilities document. This one here, this isn’t exactly the one that we’re using internally. It’s got some little cleanups and simplifications, but this is the basic idea, is that teams have things that they have a right to do, but in return, you have created responsibility. Teams write their own MMFs, which are a minimum marketable feature.

This is the unit of work that a team has committed to do, and we have said Yes, once that is finished, the business will have value. That value could be knowledge, it could be revenue, it could be increased capabilities, but it’s something that, when it’s done, has a durable value that isn’t lost if you never return to this topic. In return, for the right to create their own MMFs, the teams have to listen. Other people have ideas, and they can bring their ideas to them. The teams MMFs really need to be minimal, because we’re making an organizational trade here, where we’re saying We’re going to try as much as we can to avoid disrupting your work part-way through. Because we know that if we disrupt an MMF halfway through, we’ve paid half the cost for zero value.

In return, the team needs to make these MMFs small, so that the org can, essentially, gut it out for the length of time it takes to finish them. Even if we say Well, that really isn’t the top priority anymore, but we have to trust you. And so on. What we see here is that each one of these rights is linked to a key responsibility that is the trade. This is the contract between our teams and our organization. This is those strong boundaries around each of the components that go into our diagram. I put all the slides available online also. If you’re curious to kind of read some of ht details later.

When we take a step back, maybe running a six-month-long re-org isn’t something you’re going to go home immediately and do, but there are smaller things in your organization that you can do right away. There are key fundamental ideas here. The most important one to me, honestly, is that you hire smart people, you should trust them. You should trust them to do things that are well beyond what you thought they were capable of doing. The team that drove this entire re-org was made up of PMs and engineers and designers working together.

We did not have a whole bunch of MBAs come in here, we did not have a bunch of VPs driving it. We really had a team of the people who live this every day step outside of software development, and outside of design, and outside of product, and learn org design, learn change management, analyze the org and pursue it. Then, at every step of the way, we’ve attempted to really counteract our biases towards centralized control. Towards wanting that kind of safety of Well, it’s easier to do it all linked together. Instead say No, you know what, let’s let the teams try this. Sometimes it’s going to be wrong, sometimes, we’re going to come in and we’re going to correct this and we’re going to say You know what, we need the whole org to move together.

You know, it’s not a total anarchy here, but try to push further than you think you could. Because, better teams make better products. This quality of the software, and the velocity, but it’s people who are able to control everything that goes into that system. They have more sense of ownership, they have more pride, and they have great ideas. Our software engineers are super smart. They can think of a way to make this better. They can write an MMF that the product management team, and the centralized product management team would never have thought of, because they’re not close enough to the problem.

You have to let the teams pursue these ideas, even when we don’t see them as leaders. I want to give some acknowledgements here. I want to thank, first off, this gentleman, Jim Shore. He was an outside consultant who came in, and brought this idea to us of taking a much more rigorous approach, and being a lot bolder. This is his book, the Art of Agile Development. Really, what his focus is is working with high-growth companies, who are reaching the limits of their org structure, and trying to find a new way. There’s a bunch more reading behind this that we went through, under this process. The Liftoff Book by Diana Larsen is where the working agreements and a lot of what we did, and the generally, team formation exercise came from.

This book, Creating Great Teams on Self-selection is not a book we used, but it’s pretty much the best book that we could find on the topic of self-selection, for those of you who want to know more about it. It talks about doing it at some significantly larger companies. My two personal favorite management books here, Turn the Ship Around, which really drums in to you this idea that you cannot create strong decision-makers and leaders by making decisions for them. And the Principles of Product Development Flow, which is a fabulous book, if you love math, and talks a lot about the economics of delay, and why it is that you absolutely have to emphasize that agility over the total throughput. It’s more important to have a short lead time, than it is to have more product per month coming out of your teams.

This is something that is surprisingly hard to convince, even your individual contributors of. They will feel bad that they are not immediately jumping off and working on something where they can move faster. You have to walk everybody back, show them the charts, get them to agree Yes, I do believe in this. And to work on something where they feel like You know, they can’t be as great of a contributor in a technology they’re unfamiliar with, but, it does help get that task out the door faster. That’s what the business really needs.

I want to thank you for listening to me ramble on this topic for 40 minutes. And the slides are available online here, on NicBenders.com. This is my Twitter handle, down so low that you can’t see it, @NicBenders. You can feel free to Tweet your general gripes, complaints, or questions, or come up and grab me afterwards.

Thank you.

Flynn:
You mentioned at one point, the fear that you had just destroyed your engineering organization, you were all going to be fired, the world was going to come to an end, and things like that. The night before your big event, what was the failure mode that was keeping you awake at night, that you were horrified was going to happen the next morning?

Nic Benders: Oh man, there are so many.

Flynn: Any other questions, go ahead and raise hands please.

Nic Benders:
The main thing that I was worried about tactically was that you have 300 jobs for 300 people, will they find them? Will they get in to the right configuration, or will we end up with everybody on some popular teams, and nobody on the unpopular teams? By and large, this did not happen. We were more worried that people would be angry, and would feel that they were excluded from something they wanted to do.

One of the mitigations we took is although we told everyone teams had a size leading up to the event, for the event itself, we weren’t going to actually enforce that. Instead of saying Well, this is a six-person team, and there’s seven of you, so somebody better scoot. We just said You know what, maybe it’s going to be seven, as long as we can figure it out entirely in the org.

Even with that, there were still a few people who honestly had to get the tap on the shoulder and be told, they really wanted to work on this technology or on this team, but that the company needed them in this critical role. We owe those people, because they are the ones who made the organization work. There were not a lot of people in that position, which was good from an org perspective, but made it, I think, even more uncomfortable to be in that position.

I can’t see anything, so I will let Flynn pick people.

Audience:
Thanks for the great talk. You mentioned the anxieties of managers before this self-selection event. I was wondering what the role of managers on these little mini agile teams is. How do you avoid that turning into a little command and control unit, where it’s imbalanced local decision-making?

Nic Benders:
The question of what does a manager do is a perennial one. There’s a lot to do to … Like I said, we have a responsibility, regardless of how we structure our org, to look out for the careers of our employees to ensure that they are growing, that they have what they need, that they’re happy, and that the team is working together.

While we said managers can’t tell people Hey, you can’t be on my team during self-selection. If after three months, someone is not working out, or it’s like there is a tension on the team, it’s absolutely the manager’s responsibility to get to the bottom of that. We have not seen a lot of managers trying to grab control away from the team, I would say. I think that the managers are still kind of timid, and they are afraid that product management, is just squinting at it. We’ve actually spent more time, I think, encouraging people to grab hold of the destiny. To work with their embedded PM.

Remember, every time I say team, I include a product manager who is on that team. To work with them, and to pick their MMFs, and that we won’t then go and slap their wrists. I think that the teams who are comfortable doing this are the teams who are the happiest and the most productive.

Audience: Giving the teams the ability to choose process, technology, et cetera, with technology, specifically, across the org, what kind of impact, or what did you feel, giving the teams total control of technologies?

Nic Benders:

Yeah, this is another one where I have my personal feelings. Where I’m like I think that, perhaps, that is not the best technology to choose, or that’s just some hacker news thing. It’s more important for teams to have autonomy than it is for teams to have cohesiveness, in terms of technology. If people pick a technology on their team, and then half the members of that team leave, and they take with them that knowledge, and other people are like Well, actually I hate this, and we have to rewrite the component. Frankly, that’s a small price to pay. If people are going out and innovating and creating faster in this system.

We’ve tried to set some basic boundaries over things. We are container-based. We want people to be deploying via containers. We are a monitoring company, we sell a monitoring product. We want you to monitor your technology, no matter what it is. If you are building your services in Elixir, which is something that people do, we don’t have an Elixir agent, so that hast to go on your list, also.

You have to build, essentially, an Elixir agent as part of your project. If you still feel like this is a way to get to your goal faster, then go for it. We had a team do this, and they basically built just enough of an Elixir agent to meet our monitoring requirements, and the component runs great. I’m happy that they were able to do something that I would have absolutely said no to, if we had a more, kind of, command and control org.

Audience:
Two questions: one is, you did mention briefly, about six months, but is that the time period from the time the idea floated, to actually things settled down and people were, again, productive?

Nic Benders:

No, teams were really humming within the first month. They weren’t back to full speed, but to give you kind of an idea of timeframes, every team is different, and I am up here in complete terror, which is hard to see, perhaps, that I’m going to say something, and I’m going to be going back to my office, and people are going to be like Let me tell you, I was miserable and my team had a terrible time, and you’re making it sound great. I want to put a caveat that everybody had a different experience, and this was not like, universally grand. Teams like the insights team, which went from having two-thirds new members in May, to standing on stage in November with a GA of a product that was built in that intervening time that had significant complexity.

Teams were able to move past it, over the period of a single quarter, I would generally say. Today, I’m happier with the org than I’ve ever been.

Audience:
Back to the manager part of it. Did the managers also go around the team, finding new teams, or were they the ones that defined the team?

Nic Benders:
Yes, the managers were chains to the tables, which is maybe one of the reasons why the managers were not as thrilled is that we needed a colonel to form each team. In most cases, this was a manager, in some cases, it was an engineer or a PM. We needed somebody to take the draft charters that we had produced as kind of a change group, and turn them into a list of responsibilities, and to stand up at all hands, and say I’ve got a great team solving this great problem. We needed somebody to be that seed in the center of the team, and for most of the teams, it was the manager.

Audience:
How did this interact with performance management.

Nic Benders: The personnel performance management, I assume?

Speaker 6: Yes.

Nic Benders:

Not application performance management, which is kind of a different subject for me. For some people, it was a real reset. We had people who … It’s like if they had lost some of the wind in their sails in their previous team and joined a new team, we didn’t want to give them totally a clean slate, but we wanted to say You know what, this is why we are doing this. We are doing this so that people who weren’t feeling engaged, they weren’t feeling like they had an opportunity to really contribute at full capacity had that.

For those people, yeah, it was a chance to start again. If you add a different type of personnel issue, that, obviously, wasn’t changed by the team structure. For people who we just felt like maybe weren’t a top performer, we tried to put all of that behind us, and to look again at their performance since then. We definitely had a number of people who just really jumped way up in performance in their new roles.

Audience: [inaudible]

Nic Benders: How many failed?

Audience: [inaudible]

Nic Benders:

Boy, I should have prepared for that question. We had turnover, absolutely, and I would be dishonest if I didn’t say that some people were really unhappy with both the change management process, were unhappy with the final result, or unhappy with the teams. Just kind of the whole way that we approached it, and they left. I think that we will still come in at, basically, our kind of annual average. I don’t think that it changed the year-on-year numbers, but we definitely saw a clustering where a lot of people were maybe in kind of a wait-and-see-what-happens. Then they were like Actually, nothing good happened. Then they left.

You know, it’s one of the worst parts of the job. You know that, that there are good people who will not find attraction with what you think is a good idea and good outcomes. You’re not going to win them all.

Audience:
If teams owned all of their technology decisions, who owns the space between the teams? Do you have a soup of SOAP and REST and JSON and GRPC, and Thrift, and every other communication technology under the sun?

Nic Benders:
We do not. The space between the teams is owned by the architecture team. We establish, in addition to the architecture team, who has always looked after that inter-team communication protocols, things like that, where your decision affects others directly. We created a number of communities of practice. Because now, instead of having all of our Java engineers under one director in a department, they could have their team meetings together. We have them spread out throughout the whole org, so we created a Java community of practice, and they have bi-annual offsite conferences. They have monthly meetings, they get together and they share those practices and they say Well, you know, this is what we’ve learned, this is the emerging techs, technologies that we see as promising.

Again, the architecture team, our job is to be responsible for kind of the interstate commerce clause. Most of that work had been done before the self-selection or the re-org, so we haven’t had to revisit a lot of those decisions, but when that comes up, that is what I have to actually do. Not just get up here and be pleasant.

Flynn:
We have three more we’re going to take. There are a couple others that I think raised hands beyond that. If we don’t get to you, then find me and we’ll get answers.

Audience:
I think you just kind of partially answered my question. You said you have engineering and product standards that everybody has to meet. Who controls those, and who manages their changes?

Nic Benders:
The engineering standards are agreed upon by the architecture team. We set things like This is what your health check has to look like. We are using Thrift for our serialization both in Kafka and over HTTP. Things like that that are engineering standards. We have an RFC-like process internally, called the architecture notes where that’s all documented. One of the big shifts that I didn’t have a chance to kind of talk about was, again, if you want teams to make decisions, teams need information.

Traditionally, a lot of that information is forded kind of in the middle of your organization. Your leaders know the business context and they know the strategy, and they know kind of all of these things so that they’re making decisions saying Well, I know things that you don’t. If you’re going to make the decisions, you need to know these. We established a product council who sets our strategy, so this is our SEP product management, and it’s down, the VPs reporting to him, as the chief architect, I’m on the product council. Our job is to say These are the most important problems facing the business. These are the things that we absolutely must do. This is what it looks like to be successful in alerts or in dashboards, or in this other market.

Then, the teams, and the PMs on those teams help to turn those into those requirements. The product council is also, outside of what we call those product briefs, which are like This is how dashboards are going to look. That’s also a cross-cutting product standard. Like these are the browsers that everyone has to support. This is how linking between products will work, and those types of things.

Audience:
From what it sounds like, you built these new teams, and everyone moved to the new teams. How was the transition? What were you doing with the existing products and what was running in production, who was taking care of those things, et cetera?

Nic Benders:
We did an extensive mapping as part of the organizational design to find every product, both external and internally facing. Every library, every service, every piece of information. For instance, engineering standards were a product that were attached to the card that said architecture team. We created a big transition sheet to ensure that everything on the origin sheet was landed somewhere on the destination sheet, and we did a lot of audits through that. Including some pretty interesting work that Ward Cunningham did, cross-checking our various documentation systems into a graph database, which I’d love to be able to talk to you about some other time.

That caught most of it. Then what we did is we had a transition period of two weeks in between the self-selection event, and essentially, your new team go live, where you had to ensure that your pager rotation was handed over, that the new members of your team, or the new team who would be handling your service knew how to start and stop and deploy, and rollback all of the systems and things like that. We also dropped a few on the floor, too.

We did our best to get everything, but certainly, reliability, had a spike there where people were unfamiliar with, especially the deploy rollback process, and that caused some crossed wires. I think that we did a lot of prep work over pager and deploy, but we could’ve done more and had an even better outcome.

Flynn: Last question.

Audience:
After all these changes, do you feel, or your team feels that I need another change? Would you repeat it? Maybe a momentum for some of the folks, I need a change to make me perform better. It’s like the frequency may vary, right? What is the general feeling about it, or the management thinks that we need to repeat it every two years, kind of thing, or what’s the general idea there?

Nic Benders:
This is a pretty hot question. A lot of people asked, originally Well, are we going to do this annually? We know that some companies do it. It was such a production that we would prefer to not. We would like to have a continuous improvement process, as opposed to needing the big kind of spike. What we’ve tried to do is offer periodic retrospective, tuning of the organization. I would actually say that a quarterly rhythm of revisiting your team charters and things like that is probably better than the twice a year. Just because those things change so rapidly in the business context.

We also wanted to preserve the feeling of self-selection, where engineers could find a change for themselves. We’re working to try and make that into a permanent process. A lot of this just has to do with reducing friction around internal transfers. I don’t have the answers for this one yet. This is very much kind of an in-flight thing, where we’re trying to find out how to get some of the benefits without so much disruption.

Flynn: Thank you.

Nic Benders: All right, thank you.

Expand Transcript

Engineering and Autonomy in the Age of Microservices – Nic Benders

Nic Benders (New Relic)

Description

Presentation Slides

Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.