Designing APIs as products, always asking how your consumers will think about your data model.
Hi, everybody. So as said, I’m Nic Benders. I’m the Chief Architect at New Relic and I’m going to talk to you today about applying UX principles to designing the interfaces between services. And along the way, going to talk a little bit about New Relic’s service training. So these slides are available online. You can get them from my website. And then here’s my Twitter handle so that you can complain about me or ask me questions later. I’ll also put these up at the end of the presentation so you don’t need to grab them down right away.
For those of you who don’t know New Relic, we are the cloud-based software analytics company and so we give visibility to our customers into how their applications and their business is running. And we do this through a variety of agents, which are installed into applications or on servers or synthetic browser checks or things that people use to monitor cloud services. And this all comes together into the New Relic cluster where we collect all that data, we process it, we store it, we make it available for people to query it in a fast, ad hoc manner using our web and mobile interfaces.
So now that you are all experts in New Relic, you have the background necessary, I’m not going to talk any more about what the product does. I’m also not going to give you investment advice. Instead, I want to talk about kind of how we built the product and the decisions that we made as, in particular, we were undergoing our microservices transition, which started about two years ago. So before we talk about the microservices part, or the services part itself, I want to say why on Earth would you do such a thing?
And for us, we started off, actually, with a true monolith. We had a single application that ran our entire business that contained the agent, and the data collection pipeline, and the web interface. Back in ancient times, this was then divided out. And so, really, once we started getting a good clip of customers through, we had two large applications, or two small applications in the beginning.
One was this Ruby on Rails web application, which is the user interface, and another, which is a Java data collection pipeline. As our customer base grew, as our feature set grew, most importantly, I think, as our company grew, the services grew. As the services grew, we started to have problems, which are probably familiar to many of you. And as the services continued to grow, the problems also continued to grow. And after several years of talking about this and trying to figure out what the right answer was, trying a couple of approaches, we decided we have to do something more serious about this.
We need to really tackle this problem. So we said, “What is it that we want?” What we want is we want to have components that have a single owner so that we don’t have to worry about Team A is trying to ship their feature, Team B is working inside the same component trying to ship their feature, which has nothing to do with Team A’s work, and so they step on each other. We want systems that are easy for people to understand. We want to have those dependencies so if Team A’s feature depends on our account data model, we want to know that.
We don’t want to discover, “Oh, I made a change to the account data model and I broke something in the partnership interface.” And because all of those things led to a high maintenance cost for our systems where we didn’t know how hard it was going to be to make any given change, this also led to these issues with a divergent system where small problems could escalate into big problems.
So we wanted to improve the safety of our overall system, so that if you make a bad decision as a developer or as an operator or something weird happens on the internet, instead of blowing up a large part of the system, you should have a failure that is contained. You should have a convergent system. And this involves graceful degradation. When we looked at these requirements, we said, “You know, really the key here is explicit dependencies and allowing independence and autonomy for our teams, we want our teams to work quickly, and we want them to know what it is they are requiring on top of and who is requiring their systems.”
So that is how we came to our services. That was the goal of our services work was to create an organization that supported, more than anything, the growth of the company. The growth of the service complexity itself was manageable, but as we had more and more teams who were trying to all work at the same time, this is where our problems arose. And so this brings us to this, which I believe everyone here has probably seen.
This is Conway’s Law. Conway’s Law is maps or software architecture to our organizational architecture. In fact, it dooms us to it. When we think of making changes to our software architecture, the place where we had to start was to make changes to our organizational architecture. If we want to have loose coupling between software components, we want to have clear dependencies, and long-term ownership, then we have to create an engineering organization structure that has loose coupling and has these clear boundaries and ownership.
So we set embark on this great services transition about two years ago this month. And some things went well and some things didn’t. And talking all about the stuff that went well is very boring. And so we’re going to talk about the things that didn’t go well. There’s lots to be learned there and the scars still seem fresh. The first thing that…whoops, that was very surprising to me. Oh, well anyway. The first that we learned, which is not this slide, was that everybody was really into services.
And so once we told people, “We’re going to create a service architecture,” and everyone jumped out and built lots and lots and lots of services. And as part of that, we had decided, we say, “Well you know what? I think that the services architecture thing could really take off, so we will probably have dozens of services.” And I believe our architect at the time said, “Oh, my friend, it could be 100,” and everyone said, “Oh, that’s bullshit. There’s no way it’s going to be 100.” It’s obviously north of 200 now. So we said, “We need a way to deploy these faster.”
We chose Docker. Keep in mind, we chose Docker in January of 2014, which was perhaps ambitious. But it had all of the characteristics we wanted. It isolated the experience of developing your application from the maintenance of that service, whereas before this point we had been rewarded. We had great efficiency and good uptime. When you really understood everything you were running, now we were saying, “Hey, operations group, instead of running two or three services, you’re going to run 50 or 70 or 200 services.” And so they needed to have a much more standardized interface on top of that to let them do that.
Unfortunately, when we did this, we didn’t really pay attention to our developer experience. So developers were very excited about Docker. They’d heard all the best things about it on Hacker News and everybody said, “Great, we’re going to Docker everything we can find and we’re going to microservice the living daylights out of this architecture.” And so most teams had then launched, by the time that we had finished announcing that we were going to have services, they all had two or three services in development.
And our tools were not there. Our tools had barely started to be put together. And worse, our tools were very operations-centric. The people who were building the tools looked at the problems they were trying to solve. I was in Production Operations Group at that time, and we fancied ourselves to be rather handy with Docker. We said, “Oh, we’re quite good with this. Here’s how you use it. Developers, you’re smart people. Have fun.” And some developers had fun and most developers did not have fun.
They were smart people and they knew nothing about Docker. And it changed their metaphors. So the teams had been used to deploying their software using Capistrano, so it’s SSH is out to a server, it copies files in place, and it starts them. And if you need to debug that, you SSH to the same server, maybe use some handy cap scripts. Maybe you just go yourself and you just look at the process, and you look at the log files, and you look at the file system. And how we had taken that away from people, and not realize that this was what they did when they considered themselves to be doing their jobs.
We said, “Well, no. why would you go to the log files? You should be sending it to the log service.” They said, “Well, we hadn’t been. I’ve always done my job by SSHing to the machine.” And we didn’t really appreciate this when we rolled out this tool. And so within a few months, the developer experience of creating software at New Relic had gone downhill significantly. We also found, quite surprisingly, that the performance of our application had gone down quite significantly.
In hindsight, it’s rather obvious. There was a lot of finger pointing about JSON marshalling or a Thrift marshalling, or things like that, which is largely beside the point. The issue is that previously, a single application could run through a set of code in a single thread that would do something like call a database, which is very close to me, make another database call, merge these results, talk to this piece of logic, talk to this piece of logic, send it back to the customer. Today, once we made our services change, you came in the top and now you say, “Well, I’m going to use services to I need to make an HTTP call here and a call here and a call here and a call here.”
And we’re doing this from a Ruby on Rails application, which is really kind of mono-task focused, very, very good focus. The Rails app ended up taking much longer to perform the same task as before. And we spent a long time having to patch over this with caching and various things because we didn’t understand, again, that when we had handed this new set of tools to people we had said, “Great, you can use services,” that it would change the way they needed to write their software and the way they needed to deploy their software.
But worst of all, we were still in the monolith. After creating another 75, another 200 services, many of our new features still required work in one of those two original code bases because we hadn’t thought through the process of what exactly is it that leads you to write a line of code into our UI app or it leads you to add a line of code into our data collection app. And so we had built lots of new functionality, lots of great things. The giant balance of code was now outside of these systems, all of these teams working outside of these systems.
But for many, many tasks, you had to go back into the original monolith and make your modifications there. So let’s take a step back and say other than you, Nic, what is the common element in all of these failures? And other than me and my skills with the keynote, I believe the element is that we weren’t thinking like a designer. We were thinking like an engineer and we were starting with what we had. We were starting with our data models. The very first service we wrote was an account service where we took the account data model and we said, “Everyone needs this in order to do their job so we’re going to expose out our data model to services over a service boundary.”
We should have been thinking about this like a designer. At New Relic, our designers have a good philosophy. They start with what is the problem that the user is trying to solve? And we look how will the product I’m building help them solve it? And then they test this against real users. We can do this, too. Just because we’re not pushing pixels doesn’t mean that we aren’t designers. If we start with what the problem the user is trying to solve is, the problem that our users of our development tools are trying to solve is “I want to deploy and then debug my application.”
How are they going to use the tool that we’ve given to solve that? Does it make sense? Does it help them do the task that they wanted to do or does it help them do a task that I thought they might want to do? And the same thing when it comes to APIs. Our first mistake that we made on that very first service was that we started with the data model. We didn’t start with the outside of that system. By starting with the data model, we looked at what we already had and we exposed it out in a way that made sense to our new architecture. But it didn’t actually solve the problem of our users.
Our users, the developers who are our customers as we make platform tools or development tools, they wanted to know real questions like “I have a user session and I need to know what their account ID is, what the name of their account is so I can display it on a page, what the user’s email address is so I can display it on a page, and what their product levels are.” These are very concrete tasks and aren’t related necessarily to the model as it exists in the database, but instead are related to how our developers are trying to solve a problem.
We also have to keep in mind that the design of the system changes the way people use it. A system that makes it easy to build new services will encourage your users, developers, to build many services. A system that makes it easy to report your log centrally will get all those logs in centrally. And so as we are making these design decisions, we’re not just trying to solve those existing problems, but we’re trying to steer the long-term usage of this system by making some tasks easy, and some tasks we might just not invest in, or we might attempt to aggressively make hard doing things like fault injection.
If I want to make sure that everybody is designing their database clients so that they can deal with a flaky network connection, then I should make the network connection flaky in that test environment and make sure that people suffer for doing the wrong thing. And so by changing the design shape of our system, we can encourage users to follow a particular pattern of behavior without mandating it. We can still allow them to move over the edges, but if we make that easy path the right path we will get more people onto it.
And so when we talk about design, again, I mentioned that account service, our mistake in the account service. Our mistake was starting in the center. We were doing an inside-out design. We were starting with what we had and moving outwards, add our business logic onto that, and outwards add an API that makes sense for the business logic and the data you have. Seems very logical. Unfortunately, what it leads to is you’re doing the API last.
By the time it came around to do our APIs, we’d already made a lot of the decisions that would force the shape of that system. If we can do our API s first, then we get a nice, elegant API and then it’s left up to us to figure out how on Earth we’re going to support that in the underlying system. This is similar to test-driven development where test-driven development, one of its features people often talk about is it leads to a well-testable [inaudible]. Of course, because I’m lazy and so when I write a simple test first, I’m going to go in later and build an implementation that works with a simple test so I don’t have to go back and make a really complicated test.
To do this with APIs, we want to design that API before actually building the service. Luckily, there’s a lot of techniques that are available to us. I often start, when I’m doing this, with just writing down some pseudocode. I just sit in a text window and I just pretend that I’m programming, which as Chief Architect is about as close as I get. I write some method calls for a system that doesn’t exist. I’m like, “Well how does that look? It looks awkward. Let’s change the arguments.”
And you kind of come up with the ergonomics on there. We can write the documentation first. So Swagger was mentioned. There’s tools also, API Blueprint, these ways for building docs for APIs that you haven’t built yet. This is a great technique. It lets people argue over the interface. It lets you show somebody, “Here’s my documentation. Here’s the API reference. Here’s some examples I wrote. Can you work with this?” And get your users to sign off on something when it’s still relatively cheap for you to make those changes.
And the next level is, when you do start implementing the service, start with some of these just really bogus stubs, like a method that always returns the same string. It’s fine. Really, do everything you can to drive your risk out of the system first. And your risk is not that your implementation is going to be hard. Your risk is that you will create a system that is hard to use or that encourages poor behavior. We go back to our principles of design. We see how these apply here.
Start with the problem. Look at what you can do to solve that user’s problem. And then test it. We enjoy, as engineers, an unfair advantage here versus our friends in the design world because our customers work at the same companies we do, generally. And you can just go and sit there. And we don’t have to trick them with Amazon gift cards or something so we can watch how they work. But think of it the same way. If you were doing a UX experiment, you would find some likely customer or current customer.
You would cajole them into spending an afternoon with you and ask them to think out loud as they try to solve tasks. Like, “Hey, could you create a new synthetics check. Tell me what you’re thinking as you’re doing this.” Try that same approach when you’re building an API. Ask somebody who’s on that client team who’s going to consume your service, “What would it be like to go and write a client to create a new system via this API? Think out loud for me as you type. Show me where you would go to find documentation. Show me the types of tooling that you use.”
You really want to understand the user so that you’re building the right system for them. Of course, another thing that we can do from the design world is we can steal. When we think about what makes an elegant interface, the best place to start is often “What are the current interfaces that I enjoy using? Do I like the GitHub REST API? Do I like the command line interface for Git? Do I like the way that Apiary works?” Find systems that make sense to your users.
So if you’re building a Ruby application or you’ve got Ruby developers, find the systems that they like, that will feel natural to them as examples. If you’re building an operations tools, talk to your ops people. Ask them, “Hey, what are the tools that you really enjoy using?” Understand what makes those interfaces work. And then not only do you have the opportunity to copy off of someone who has built a great system ahead of time, but you can start by building a system that already makes native sense to the people who you want to be your audience.
This is another form of understanding your user. And that’s the UX principles. That’s how we can bring them into services. It’s not necessarily a textbook system. You can’t go through and say,“Well, UX tells me I should always use Thrift or I should do this or I should do that.” But we’ve found that many of our architectural arguments and discussions can be settled if we look at them from the standpoint of the consumer of the service and what they’re trying to accomplish as opposed to looking at it from the construction of the service itself and what the service is trying to implement.
And I’m under time so we will have plenty of time for questions.
Nic: So the question is what’s an example of one of the smallest services that we’ve built where it would be right on that dividing line where you say it might not be useful to build a service to perform this task at all because of the latency? We have, I think, several hundred services live. I think a good example might be we have some cache brokers that really are just frontends onto [inaudible] cluster. And so the reason why we have a service there instead of talking directly into the cache system is to ensure that we can maintain that same interface contract as we switch back ends.
This is a borderline case where the customer, in this case the client application, is trying to cache something. So some places in our architecture, we’ve said it’s all right for the consumer to talk directly to a cache. And in other places in our architecture, we’ve wanted those calls to have an intermediary.
Nic: So the question is with engineers being engineers, how do we build organizational inertia around design thinking? For us, we’ve always considered ourselves to be…we have a strong product culture so it’s something that has been ingrained in many of our engineers… [audio cuts] start with. Jim Gochee, who is our Chief Product Officer, we have a little IOS app that has an animated version of his face that just says, “What’s the problem you’re trying to solve?” That type of thinking is deeply ingrained.
My advice to people is that engineers have this mindset. They get kind of tricked away from it when they start getting into the details and they really want to build something. But it’s a problem-solving scenario. Engineers love to solve problems. So if you can make sure that the communications channels exist so that your teams implementing things are close to the teams who use them, then make sure that those using teams feel like they are allowed to give feedback.
Then people will dive into this. One of our earlier missteps that I mentioned around that early rollout on Docker was largely because it was built in a silo and then the teams who were using Docker, who are unfamiliar with it, they would get frustrated and they wouldn’t tell anyone. And so we wasted a lot of time with people being angry with a solution that could have easily been fixed, but we didn’t know that it was broken because we hadn’t created a formal environment.
We didn’t have a forum for our consuming development teams to talk back to the production operations team. We had assumed most people are friends, everybody’s here in the same building, someone’s going to talk to us. But you actually do have to give that indication. So much the same way, you’ll see for UX feedback, people will use Intercom or something like that to get in there. Users could always email you, but if you can get to them a little bit earlier and solicit that feedback from them, then you can get to them before they become completely enraged.
Nic: So the question is even without microservices, you’re still creating APIs for your classes or for your code, and you should be assuming that there’s someone other than yourself who’s going to be using these, and you should be being kind to your fellow engineer, possibly your future self. This is absolutely true that microservices did not invent the API. However, what we found is that those in-product interfaces, people considered them to be too easy to change. To say, “Well, I built it this way and then I’ve revised it five times today to get the system that we want.”
People also often felt that they were the only ones who were going to call them. That even though it was an API that might have a public visibility on it, it’s really code that I wrote, it’s part of my system, forgetting about being nice to their future selves or to other members of their team, let alone to other teams. Once we broke the services apart, it became much clearer for people where the boundary is between the space in which they are permitted to make bad decisions and the space in which you really want to be sure about what you’re doing.
And so I think it really comes down to making dependencies and interfaces explicit instead of implicit. And we could have approached this using programming language functionality, using code reviews, and things like that. But by matching it to our service boundaries, it made it much more natural for people.
Before then, we weren’t doing a formal API method. We would use Java Docs, we would have people write their readme sections in their documents, but it wasn’t something that you could test against in a formal way. And as far as API complexity, I think that by and large our APIs have been very simple. There are a couple of them that have pretty tricky constructs, and we certainly, completely at my own encouraging, dove two feet into the thick client approach for many of our systems, which I have now been scolded by Ben.
So I do think that that was a mistake, is that early on we tried to draw our service boundaries, not necessarily at the network level. And we’ve become better at that, although in an asynchronous system, many of our service boundaries actually occur via Kafka. And so when I define an API and I’m thinking about who my consumers are, I’m really thinking about I’m going to place some messages into a Kafka topic and someone who I’ve never met is going to pull them off and get business value out of them. And so this is the type of interface that we are now defining, for the most part, as opposed to synchronous interfaces.
Austin: Question from online. How did you actually approach the problem of dividing your monolith up into microservices?
Nic: So the question is…oh, so you’ve got the mic so is that recorded?
Austin: In theory they have it.
Nic: In theory, great. So how did we approach the problem of dividing our monolith? With terror and great trepidation. The monolith, you know…
Austin: Makes sense to me.
Nic: …started in 2007, so by the time we started seriously trying to break ground on our services, it was 7 years in. It has a lot of code. It has a lot of mixed concerns. So what we initially tried to do was to take those data items, as I said, pick the pieces of data that were the key integration points and break them out. I believe now that this approach was wrong. I think that the approach that we are now following, which is to look at teams who are trying to build new functionality and look at what they need to do that.
And so when we built the browser monitoring product, when we built the synthetics monitoring product, these are new product lines that were constructed outside of the monolith. And we can study those systems and say, “Well, I need this piece of data in this shape.” And so then we can give people the shape they need. And we discover things like, “Oh, well we need authentication is actually the most important thing, not account management.” So okay, well we built an authentication system.
And so by looking at that as use cases around building new functionality, instead of as a here’s the monolith, let’s start carving it up because it’s good to have a carved up monolith. This gave us a guide that we could use to evaluate the effectiveness of our operations.
Austin: One more. You mentioned earlier that as you switched into this, you found that there were cases where your developers did not have the tools that they needed for your brave new world. Can you give a couple of examples of the nastiest ones that you ran across there?
Nic: Yeah, absolutely. So our development tools, like I said, in the initial wave came up quite a bit short. They’re quite a lot better now thanks to the tireless efforts of our dev tools team. That’s a piece of advice that I would give to everyone is to have a dev tools team. This is a substantially different task than our previous production operation support. And so when we had those tasks deeply mingled, they would always starve each other out.
So having a dedicated developer experience team really has helped us. I would say the nastiest ones for us were that change in mental model, moving from the Capistrano deploys, SSH-based, into the Docker container ones. The container world is substantially different than the process world. And you need to really be sure that people understand what’s different about it and that you give them a very, very clear path through logging, in particular, process management, and alerting for things that are unique to the container as opposed to normal system level things.
And so that’s like how much memory is left in your C group before you trip the wire? That type of thing is very important. And before we had good visibility there and a good, easy to reproduce story for logging, many projects didn’t do a good job with those because they’re hard to solve individually.
Austin: Thank you. There was one other question in the back, I think.
Nic: Ah, so we’ve taken a couple of different approaches at this. Oh sorry, the…how do we understand the architecture of the entire system both for our individual developers and for the architects and the other people trying to make decisions on this? So I should slip you a 20 or something. I should have talked about this.
Austin: You need to hold him to that.
Nic: There’s a product we could use, as Adrian says, is that we have a piece of functionality for this. So as we moved into this service world, we realized that many of our customers were already in this world or would be soon, and that our existing application-centric view of the world where the way you use New Relic was always “I know there’s a problem in the UI, so I go to the UI app and I look at it, or I know there’s a problem in this app.”
And often today, you don’t know where the problem is. And so we’ve changed our views in our product to service problems in a broader area, allowed more querying across things, which has been very helpful for us. We’ve also introduced a service map, which many of our teams build. So our teams that run a set of services that make up a piece of functionality will build that service map that’s restricted to their scope.
So we have a diagram that contains every service. It’s called the spaceship view. I didn’t include it in our deck. It’s almost illegible at any practical size. But every one of our teams can look at their own components and say, “Well, I depend on these things from the outside and these things depend on me. And then inside of this are a set of services that I control and here’s the monitoring view from those.” That’s the primary way we do it. The predecessor to that was a documentation tool, which actually generates these pretty circles that you see on the screen.
And so that was a way for before we were able to automatically detect this in the product, that you could go and make documentation changes to explicitly call out your dependencies.
Try the open source Datawire Blackbird deployment project.