Bringing Learnings from Googley Microservices to gRPC – Varun Talwar

Varun Talwar (Google)


Varun Talwar, product manager on Google's gRPC project discusses the fundamentals and specs of gRPC inside of a Google-scale microservices architecture.

Presentation Slides


Austin Gunter: Up next, Varun Talwar. He's a product manager on the gRPC project, and he's gonna googley-eye some microservices for you. If you guys want to switch his computer on, that'd be awesome. Take it away.

Varun Talwar:
All right. Thank you, Austin. Let people come it. Hi, my name is Varun. I'm from Google. I'm here to talk about some learning from microservices at Google. Just a caveat, it's not covering the entire gamut of microservices problems. I'm touching on one aspect of it and hopefully provide some inside, some learnings, and some actionable way for you to take it to your organization.

All right. I'm gonna do a quick context. I think Austin started with it, I'll do like a couple of minutes on it. Primarily I want to focus on these two things. I think Matt touched upon something called Stubby. That's a framework we use, internally, for all the service calls at Google for last 15 years. There's some learning, which I hope to share. This is not all. It's not a comprehensive list, but some things, which might be useful. The second half is how we are bringing some of those learnings into our open project called gRPC, which some of you may have heard about. How many people have heard about gRPC? That's about half the room, which is encouraging.

All right. So quickly, why are we here? This is sort of going into a little bit of the end goal and this notion of sort of up-streaming of why even people are interested in microservices. They think people want to operate at, you know, the want the agility, and the resilience, as came out in the surveys of all these internet companies. Why is it that certain companies are able to operate faster, roll out stuff faster, are more productive? I think that's one of the end goals we are all here after.

This is an important other end goal I think we are after. Stubby and gRPC both play into it, which is making our costly resources and developers extremely productive. The more we can take away some of the heavy lifting from them, yet give them the knobs of flexibility. Max spoke a lot about this, which is give them the stacks that are in observability. The more the trust they have, more productive they can be. I think that's also one of the end goals of all this.

Performance, again, at Google, scale becomes a larger problem, but I think it's a problem anyway and is directly connected to how it's perceived, how customers perceive you. We had some interesting stat, which is even a 400 millisecond delay can gives you like a .4% impact on search volume and search revenues. If you buffer for more than .5 seconds, four out of five users will leave out. Lots of these stats, which have been observed across a lot of our properties, where performance is critical to us and in general to people. When we talk performance, it's not just what user observes, it's about service performance, natural performance, resource utilization, et cetera. All of that.

I think I already said, Stubby is one of these RPC frameworks. RPC, by the way, stands for Remote Procedure Cause. It's a framework, which is written back when Google started. It's the back end of every Google service that you know of. Every call that comes from all of Google's services into what we call GFE, or Google Front End. After that, is all Stubby calls for all services of Google. It's a pretty massive framework. Not all of it has come into open, but pieces have started to come. We believe it's one of the core pieces that enables Google to get some of those advantages I spoke about, and can be one of the core pieces for service communication, service building, as people start to build their microservices in their own organizations.

Our scale is, of course, humongous. You're not reading the number wrong. This is a conservative estimate of what we do today. Tens of billions of RPCs per second, across Stubby framework, is a very natural event. It happens every day. If you think about our product line, it doesn't sound that crazy. The way it works is, from a developer productivity point of view, is that every Googler who comes in has this framework available. All they need to do is define their data type, define their service contracts, and a whole lot of magic around monitoring, load balancing, health checking, tracing, it just happens for them. They don't have to care about it. Right? That is, I think, one of the things we have seen as a big value and we hope to bring our with gRPC, as well.

This is interesting. Google doesn't run without memes. For people who are new to it, it has this magical feeling of like, Oh. I don't even have to worry about it. Right? I think that's what we are after where a lot of heavy lift is care for you, and yet you have the knobs when you need them.

How does this happen? In terms of analogies, if people ...One of our open goals at Google is to bring some of these technologies out in the open. You've probably hears of Kubernetes, which is a Borg equivalent, gRPC is equivalent for Stubby, and we are on our way to give that often, as well. My focus of the next few minutes is about giving some lessons out of Stubby. I'll go into each one of them, some in more detail than the other and hopefully these are insightful for you. Okay, let's get into it.

f you think about how people, with the explosion of browsers, apps, World Wide Web, I think this notion of REST and HTTP JSON just exploded as the way to do APIs, and rightfully so. There was debug ability, usability, nouns based, easy to understand. All sorts of tooling came with it. JSON easy to read and debug. The ease of use of REST and HTTP JSON is great. We actually have a whole lot of Google APIs, which are exposed to REST. We think it's great, but it may not be great for certain use cases when it comes to internal APIs, when it comes to more stateful. Think storage. I think this is where we found that a more structured framework, a framework, which is tighter contracts, more efficient on the wire, giving people language binding so people can work on their languages of choice, was actually not on more efficient from a performance point of view, also made people more productive. Also, go people less error prone because of tighter contracts.

We don't try and say this is REST versus RPC, or something, but in service use cases where you're in a trusted environment. People want to work in different languages. Polyglot is very much a reality. There's a whole lot of state you know about, which service is calling which other ones. One of the other things, which is super important is versioning and changing of API, so evolving. I have one particular service, which is evolving faster than the other one. I want to roll it out more frequently and make sure I don't break contract, I don't break dependencies. It sort of goes back to resilience team. In classic REST models, especially when you have added a new data type that is requested, you've changed the type of a given data field. Things, especially in polyglot environments where one language considers a given data type as integer, the other reads it as string. It starts to get tricky in terms of making it backward compatible and making sure all kinds of clients are not breaking as individual services start to evolve.

Also, pure compute network perspective. Having text JSON in the wild is not great from a [inaudible 00:10:29] perspective and from a cloud bill perspective. That was sort of the first thing, in terms of evolving us to a more RPC framework. Second, which I think is probably one of the top learning from study is this notion of a strongly typed IDL, what we call interface definition. You probably, a lot of you have heard of Proto Buff, Protocol Buffers, being open source since 2003. It's a popular notion that this is the data language that Google uses. Essentially, the notion is up front, agree on your IDL, agree on your data semantics, and now with gRPC, your service semantics.

The way it works is you basically define your data types, up front in a language agnostic format called Protos, Proto files. There's tooling available to generate code in various languages. With the open source version, there's a whole lot of languages supported. Basically at that point, once you've done the code gen, teams and developers can go off and start operating in their languages, in their environments, with stubs that have been generated out of that IDL, and just use them. All right? Essentially, for any new developer to come in, it's easy to see, Oh, this is the data schemer. I can do code gen and start off on my own front end JavaScript. Or, I'm Python developer who joined. It's easy for me to just do code gen and start in my own language.

What it does is establishes stronger, stricter contracts. The other thing that you will see with Proto Buff is if you see the syntax, and we'll get into that maybe in the next slide. There's a joke around, at Google, which is like we hire the smartest engineers and just make them do Proto files. It is of course, exaggerated, but it sort of tells the point of how far we have taken the notion of easing out the work for the developer. We actually want them to find the schemers and service all of their data and services. A lot of things are taken care of for them.

If you've never seen a Proto file, this is sort of what it looks like. Essentially, this is a very, very straightforward example just to convey a point. You have a service called weather. It has an RPC, which takes in a request and returns a response. Your request and response are basically message types, which can basically have both primitive and custom data types defined inside. You'll see that they are incrementally numbered. Coordinates, which has latitude 1, longitude 2, and so on, for response as well. You can have a simple nesting. I could refer to other types, as well. Basically, what this notion of numbering gives you is, let's say I were to evolve this service to say, Oh. By the way, now in the response type I am going to give you another field, which is format, which is equal to three. Right?

In a world of Proto, it is not ... Previous clients, which would not send you their previous data type would not break. Basically, the incoming service would just ignore the fact that the new data type doesn't exist, and be able to pass that request and process it. This notion of forward and backward compatibility of your interface APIs is extremely critical to letting services evolve.

The other big advantage that Proto does is carrying binary on the wire. It's basically, in terms of your CPU utilization, network utilization, data itself, you will see anything from two, three extra 10X improvements, just by the fact that you're not carrying JSON, but binary on the wire. Right? This, of course, is being largely used across lots of systems. Storage systems in particular see a tremendous benefit from having a surface based on this.

Okay. Third. These are kind of what I think like basics of and RPC framework. You design for fault tolerance and give services owners control. The control piece, I think, is still [inaudible 00:15:58] at Google. When you define APIs, of course, you give the sync [inaudible 00:16:06]. You can have blocking calls, you can have non-blocking calls, you want to support all of that cases for service to service communication. Need for fault tolerance, this sort of goes into resilience team as well. I think Matt was touching upon this, which is a lot of people. In polyglot world there are language specific frameworks and libraries, which will give you these capabilities. Like, you should do timeouts. You should use timeouts smartly.

But the thing is, they all do it in different ways. When you get into microservices world where they're all communicating with each other, ultimately filling a user request, it really becomes hard to debug, hard to have common nouns between the services teams, and leads to more debugging times, confusions, et cetera. In your worst case, outages, and so on.

I'll talk a little bit about deadlines, and cancellations, and some of the control knobs. We don't have a concept of timeout, but we have first class feature in terms of deadlines. Pretty simply to understand. Indicates how long you should wait before you ... Basically, the RPC fails without giving status code. We have well defined status codes for all of the RPC requests, and that's pretty important too. So what does that mean? In a real call, say I have 200 milliseconds ... Get this thing working. You basically start with the current time and what the deadline time is. Each service should basically add on the time it has taken to process, so my current time is now based on whatever time it was taken, added up by 40. The sum cumulative time now adds up to 150. Now adds up to a time, which is more than what I defined as the initial request from the user, hence deadline exceeded, and we are in sort of happy state.

It look sort of simple, but in a lot of languages, frameworks, et cetera, the notions of timeouts, it gets pretty complicated and we feel this is simply done with the notion of deadlines. Deadlines are expected. Server was slow, or the server responded with success, but it had already passed the deadline. But unpredictable things happen. What if the user canceled the request? That's the notion of cancellation. Now, typically with one, unless it was easy, hey, it was one system, you canceled the request, and the request was not so. I gets pretty hairy with the service graph and dependencies, so you need this notion of cascaded cancellations.

If you have a graph where you are calling a bunch of systems and the user cancels, you need to cascade it and make sure it cancels out all the dependent calls that were in flight when the user canceled. All right? And get them to the right stage if they were changing state and they were sort of a non-item important request. How do you handle that? I think the notion of cancellation is important too. We propagate automatically. We fail with a canceled status code. I think sort of conveys the point.

Moving over a little bit to flow control, which is this notion of when you have services talking. Flow control is not a very, very common occurrence, but it is very possible to have a slower client and a faster server, or a better scenario, a faster client and a slower server. How do you make sure one fast sender is not totally taking away my server? Not taking all of my server memory, or not taking up all of my server resources. This notion of fast server, where I'm getting a whole bunch of responses and the client basically it gets called canceled or unavailable. The other case of a faster client. The way Stubby, and even gRPC handles it, is when you have a whole bunch of requests and a faster sender to a slower receiver, the receiver won't pick it up and have a well-defined signal back to the sender to slow down. That sort of gives you the flow control capabilities.

Where does it help? It helps to balance your computing power and network capacity. I think a big concern is typically, how do I protect my servers? Especially, how do I protect my server's memory? There's both client and service side flow control, and both are supported. Besides flow control, I think, as a service owner, if you are one of the owner of a service, you want to have more control in terms of how clients request you. This notion of setting up some policies where I can tell the clients, Hey. What should you do? Right? When you're calling me, when you're calling this particular method, in this Proto I have this service and this method, please use a deadline of 100 microseconds instead of 300 microseconds, or use higher than what is defined for the service.

These hints and suggestions, as a service owner, which makes you feel you have control over how you are being accessed is pretty important and helps in control, helps in avoiding overload situations. Other things that you can say, for example, in a service config is, Payload size for a given request from client to server should, at most, be so many gigs, or so many megs. Or, The LB policy that you should use for accessing a given method should be on Robin. This notion of suggestions, of course clients can override, but it's actually liked by the SRE teams because it sort of gives them more control.

How do clients discover it? Typically, the way we do is it's wire DNS, so when the client calls DNS, they get the service host name. Alongside, they get the policy, which is defined by the service config, and then they use the client library that the caller has, can then use the knobs exposed by a service config. All right? Another sort of notion of control for service owners.

Okay. Another aspect in service to service calls is it's not just about the RPC contract that you already established. Your application interface, you agreed on a service contract. There's a declared interface, but there is other side-channel information called metadata information that needs to be exchanged between the sender and the receiver. I am deliberately using those terms because the client and server meanings sort of become quite overloaded in microservices world.

Things like alt tokens, things trace context. I'll talk about tracing in a minute, next. Those can easily be exchanged via metadata between the sender and the receiver. In most cases, these are information that you need for systems to operate properly, but they would evolve and change in a different way, at a different pace than what your contracts are evolving at. Metadata is a first class concept in both Stubby and gRPC, and help a lot in the exchange of the control information.

All right. That was sort of the design for fault tolerance and provide our service owners some control. My next two points are around analytics and tracing. I think Matt touched a lot about this notion of observability, having it completely agree on their topic, especially when you have huge service dependency gaps. Unless you have a common nomenclature, a common way of how you log, what stats you see, what those stats mean, it becomes extremely hard to debug real situations.

With Stubby, you have, and with RPCs, every Stubby service has exports in a consistent way while they stream these. You can basically go on to any service endpoint and load up a dashboard and a browser in real time and get a view of what are all the incoming RPCs, what are all the outgoing RPCs, which ones have greater than X second processing time, which ones are active now, where are the editors happening now? How many bytes were sent, how many RPCs are active right now, how many did I serve in the last minute?

Again, easy notion of stats. Tracing, which I think is even more interesting. When you get to these service dependency graphs, it becomes extremely important to trace a request from service to service so you can get a view of the overall request. Now we have a system inside, where any services resource usage and performance stats can basically be seen by arbitrary metadata. You can, for example, add system or Gmail system, they can get to monitor RPC latency broken down by whatever dimensions they chose. Let's say by client types, in this case, by web/Android/iOS. The trace context and the arbitrary metadata that is set by these service owners actually propagates through the entire service chain, which makes debugging and tracing extremely ... I would say this is not a useful, this is a must have, otherwise debugging is impossible.

You often have cases ... I have a single request out of many that I need to find out, it's super important. You take a sample, store it, and help identify which request took us a longer time. There's also a lot of hotspot analysis that happen. Along my entire service graph in the trace, which one is actually taking the most time, where is it stuck in the graph, and start to chase that. I think I make the point.

The last two, one is load balancing. In Google, we had this. We've gone through many integrations in Stabby, so we first had a classic proxy-based load balancer called Stubby Balancer. What is this? You run it as a service, and all requests come to it, decides based on certain algorithms, which back ins to it. It ran well for many years. A couple of problems. One is, at some scale, some large teams like Search, you would have the network concerns going through their proxy. The second one would be, I just still have to manage that service.

We went to addition two, which was client side load balancing, which is, I will have a library given to the client, I will have some sort of Google global software load balancer. That's what GSLV stand for. All the back ends will report their CPU stats and what their health is. Based on that, that can give a list of back ends to hit to the client and the client can directly talk to them, which is kind of what is being shown here. We took a hybrid of that approach, which is currently what is running, and some newer ideas, which we have bringing forward, which again, is taking a lot of the client logic. The one thing that service owners didn't like in their client side load balancing model is, Hey. I'm losing control. I can't define how I'm being called and that's not something I like.

We are moving over to a world where even the client side logic is moving over the controller, and all that the client is doing is basically round-robining over an addresses list. The controller has the intelligence of which back ends and IPs to hit.

These were some of the learnings and the main point is when you have a single point of integration, and you have the single framework for a lot of these utilities like logging, monitoring, tracing, LB stats, it makes lives of developers much more easier and productive. Right? We bringing a lot of that into gRPC. All of the things that I mentioned are coming over to gRPC. It exists in three different stacks, C, Java, and Go. It's at 1.0 right now. One of the things we did with gRPC was project to more platforms and languages as we go into Open Source.

Essentially, as we got Open, we got to more platforms. These are the languages we support today. Pretty much, if you're writing in any language and platform, you can use it. What it gives you is inter-op. You'll notice that we don't stop at service to service. You notice we go all the way to client side, so including mobile web, JavaScript coming, and so on. We've actually expanding, extending the scope od Stubby when we go to gRPC, and to mobile web as well.

I think that gets the point. I'm short on time, so I'm gonna run through, skip some of this. One important point, which is based on HTTP/2. I won't go into details of HTTP/2, but it gives us a whole bunch of features like streaming, over one TCP connection, and so on.

All right. There is an Open Performance Dashboard. You can go through it and all of the languages and graphs are open out there. You can easily see number like that on Open Dashboards, what QPS we get for unit RPCs, streaming RPCs, by language. It's all out there in the open. We run it in [inaudible 00:32:50], but you can run in your own and get to see those numbers.

All right. I'll skip this and straight go to Q and A. Just before that, here's a quick list of what's coming soon. I don't think I have tome to get into a whole lot of this list. Some important ones here are reflection Health Checking, which is defined, and some automated unit testing supports. When you have services, when you want to test locally and you don't want to bring up all the dependent services, how do you do automated mock testing, and so on? All right. That, some people are already using, which is great. You can find more of these on github and our site.

I'll take Q and A. Sorry. We're running over.

Austin Gunter: You're good. Yeah, yeah, yeah.


Varun Talwar: Okay.

Flynn: There's one over there. I'll continue the trend by asking while I walk over. What sorts of problems did you have as you try to get people to switch from Stubby to GRPC?

Varun Talwar:

Google itself is in the path of doing it. It hasn't done it yet. For us, I think, the Stubby to GRPC journey for Google is a higher bar because there's so much value already built into Stubby. We're trying to feature parody, that's one. The problems I think is sort of showing the value to service owners, like why should they switch. What's in it for them? I think the commonality of if services are giving external APIs and having external and internal API contracts being similar is a big value. Having support all the way for mobile web is another value.

I think the main challenge is showing the ROI for service owners to switch internally at Google. For a lot of people outside who haven't seen the Stubby side of things, I think the value becomes much easier from the get-go. Especially painful for a lot of companies that I've seen is supporting developers in all of these languages, maintaining all these language bindings is not easy. Some big companies have tried to attempt and found it costly.

Audience member:
At Google, how do you do the Proto sharing as distribution? Do you have like one concept of a user used by every service, or does every service define it's own user?

Varun Talwar: We have a notion of identity in terms of ... I don't know if you mean user or service.

Audience member: Do you use shared contracts between many services, or does each service implement it's own completely independent contract?

Varun Talwar:
Each service implements their own, unless you have common thing like, Oh, my contract. Every service, the way to talk to tracing is defining the tracing Proto, or the logging Proto. So those are common, but otherwise Protos are service owned, so to speak.

Flynn: And I think we have time for one more.

Audience member:
As far as indistinctly used Protocol files, basically interface definition language, two questions. First, do you generate both client and the service code from it? And does it limit your flexibility on their server side of things?

Varun Talwar:

Yes. And your second question was, does it effect flexibility on the server side? If you mean like ability to add new features, or customize, or ... All we are generating is Stubbs in that language as defined in your contract. Whatever you define, in terms of your RPC methods, this is what you get. Of course, with each of those calls, you have further APIs in terms if I want to make it a sync or an a-sync call. What we have tried to do is make our APIs as close as possible to the language you're choosing. All right? So if you're generating a Note API, as a support for futures, and so on. Things start to vary a little bit, by language. It sort of goes towards flexibility if you're writing a server in a given language. Otherwise, we don't see this causing lack of what you want to implement on the service side.

Thank you.

Varun Talwar: All right.

Expand Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.

Try the open source Datawire Blackbird deployment project.