Lyft’s Envoy: From Monolith to Service Mesh – Matt Klein

Matt Klein (Lyft)

Description

Envoy is a new high performance open source proxy which aims to make the network transparent to applications. Envoy’s out of process architecture allows it to be used alongside any language or runtime. At its core, Envoy is an L4 proxy with a pluggable filter chain model. On top of the L4 proxy it also includes a full HTTP stack with a parallel pluggable L7 filter chain. This programming model allows Envoy to be used for a variety of different scenarios including HTTP/2 gRPC proxying, MongoDB filtering and rate limiting, etc. Envoy includes advanced load balancing support including eventually consistent service discovery, circuit breakers, retries, zone aware load balancing, etc. Envoy also has best in class observability using both statistics and distributed tracing.

In this talk we will cover Lyft’s architectural migration from monolith to a fully distributed service mesh, the origins of Envoy, a high level architecture overview, and future directions.

For more information on Envoy see: https://www.envoyproxy.io/

Resources
Getting started with Lyft Envoy for microservices resilience
Deploying Envoy with a Python Flask webapp and Kubernetes
Deploying Envoy as an API Gateway for microservices / Ambassador: Kubernetes-native API Gateway built on Envoy

Introduction to modern network load balancing and proxying

Using API Gateways to facilitate your transition from monolith to microservices

Why Envoy Matters

What is a service mesh and do I need one when developing microservices?

Service mesh data plane vs. control plane

Presentation Slides

Transcript

Richard:

Thank you, everyone, for coming. I remember when I first heard about Envoy we were chatting with some of the folks from Airbnb, I think maybe about a year ago. Then, when we ran into one of the folks from Yelp, who’s actually somewhere in the audience, he told us, You really need to check out this envoy thing. It turned out it was pretty cool. Matt’s done an amazing job of getting Envoy off the ground into production. When it came time to organize a summit, we were trying to figure out what are interesting topics and cool things to talk about?

Envoy was the first thing that came to mind, so without further ado, Here’s Matt Klein.

Matt:

Great, thanks, everyone. Yeah, so I’m Matt Klein, a software engineer at Lyft. Today, we’ll be talking about Envoy. In general, from Lyft’s migration to a monolithic service architecture to a service mesh architecture. A quick talk overview. We’re going to start by just talking about kind of where Lyft came from and how it got to where we are today. We’ll transition into talking about the current sate of SOA networking within the industry. Then, we’ll actually jump to what envoy is. We’ll talk about envoy’s deployment at Lyft. We’ll talk about where envoy is going. Then, we’ll do some Q&A.

This picture is going to look probably very familiar to folks that kind of started out with a basic architecture. This is Lyft about three and a half years ago. Lyft started entirely with an NWS built on PHP Apache monolith, MongoDB as the backing data store. Single load balancer. Clients would talk over the internet, to the load balancer. Load balancer to PHP, PHP to Mongo. That’s essentially it. Even at this simple time where we say simple, right? No service oriented architecture. We already start seeing some problems, right? What happens here is PHP and Apache, for example, are a process per connection model, can only handle so many connections at the same time.

The load balancer doesn’t actually know how to do that, so it starts connecting with many pre-connections. Then, we start having to talk to AWS to tell them to start turning off features such as pre-connect and stuff like that to actually make this work. Even with this very, very simple architecture, three and a half years ago with no service oriented architecture, we already start seeing some problems in that networking space.

Let’s fast forward just one and a half years later, so this is about two years ago. We’ve gone from something that looks very, very simple to something that is probably again quite familiar to most of you who are working in kind of with these architectures. It’s not simple anymore, so now, we have clients talking over the internet to the external ELB. We still obviously got this monolith, but now the monolith is trying to make service calls to new back end services, but the serving architecture of PHP doesn’t allow us to make concurrent connections, so we’ve dropped in HA Proxy to try to fix that problem.

We’ve got some queuing now, so we’ve got all of this operational tooling and scripts to attempt to make that actually work. We’re making calls out to AWS internal ELBs. Now we’re calling a bunch of python services. We’ve got two data stores now. We’ve got Mongo as well as different dynamo DB tables. We’re attempting to keep this all together. From a deployment standpoint, it’s starting to get pretty complicated. From an observability standpoint, we’re starting to have fairly major problems in the sense that the information that we’re getting out of this system in terms of logging stats, we don’t have any tracing at this point. Very difficult to understand where things are actually starting to fail.

If a call fails, did it fail somewhere in HA Proxy land? Was it somewhere in the queuing layer? Was it in the monolith? Was it somewhere within AWS? Very, very difficult to actually tell what’s going on. Kind of using that picture that we just saw, I’d like to talk a little bit about what I think most people are experience, feeling right now with their service oriented architectures. We have right now in the industry is we have tons of different languages and frameworks. It’s becoming increasingly common that companies have polyglot systems.

They might be deploying services in, say, three or five different languages, whether that’s java or go or scala or PHP or python, right? There’s all kinds of different technologies out there. In terms of frameworks themselves, people are using … They might be using G-event. They might be using different PHP frameworks, right? There’s all kinds of different things going on. More importantly, from the networking space, you typically have per language libraries for actually making these service calls. If you’re using PHP, you might be using Kerl, right? If you’re using a JVM thing, you might be using a very sophisticated library like Finagle.

Typically, every language has different libraries that you’re using to make these calls and also get different stats and kind of observability out of them. We’ve obviously got lots of different protocols. We’ve got H1, H2, GRPC, which you’ll hear a lot more about soon. We’ve got tons of different databases, whether that’s Mongo, Dynamo, Sequel, all types of different things. Obviously, different caching layers, whether that’s Redis or memcache. There’s all types of things going on.

Then, in terms of infrastructure, we’re kind of at a phase now where we have physical infrastructure. We have virtual infrastructure. We have infrastructure as a service. We have containers as a service. You know, you’re starting to talk about really heterogeneous deployments that can get quite complicated. Now, we get to load balancers, so again, we’re in this transition phase where we have physical load balancers still, things like F5, things from companies like Juniper. We have elastic load balancers, or virtual load balancers. We have software based load balancers, like Ha Proxy.

Then, we start talking about how you actually figure out what’s going on within these systems. Ultimately, observability from an operations standpoint ends up being extremely important. That typically boils down to stats, tracing, and logging. When you have this entire system of all these different components that are quite heterogeneous, the output tends to be very varied. Each of these systems are producing probably no tracing, or they’re producing different stats, or different logging. From an operations standpoint, trying to figure out what’s actually going on in your system can be very, very complicated.

This is probably the most important is that from a distributed system standpoint, there’s a lot of best practices that go into service oriented architecture and networking. Those are things like retry, circuit breaking, rate limiting, tie in mounts, all of those kinds of things. What you’ll find is that when people use different libraries of varying quality or they’re using different hardware appliances, the features that are provided in these systems tend to be varied. They either work differently. They might be implemented, or they might not be.

From a developer standpoint in your ecosystem, if you’re already using all of the stuff in your five languages, it can become nearly impossible to kind of grok what is available and what is happening and what are your best practices. Then, of course, we’ve got authen and [authz 00:08:14]. You know, what I find when I talk to most companies and most customers at this point is companies are either not doing it at all or they’re typically doing something like basic auth with no key rotation, which has obviously a lot of different problems with it.

Another large area. All of those things, you know, they tend to lead to this, right, which is people that are very frustrated. They’re typically scared of actually building service oriented architectures, because they don’t understand where things fail. Part of that is lack of features. Part of that is actually visibility. What it really comes down to is people do not understand how all of these components actually come together to build a reliable system. What most companies are feeling today is kind of that previous slide.

It’s a lot of pain. I think most companies and most organizations know that SOA is kind of the future and that there’s a lot of agility that comes from actually doing it, but on a rubber meets the road kind of day in and day out basis, people are feeling a lot of hurt. That hurt is mostly around debugging. When I joined Lyft, people were actively fearful of making service calls, because when a call failed, or if there was high latency, people physically could not understand where the failure was happening. Was it happening in the app? Was there some type of GZ pause? Did the amazon hardware fail? Was there an actual networking issue? Basically impossible to tell.

Because of what I was saying before where you have all these different components that are quite heterogeneous, just figuring out what your visibility is can be nearly impossible. You know, if you’re relying on something from amazon like the elastic load balancer, you’re basically beholden to the stats that they give you that you can effectively see in cloud watch. You don’t really have good access to logging, so it can be incredibly complicated to work with different vendors of different components to figure out what’s actually going on.

For some of these problems that tend to be so transient, if you can’t actively debug and understand what’s going on, the chance of a developer understanding what happened approaches zero, and their trust of the system approaches zero, so building trust from devs is quite important. I already touched on this, but what you tend to see with a lot of the existing libraries is partial implementations of best practices. There are so many times when we were building envoy at Lyft. In the early days when we were rolling it out and getting all the services transitioned over, people would often say to me, Hey, Matt, I don’t understand why we need envoy to do timeouts, or to do retries or stuff like that.

To a lot of developers, some of the things that I’ll talk about that envoy does ,they seem simple, at least at a high level. Conceptually, retry seems simple, but as many of you probably know, retry is one of the best ways to take down your system, right? It’s so easy to basically have the entire system go down from exponential overload. You know, so many times that people told me these things are easy. They end up being partial broken. Then, we end up having outages because of these theoretically easy problems.

Most importantly, for the few companies that do have a good solution, so if you’re lucky enough to be on one or two languages, you’re on JVM and you can use something like Finagle, that’s awesome. That’s a great library. If you’re google and you have stubby, that’s awesome, right? If you’re amazon and you’ve invested in their internal tools, that’s great. What that typically means is that A, you’re locked into few languages, and you have to have large dev teams that are typically replicating technology across different stacks. In order to keep that kind of homogenous, it tends to be a lot of effort to keep those libraries up to date.

There tends to be a lot of lock in. If you’re invested in JVM and you’ve got this amazing library system and suddenly you want to start writing ghost services, the investment to do that can be absolutely huge, because you have to have someone port that huge library to do all these sophisticated things. Lastly, if you are using that library approach, they’re really painful to upgrade. I’m sure you’ve all been in this situation where you’ve got some library. You’ve got a bunch of services. You want people to deploy that standard library. Good luck.

It can take five months, right, to go and where you beg and plead people to actually upgrade that library. It’s one thing if you’re trying to get people to roll out new features, but if you’re trying to deal with some type of CBE or some type of general security issue, it can become imperative to actually get that library upgraded and that pain, it just makes that part even worse. What I typically tell people, particularly with SOA networking, is at the end of the day, observability and debugging are pretty much the most important thing.

You have developers who you would like to write services for any number of reasons, but if the developers cannot understand what’s happening—if they don’t feel confident they can identify root causes of latency or failures—they will not trust the system. A recent challenge faced by an オンラインカジノ development team underscored the need for robust observability and clear debugging paths to maintain platform reliability. It is absolutely imperative that we give people good tools to actually observe what’s going on in the system and also do debugging. What I think we’ve seen so far up to today is that, as an industry, we have not provided developers with the right solutions to effectively tackle these issues.

Because of kind of the heterogeneous system, the partial features, and a bunch of other things that I’ve already talked about, you’ll actually see at a certain point productivity start to decrease, where like I was saying, people will not trust the network. They’ll start actually shipping libraries again, because they don’t want to make service calls. We saw this at Lyft all the time. We were into our service oriented architecture. People didn’t trust the network. They started to ship fat libraries again. We started to build monoliths again. We were fighting. We wanted to go SOA, but then we were still fighting that kind of fear of going full SOA.

That begged the question, right? We’ve kind of seen the history of Lyft, we’ve seen some of the problems we have to day in the industry. Can we do better? Let’s talk about envoy. This is kind of the mission statement of envoy, which reads the network should be transparent to applications. When network and application problems do occur, it should be easy to determine the source of the problem. That’s a great mission statement, but as is probably clear, it’s incredibly difficult problem to solve.

Let’s talk about kind of what we do in envoy to attempt to start to solve that problem. Then, we’ll go into a bunch of diagrams and details. One of the main things that we did in envoy is that envoy is not a library. It’s a completely self-contained proxy. You can think of it as something like Engine X, something like Haproxy. It’s a server that you install and you run. There’s different names for this. A lot of people call it the sidecar model. I just call it the service proxy model. The idea with envoy is that you implement a lot of features, which I’m about to talk to.

You’re going to implement them in one place. You’re going to co-locate this proxy next to each and every application that you have. Then, no matter what your application language is, whether it’s GO or Scala or Java or python or C++ or PHP, it doesn’t matter. What ends up happening is that application talks to envoy locally. Envoy does a lot of stuff. Then, it returns that response back to the applications. We’ll start going into what is all of that stuff?

Real quick, envoy is written in C++. I’m not going to talk about this more. That was chosen just for performance reasons. That C++11 is actually very fast and productive language. At its core, envoy is what we call a kind of an L3, L4 proxy, and by that, I mean it’s a byte oriented proxy. Connections come in, you read some bytes, you’re going to operate on those bytes. Then, you’re going to push those bytes back out. That can be used for very simple things like just kind of a simple TCP proxy or a simple SSL proxy. It can also be used for much more complicated things.

It can be used for Mongo, it can be used for Redis. As I was saying, it can be used as an s-tunnel replacement. The idea is that envoy is kind of a generic holder for different filtering mechanisms that can operate on bytes. Of course, as we all know, most of the industry is based on rest, so envoy supports very, very rich HTTP support. Along those lines, we have a second layer filter stack that allows us to filter on kind of messages at that level. For example, headers or body data or something along those lines. We can do very rich routing for example based on headers or based on content type.

We can do rate limiting. We could do all types of different things. We also allow people to plug in different functionality. One of the things that we did with envoy, which is actually not common for most of the proxies that exist today, is envoy was written to be H2 first. Most proxies today, they do support H2, but typically, they only support H2 on the front side. They still use H1 on the backside. Envoy is a fully transparent proxy, so it can do H2 on both sides and transparently go H2 to H1. Where that gets pretty important is that GRPC is an H2 protocol.

Envoy is one of the few proxies that can correctly proxy those types of messages. I’m actually not going to go too much into GRPC just because there’s a whole talk coming on that. Envoy does a lot in the area of both service discovery as well as active passive health checking. Service discovery is how you know what all of your backend hosts are, and health checking is whether you know those hosts are health. I’m going to talk a lot more about this in the following slides.

We do a whole lot in the area of what I would call advanced load balancing. Those are all the things that I was talking about previously around retry, timeouts, circuit breaking, rate limiting, shadowing, outlier detection, stuff like that. These are those features which, in most people’s architectures, they’re either non-existent or there’s partial implementations that people typically either don’t use correctly or they don’t actually know how to do. Most importantly, I’m going to keep talking about this throughout the talk is observability.

I would say best in class. We have extremely rich stats output, logging, as well as tracing. I’m going to talk a lot more about this. Finally, envoy supports enough feature typically for most people to be an Engine X replacement in terms of front proxies. We do enough TLS features so that for almost every architecture, you can use envoy as an L7 reverse proxy at your edge. You might ask why would we want to do that. What we find, and going back to Lyft’s architecture we were running NSQ and haproxy and kind of a bunch of different things, is from an operation standpoint, there’s huge benefit in running a single piece of software everywhere just because you have kind of the same set of stats, the same type of logging, kind of same tracing.

What we find is that being able to run our proxy both at the edge and in service to service mode, it makes it easier from an operations standpoint. When you actually think about it, 90% of what is done at the service to service layer as well as at the edge is essentially the same. All right. This is what kind of … This is like a little zoom in of what kind of a basic envoy cluster looks like. What you’ll see here is we’ve got our service cluster, right ,so that might be n-nodes. This could be virtual machines. This could be pods within kubernetes, doesn’t really matter.

You’ve got entities of your service co-located alongside every service, again, whether that’s a pod or whether that’s a virtual machine, doesn’t matter, is an envoy instance. The envoys communicate to each other typically over H2. They might be talking GRPC. They might be talking rest. Again, doesn’t really matter. Envoys might also proxying out to external services, so think DynamoDB or other kind of third party vendors. Then, of course, the envoys are talking to some type of discovery service, which we’ll talk about more to find out where are all of these pods? Where are all of these entities?

I think one of the most important things to actually realize about this picture is that from the service perspective, the service is only aware of the local envoy. From a developer productivity perspective, this is actually truly amazing. If you’re a dev, it doesn’t matter whether you are coding in your dev environment, if you are in staging, or if you’re in production. You write code in exactly the same way. At Lyft, we have this very thin client that we call envoy client. What our devs do is they instantiate an envoy client, and they literally just give it the name of the service they actually want to talk to.

At Lyft, let’s say they want to talk to the user service or the location service. They’ll instantiate a client, the envoy client, and they’ll say, I want to talk to locations. Then, no matter what the environment they’re in, the system, the mesh, is set up in such a way that it works. Our developers, when they’re running in their dev box on their laptop, it just works. When they’re working in staging, it just works. Product just works.

Furthermore, it unlocks actually really interesting scenarios during dev, where we can transparently offload certain services from dev’s laptop into the cloud, and the devs don’t even know about it. You know, this type of abstraction of the service not really knowing about the networking can be very powerful. Just to briefly touch on what the edge looks like, it’s a typical kind of proxy scenario, but we’ve got externa clients, we’ve got the internet, and at Lyft, we actually run kind of two types of edge proxy scenarios. One of them is the standard proxy, so that is terminating TLS and kind of doing L7 reverse proxy for API and then going back into our private infrastructure with all kinds of different services.

Then, we also actually run envoys in what we would call a double proxy mode in multiple regions. For example, for Lyft, we terminate traffic here in San Francisco because we have so many customers here. Then, we end up proxying some of that traffic back to our east coast data center. That’s a very common paradigm, but it just kind of shows that we can use envoy in kind of all these different scenarios. We can use it in the double proxy scenario, the main edge proxy scenario as well as this service to service proxy scenario.

This now is Lyft today. There’s more boxes here, but it’s actually a lot simpler. What we have now is we have clients, right? We have the internet. We have what we call our front envoy, our edge proxy fleet. These still run within AWS. We still use basic TCP, ELBs to get traffic into the front envoy fleet. The front envoy fleet is proxying still back to our legacy monolith. As you all know, monoliths take a long time to actually die. We’re actively working on that, but that’s still ongoing. We have a lot of python services. We now have a lot of go services.

Envoy runs on every service that we have here. Services do not talk to anything without talking to envoy. Envoy is actually used to proxy to MongoDB. I’ll talk a little bit more about that later. We use it to proxy to DynamoDB. Envoy itself emits a ton of stats. It hooks into our tracing system. All that information comes direct from envoy. That’s really incredible, because no matter what language you’re actually running, you just show up, you launch your service, do a little bit of thin calls to go to you local envoy, and you immediately get fully distributed tracing, all these stats, logging.

It’s actually pretty amazing. Then, we obviously run our discovery service down at the bottom there. That’s how all of the envoys actually talk to each other and figure out what’s going on. All right. Let’s talk just briefly about service discovery. Typically, the way that historically most companies have done service discovery is to use fully consistent systems. That’s typically zoo keeper or console or something along those lines. That has worked for many companies, right? There’s still a lot of companies that are using zoo keeper. There’s still a lot of companies that are using console and people like these products.

One of the things I have always found very interesting about service discovery, and particularly that people tend to use these fully consistent systems is that service discovery is not really a fully consistent problem, right? Nodes come, nodes go, there’s failures, and as long as you have an eventually consistent view of the hosts in your mesh, and as long as that view converges, you’re ultimately going to get to your end state. I think one of my I would say core distributed systems design philosophies is that if you have a fully consistent problem that can be made eventually consistent, always make it eventually consistent.

What you’ll see is because of the fully consistent nature of zoo keeper and ETCD and Console, these systems tend to be hard to run at scale. Almost every company that I know that runs these systems has full teams dedicated to keeping zoo keeper running or keeping ETCD running at scale. Like I was saying before, because service discovery is an eventually consistent problem, every node does not need to have the same view of the network as long as it converges eventually.

We can greatly simplify this problem by making this an eventually consistent problem. We designed envoy from the get go to be eventually consistent. In our service discovery service, which we’ve open sourced, is literally I think about 300 lines of python that writes into Dynamo DB. It is dead simple. Every host, every minute, checks in and says, Hi, here I am. Here’s my IP address. Here’s what I am. We do caching at every layer. Everything is eventually consistent, and the envoys talk to this discovery service and essentially get the roughly up to date view of the world, and that converges eventfully.

What we do is because of this eventually consistent nature, we assume that service discovery data is actually lossy. We assume that hosts are going to come and go. We actually assume that we can’t trust that service at all. We assume that it can go away and then come back. What we do is we layer on top both active and passive health checking. By active, I mean every 15 seconds going to /healthcheck and saying, Hi, are you there? Passive health checking meaning did that host return three 500s in a row? Let’s just boot it, right?

We layer on top those types of active and passive health checking. We produce this, I would say, consistent overlay. It’s super important because what we end up doing here is if you look at this four by four matrix, we end up trusting the active health checking data more than we trust the discovery service data. If the health checking is okay and the services discovery we obviously route. If discovery shows absent but health checking is still passing, we still route. We don’t trust that service, because we assume that that service discovery service is lossy and it might end up coming back later.

If the health checking is failed and discovery service shows discovered, we trust health checking data more than discovery service, because again, we do a lot of caching in that layer. It’s eventually consistent, so we assume that the health checking data is more up to date than that background service discovery data. Only in this bottom right hand corner where discover shows absent and health checking is failed do we fully take that node out.

What we found using this system is we have not touched our discovery service in … I think the last code change we made to it was six months ago. It’s literally fire and forget. Based on what we see from most of our peer companies and other companies that are using some of these other fully consistent systems, the care and feeding that people have to put into kind of keeping those systems up can be very challenging. Furthermore, what you see is that people that are using zookeeper and ETCD, they end up actually building eventually consistent behavior on top of it because of the inherent instability.

I find that interesting. It’s basically they’re taking this fully consistent system and they make it eventually consistent to kind of do the same thing. We kind of, I think, took the slightly novel approach of just treating it as eventually consistent from the get go. Very briefly, just go through some of the load balancing features envoy supports internally. We have obviously a variety of different discovery types, whether that’s static configuration, DNS, our own APIs around service discovery service. We do zone aware, lease request load balancing, which is a pretty cool feature.

If you’re in AWS or other kind of multi-zone architectures where not only will calls potentially take longer to go between kind of zones, but they may actually charge you money. Envoy will optimize to send requests to services in the same zone. Then, if things start to fail, they’ll actually properly fail over and send them across multiple zones. This is good from a cost-saving perspective. It’s good from a speed perspective, reliability perspective. We do a ton of stats like I was saying before. Per zone stats, canary specific stats. We spit out really all kinds of stats to make it easy to operate.

We do a lot of circuit breaking, so that’s things like max request, max connections, max concurrent retries. We do rate limiting. Envoy has integration with kind of a global rate limiting service. We will hopefully be open sourcing the rate limiting service next week. I am very excited about that, so that will be out there also. We do request shadowing, so we could fork traffic to a test cluster. Our devs love this feature just because they can do load testing. They can kind of test new feature without worry of actually affecting production.

We do all kinds of built in retries, so we’ll have kind of … We tried back off with kind of exponential jitter. We’ll have different retry policies. We’ll do timeouts, and those are both outer timeouts, including all retries as well as inner timeouts, so that actually allows people to potentially set pretty tight, single try timeouts but still bound it in some kind of reasonable outer limit. We do outlier detection, so currently we support consecutive 500s. If a node, say, returns three 500s in a row, we can boot it, kind of put it into a penalty box for some period of time. With all these features, we also allow you to do deploy control, so blue/green deployment, canary deployment, and stuff like that.

All right, so let’s talk about observability. Observability, like I was saying, it just is the most important thing. You can have all these features, but if you don’t know what’s going on, no one’s going to use your system ,and no one’s going to want to write highly networking dependent service architectures. When all of the traffic transits through envoy, you start getting a pretty incredible single place where you can generate really important stuff.

We’ve already talked about stats, so we can do stats on every hop. We can propagate a stable request ID as well as a tracing context, so this is super important where if you’re doing logging across your architecture and let’s say you want to sample one percent of logs, if you randomly sample one percent of logs, it ends up becoming almost useless, because there’s no context. If we could control this request ID and we could propagate it amongst the entire system, we can actually stably sample at, say, one percent. It’s not one percent of random requests. It’s one percent of entire chains of requests, which ends up being a very powerful feature.

Just as I was saying, we can do consistent logging. We can do tracing via this stable request ID and this tracing context. We can link all of these things together. Hopefully this is not too small to see, but just to give you kind of an example of the information that envoy provides to our developers are Lyft is this is kind of our service to service dashboard. What you’ll see in this dashboard is in the upper left hand corner, you can select the origin service, and you can select the destination service. All of our services are in there, and you can select any two pairs. What dashboard shows you is all the things you might imagine that you want to know between any two services.

You’ve got your connections per second request per second, your total seconds, your membership, your response times, retries, errors, all of these things. Then, it’s cut off, but this dashboard shows both halves. You’re seeing the outgoing, or the egress portion of this dashboard, but kind of below the fold is the ingress, or the incoming portion of the dashboard. It creates this really cool thing where if people are seeing errors, or if they just want to understand what’s going on at any particular hop, they can open this dashboard and very, very quickly understand what’s going on between any two hops in the system. This is our global dashboard. Again, you’ve got all of this data coming in, and this is showing, you kind of at a high level, at Lyft, from all envoys, what is the current state of the system.

We’ve got how many global RPS, how many global connections per second, overall success rate, connections, retries, failures, right? From just show me one thing that tells me kind of the overall network health of the system, this type of dashboard becomes extremely invaluable. Then, using this dashboard, you can drill into where are failures happening? You can go back to the service to service one to kind of figure out what’s actually going on. This is showing a tracing view from our tracing provider, Lightstep. This is entirely generated by envoy. App developers did nothing to actually make this happen.

All of the data transmitted through envoy. We had stable request Ids, stable trace context. With absolutely no effort, you can start to see the flow of this request failed and you can start to actually dig into where it failed. I can’t actually drop it down here, but if you click on this stuff here, you can drop it down and we’ve got all kinds of other information about the request headers and different parameters. That’s pretty cool.

All right. This is our logging system, kibana. Like I was saying before, we’ve got that stable request ID. Because we’re using the stable request ID throughout the system in addition to traces, it also goes into our logging system. Now, when applications log or when envoy logs, we can go through and give it a trace. We can link back to our elastic search logging system, and people can do additional debugging. Given all these tools, dashboard, tracing, and logging, it becomes substantially easier if failure occurs to figure out is it in the network somewhere? Is it a single instance, was it an application error? We’ve got lots of information.

All right, so I want to just really quick talk about perf. From a perf perspective, what you’ll hear from lots of people is that performance only matters if you’re google, right, or performance only matters if you’re someone like Amazon. It’s true to a certain extent that performance, I would say for throughput, when you reach a very large scale, throughput is obviously very important, but it is also true for smaller companies and people kind of make this argument that, Well, my developer time is worth more than optimizing throughput. You kind of got these two competing things.

What kind of ends up happening is people don’t think a lot about tail latency. In these systems, even for smaller companies, even if your devs are super expensive and you don’t care as much about total throughput, you end up spending a lot of time debugging tail latency issues, so developers will come to you, and they’ll say, I’ve got some SLA. Why is my request out at p99 or something like that, or p99. Why is it taking so long? That ends up, from a networking perspective, being a lot of time that we actually spend helping people understand where that time is being spent.

You know, as we’ve already talked about through this talk, the deployments that we have today are incredibly complicated, right? We have virtual architecture, physical architecture. We have all kinds of different things. We have garbage collection, we have no garbage collection. All of the things that we have today, they’re awesome. I mean, they make us way more productive, but they actually make it really hard to understand where time is being spent. When we see problems, you know, where is it? Was it in the app, was it in the infrastructure? Who knows? It’s almost impossible to tell?

At the end of the day, though for many companies throughput is not necessarily the most important thing, being able to reason about where the time is spent and overall performance is very important even to small companies. What I would argue is that you’ve got this observer effect, right? Hopefully I’ve done an okay job of convincing you that this service proxy approach is pretty cool and it has a lot of different benefits, but if the service proxy itself can kind of have that physics observer effect where by measuring something you affect that measurement, the whole thing starts to fall apart.

If you can’t trust that data that’s coming out of the proxy, if you can’t trust that p99 or p99 data, it starts to become a lot less valuable. I won’t touch on this that much more, but it is important to consider for some of these tools how are they written and what are the latency profiles of these systems? If you’re going to consider building one of them into your architecture, I would strongly encourage you to take that into account and to make sure that you choose one that is not adding a lot of extra variance, because that ends up being pretty important.

All right, so real quick, let’s talk about our deployment at Lyft. Just we’ve already gone over most of this, but in terms of scale, we run envoy now on over 100 services in that deployment graph that I showed to you. Over 10,000 hosts. We do peak over two million requests over second within the mesh. 100% of traffic goes through the envoy mesh, so that’s both rest as well as GRPC. We use GRPC bridge technology, which I then talk to to allow python g-event clients to actually talk out. We use envoy for proxying MongoDB. There’s actually a lot of really cool stuff there that I would love to talk about but there’s just not time. I can chat about that more after if there’s interest.

We use it for DynamoDB. I would say just really briefly about Mongo and Dynamo, one of the main things that envoy does, we actually sniff that traffic, so we sniff the Mongo conversation and the Dynamo conversation, and we generate stats that are very specific t dynamo and for mongo. What that gives us again, is it’s all about observability. No matter whether the application is written in GO or PHP or python, we get really rich global stats for these systems, which ends up being pretty amazing. We use it to proxy to all of our partners. I’ve already touched on this. We use Kibana in elastic search for logging.

We use a company called Lightstep for tracing. We use a company called wavefront for stats via writing the stats out to statsd. Really briefly on kind of future directions, we’re going to be adding Redis support to envoy. We you se Redis extensively at Lyft. There’s some operational pain points that have been pretty bad for us. That will be coming soon. We’ll be working more on outlier detection, so in addition to consecutive 500s, we’ll be adding success rate variance as well as latency variance objection. We’re going to be really investing in kind of public APIs for interacting with envoy.

One of the things we’ve tried to do with envoy is we are trying to make all of the APIs, whether it be for rate limiting or service discovery or auth to have them be very open. The idea being that there’s no lock in to use some of the ancillary services with envoy. We want to be open about the APIs and allow people to, for example, bring their own rate limit service. We’re going to be adding a bunch of stuff around configuration schemas, better error output, make it a little bit easier to use. We’re currently getting started with a partnership with the folks at google to actually add what I hope is going to end up being pretty nice kubernete support. Start looking at authen and authz.

Just briefly, we’ve only been open source with envoy for about four months, and the reception that we’ve gotten has been awesome. I’m really excited just to see all the interest that’s out there. We’ve got actually several companies, some public, some that are not yet public, that are interested in becoming contributors. Trying to figure out over the next six months kind of how do we build a larger community around envoy? I’m actually pretty excited about that.

All right, so that’s all I had. Again, thank you very much. Just again, talking about before, really excited abut getting people interested in envoy. Happy to answer questions. We’ve got a bunch of information on our public landing page. I got to do my plug that we’re hiring, so you know, you can always talk to me about that. That’s it. Thank you. Happy to take questions. Yeah.

Flynn:
I’m Flynn from Datawire. I’ll be running questions. I’m going to open up with one from me while I walk over to these other people. When you were getting Lyft to switch over to using envoy as a company, was there any particular challenge you faced on that? How did you deal with that?

Matt:
Yeah. That was challenging, and I take the approach to engineering that everything should be incremental. I probably should have talked a little bit more about what that transition was like. It wasn’t like we magically developed envoy and obviously switched everyone over. We started, it was a pretty incremental process where we started by using envoy for our front proxy, so it’s kind of like as the engine x ELB replacement. We then rolled out envoy on our monolith. We added MongoDB support to the monolith. Then, we started gradually rolling envoy out in a way that was concurrent with the existing ELBs.

It was a long process. I would say we’re not fully deployed, but it took a year to get fully deployed with concurrent development. Yeah.

Flynn: Thank you.

Audience: I had a question around tail latency. Do you recommend a deployment that you reserve a core for envoy?

Matt:
I wouldn’t say I recommend it exactly. I think that it could help for sure. I’ve done a lot of work kind of with core pinning and a bunch of other stuff over the years. I guess what I would say in short, and I’m happy to talk about this after, is you can get a lot of benefits by doing that, but it’s very complicated. You can also, I would say, make things worse pretty easily. Real quick, I didn’t cover this, but the architecture of envoy is completely non-blocking and embarrassingly parallel. We recommend that people run basically one thread per core for envoy.

We allow you to limit the number of threads. It would be very easy to basically do a envoy has one thread, pin it to one core type model for sure.

Audience: Are your data pipelines integrated with envoy?

Matt: By data pipeline, do you mean-

Audience: Your spark clusters or anything along those lines.

Matt:
Right. We use it for Kinesis mostly at the RPC layer, but we don’t have integration yet with Spark or Hive or something along those lines.

Flynn: Go ahead.

Audience: Can you talk about envoy’s extensibility story? What does it take to add a new filter or a new protocol to envoy?

Matt:
Sure. Right now, the filters are written in C++, so there’s no scripting capability currently. I’ll also be honest in that we don’t have any public docs on it yet. I will say, however, that we’ve had multiple companies write filters already with little to no involvement from me. I feel very confident that kind of the interfaces are at least on the right path and the code documentation is good enough that it’s not actually unreasonable for people to start writing code, but that is something we want to invest in in terms of having better documentation and better extensibility story.

Audience:
One of the advantages, obviously, or one of the disadvantages is it brings back the work into a single team or a single entity, like envoy, so how do you [inaudible 00:51:13] of features that other teams want? If somebody’s implemented something, how do you bring it back into envoy and get it out to everybody?

Matt: Do you mean features from a product perspective?

Audience:
No, specifically to envoy, right? There’s a new rate limiting filter that one of the teams that already has envoy actually wants to use, and it’s not in envoy yet.

Matt:
You know, that hasn’t typically been a problem so far, because what I told people at Lyft, kind of when we were developing envoy, is I said if the word car ever appears in the envoy source code, we’ve done something very wrong. You know, by that, I just mean that we’ve tried so far to build features into envoy, I would say, that are useful across a very large portion of the organization. I can’t say that we’ve actually had that problem yet, because almost every feature we’ve built, every team ends up using effectively.

I would say in the future, the filtering mechanisms are extensible and pluggable enough. I don’t see any particular reason it would be difficult that a team could use a filter that another team wasn’t using.

Audience:
Thanks for creating envoy. That looks really exciting. From a health checking perspective, I’ve heard about some of the larger providers that are doing manual health checks against a service from the client side and burning a ton of network and CPU just doing health checks, whereas one of these kubernetes does the health checks locally and then updates the service registry. I’m wondering, you kind of lumped together the explicit health check with the just kind of watching the request response stream. Is that separable?

Matt: Yeah.

Audience: You can just let the service registry deal with the health checks.

Matt: Yeah.

Audience: Okay.

Matt:

Yeah, so from a health checking perspective, everything is completely configurable. Whether you use active health checks or not is up to you. Whether you use passive health checks and what the parameters are are up to you. I do want to touch just really briefly on the burning CPU and network comment. Where there definitely is a perception that doing all this active health checking is very wasteful. It may seem like without thinking about it that much that it is very wasteful, but the honest reality is that when you do health checking using plain text, H1 calls, and you use kept alive connections, the actual load to doing it, if you measure it, it’s really not that large.

Especially when you’re doing all this other work on your nodes already, what we find is … I actually don’t know the current numbers, but I think we run health checks every jitter, every 15-30 seconds, and we use passive also. We have a lot of health checking traffic, but its noise just compared to everything else that’s going on. I would push back slightly and say the active health checking model is not scalable. I think for very, very, very large infrastructures, there are cases where it doesn’t scale and there are things that we’re going to do in the future around subset support within the load balancer to kind of fix that. I think very few companies and deployments actually will have an issue with active health checking.

Audience:
You mentioned that it was possible with envoy to separate the discovery layer and health checks and all that sort of stuff. Do you have any experience on whether it’s a good idea?

Matt: Do you mean separate in which way?

Audience: The question from a moment ago was systems like kubernetes could do local health checks and then update their own [inaudible 00:55:13] stuff, which is kind of different, I believe, from the way you guys do it.

Matt:
Yeah, I don’t see any problem with that. I think from an envoy perspective, and we’ve had a lot of conversation with different customers at this point about what they want. The system is actually extensible enough that there’s no reason, for example, that the service discovery data that comes back couldn’t also include health checking, right? It’s like the discovery service could also be doing centralized health checking. That could be part of it, or the discovery service could hand back a subset of hosts which are then actively health checked.

I think kind of depending on the deployment, there’s a lot of different things that are potentially possible for sure.

Flynn:
I think we have time for one more.

Audience:
Thanks for a great talk. On deployment I like to deploy my micro services in docker containers, so I was wondering how well envoy would work in a docker container, maybe deployed to the same host as a service also running in a docker container.

Matt:
Yeah, there’s so much to talk about, and there’s not that much time, but we support an envoy hot restart, so envoy is able to fully restart itself, full code change, without dropping any connections. Based on previous experience, we actually built that to be container aware. Envoy actually has a unix domain socket protocol that it uses to talk to other envoys, so it can gracefully drain. That was a little bit of a long winded answer to say it will work fine within containers. Yeah.

Flynn: All right. Thank you very much.

Matt: Yeah. Thank you.

Expand Transcript

Lyft’s Envoy: From Monolith to Service Mesh – Matt Klein

Matt Klein (Lyft)

Description

Presentation Slides

Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.