This talk covers the past, present, and future of Microservices at Squarespace. We begin with our journey to microservices, and describe the platform that made this possible. We introduce our idea of the “Pillars of Microservices”, everything a developer needs to have a successful production service. For each pillar we describe why we think it is important and discuss the implementation and how we utilize it in our environment. Next, we look to the future evolution of our microservices environment including how we are using containerization and Kubernetes to overcome some of the problems we’ve faced with more static infrastructure.
Kelsey: All right, everyone, we’re back. I now have Doug Jones and Kevin Lynch who are going to be talking about adopting microservices at Squarespace. If you’re just joining us, I want to point out the Q&A button on the bottom of your screen which you can use to ask questions. Feel free to post some throughout the presentation and we’ll get to all of them at the end. Now, I’m going to hand it over to you, Kevin and Doug. You guys can get started.
Doug Jones: Okay, thank you Kelsey. Thank you to Richard Lee, Datawire and the Microservices Summit for hosting us today. My name is Doug Jones and I’m also here with Kevin Lynch. I don’t know if you can see him to the side here.
Kevin Lynch: Hello.
Doug Jones: We’re going to tell you about the story of microservices at Squarespace today. We’re going to have a talk in two parts. For the first part, I’m going to talk to you about building the pillars of microservices at Squarespace and for part two, Kevin is going to take over and go over containerization and orchestration with Kubernetes. Okay so part one, building the pillars. I’m going to talk to you about our journey to microservices and then, how we actually built the pillars of our microservices here at Squarespace. If you aren’t familiar with the company, Squarespace is a software service that lets our users create their own sites, build their online stores, register their domains, everything that they need to get themselves or their business online.
I think the story of microservices at Squarespace is really a story that’s tied with our growth. We’ve been lucky enough to grow in terms of the size of our engineering team and also our product and customer base. Overtime, looking back where we started and then where we’re at now, I think some of the chief challenges that we’ve experienced due to this growth is challenges, not only with how we architect our software for scalability and reliability, but also how we actually organize our engineering team and make sure that we’re staying productive. I think the challenges with a monolithic system are pretty well understood but I wanted to highlight a couple that were relevant to us.
Reliability and performance are two big ones. When you have a monolithic system and you have your entire functionality, your entire codebasis running within one process, it’s very easy to bring that process down and then therefore, bring your entire company down more or less. Performance, it’s hard to measure the performance of a system and make changes overtime. If you think about when you start a codebase and you start in year one, some of the architecture and performance-related decisions that you make and then, you’re in your five or ten or whatever of that codebase, it’s hard to go back and unwind those changes maybe and make different decisions about caching or other performance-related things like that.
I think, also, related to the challenges with growth comes challenges with how you actually are able to engineer things. If you look at the speed of an engineering team, how difficult is it for them to go from an idea to actually delivering that into production, how much time do they have to spend coordinating with other teams to make sure that they’re not breaking something that’s using shared code or that whatever they’re building is going to work for that team that might go on top of their stuff and how much time they spent doing that rather than actually delivering something. Another aspect to this is engineering time spent firefighting rather than building new functionality.
If you always have to be putting out a fire or investigating an issue or an outage, that takes time away from building new things. Let me share one of our stories about an outage and this was during our first Super Bowl ad actually. We put the ad, we knew we’re going to run the ad and so we spent a lot of time to make sure that we have caching and that we could handle the load that we expected from a Super Bowl. One of the things that we overlooked though was if you go to Google and you search for Squarespace, the first thing that you see is our ad. The ad is configured to add a unique idea as a query parameter for each person that clicks on that. That passed right through all the caching layers that we had setup and basically, we were trying to serve all those requests out of the database.
This brought our system to crawl and right when we needed it to be online the most and what’s worse is that it also was affecting our customer websites, not just our own homepage and not just our own sites related to the ad. I was part of the team that was there doing the Super Bowl, watching this happen and trying to keep our systems online and figure out what’s going on. It’s not easy to dig through a mountain of information and try to find the one single thing that actually caused the failure. If you’re trying to do this while there’s a fire going on, that can be very difficult to dig through a monolithic application and come up with a single cause of failure.
Yeah. Related to this, I think when you have a monolithic system and you think about how you’re going to monitor and find faults with it, I think that monitoring, typically, starts at the edges of a system. If you imagine a system with no monitoring, you’ll probably start by adding it to the request in. Then, your database queries or cache queries going out from that system and you’ll start to have graphs and metrics for this sort of stuff. What about the rest of that application like that center part of the application where the guts are and all of your business logic? Things can be very complicated there and how much visibility do most teams really have into how that operates.
When there is a fire going on, how long does it take for you to recover from an issue and also, to find the root cause and fix that issue? Yeah, these were some of the challenges that we were facing with our monolithic system. We started to investigate microservices and started to think about how that could be part of our organization and our way of building things. What we decided was that we would start with an experiment and whatever we learn from that, we would turn that into a platform that could be reused by other teams in our organization. We defined these ideas that we called pillars that we are considering all of the ideas that are necessary to have a successful production microservice.
We wanted to take these pillars and build them into our platform such that other teams could benefit from this and this would help reduce boiler plate and keep people from having to reinvent the wheel and instead, they can focus on building whatever their application is and getting their cool new functionality out into production. Okay, so here I just wanted to show you some of the pillars that I’ll be talking about today. I’m going to go through each one of these in turn and describe to you what it is and why we think it’s important and how we implemented them. To give you a little bit of context about how we actually delivered these pillars, so there’s basically three areas that we might have code in.
I think the first is the client platform. That’s the client, our HTTP client that’s running in all of our microservices and then, second is the server platform. Squarespace is a Java shop so we’re talking about Java, HTTP servers here. Finally, the tooling. That’s the whole world of tools around all these services that lets developers have understanding and then, be productive and have control over the system. On the slide, I just wanted to show you some of the features that we have in each of these parts of the platform and I’ll be touching on some of these as we go through the different pillars. Okay, so our first pillar is HTTP APIs.
We use HTTP with JSON and at the time that we made this decision, I think it was pretty easy because HTTP and JSON were industry standards, there’s lots of tools, lots of opensource. It was a really safe bet to be able to say, “No matter what route we end up going, we’ll be able to integrate with HTTP and JSON.” As I was saying, we’re a Java shop so we were looking at Java API server platforms and we started with Dropwizard because that’s what we were familiar with our monolithic system. Dropwizard has served us pretty well but I think there are some limitations when you try to use it for microservices and want to have some variation with how these services might be built.
We were using Dropwizard with dependency injection from Google GUIs and we have to integrate that ourselves. It wasn’t actually a part of Dropwizard and we found that to be a bit limiting because when we wanted to swap out different components or customize the server deployment a bit, it was a bit difficult to pull that off with Dropwizard. That’s what caused us to start to look at Spring Boot because Spring DI is a core part of the framework and it allow us to do interesting things like use autowiring so that we can drop a .jar on a class path and all of a sudden, that code is just wired up into the application without the developer having to write anything.
We have Spring Boot configured to use Jetty and Jersey 2. Okay, so now we have these API servers but we wanted to make it even easier for our developers to actually build APIs. We started to look at specifications for APIs and we landed on Swagger and the OpenAPI specification. From this, we can generate code from the specification and we generate models, the server API and the client. Here’s an example of a Swagger spec. You write the spec in the [inaudible 00:10:36] and then, you basically are describing all of your paths and the HTTP verbs on those paths. Then, you have some information that is metadata used for documentation, you’re describing your parameters and describing your responses and then, you can reference some models for the JSON objects that you’re going to consume in and return.
One of the cool things that you get with Swagger is an interactive dashboard and documentation. Here’s an example for one of our API endpoints. You can actually go to this dashboard and try out the API, interact with it against the live server. I think having this method of developing APIs has been really productive because it allows developers to focus on an API first mentality and they can just make a specification file and iterate on that and the code will just be generated for them. They’re not getting bogged down in trying to refine their API and having to make a lot of code changes as they go. Our next pillar is service discovery. This is important because this is how our services find each other in production, this is how you know what server you actually have to communicate with.
We wanted our services to announce themselves and just publish a very simple payload of their name and their host and port information. We also wanted our service discovery system to be able to have some concept of health checks such that if a service is degraded or down, it’s excluded from the pool of available servers. Our first service discovery system was Zookeeper and Zookeeper is really solid but I think it has some drawbacks for service discovery and one of the first ones is that it’s complicated. Zookeeper clients are complicated. There’s not an HTTP API, you have to take a thick Zookeeper client and even that client doesn’t have some of the primitives that you want for service discovery.
You actually have to build an application on top of that. That is a lot of work especially if you’re looking at integrating a new platform that doesn’t have a Zookeeper client already or it doesn’t have a library doing service discovery the way that you wanted to. Zookeeper is strongly consistent which is great for a lot of type of applications but it’s not necessary for service discovery. Zookeeper does have client heartbeats but you can’t really configure or expand upon that system. Finally, Zookeeper didn’t really have a great way that we could find to have a cluster span multiple datacenters. That led us to look at Consul as a replacement for Zookeeper for our service discovery needs.
We like Consul because it had first class support for service discovery and it also has built-in multi-datacenter support. Consul has a very simple HTTP API which, I think, makes it so easy to integrate just about anything out there that can speak HTTP. You can even just use the API through [inaudible 00:13:46] and start to be productive with it. Consul has configurable health checks so it’s pretty great that you can actually do some things like have your service provide more information back to Consul, publish your own health checks so that you can have your service itself inform Consul whether it’s healthy or not. Then, Consul has a key value store that we use to port over some of the functionality that we have on Zookeeper such as dynamic configuration and leader election.
This is a picture of how we typically deploy things in multiple datacenters with Consul. In each datacenter, you have a consistent set of Consuls and then, Consul will use Gossip across the WAN to communicate with other datacenters. When we deploy our services into each datacenter, they announce to their local Consul cluster and then we also configure our services to read and write from the primary database, regardless of what datacenter that’s in, so that we have rewrite capability and every service instance, no matter what datacenter it’s in. If there’s an outage for a service in a datacenter, what you can do is when you query Consul for discovery information, you can give it a parameter for what datacenter you want to query in.
Then, Consul will go ahead and forward that over to the remote Consul cluster and that datacenter and then, you’ll get back the services that are running there. On our case, we’ll get back to the services and datacenter too and they’ll be connected to that same primary database so we’ll be able to use the API as normal and have read or write capability. The downside here is that you’re paying a latency penalty to hop across datacenters but we think that that penalty is worth it to keep the service online. If there was an extended outage, we would fill to the other database over to the other datacenter.
Another pillar is software load balancers. We wanted to go with software load balancers and integrate them into our clients so that we could avoid extra middleware and configuration and just another moving piece that could break or fail. We wanted also the ability to customize are software load balancers and we thought it would be to our advantage to have it plugged in to the service discovery system and have its own view of the world so that you could do things like have better fault tolerance and potentially better performance. We built on top of Netflix Ribbon, Opensource software for our software load balancers.
A big pillar for us is observability and I started, in the beginning, talking about observability on monolith and how that can be challenging. I think microservices, there’s also challenges for observability there but I think you also have an advantage, in the sense, that you’re creating edges in a system where there might not have been edges before. If you think about a monolithic application with some business logic that’s maybe a package or a module inside of it. With microservices, you can expose that and give it an API and then, also give it an edge that you can monitor. Hopefully, each microservices, a smaller thing that’s more easily understood and observed.
Then, you run into the challenge of you don’t have microservices in isolation. You have this whole distributed system so how do you understand that and make sense of it especially when there’s problems or faults within the system. That’s where this observability pillar comes in and these techniques. Metrics and dashboards are huge for us at Squarespace. We use Graphite for our metric storage and we use Grafana for our dashboards and our services are able to publish their own dashboards into the Grafana. We use Zipkin for our distributed tracing and I think it’s really interesting to have this view where, for one logical request, you can see all of the services and all of the databases involved in that.
From there, you can get a picture of where time is being spent handling a request. If you want to optimize something, it tells you where you should be spending your focus. They key to being able to actually do distributed tracing is having some way to tie together all of these different requests that are happening in your distributed system. For us, that starts with a context ID at the edge of our system and so that’s just a unique ID per request and we carry that forward to all of the request from microservice to microservice. We also log that ID into each of our log lines that are handling that request. Our service client automatically takes this context ID and passes it along as a header for each outgoing request.
Another pillar for us is having an asynchronous HTTP client. I think this is really meant for performance and to be productive as engineers dealing all of these remote calls. We’re talking about the [inaudible 00:19:17] problem here and how we can have good latency despite this. We wanted to go with a reactive approach using our RxJava and RxNetty and I think this approach allows us to have better composition and reuse. If you compare it to other ways of implementing asynchronous system, you might avoid callback hell by having a reactive approach. I think RxJava allows for greater composition and reuse of components that otherwise might be deep within a nested callback that you couldn’t really easily extract.
Wanted to depict fanout and then try to help describe the situation when you have many services called out as part of handling one request. Here we have an application container with a client that’s calling out to services A and Z. Then A, in turn, calls out to B, C and D. If you have a typical blocking client with synchronous execution, what you’re going to see is that first, you call to A and then, A calls out to all the services it depends on one-by-one and then finally, the client calls out to Z after it gets an answer for A. If you were to add up the latency for all of these, you would find it’s just the sum of the latency for each service involved in handling that.
You can see how these really can balloon overtime and the total latency that handle request can really blow up if you are forced to execute these synchronously. With asynchronous execution, assuming that there’s no data dependencies between any services, you can execute them with perfect concurrency so we can have our client call out to A and Z concurrently and then, A can also call out to all of its services concurrently. That means that our total latency for handling this request, basically, is the slowest part of any of the services that it has to call out to so either A or Z will be the slowest to respond here and that will define our overall latency.
This is a much better situation than having to synchronously call out and potentially having latencies really balloon overtime. I think reactive RxJava approach allows us to really control and manipulate all of this concurrent computation that’s happening. A huge pillar is fault tolerance and so this is really what you need to make sure that if you have a problem in one of your microservices, it doesn’t cascade everywhere and bring down your whole distributed system. For us, some of the key techniques here, one is circuit breakers and so that’s the idea that if there’s a fault in the system somewhere, you’re going to break that circuit and isolate that system.
Then, retry logic so if you have services that are coming and going and you’re deploying, you want to be able to retry a request to make sure that you aren’t going to report an error just for some transient hiccup. Timeouts, make sure that you’re not waiting for forever a response from a remote system and finally, callbacks that you respond with cached or static values in the case that there is an error somewhere so that, potentially, your users will not even notice that there is a problem. We use Netflix Hystrix to implement these. Here’s a depiction of fault tolerance. Imagine that we have this one user request that depends upon services A, B and C.
If C is down or slow, and we are not fault tolerant, then that means that our user request is stuck there and blocked waiting for an answer from C. This can also cause request to pile up so you have all these user requests that are there blocked, the threads are piling up in your application container and this can cause an outage because you’ve overloaded that system. If you have fault tolerance, you can have a circuit that can break open for C and then, you can either fail quickly or pass a fallback value and still respond to that user request successfully. Because you’re able to respond quickly, at least, you’re not having these requests pile up in your system.
Okay, so those are all the pillars of microservices at Squarespace that I’m going to cover. These are ideas that we think are important to have to have a successful production microservice. For us, some of the future work, I think, one big area is we have asynchronous clients but we want to look at having an asynchronous system end-to-end. More easily, being able to develop asynchronous servers and take advantage of end-to-end streaming and we’ll probably look at technologies like gRPC and Netty. Another area is distributed task management. If you have this fleet of microservices, how do you make sure that you’re not wasting your hardware sources on tasks that maybe only run a fraction of the time. We think something like server-less computing can really help this area.
Alert definition and management is really important when you have a team that is operating many microservices. You need to make sure that they are able to automate alerting so that they don’t have to be glued to the dashboard. At a certain point, it becomes impossible to just be able to watch dashboards throughout all the different services that you might be responsible for for operating. Finally, we’re looking to build better tooling to create and deploy services just to make it that much easier for our developers to take something brand new and get it into production. Okay, so that was it for part one. Now, I’m going to turn things over to Kevin for part two.
Kevin Lynch: Thanks, Doug. Yeah, so I’m going to talk about some of the challenges we’ve faced with containerizing our Java applications and integrating them into Kubernetes for orchestration. I’m going to start the talk by talking about a little bit of the problems that we’ve had with our static infrastructure and how it doesn’t really work well with microservices and that high velocity world that we want to move towards. Also, we faced a lot of issues because we run our own datacenters. We don’t really have anything running in [inaudible 00:26:06] or Google Cloud so integrating Kubernetes into our datacenters has been a challenge that we’ve had to overcome.
Finally, there had been other issues with containerization especially related to Java that we’ve had to overcome so I’m going to outline those at the end of the talk. One of the problems that we found ourselves in is that we’re stuck in a loop. Our engineering organization was growing, we wanted to have more services because we wanted to move away from the model that Doug was talking about. That requires that we have a lot more infrastructure that we have to spin up because we were currently, at the time, running one instance of its service on one virtual machine. The only people who could really bring that up were the operations teams.
As we were doubling our engineering organization, we ended up moving from, I guess about a year ago, we had about a dozen microservices. Right now, we have about 50 microservices that are either running in production or going to be in production very soon. You can see how this can quickly become a problem. We pretty much came to the realization that static infrastructure and microservices don’t really mix. It’s difficult to find available resources. We were spinning up our VM instances [inaudible 00:27:43] and and we were using a very manual approach for identifying which hosts had free resources. It was very slow to provision and scale up these services and that was especially problematic when we were hitting rough times where one of our services was terribly underperforming and affecting end user performance.
We had to quickly spin-up these services and when it takes 15 minutes to spin up just one instance, then you’re in a really tough position. One of the other things that microservices pretty much require is a robust discovery system. Fortunately, as Doug talked about, we already have a Consul system up and running which has helped us ease that transition. Some of the other challenges are the Graphite system that Doug had mentioned. It doesn’t really support short-lived metrics, each service, each instance of a service that we bring up is going to start producing metrics immediately and those metrics take-up a certain amount of storage space and introduce a strain on the Graphite system.
We had to look for a different alternative especially one that would it be able to support much more ephemeral instances. Finally, as we’re moving towards the microservices world, we need better alerting particularly around moving away from the one alert per instance mentality where you’re treating things as pests rather than looking at the service more as a flock of cattle where you’re looking and more concerned about the SLA or the service rather than, “Do I have all of the instances up and running?” We’ve had to overcome some of these challenges. The traditional provisioning process was very painful.
In VMWare, you have to pick ESX with available resources, you have to pick an IP address for that instance, you have to register that instance or provisioning system called Cobler, you have to add a DNS entry to our system, we have to accurately create the virtual machine, PXE boot that machine and install the OS and base configuration on that. Finally, you can get to installing system dependencies that are required by organizations such as LDAP, NTP, CollectD, Sensu for alerting. You, then, have to install all of the application dependencies on that as well, Java, FluentD or Filebeat for log aggregation. You need to install a Consul agent as well and use Mongo [inaudible 00:30:44] Mongo for a lot of our services.
We have to install that and make sure that that can communicate with our database cluster. Finally, once you’ve done all that, you can install the app and it’s finally able to register with the discovery system and begin receiving traffic. Obviously, some of these can be optimized, we can start using VM template images, but then you run into other issues where you have to keep that up to date. We can start shipping containers directly on to the instance but you still have to spin up that whole VM instance as well.
So we were looking for a much bigger win. In the Kubernetes world, it’s as simple as just applying an application definition. Obviously, this is very over-simplified, and we ran into a lot of challenges, as I mentioned along this way. But, largely, how do we make this magic work? Provisioning, scaling, deploying. That’s all handled by the Kubernetes infrastructure automatically. It’s been very easy to adapt to that world because you’re really just defining what constraints, what you want for that service, and then Kubernetes will handle the rest of deploying whatever the application container is and hooking up all of the logic for that. Picking, doing the orchestration, handing scaling of instances.
For monitoring, we have been investigating the use of Prometheus for that. Prometheus is able to handle, automatically discover the instances that we need. It supports a much better tagging method as well, so you’re not really concerned about the storage space of an individual metric. But, instead, you’re able to create the metrics to include much more detailed information and the developer’s less concerned with taking up way too much resources of that Graphite cluster.
Additionally, alerting, rather than using Sensu that we were using before, we’re able to use AlertManager, which is provided along with Prometheus, and that allows developers to actually ship the actual alerts they’re concerned about alongside of the application. I’ll talk about those a little bit later in more detail. And finally, discovery. We’re using Consul and Kubernetes also has some very rudimentary discovery capabilities as well with the service IP.
One of the challenges that we’ve actually faced is, as we’re moving towards this model, we have to deal with decentralization much more. So that’s kind of an overarching problem with all of this. Better insight is to who is spitting up the service, what talks to the service, where do I go when these services is malfunctioning, who do I talk to, so that would tie back into the tooling that Doug was talking about a little bit earlier.
So moving on. The biggest problem, like I had mentioned before was running Kubernetes in a datacenter. Richard, earlier in his keynote, had mentioned that it’s very easy to spin up Kubernetes in an AWS or a Google Cloud or one of the other cloud offerings with pretty much just a click of the button.
We were not so fortunate. So the biggest challenges for us was really the networking aspect of that. Provisioning Kubernetes masters and Kubernetes nodes is very straightforward and just like provisioning any other application. The architecture of Kubernetes is pretty basic, where you just have these masters and a bunch of nodes. The master runs all of the core processes like the controller manager, the API server, or the scheduler and it stores all of its state in that CD.
And the worker nodes will just then communicate with these masters and will be scheduled tasks that are assigned to it. The biggest challenge for us was actually the networking, and one of the decisions that we’ve been making independent of this Kubernetes switch is actually a switch towards a spine and leaf layer three network, and what this means is it’s very different from the traditional networking typically found in datacenters. Most datacenters always had a layer two network, which was very easy to set up, but as your datacenters grow and grow, it becomes much more of a difficult and error-prone solution.
So when we got a new datacenter, we decided to switch towards the spine and leaf topology, and what this is is each rack is its own layer three network. So the switches don’t really communicate at the layer two MAC address layer. They’ll be talking at the layer three IP address layer of the network stack.
So each of these leafs is a top-of-rack switch, and all of the real heavy lifting is performed at this layer so you’re reducing a lot of east-west communication at the spine layer, and those spines are really just there for routing the traffic from one server on one rack to another server on another rack, and it’s much more easy to conceptualize and you don’t really run into issues like the spanning tree protocol that you would see in a full L2 network, where you could pretty much bring down the entire network.
And finally, each of these leafs is its own separate BGP domain, and this is important because this will allow us to use much more easily the Calico solution for Kubernetes networking, which I’ll talk about a little bit later.
So this is much more simple to understand, and it’s very easy to scale. If we just add another rack, we would just put links to each of the spine switches and each of the other switches are then independent and don’t need to know about that. It’s also very efficient because it’s predictable and consistent latency. All traffic going is coming into a leaf and is exiting a leaf, so you only have two hops.
And finally, this allows for what are known as Anycast IPs, so we can add an IP address and have that announced from one server in one rack and have that same IP address announced from another server in another rack. And what that gives us is the ability to kind of have like a high availability for one service, and all traffic would go to either one node or the other node, and if one goes down, then that IP address would stop the announcing from that top-of-rack switch and all traffic would then go to any of the remaining IP addresses.
This is useful because now we can just treat the Kubernetes masters as one IP address and all communication flows in through that one IP address for the masters, and if we end up losing a master, then we’re still at a consistent and stable state. We can just take our time to bring up that other master without any end user being affected.
So like I said, this allows us to take advantage of Calico Networking. Calico is different from some of the other solutions for network overlays and Kubernetes, such as Flannel or Weaveworks, where they would do IP and IP encapsulation. So basically you’re adding an overhead across all of the packets you’re trying to send.
So there’s no network overlay required and Calico itself, there’s a daemon running on each Kubernetes node that would do BGP peering with each of the top-of-rack switches. What this gives us is the ability to have the actual pod IP addresses be indistinguishable effectively from any of the traditional VM networked service instances.
This gives us a much more seamless transition from VM-based instances to ones running in … where is this application running? It just runs in either Kubernetes or on a VM, and there’s no real differentiation between the two. Finally, it allows us to continue using Consul for service discovery because, in each pod, we would bring up Consul. It would have its own independent IP, which is reachable from any other instance in that network and you don’t really need to worry about … You’d just announce that instance like you would any other instance in a VM.
And finally, Calico gives us the ability to add network policy firewall rules. So if we have a service that we only want to be communicating with a certain other service for security reasons or whatever, then all we have to do is add fire … The network policy firewall rule and only the allowed pods or source IP addresses can be communicating with that.
So moving onto monitoring, like I said, one of the problems is that Graphite doesn’t really scale well with ephemeral instances. Each time you have to provision a new server, it’ll start sending immediately hundreds, thousands of metrics that are unique to that instance, and if that instance is constantly being deployed or scaled up, scaled down, then each time you do that, you’ll end up sending hundreds of thousands of new metrics to the Graphite cluster.
It’s very difficult to see exactly where, how a service is behaving across all of the deploys because now you’re looking at different metric and path and points throughout the course of hours or days, rather than if we were using something like Prometheus, where that tagged metric, you’re just clearing the same metric, just across different instances.
And finally, Graphite is very difficult to control the combinatoric explosion of metrics, where, if a developer just thinks, “Oh, I’ll just add this one end point without taking into consideration the number of servers,” the number of endpoints for that, then it can easily end up with one line of code and ending up being thousands of individual metrics without intending to produce that many.
Sensu has similar issues as well, where the application system alerts are very tightly coupled together. This means that engineers don’t really have a sense of ownership of where their alerts are and it’s very difficult to separate how an instance alert from alerts being generated for the entire service itself, and these would be the … an SLA alert, such as what’s the …
Make sure that the P95 for a service only gets … P95 response time is only a certain level, rather than, “Do I have all of my instances up?” It’s more important to say, “Do I have at least one service up?” and, “Am I meeting my response time guarantees?” Finally, it’s difficult to route alerts in this model. The alerts are currently defined separate from the application codes, so we wanted a way to merge those two together.
So here’s an example of hosts alerts, and you can see that it’s very hard to identify which of these are application alerts and which of these are just system-level alerts that come along with every system. Developers are typically not concerned if the disk is full or something like that, whereas an operations person would be more concerned with that.
So the answer to this is the Prometheus operator, which we’ve adopted. This gives us a couple of advantages. Prometheus operator is something provided by core OS and all this does is it manages Prometheus instances within Kubernetes itself. So each team would define their own Prometheus instance and this Prometheus instance would be … It writes to its own storage location. Can be cleared independently. So you’re basically taking that gian Graphite monolith and breaking it out into its individual instances. Where each team would have ownership of that Prometheus instance.
This gives us a little bit better sense of ownership and it also protects us from, if a team does end up writing, sending a large amount of traffic to a single Prometheus instance. You’re not affecting metrics for the entire organization just for that one individual team and we can then address that problem with a little bit more thought, rather than a knee jerk response to just getting the entire system up and running again so that the rest of the organization is able to see their metrics and alerts.
All a service center would then have to do is just define, create what’s called a service monitor in this world that would then … And that service module would be per service and along with that would come alerts for that individual service. Both of those can then be deployed along with the application. And so those definitions can exist within the code repository itself.
So here’s an example of Prometheus alert manager alerts, and they’re very simple to define. Basically, what you would just do is define each of these alerts as that text block within an alert definition in the config map in Kubernetes and ship that along with the application code.
One of the other challenges that we’ve encountered with micro services in Kubernetes is actually related to docker. Our microservice pods are pretty simple. All we ship in that pod is the Java microservice itself, Fluentd for aggregating logs and shipping them to our elastic search cluster, and then Consul for discovery.
Each of these comes along with different resource constraints. We typically just give each microservice two CPU cores and four gigs of memory. What we found is that this doesn’t really play well with the Java.
So Kubernetes pretty much assumes that no other processes are consuming significant resources. That’s pretty easy to deal with. Basically, what Kubernetes does is it’ll calculate all of the resources available on that host and assign them. It’ll schedule pods, based on what it is running, what was requested, and what limits are in existence for the pods running on that service.
In order to understand how this really works, we need a better understanding of the completely fair scheduler in Linux. This is the scheduler that existed for a very long time in the kernel and what it does is it schedules tasks based on CPU shares and throttles these tasks on pre-defined CPU quota. And these shares and quotas map up to the CPU requests and limits.
The CPU shares assigned for a given pod are basically the number of CPU requests times 1,024. So in this case, the shares would be 2,048 and that share is relatively assigned based on what Kubernetes detects to be the number of cores on the entire box.
So for our boxes, we typically run on 64 core boxes so the total number of shares across all the Kubernetes pods would be about 64K. And then the quota is assigned by the number of resource … By the number of CPUs assigned times a hundred milliseconds and, in this case, it would be, each pod would be assigned 200 milliseconds over 100 millisecond period.
Unfortunately, the JVM is able to detect the number of cores available to the entire system and it scales tasks relative to this. So the JVM itself would see 64 cores but in reality it’s only really being assigned the ability to run on two of those cores. So this ended up being a huge problem for us. Our workaround for that was basically to provide a base container that calculates the containers’ resources.
So we detect the number of cores assigned to this by, and each container we’re able to see the actual number of quotas assigned to that and we would automatically tune the JVM with those resources using JVM flags. So we can set the number of PC threads, the number of concurrent PC threads, and then the Java [inaudible 00:51:32] pool we limit to two concurrent threads as well.
However, one of the other problems was that a lot of libraries end up relying on the number of available processors, including Jetty and whatever that mysterious dependency that we’ve included in our application itself. So the solution to this was actually to basically preload … Use the LD preload mechanism in Linux, where we override the actual JVM library call that ends up returning, so that it returns the number of available process as two in this instance, rather than the number of cores available on the entire system.
So one of the other problems that we encountered was that its very difficult to communicate with external services, in the sense that we wanted to deploy applications agnostically to different environments and different datacenters, and so we can’t really hard code datacenter DNS names for external services like Elk or something, or the Consul masters, and we can’t really use Consul itself for this because not all applications that we’re deploying are using Consul.
For this, we ended up coming up with a solution where we basically … In order to talk to, for instance, the Kafka Elk cluster in each of the datacenters independently, we would create a service with the name Kafka.Elk, where that is actually just a headless service defined in Kubernetes without any cluster IP, and we would basically explicitly define the endpoints for that.
So in this case, we would be able to add the three, six, whatever end points for Kafka and we can then ship a very simple FluentD container as a side cart or a Consul container as a side cart in this without that container really knowing or caring where it’s running. All it knew is it needs to talk to the Kafka.Elk DNS address.
So there are couple of challenges that are left that we’re going to be addressing in the future. One is as we end up with all these services, we don’t really know who is running these services or who best to talk to. It’s also very challenging to enforce constraints, such as this … We require that all services have certain requests and certain limits and what is this service running? Is it using the Java Microservice framework or is it a Python microservice? So we’re working towards adding custom admission controllers to Kubernetes that would require all these standards.
One of the other challenges is that it can be confusing for a service developer to remember to include the exact Consul configuration that we want, or if we want to change the Consul configuration across all of the services, it’s difficult to go through all of the microservice repositories and make this change. So in Kubernetes 17, there is what are known as custom initializers, so we’re going to be working towards a solution where all we have to provide is each service owner would have to include the annotation. Initializers.squarespace.net/consul is “true” to indicate that they want Consul to be automatically injected into this deployment deploy time.
So yeah, so those are just some of the challenges that we’re facing in running Kubernetes in a datacenter and just wanted to open this up to any questions that you guys have. Thank you.
Kelsey: All right, guys. Thanks so much. So we did already have a few questions come in so we’ll start with those, and if you didn’t get a chance to post your question yet, feel free to do that with the Q&A button or in the Gitter channel. So the first question is has your team investigated RAML to write specs?
Kevin Lynch No. We haven’t looked at RAML. I’m not really too familiar with it so I’m not sure how different it might be from Swagger in their spec.
Kelsey: Okay. The next question is, how do you support API versioning with Swagger?
Kevin Lynch Yes. So we’re using Swagger to generate a client library that is then versioned and shared, so I think it’s ongoing work for us to look at how we can version and host the specs themselves and then maybe just generate the clients on the fly. For now, I think what we typically have done for services that have needed to make like break in changes to their API is that they basically just create like version 2 of that service and then release a whole other library in that spec.
Kelsey: Okay. Can you explain a little bit why you need Consul and Kubernetes for discovery?
Kevin Lynch Yeah. So we haven’t switched over all of our microservices to be running within consul right now, and all of the microservices use Consul as the backbone for the discovery mechanism, so the easiest transition was to continue using Consul for that, and we needed a way for those instances running within Kubernetes to be able to communicate with those running outside of it. So continuing on with Consul was the easiest solution for that.
Kelsey: Okay. So if you had adopted Kubernetes earlier, do you think you would still have gone with Consul for service discovery, or would Kubernetes’ built-in DNS-based service discovery been sufficient for you?
Kevin Lynch Well, Kubernetes’ built-in discovery doesn’t really add some of the features that Doug was talking about for the microservice so it doesn’t really add the ability to do circuit breaking or anything like that and you would rely on Kubernetes itself to detect when a pod instance goes dead and hopefully removes that from the pool of instances.
So we feel we probably would have needed some sort of mesh network solution, and of course, Envoy at the time didn’t really exist so I think if we were starting from the ground up today, probably something like Envoy would be a little bit better than Consul, but yeah. Something would have been necessary.
Kelsey: Okay. How are you able to test performance and capacity management for critical tier one apps and services?
Kevin Lynch In Kubernetes or just in general? I don’t know.
Kelsey: I don’t know.
Kevin Lynch Doug, do you have an answer for that?
Doug Jones: I think it depends on the service, right? So you want to determine, I think, like what does it scale with. Like just incoming traffic, does this service have to be involved in every request coming in past the edge or is it scaled along some other dimension? I think we would look at that and if it has the scale … Like the worst case is that it has to scale with every hit that comes in along the edge. If that’s the case, we would probably build it and benchmark it and load test it and make sure that it’s able to do that, but most of our services, they don’t need that level of scale.
Speaker 4: So the questioner did clarify that they meant Kubernetes.
Kevin Lynch Okay. Well, to actually build on what Doug was talking about, we would need a better understanding of what actually it’s scaling on. If it’s scaling on requests or CPU, then we could take advantage of the Kubernetes auto-scaling functionality. That will, by default, it does auto-scaling based on the number of resources, based on the amount [inaudible 01:00:28] that’s used by that container and its pod limit.
It also is able to tap into other metrics as well, so we would … In most of ours, they end up being request-driven, so we would just add a rule that whatever metric announced by Prometheus, once it hits a certain level, then they auto-scale up. We’re not taking advantage of that right now but we’ve experimented with that and it seems to work pretty well.
Kelsey: Okay. We’ve actually had a lot of questions come in and we’re just about to run over so we’ll do a couple more but if we don’t get to your question, feel free to put it in the Gitter channel. So do you use an API gateway and how do you release API docs?
Doug Jones: I would say that our Monolith is our API gateway and we’re starting to develop the ability to move away from that. I think what we’re looking at is sort of having like each service … Like basically from our edge, like using that to route request to the relevant service and then each service there needs to do the relevant thing, such as authentication. What was the second part of that question? API docs?
Doug Jones: So each service serves up their own API docs so you can just navigate to that service on any host and read the docs and then use the Swagger dashboard there.
Kelsey: Okay. Have you thought about using SDO to manage a service mesh?
Doug Jones: Do you want to talk about that?
Kevin Lynch I mean, we’ve talked about it, yes. It would be nice to switch to something that’s a little bit more ingrained into Kubernetes but I think we would need to get more services running inside of Kubernetes right now and move away from the traditional VM deployments which we’re kind of in the process of doing right now, so once we do that, then we would consider switching, I think.
Kelsey: Okay. How do you handle API security?
Doug Jones: Again, I think our Monolithic system is still like handling all of the security concerns, like for the product itself and doing authentication and authorization. And so that’s ongoing work for us to try to break that out a little bit more, but then also, in some places, we’ve used mutual TLS between services.
Kelsey: Okay. And then the last question is did you evaluate or reject any other schedulers? Example, Mesos or Nomad?
Kevin Lynch: So we did actually have a Mesos instance up. About over a year ago now, we were having the … Basically the engineering team was the first adopter of that, so we set up Mesos because Kubernetes was in its very early stages but Mesos looked like it was a lot more robust. We ended up running into a lot of challenges with that. It was much more complicated to set up as a platform engineer, and it was even more difficult for our engineers who weren’t used to basically platform engineering.
It was much more difficult for them to get their heads around resource constraints and whatnot. And they ended up killing the cluster several times, so that and the quick rise of Kubernetes made it a clear answer to switch to that, and as soon as we switched to that, it was night and day, I think. But that’s just our opinion.
Kelsey: Awesome. All right. Well, thank you so much, Doug and Kevin. This is going to wrap up our presentations for today, the first day of the Virtual Summit. We will be sending an email out with the talk recording, as well as the slides from today’s presentations, and then we will start tomorrow morning at 10:00 am Eastern with Lauri Apple from Zalando, talking about how to avoid creating a GitHub junkyard. So we hope to see you then and, again, if you didn’t get a chance to ask any questions, feel free to post them in the Gitter. It’s Gitter.im/datawire/summit. All right, thank you, everyone. We’ll see you tomorrow.
Doug Jones: Thank you.
Kevin Lynch: Thank you.
Try the open source Datawire Blackbird deployment project.