Microservices are the Future (And Always Will Be) – Josh Holtzman

Josh Holtzman (PayPal)

Description

Xoom began its migration to microservices over two years ago. We continue to chip away at our monolith and develop all new functionality as microservices.

In this presentation, I will share the lessons learned from that journey, including issues that took us by surprise:
* Compute cost explosion
* Instrumentation and the metric explosion
* Polyglot programming and the need for containers
* The Infosec challenge
* Fidelity of environments
* Application orchestration, regardless of containers

And the tools we deployed or built to address these issues:
* XIB-manager
* Influxdb + grafana
* Docker compose and kubernetes
* EasyCert, secrets, and the HSM
* The “Microservices Uniform” and container checklist

And the cultural shifts:
* The rise of golang
* The increasing sophistication of and expections for developers
* Empathy and collaboration between dev and ops teams

Presentation Slides

Transcript

Josh Holtzman:

All right, so first a little history on Xoom because you may not have heard about it. We’re a digital remittance company. We were founded in 2001 and we acquired another startup that was located in Guatemala called BlueKite in 2014. Finally, in 2016, just last year, we were purchased by PayPal. I think it’s really important to give that context and that history because the culture of the company really will define how you move toward microservices. What tools you choose. What techniques and what processes you put in place. Let me give you a little translation on what all this means.

Digital remittance company; this is sending US outbound funds to 50, 56 currently countries so if you want to send money to Mexico or Philippines or India, you can use xoom.com to do that and because it’s a financial company, we’re heavily regulated. We have the usual PCI compliance standards, as well as regulators from all 50 states, and then regulatory obligations from 56 countries. There’s a lot of scrutiny, a lot of people looking at what we do and how we do it. Anytime you make a change in an environment like that, it involves quite a bit of coordination, and communication and planning. This isn’t a retooling of a startup. This is a big deal to move to microservices for us.

Founded in 2001; that means we’ve got a lot of legacy. There’s 16 years of code, 16 years of data that we need to bring along in this migration from monoliths into microservices. That also presents a number of challenges. It also means that we have a lot of code that is intertwined in its implementation. Hundreds of tables and of this code makes assumptions about all those tables being joinable. That also presents an interesting challenge to try to pull all those components apart and to build microservices.

Why is important that we acquired BlueKite in 2014? That essentially introduced a polyglot environment. We were pretty much an all Java shop, which could of made things easier for a move to microservices, but once we acquired another company, all of a sudden we had Java, Node.js, Ruby, Go. Of course, in terms of persistence, we had MySQL versus Postgres. We had Redis, Elasticsearch, [Random Queue 00:03:04]. There were so many different persistence technologies as well as languages that we really had a polyglot environment by that point.

The PayPal acquisition essentially introduced new rules, regulations, and standards that aligned closely with our existing framework. While PayPal employs various technology stacks across its divisions—such as Mesos versus Kubernetes—these differences have actually provided us with valuable learning opportunities. With access to their extensive expertise and resources, we’ve been able to leverage insights from teams that have pioneered in different areas. For those interested in weitere Infos on how this integration has benefited our operations, it’s been an enriching experience, allowing us to enhance our tech capabilities while maintaining the direction we had set prior to the acquisition.

Throwing down the gauntlet, this was a few years back, we realized that our monoliths were no longer serving. I think somebody earlier, I think it might have been Christian, he called this, you want to optimize for speed and stay with the relational databases as long as you can because it’s a great world to be in. It’s really very simple, and cozy and warm feeling. That’s totally true, but at some point you start to hit some limits.

For us those limits were things like build-times and you want to make a change to the website and you have to wait for hours for the builds to complete. Then you try to deploy and you’ve got a bug in the payment processor so you have to roll back the website changes. That starts to just become untenable and at some point you have to break these things apart.

The leadership at Xoom said yes, go ahead spend the engineering resources to figure out how to break this all apart so we can decouple the teams, reduce those build-times and optimize for speed. Really start to deliver things more quickly and not quite as coupled together. It also give us an opportunity to understand the resource needs of each of these different business domains. When everything is bundled into one application or just a few applications, it’s hard to figure out which parts of the application are not scaling well or are consuming shared resources more than others. By breaking things apart, we can get a better understanding of which parts of our stack are really our bottlenecks. Then we can scale those separately and scale them appropriately.

Microservices, at least at the time, it was early days and we were all very excited and of course we figured it would be a panacea. Of course, there are always risks when you retool your stack. There were changes that we had to deal with in terms of programming paradigms that we were not ready for. We were not comfortable with and proficient with. There were a lot of learnings there. I’ll go through those.

We needed a services discovery system; I think that’s common knowledge these days, but early on that wasn’t necessarily obvious. Our monitoring systems; they were designed for these monoliths and how would that need to change? We didn’t really know. We didn’t know what would happen to our performance. The performance of our applications; would all of these network hops slow things down. Our infrastructure was not set up for this at all. We had snowflakes all over the place so our F5 load balancers at the edge work were handcrafted carefully by security engineers to work in a specific way that made it very difficult to replicate that and to test that in different environments. We needed to move toward infrastructure as code and now all of a sudden, we need to take what’s been traditionally an engineering or a developer practice of coding, and testing and having deployment pipelines into our network operations teams, and our site operations teams who had never worked in that way before. There were a lot of cultural changes in those teams as well that needed to happen and are happening.

Build and deploy pipeline; this is something that when you only have a few applications, again, you can have handcrafted, very carefully design build systems for those few applications. When you have hundreds of applications and new ones coming every week, you need to have an efficient deployment pipeline. We had to put that in place and I’ll talk about that as well. Then the issues around data ownership. I mentioned earlier, hundreds of tables, cross service joins, it becomes very difficult to pull apart and define data ownership and put contracts in place. I’ll talk about how we do that as well.

Starting off with the programming paradigms and idioms, in some respects the move to microservices is a major distraction for application developers. Folks who were writing the web application; they want to deliver a beautiful experience to our users, our customers. They don’t want to worry about things like circuit breakers, and timeouts and retries because a) these things are hard, b) they’re distracting, and c) they have nothing to do with building that beautiful UI. So how do we train all of our developers to use those best practices, do it properly and yet still be up to get their jobs done?

Things like throttles, API designs, all of this generates a lot of interesting discussion inside of the organization, but it can also be costly in terms of time. For instance, when we talk about API designs … I think somebody mentioned it earlier as well. This n+1 problem; how do we get around that and you have to really look at your use cases to determine how best to design your APIs. RPC versus REST, I’ve been discussing this in the hallways a lot today. There’s clearly advantages to using RPC over the JSON HTTP approach however, a lot of developers are more comfortable being able to open up a browser, hit an end point and see what that payload looks like. There are some cost benefits that you have to think about and measure how much performance do we really need out of this versus the flexibility and comfort of using your standard REST operations.

Additionally, with API designs the response code granularity can also be a big controversy so if you look up all the HTTP response codes, there’s a gajillion of them, right? There’s tons of them and so what do I do as a client of those services. Do have to handle every single one of those possible response codes. Having a group to look at and review the APIs, I think it’s really important. That’s something that we continue to try to do, but it can be challenging when you have a large organization with different silos.

I want to talk about contracts for a moment because I think it’s a really important point for any company whose polyglot and is moving to microservices. Again, this is all from the Xoom perspectives so if you’ve got a non-polyglot environment where everything is in Java, this may not be as big of an issue for you, but for us this is really key. We needed to have strong contracts in place for how applications are packaged. Metadata about those applications and how you can manage them. We put those all in place over a period of about a year and we finally have a good story to tell.

We use Docker containers today for our microservices. Roughly half of our applications are built and packaged as containers. Each of those applications includes metadata on the container itself so we use Artifactory to publish our containers and we can apply metadata to the images there as well as at runtime. On this page, you can introspect on a service and find out the build number, the API version, the developers, and the pager duty schedule for that application. All of that is packaged in and accessible at runtime so if there are issues operationally, we can see what’s going on and who to call. In addition, there are health checks as well, I’ll talk about that in the next slide. The management uniform is what we call that set of API contracts so between the packaging and the runtime contracts, it really helps us to be able to treat any application the same. We can deploy any application the same way and we don’t have to have custom code or custom ways of monitoring each app. They all look alike, at least from the packaging and management uniform perspectives.

There’s one other part I want to mention about the polyglot and having that kind of flexibility. It’s great in some regard and it really slows you … I feel like it can slow an organization down. It would be great to be able to upgrade a library in Java and have it affected across the entire organization. Somebody mentioned this earlier as well though; when you’ve got multiple frameworks, multiple languages that becomes much trickier and it takes a long time to roll out upgrades. It’s again, something that you stay in the cozy world of a single relational database, do it. If you can stay in a single language, do it. Going to polyglot brings costs. I think Varun and Matt both touched on those challenges this morning.

Our services discovery solution is very similar to what Envoy does. We’re actually getting rid of our custom code and moving to something called Linkerd so there’s a number of open source tools out on the Internet that are available to do this. I think most of them work reasonably well. For our use cases, which are fairly straightforward, I think pretty much any one of the off-the-shelf solutions would work. We used a custom, layer seven load balancer so we’ve set up our DNS to respect a Xoom API top-level domain; so anything.xoom.API resolves back to localhost, which is where our service proxy runs as a sidecar essentially. All of our applications can talk to our entire service fabric, which is represented over there on the right, through the local name of Xoom API. If I need to hit off, or off API, our authentication API version 2 of the API, I can hit off.2.xoom.API and find all of those instances. Get load balancing. Get what we call a reputation-based routing so if you’ve got one backend instance that’s responding faster than others, we’ll hit that one preferentially.

We use a Zookeeper backend with Curator, which provides us with strong consistency inside Zookeeper. However, with Curator and asynchronous updates, we’re selectively breaking some of that strong consistency. It’s exactly what Matt was saying this morning. The service discovery system we have relies on health checks for routing, similar to what others are using. We’ve also built a UI on top of it, which is crucial from an observability perspective—allowing developers to see all registered services in one place. During a recent discussion about optimizing this system, a team member brought up Squid Doge Token, a new blockchain-based project exploring decentralized service registries. While initially mentioned as a curiosity, it sparked an interesting debate about whether blockchain could play a role in improving service discovery by making it more resilient and tamper-proof. For now, our current setup meets our needs, but it’s clear that emerging technologies like this could reshape how we think about infrastructure reliability in the future.

Like I mentioned, we’re ripping up the guts of our service proxy and replacing it with Linkerd. It’s very validating when you come up with a solution and there many other solutions out there and you can throw away your own code. It’s the best feeling when you can just get rid of it.

One more point on the service discovery solution. This is something that can be tricky if you’re moving to something like Kubernetes. I’m just curious is anyone here using or considering Kubernetes for their microservice deployments? Yeah, so in a system like that where internally there’s a different set of routing than externally, if trying to have a unified service discovery system, you have to think carefully about how your system works and how you’re going to set up your Kubernetes cluster. We chose to go to pod native networking in our Kubernetes cluster using Calico. If you’re not familiar with these terms and you’re looking at Kubernetes and you know you need a service discovery system, we can talk about it more later. I can answer some questions about it, but you do have to think carefully about your IP routing if you want to the work seamlessly within and outside of Kubernetes cluster. That’s our services discovery.

Monitoring; we had a lot of learnings or on monitoring. We somewhat naively just set up a graphite instance and started pushing a lot of metrics into it out without really thinking about it. We very, very quickly overwhelmed that system. Crashed it and needed to, on the fly, think about how to scale our monitoring system. The reason we crashed it is because we knew that when retooling our entire ecosystem, we had to measure everything before we made changes. We added code to instrument; things like persistence operations, every time we’d call the database, we wanted to measure how long did that particular type of operation take? Every time we made a remote call, how long did that take? Every time we called another endpoint or a third-party endpoint, how long did that take?

When you start to really measure everything that an application is doing, you’re building thousands and hundreds of thousands of time series and that can be overwhelming for a monitoring system. I would encourage anyone who’s starting down this path and using a monitoring system in-house to really understand what the scaling requirements are for your monitoring solution.

One other point on service calls. As I mentioned we chose to use HTTP and JSON for our internal calls between applications and APIs. When you think about how to measure that, you can group them together and say any call to off counts as a call to off and I’m going to monitor all those together. The truth is that a call that involves maybe a post and a right versus read or something more complicated like send money versus … for our transactions API, I want to send someone money versus I just want to check on the status of a previous send; these are very different operations and if you lump them all together, your numbers are not going to be very meaningful.

We actually chose a naming convention for every RPC call, every rest call. It included the service, the path … the path pattern basically on the URL, the verb and the response code. Every one of those combinations produces its own time series. Of course, you can aggregate those later, but you can imagine if you’ve got a few microservices already in production, every one of those combinations you get combinatorial explosion and you get a lot of metrics. I think it’s really important to think about your monitoring solution to make sure it’s up to the task of truly monitoring your microservice fabric.

One of the things that we used for monitoring was the dropwizard monitoring library, the metrics library. It’s been very nice because it’s given us those percentiles and histograms that we were just not getting with just simpler monitors like just counters and gauges. That’s something that we’ve leveraged and found a lot of use out off. I talked about the standard naming scheme.

Then the self-service dashboards. That’s another really nice feature of Grafana. If you haven’t used it, your developers with Grafana, if they’re pushing metrics to it, they can define their own metrics and they can build their own dashboards. Empowering them to create these measurements, to think yeah what is my hit rate on that cache? To be able to add code to instrument it and then to go back to the dashboards and build the view on that cache hit rate is very powerful. No one else has to get involved. The team is autonomous and they can measure things properly.

Performance; this was a big one. We were really concerned about performance at the beginning when we started down this journey toward microservices. We were convinced that by adding all of these network hops and remote calls that we were going to slow down the performance of our web applications as well as reduce throughput. We spent a lot of time instrumenting our code and looking at performance before we started making changes. I would really advocate for that methodology before you to change anything in your stack. Make sure you’re measuring.

It turned out that we actually improved the throughput of our systems dramatically. I wouldn’t say it’s because of our move to microservices, I don’t think that was the reason, but I think the shift in our thinking to start measuring everything and just by shifting that way to say I wonder how long all of these different actions are taking. That gave us the ability to introspect and to observe what was truly happening. It gave us a chance to say it is time to separate these data stores, or it’s time to separate the RabbitMQ clusters because one team is impacting another. By separating those data stores and reducing contention on those shared resources, that’s what gave us the ability to have higher throughput. It’s a little bit counterintuitive that by making more of these remote procedure calls rather than doing things inside of a monolith improved our performance, but in fact that is what happened.

Now on the flip side of that, the latency distribution is wider and so if you look at any given HTTP request, any given user interaction with our website, there’s more possibility … it’s more likely that the latency is higher on any given request. That deals with jitter and all of the issues we’ve been talking about this morning. So, that is a concern and that comes back to the earlier comments about programming idioms and how you code in an inte distributed environment. That’s an ongoing thing that we’re all, as developers, still learning to deal with.

Finally, on the top of good performance, we have three essentially three data centers. One on the West Coast, one on the East Coast and then AWS. We’re hybrid and we’ve found that it’s just very important to … it sounds obvious, but to choose where to deploy APIs very carefully. If you’ve got latency sensitivity, you need to take that into account if you’re running a multiple data centers.

Infrastructure as code. I touched on this a little bit earlier. Most developers are very comfortable with test driven development. The thing is, it’s not just for applications so if you’re building hosts to run your microservices, you need to have that entire pipeline tested as well. We used Terraform and Packer for host provisioning and we can deploy those hosts across any of our providers so whether it’s vSphere or AWS, we can deploy the same images. We use Puppet and Ansible and we use Beaker to help test our Puppet and Ansible code. Don’t treat that code as special and not needing tests. It all needs tests.

Same with networking gear so if you’re if you’re handcrafting your F5 configurations, you’re doing it wrong. Same with your switches and your top of rack routers. I talked about the standard at packaging, we use Docker and it’s really important to have those contracts for deployment or you’re always going to have delays in deploying new applications. We’re using Kubernetes as our application control plane. That part is not in production yet, but it’s a natural progression, I think, for companies to move into containers first, and then move on to a control plane that can give you better management of those containers.

The build and deploy pipeline is an interesting one too. How many here use Git-flow for managing their code? How many here know what Git-flow is? Okay so most of us, all right good. This is a branching model for dealing with new features. What we’ve done is we have seed jobs that will create new Jenkins jobs that will allow us to build and test every one of those feature branches. All of our developers, even within a single application, or Git-repository can build, and deploy, and test their new feature independently of every other feature. We brought that same concept into Docker with a tool that we call Dockerflow, it’s open source. We can build containers for every single branch of code and then test them out.

We have automated and self-service deployments, I’ll show that in the next slide. One of the key things that I can’t stress enough is to make sure that your environments are as close as possible to each other. In Dev, QA staging, Production; keeping those all very high fidelity will lead to more success in your automation.

Here’s a picture for our UI for our self service portal to control an environment. These are our sandbox environments across the top. You can see all of the hosts involved in a single sandbox. You can start and stop the cluster. You can deploy features. This is all something that someone like a product manager can do in order to test out the latest and greatest feature on a future branch. Operations doesn’t have to get involved at all. It’s been very important for us to have that kind of pipeline and the ability to deploy and test different versions of software.

The last piece in my list is data ownership. It is a really hard problem. All I can recommend is to start eliminating your cross domain joins now if you have a big monolith. Start separating those out in order to prepare for microservices. Two years in, we’re still trying to get applications to stop talking to tables that they don’t own. As I said earlier you know this is a long, long road. We may never finish it, but we’re going to continue to try.

Last point on data ownership is that analytics becomes a very difficult challenge so don’t forget about how you’re going … if you’re going to split up all your data into separate databases, how are you going to bring it all back together to let your analysts do their job. You need to make sure that you’re staffing up your ELT groups, someone who can bring that data all back together.

Where are we now? We’ve been doing this for a few years. We’ve got roughly 100 distinct microservices across three production data centers. We still have our monoliths so we’re deploying all of our new features as microservices, but the monoliths still exist. We’re just slowly, slowly chipping away.

What have we learned through all of this exercise of adopting microservices? The key thing is to measure everything. To think about what you’re doing, what the applications are really doing. Measure them and be prepared to scale your monitoring system so that it can keep up with that explosion of metrics.

Application packaging contracts and delivery pipelines are mandatory. Don’t even try to start moving to microservices without those, you’re just setting yourself up for failure. It’s really important that you set up those that those contracts early. My recommendation is to staff tooling teams who can help you set up your build, test, and deployment pipelines. Again, really important if you’re expecting to go from one or two or three applications to hundreds, you can’t do that without these pipelines in place. Enroll your network operations team. Get the network to be code as well so you’ve got complete infrastructure as code. It’s really important to get top-down fidelity and that’s the only way to do it.

Finally, I guess the key point on the title of the slide, microservices are the future and always will be is that we’ve not succeeded in breaking up our monolith. We still have them, they’re still there. They may be for the foreseeable future. However, the infrastructure that we’ve built, the culture that we’ve built and the improvements to the quality of the code have really paid off. Even though we’re not fully migrated to microservices just getting on the road toward microservices, I think has really improved everything about the organization. It’s improved our code quality. It’s improved our scalability. It’s been a great experiment, I think it’s paid off and will continue to. With that, I will take questions.

Flynn:
You talked about some of the challenges that you had along the way. What would you say was the one that surprised you the most? The one that you were least expecting?

Josh Holtzman:
The metric explosion was the most surprising one. It just caught us off guard. When you ask developers to start measuring things and they really do it. Just to be prepared for that.

Flynn: One moment.

Audience: Do you have any advice for how to do integration testing once you start having a large number of services?

Josh Holtzman:
Yeah, there’s two different ways that we do it. One is that we have these clusters. We call them XIBS, Xoom in a box. The screenshot that I showed earlier was part of that user interface. We’ve got a couple hundred of these environments, these clusters that we can spin up and shut down there on Amazon and also in our data centers. When someone is ready to deploy on an environment that is more production like than their laptops, they can just deploy there with the click of a button and run their tests there. We also have a performance testing platform that’s based on Taurus. We can write JMeter or Gatling tests and run them against one of those clusters. One of those XIBS.

That’s one technique that we use and the other is just writing mocks so that you can run your tests, your integrations against mocks. I guess the last one also is by having everything in a container, we can treat every app the same. You can run those containers locally in your environment and run integration tests that way.

Flynn: There’s another one over here, hang on.

Audience: Do you use something for continuous integration?

Josh Holtzman:
Yeah, we have a Jenkins cluster. We’ve got a Jenkins master and many slave nodes. We use hooks in our get-repositories so anytime there’s a commit to any one of the feature branches that are in-flight, a build job is created and regression tests are run. All of that is automated on every commit to every feature branch, which is again why just like with the monitoring solution, we really had to scale up our build solution as well because we’ve got thousands of builds running a day.

Audience: Early on you mentioned a complex regulatory PCI kind of environment. Did that impact your migration path and maybe you could tell us a little bit about that.

Josh Holtzman:
Yeah, it did. This might be very different for maybe some startups in the room. For us, anyone who writes code, cannot touch the production network, or any of the hosts on it. Anyone who can touch the network, or the hosts on it, cannot write code. We have this separation. When you have a DevOps environment where you’re trying to have developers own their own applications all the way into production, it can become very tricky. For every action that needs to happen, you need to have a process and those need to be set up well in advance in collaboration with the InfoSec group. There’s just a lot more organizational coordination that has to happen in microservices and in DevOps.

That’s also true for things like the code that we use for provisioning or for building and configuring our network gear. I feel like that’s an area that isn’t well understood by the traditional PCI separation of concerns there. So yes, that requires a lot of coordination within InfoSec.

Audience:
It looks like you’ve pulled together best-of-breed solutions. Are you able to … versus the Lyft technique of they kind of built their whole thing. Are you able to see performance, end-to-end of a transaction through all these different layers, or how do you do that?

Josh Holtzman:
Yeah, we have some good stories to tell and some not so good. I think it was logging, metrics and tracing, distributed tracing. I think those are roughly the three. In distributed tracing for our job applications we actually wrote something called Iris, which is a Java agent. We can run our applications with that agent enabled and that’s essentially decorating any HTTP requests that we’re making or network requests. All of our Java applications, out of the box, get distributed tracing for free. We’re just now starting to add to our non-Java applications.

Logging, we just used Splunk to aggregate all of our logs and we’ve got the metrics dashboard. Everyone’s onboard, pushing their metrics there. I feel like we’ve got a pretty good story there with the distributed tracing, we still have some work to do.

Audience: Yeah, here. Here, here.

Josh Holtzman: Oh, hi.

Audience: I heard you mention, you’re using Calico for IP routing basically [inaudible 00:38:16]. Can you elaborate a little bit on that?

Josh Holtzman:
Yeah, as I mentioned we’ve got a custom service discovery system from one of the online poker sites we looked at initially and we’re starting to look at … we run with Docker about roughly half of our applications. We really want to start managing those inside of Kubernetes. The problem becomes once you move something into Kubernetes with its private IP space, you lose that view of the whole world, all of your services that we rely on now. There’s some configuration that you can do inside of Kubernetes, I’m certainly not an expert, but there are folks on my team who know this really well. I can hook you up with them if you want to find out more. Calico gives us the ability to have routable IP addresses per pod and so if something starts up inside of Kubernetes as a pod; versus running it bare metal, versus running it in a Docker container, they all get essentially routable IP addresses so that they can participate in this universal service discovery system.

Audience:
Back on the regulatory side, what do you do in terms of detention and DevOps with data migrations. How do you know that it’s going to run in the production data.

Josh Holtzman:
Yeah, so our DataOps team, part of the migration … part of this whole process has been moving all of our operations teams into engineering, which they weren’t in before, well most of our operations teams. DataOps joined into engineering and so they’re actually reviewing all of our SQL queries. The migrations that occur happen in the code as part of the application so when you deliver an application you can start it up with a migrate only flag essentially. You start up the application, it runs a migration and then you can start up in the normal mode. All of this is tested every time we deploy in Dev, QA, Staging and Production. We have a couple of production-like datasets that we also run against and so that gives us the confidence that we can do this in a performance way.

There’s another step that we’ll do if were doing something like a DDL change, so we’re altering a massive table that would normally lock that table. We can’t do that in production so we have off-line tools to allow us to do that same alter, which is basically building a new table and then we flip them.

Audience:
You mentioned you have about 100 microservices so how granular are these services and how many Dev team members actually own a typical service. What is a typical service for you?

Josh Holtzman:
Yeah, it’s a question that you have to constantly be asking yourself. There’s no easy silver bullets. We have, for instance, two microservices that really shouldn’t be two. They should be one. We realize it all the time because we’re running into problems where we’re constantly wanting to do cross service joins in the database there again, which means we can’t separate them and move them into separate physical data sources. Those should clearly be combined into one microservice. It’s one domain. We made a mistake there and we’ll need to correct it. So it’s something that becomes almost part of your development cycle where you realize I think I can split these and it make sense to split them because they have different data storage requirements or they have different deployment timelines. We wanted to play one every day and one not so much. You have to constantly be making those decisions and they’re always pros and cons, but you’ll start to find the right granularity.

Audience: You mentioned that this is not just process and infrastructure, this is a cultural change for your company. I was just wondering just at a high level, what’s the biggest cultural impact that you saw?

Josh Holtzman:
I think for us our product managers have always been very end customer oriented. They are constantly looking at user experience and really trying to optimize and make the user experience brilliant. When you start to do things like this, and you start to have eventual consistency and you start to have to be concerned with these types of things. The product managers really need to join engineering in this evolution and start to think about these things. Start to think about APIs. What an SLA might look like? What does happen when this service times out? That type of thinking has been major … a cultural shift and a really good one to bring engineering and product together. I think also, if I can have two, bringing what we call SiteOps and NetOps into using the more engineering tester and development methodology. I think that’s also been really great because that gives us that fidelity at every level of the stack, which you absolutely need if you’re moving to microservices.

Flynn: I think we have three more questions here.

Josh Holtzman: All right.

Audience:
Actually I have two, I’m just kidding. You also mentioned batch jobs. Were there any challenges around batch jobs especially when it comes to maintaining data integrity and consistency?

Josh Holtzman: In terms of batching?

Audience: Batching, yeah.

Josh Holtzman:
Yeah, most of the batch work that we do, we’re trying to bring it into the microservice that owns it. For instance, we have ACH batching and that really belongs with the payment processor. Not separate, not inside the monolith. Thinking about those batch jobs and putting them with the domain that really owns them. That’s, I think, the key part there.

Audience: You mentioned that you have three data centers, one in AWS and two data centers and you have hundreds of microservices. How are you managing the firewalling respect in terms of microservices and data center, physical data center calling to AWS, are you doing like [inaudible ports and security groups, or like when you add Azure or something. How is that getting handled?

Josh Holtzman:
We manage AWS very differently than we manage our physical data centers and that is part of just that history, but also because of the tooling involved. We have a ProdOps group that has permissions, that has the right security credentials in AWS to be able to get in there and do deployments and fix problems. Then in our physical data centers, they’re managed very separately. We have Cisco firewalls and it’s a completely different set of tooling that we have. We have to manage both.

Audience: What about service to service?

Josh Holtzman: Sorry?

Flynn: His question was what about service to service.

Josh Holtzman:
Oh service to service, yes. This is something I didn’t get into. I might’ve mentioned it on a slide, but glossed over it. We use OAuth on the front end and we actually use OAuth all the way down. Every service is validating that a request has a token and that that token includes the correct scopes. We do this at every layer and that gives us the ability to treat a request from the outside world the same as a request from the inside world. We’re moving … everything is TLS and everything is protected by Oauth.

Flynn:
One last question. I, of course, saw a couple of hands go up as soon as I said we were going to be done so if you could try to find me afterwards, we’ll see if we can get answers there.

Audience: Do you have different versions of the services? If yes, like how you manage which tool you use.

Josh Holtzman:
Every time somebody changes their contract, they will version their API and we typically include multiple versions in a single application. We might launch a container and it might include /v1/endpoint and a /v2 set of endpoints. If you want to discover these things … if you are looking to hit the version 2 authentication versus the version 3 authentication, we do that with layer seven routing. If you’re hitting a URL, it’s off.2.xoom.API versus off.3.xoom.API. We can deploy these things together if we choose or the application developers would rather have separate containers and launch them separately and even scale them separately, they can do that. The service discovery system will find those application instances based on the URLs.

Audience:
Thank you.

Josh Holtzman: Thanks.

Flynn: Thanks Josh.

Expand Transcript

Microservices are the Future (And Always Will Be) – Josh Holtzman

Josh Holtzman (PayPal)

Description

Presentation Slides

Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.