Breaking Things On Purpose

Kolton Andrus (Netflix)


How Netflix intentionally builds failure into their microservices strategy.

Presentation Slides


Kolton: Thank you. Thanks for the opportunity to speak. I’m excited to share a little bit of the hard fought wisdom we’ve learned. And as he said, I’m Kolton Andrus. I’m going to talk about breaking things on purpose. In particular, I want to convince you why failure testing is important, why it’s a critical part of running your microservices, and verifying that things behave as you expect. And additionally, I think it’s one of the proactive things that we can do when it comes to preventing outages, whereas most of the things that we do are reactive.

So I’m going to start with a little bit of context. So a little bit about me and Netflix. I was on the Edge Platform team. We were in charge of the liability and the performance of the Edge services. I was going to make the joke earlier that we’re kind of one of those megaservices where all of the devices within Netflix communicate with us, and we talk to almost all of the streaming services.

Before that, I was at Amazon where I worked in the retail website. And I also had a focus on availability and performance. At both companies, I’ve had the privilege of being a Call Leader, so it’s someone that’s nominated to help manage and resolve large-scale incidents within the company. And I’ve done failure testing at both companies and I’ve seen it be successful. So additionally, I want to provide a little context on why failure testing as a concept, while kind of counterintuitive, is an important one.

So this wine glass, we might say it’s fragile. When it falls on the ground it breaks and it goes everywhere. So one of the questions I like to ask while talking about failure testing is “What’s the opposite of fragile?” Raise your hand if you think it’s robust or resilient? Nobody? You guys already know what I’m going to say, huh? All right, awesome. I like talking to a room that knows how it goes.

So this is what I would have thought. I would have thought something that is indifferent to change is robust or resilient, but we want is we want to actually be anti-fragile. We want to improve in the face of change. And so some examples of this are ecosystems, organisms, communities in society, things that actually improve in the face of change and get better. So there’s one example I’d like to give on this front and it’s that of the vaccine.

And I think this drives home that kind of counterintuitive point. What we’re doing is we’re injecting a small amount of something harmful into our bodies in order to build an immunity to it. I feel like this translates very well into the microservice world. We’re going to inject a little bit of the harm into our microservices in order to see how it behaves and to build an immunity to those things.

So this is a little mathy, but part of what makes something anti-fragile is essentially that you have a bounded downside and you have a larger upside. So we want to be in situations where the worst thing that can happen isn’t that bad and the best thing is pretty good. In particular, when it comes to failure testing…

Man: We have a mic issue. Just keep going. I’m going to switch you out.

Kolton: All right. So the downside, when we’re failure testing, is the impact that we could potentially cause. So we’re going to do failure testing. We could actually break things. We could cause some customer impact. That’s our downside. But by doing it proactively, we can manage that downside.

Man: [inaudible]

Austin: Yeah.

Kolton: You know what? There’s so many great jokes about it, too. If the talks a failure, you can make jokes about it. It sounds like it’s in the room. Okay, well my apologies for the mic issues.

Man: It’s not your fault.

Kolton: So the downside is the impact of the test, but the upside is much larger. And it’s really, it’s the outages that we’re going to prevent, the things that aren’t going to happen. And there’s a couple different dimensions to this. One is we’d rather, as Austin mentioned while introducing me, we’d rather deal with this during the day while people are paying attention and in the office and while we have better insight and control into what’s going on.

And so we’re not debugging things in the middle of the night. We’re not playing Murder Mystery in our microservices, trying to figure out what went wrong. And I think there’s an additional upside here, more than just the test itself is that it helps our organizations prepare for the failures that occur. It gives us an opportunity to test and to see how things behave, not leave them to this point.

So this came across my Twitter feed last month and it made me laugh, in particular because I’ve been binging on Star Trek: The Next Generation lately. But Picard here says, “Hey, we need to run our crisis drills when things are good.” Two o’clock in the morning, when production’s on fire and everything’s wrong is not the time to ask a lot of questions, to be figuring out why things are going wrong. You want to do it on your own terms, during the day, after the caffeine has kicked in.

You want to practice, you want to train, you want to answer those questions. So I’m going to talk a little bit about the general process of effective failure testing and some of the things that I think about. So one analogy I want to draw is that of the threat model. So I think similar to in security we do threat models, when it comes to failure we need to be doing failure scenarios. We need to be thinking about what could go wrong. How could our systems fail? And I think that just asking this question gets us a fair amount of the way.

A lot of people haven’t even stopped to think about the things that could go wrong. So if we ask ourselves, “Okay, what are we worried about? What could go wrong?” we’ll be a little bit better prepared. The follow-up question to that is, “How likely is this to occur?” I think this could be a dangerous question, but I think it’s useful because if we see that there are common events or common classes of events causing this problem within our infrastructure or our ecosystem then we know that we should spend time on them. We know that they’re going to be useful.

Now, we can’t prepare for everything. And since I’m talking about anti-fragile, if you’re familiar with Nicholas Nassim Taleb, he also talks about black swan events. And those are events that we can’t really see coming or that catch us unaware. And I would say that those are still going to happen, but proactive failure testing, I think that we’re going to do a better job when the unexpected happens because we’ll have prepared, we’ll have practiced.

And even if it’s not exactly what we trained for, it’s better than coming at it fresh with no background. And then another question I like is “What’s the cost of being wrong?” And I think this helps us weigh our prioritization of failure testing and our risk assessment. If we run entirely in one AWS region and that region goes down, what happens to our business? If it’s down for a day, are we going to be able to survive as a company? What’s that going to do to our customers’ trust?

So and then I think by asking these questions we can kind of run a cost-benefit analysis. We can think about not just what could go wrong, but what’s likely to go wrong and where should we spend our efforts? And this allows us to prioritize and spend time on the things that are going to get us the most bang for the buck. Hopefully, the things that are going to save us the most outages and prevent the most customer impact.

And as we’re talking about the cost of things, I’d just like to point out this has been really helpful as I’ve been talking to the business about the cost of failure. So I think that the cost of failure is, in some cases, easy to calculate. If you’re an eCommerce site and you’re down 1% of the time, you’re collecting 1% less revenue. And it could actually be worse. The example that comes to mind, not to pick on them, is Target on Black Friday.

And coming from an eCommerce background, if there’s one day a year that you do not want to be down, it’s Black Friday. Those are expensive outages. Additionally, I read that it’s estimated that an hour of downtime for Facebook costs $1.7 million in lost advertising revenue. So is it important to prioritize availability? When there’s a lot of money on the line, absolutely. Now I want to talk a little bit about how we run this process on my team and how we’ve used it to help make our system more resilient and improve our operational lives.

As I mentioned, we own the Edge services. We talk to a lot of the company. And so our general process is to meet with these teams, meet with the microservice teams that we depend upon, and sit down in a room, and go through some of these questions. Hey, what could go wrong? And oftentimes, it’s as simple as drawing a whiteboard diagram, drawing a diagram up on the whiteboard. And we look at it and we ask ourselves, “Hey, where could things go wrong?”

And often these are network bounds. These are losses of dependencies. Maybe interaction with caching or persistence. And these are the things that we want to target and think about how they’re going to fail and to test. And so when we’re running a test I think it’s important to think of it like an experiment. We want to begin with a hypothesis. Hey, if we lose the rating service, what happens? Are people going to be able to stream? Well, so in this case, yeah. If we lose the rating service, members should get default ratings. Hopefully, no one notices.

Next, I think we want a measurable outcome. When we’re going to run a failure exercise, we want to ensure that things are failing in the way we expect. And this is important because we have a lot of assumptions about how our systems behave and how we think they’re going to fail. And in my experience, those assumptions are almost always wrong. There’s always some detail or some subtlety that we’ve missed that burns us later on. So we want to verify that the system is failing in the correct way.

Next we want to have a set of criteria about what does success look like overall. Throughout the exercise, in this case we want to make sure that customers are able to stream, that we’re seeing that positive attribute and we’re seeing not just the absence of failure, but we’re seeing success. And then next we want some abort conditions. I think it’s important that we have a clear guideline of if things start to go wrong, if things start to go south, we need to halt this exercise.

And so for us, it’s always been if there’s a customer impact. If customers can’t stream, if something’s breaking in a way that prevents people from using the service, halt the exercise. We found a bug. We found a problem. We’ve got a learning. We’ve got something to go fix. Halt it, clean it up, stop that customer impact. Another aspect I want to talk about is communication. I think failure testing is kind of a scary concept and when you’re talking to the business or people in your organization, they’re probably going to be like, “Hey, that sounds like a horrible idea.”

So you need to be clear in your communication and you need to involve the people that can potentially be impacted. In particular, I think inviting dependencies, inviting people that might be interested, inviting people that you think you might impact. And this has the added benefit of if many people and many eyes are kind of watching what’s going on, if something does start to go south you have more canaries in the coal mine. You have more people to kind of chime in and say, “Hey, something doesn’t look right here. We better stop and investigate.”

Share that pass/fail criteria with that group. Let them know what success looks like so that they know things are working as expected, so that the failure test is successful. And then have one command center. Have one place that people can come and tune in and know what’s going on. Whether you like to use your team bullpen or a conference room, whether you want to do it in a chatroom or over a conference bridge, people should be able to drop in and say, “Hey, what’s going on?”

We often did it in a chatroom. We would notify the general ops chatroom, “Hey, we’re going to run a failure test. If you’re interested, you can follow along in this room,” and we’d have that room where we would post when we were starting an impact, when we were stopping an impact, graphs of how things were behaving, those kind of criteria. So another important concept, I think, when it comes to failure testing is thinking about the smallest possible impact. So I call this the failure scope, but essentially we want to start by scoping the failure.

We want to do the smallest amount of badness that we can measure and see the impact on the system. So this might be running it on your local box first. It might be running it in test. It might be running it in Prod, but only running it for a single instance and seeing how it behaves. At Netflix, we’re able to scope our failures to individual customers or devices. And so what we do is we always do a functional test that we ensure that it behaves as we expect before we start to dial it up to a large percentage of customers.

And at each of those steps, we want to validate the outcome. We want to validate that things are behaving as we expect. This is a graph from the CDN selection team. And they ran a failure test. And I think it’s one of the best examples of a successful failure test graphic. The green is things behaving as normal. The red is things failing or falling back. And the black line is customers able to stream. So you can see that while they run this exercise, they ramp it up, failure increases, but there’s no customer impact.

This, to me, is what a failure test should look like. Great, so we’ve done it at a small scale. It’s working. Maybe we found some bugs and fixed them. Now what do we do? Well, now we want to turn it up. We want to ratchet it up. When we run our failure tests, we start at, like I mentioned, a single customer, single device, and then we go to 1% of customers, and then 5, and then 25, and then 50, and then 100. And not every test do we go to 100, but a lot of them we do. And we’d like to see the system behaving at scale under duress.

And I think this is important for one reason, and that’s you learn different things at different scales of failure testing. When you’re running on a small scale or a single instance or a single customer, you might catch some functional issues. Something might not degrade as you expected. But you’re going to see that. When you’re running at a large scale, you’re going to be seeing what might happen in production if things go wrong. You’re going to see resource constraints. You’re going to see queuing.

You might see cascading failure. You might run into some emergent behavior in your complex system. And again I would pause it while you have the potential, you could get into trouble. We’ve hopefully, as part of the exercise, we’ve thought about how do we stop it if things go wrong? How do we roll it back and clean it up quickly? There’s a chance that you could trigger something that you can’t easily clean up, but again, I would pause it. It’s best to do that the day while people are watching than in the middle of the night while everything’s on fire.

And then test in Prod. I think you have to test in Prod. It’s Prod’s configuration that matters. It’s your mitigations in Prod that are going to be called upon when things go wrong. If you’re only testing in Test, you’re not going to validate your production configuration, your production network, your production hardware. There’s a great quote by James Hamilton, who’s at AWS now. He used to be at Microsoft. And it goes, “Those unwilling to test in Production aren’t yet confident that their service will continue operating through failures. And without production testing, recovery won’t work when called upon.”

So again, the purpose of this effort is to save us from pain later. If we do a lot of work and we create a lot of mitigations, and we don’t exercise them, and then we go to use them later and they don’t work, it might be a distraction during an outage, it might make it worse, and it certainly won’t be the happy case of, “Yep, that behaved as we expected and we were able to avoid customer impact.” I want to talk about a couple specific case studies from Netflix, but I want to share a funny anecdote first.

In Q3, we had some failure tests scheduled. So in general, before the holidays and as part of our Q4 readiness we go through, we meet with these teams, and we run these failure tests. We had a failure test scheduled for a service to run Wednesday afternoon. Well Tuesday night we got paged because that service went down. And in retrospect, if we had been able to run our failure test we’re pretty confident that we would have found that failure and fixed it. So run early, run often. You never know. A day’s delay might cost you or it might not save you that outage.

So Spinnaker. Spinnaker is the new way for us to deploy to the cloud at Netflix. It’s Open Source. It’s cloud independent. And it’s a critical piece of infrastructure. And it’s critical for two reasons. One, it encompasses a lot of the automation that we’ve learned and developed to help us deploy safely to production and move quickly. We don’t want to lose that if that service is down or unavailable. And two, a lot of engineers don’t know how to do it manually anymore. If they had to push bits to the box themselves, they’d probably be asking for help.

So this is, to me, this is almost a tier zero service. This can’t go down. We need it to work. We need it to be highly available. But as any young project that hasn’t matured yet, it had some flaws and some things. And we saw a couple issues with it. So I met with the Spinnaker team. Here’s a screenshot of the UI. It’s kind of a sentimental one. They tried to scrub it but you can see API Prod all over it. That was my baby.

We met with the Spinnaker team to discuss what could go wrong. And as we went through the whiteboard diagram and as we drew things out and we started talking through bits of the service, we found some low-hanging fruit. We found some cases where there were single instances deployed to a cluster. We found some cases where there was a cluster that was only in one availability zone. These are nice ones to find because they’re fairly easy to mitigate. Go throw some more hardware at this. Let’s have a little bit of redundancy so that if things go wrong we’re not out in the cold.

Next we spoke of it about monitoring. And as this has been mentioned several times today, and I’ll just echo the same sentiment. We need to monitor our services. We need to know how they behave. This spawned a great discussion about what success constituted for that service. It had some good discussions about do you have dashboards in place? Could you tell me, right now, if something went wrong, what it looks like? And then we added some alerting on top of those things. We went back, we saw what kind of normal behavior was, we went and applied some of the different alerts and automation to make sure that we’d be paged and we would know if things started to go south with Spinnaker.

So that was good. I consider that all kind of in the class of you just should do these things. So I don’t think you need failure testing to really exercise that. Then we talked about Hystrix with the Spinnaker team. For those of you not familiar, and obviously credit to Ben Christensen because he’s here today, the author of Hystrix, Hystrix is good for both fallbacks and for the circuit breaker pattern and for resource protection. And when we talked to the Spinnaker team, we found that they weren’t really protecting themselves, even amongst their own microservices.

So they had a couple of different services involved, but they had kind of a gateway service that could serve some non-critical fallbacks in the case of failure. So the first was we spoke to them about you should leverage this. You should go put it in place. And that was great. But then we still had some problems related to Hystrix and the fallbacks. And it’s because tuning this is a critical part of getting it right.

When you’re dealing with timeouts, when you’re dealing with thread pools, or the amount of resources you want to allocate to work that needs to be done, you have to look at it both in the happy case, as things are behaving as they should all through the day, and you need to see it under duress. You need to see where those ceilings are so that as things start to queue up and there starts to be a lot of contention, you’re ensuring that you’re protecting yourself, that you’re shedding load that the right place, that you’re timing out fast enough that you can continue to do work, and that those failures are isolated.

In particular, one of the problems was they had Hystrix commands grouped into the same resource bucket that were both critical and non-critical. And so the non-critical one went south and impacted the critical one. And the answer there is, “Okay, well let’s separate those out and make sure that they’re isolated.” And what’s been nice about this is that the Spinnaker team was really open to taking this feedback, to sitting down and walking through it. And the end result is a more resilient deployment system that we have at Netflix.

The next failure test I want to talk about is Chaos Kong. And to be fair, this is run by the traffic team. I don’t really play a role in this, but I think it’s one of the best examples of the power of failure testing. We run Chaos Kong on a regular basis, where we evacuate AWS regions. And we find new learnings every few runs. There’s a new resource constraint, there’s a new scaling boundary, there’s some new code that was pushed, something that doesn’t behave quite right. But what’s important about this, because we run it regularly and we run it in production, is that it’s ready when we need it.

And we’ve used it a couple of times. Last Q4, when AWS had issues…and another kind of funny anecdote. The only time that I was paged of consequence in the last six months was the morning, Sunday morning, that AWS East had issues. And we got on and we looked at things, and we said, “Okay, it’s out of our control. It’s AWS. Cool, fail out.” We got people on. I Don’t even think we needed to spin up a conference bridge. I think we were able to start it via chat, shift traffic over, and basically avoid any customer impact from that incident.

And then likewise, about two weeks ago, we were uncertain if there was an AWS issue and we chose to fail out for the evening, serve traffic out of our other regions, and return the next day. So knowing that your defense mechanisms will work is very important to helping prevent customer outage. Now there’s one counterpoint to this and that’s that when you have a big hammer, everything looks like a nail. So I’ve been on multiple calls where the first question is, “Should we shift traffic?” And sometimes it’s a premature question and you’re not sure.

And I think it’s clear in some cases. AWS in one region is having a problem. We should shift out. It’s unclear in others. There’s a single service in a single region that’s having trouble. Well, maybe we need to find out a little bit more. And then it could be bad in some. If we hit a scaling boundary in our busiest region the first time code’s been deployed, and then we shift traffic to another region, we may just hit that same scaling boundary in the other region.

So we have to be kind of thoughtful about when we use these tools. But that’s separate from do they work. So yeah. So why? So I carried a pager like this at Amazon. We have PagerDuty now, but I’m nostalgic for the old school pager. What’s important is that this proactive failure testing makes my life easier. I’m lazy. I don’t like to be on calls. I kind of do because I’m a glutton for punishment, but I don’t like there to be customer impact and I don’t like there to be outages.

And so this proactive failure testing has helped reduce the operational burden of my team, in particular. Between 2013 and 2014 we had a 20% reduction in the number of pages that we received. Between 2014 and 2015, we likewise had a 20% reduction in the pages we received. And actually, and this is more anecdotally, many of those events that we were involved in this past year didn’t really require any action from us. We’re a critical service so we’re used to being involved in all the discussions around an outage.

But because our service was well-tested, well-protected, it behaved as we expected. And that’s not to say that we just got lucky and things didn’t fail. We had many instances of services in failure modes that we test happen, and be non-events. The best feeling in the world is when you come into the office the next morning, or in the morning, and your coworker turns to you and says, “Hey, did you know service X fell over last night?” And you went, “No.” “Did you get paged?” “No, we were fine.” Everything went on as normal.

And then it becomes a little bit more enjoyable. You still dig into it and figure it out, but it’s not the tire fire that it would be otherwise. And this culminated in perfect uptime for us over the holidays. And since holidays are our busiest time and the most important time for us to be available, it’s very critical that we’re able to provide that resilience and that good customer experience at peak under load.

And since I was the on call for New Year’s Eve I was thankful that I didn’t really have to worry too much. I won’t say I wasn’t worried. I was keeping an eye on things, but I wasn’t too worried. So that’s what I have. I want to give a callout to the Edge Platform team. And Netflix, they’re hiring. If you want to go run these kind of failure tests or learn from them, you’re welcome to. This is my Twitter handle, my email is kandrus@gmail. I love failure. I love talking about failure. If this is something that you’re interested in, if you have questions, hit me up, let’s chat.


Kolton: Yeah, I would say…

Man: Can you repeat the question, please?

Kolton: Yes. How has failure testing affected the development culture? So one thing we’ve seen is that we’ve gotten better about doing failure testing in our integration tests and doing things offline. We’re able to take our failure injection framework and instrument it so that we could run them as unit tests during deployment. So that’s been good. I think, in part, it’s just been a cultural change, like so many other things.

Sitting down in a room with a service and having them think through their failure modes and think about what could go wrong, and then test it and verify it, and then know how their service behaves. I think that just…I can’t point to any one thing when it comes to in development, but it just leads to a more robust culture.


Kolton: Right, so the question is has Netflix’s failure injection system been Open Sourced? The answer is, no, in part because it’s easier to Open Source libraries. I find it’s a little harder to Open Source full services. And it integrates pretty tightly with a lot of our internal components, Hystrix, Ribbon. Ribbon’s our RPC client. Cassandra, and our memcached client.


Kolton: Do we subject our data stores to testing? I haven’t gotten to the point where I need to go that deep yet. There’s been enough low-hanging fruit in just understanding how our services behave. That doesn’t mean that other people haven’t. I would say that, in part, some of the Chaos monkey or other approaches are going to indirectly impact data stores.


Kolton: There you go. Go find the tech blog on Chaos monkey on Cassandra.


Kolton: Yeah, so how does Netflix quantify the cost of an outage? I think it’s a great question. We do a lot of A/B testing. There’s a lot of debate on how you measure this. As you’ve said, when we’re subscription-based it’s a little harder. We could run an A/B test where we break everybody for a while and see if they get mad and quit their subscription. I don’t advocate that, just as I don’t want any of my failure testing to actually impact customers. But there’s been some kind of debate and discussion about how we would quantify it.

There was a number kind of thrown out by a C-level employee that it was worth this many tens of millions of dollars. But I don’t have a strong backing for that.

Austin: Do you feel like doing failure testing early in the lifecycle of a product or an organization would lead you towards premature optimization or other problems or what are your thoughts there?

Kolton: That’s a great general development question. I think you could get caught up in it early. I think thinking through it up front is important. I think that at the point that you’re designing your service and as you think through what could go wrong and how likely it is, that that’s a good point to figure out how you want to mitigate that so that it’s designed by day one. But would I say, “You should go out and break things with your first deployment or your first iteration.”? I don’t know. I think it should be part of your continuous deployment process. It should be automated where possible.


Kolton: Yeah, I would say…so the question is how…is too much failure testing a bad thing or can you do too much failure testing? I’m sure you can. I haven’t actually gotten to that point. Really, we do a push during Q4 to get ready. And we try to schedule with our critical services once a quarter to go through this. So we’re mindful of people’s time and we know that everyone needs to be involved and thoughtful during the exercise. If we got to the point that we’re really good at it, that it’s automated, it’s being run all the time, I think that’s the answer. I think you move from a manual one to a more automated one.


Kolton: So the question is, basically should we take the Chaos monkey approach of running it on people without their knowledge? Should it be a surprise? I think that’s one failure testing strategy. I think if you want to exercise how teams are engaged, paged, know how to debug an issue, and if they don’t know that that failure test is happening, it’s a great way to drill and train. But I’m not really a fan of doing it randomly. The cost of being wrong is too high, in general. And when you want to run it large scale, you want to make sure that things are…you want to be really careful.


Kolton: So I’d say the question is how do you simulate failures in a realistic manner, because it’s difficult and there are a wide variety? What we do is we’ve kind of approximated it to have failure that occurs at different layers and seeing how that behaves. So if we know that when we go to make an RPC call, if we just fast fail that and we can handle whatever falls out of that, then there’s a lot of different underlying failure modes that we’ve covered. And it could be packet loss, it could be a connection reset, it could be a bad route.

And so we kind of covered all those. But at the end of the day it’s a model. It’s an approximation. We’re doing the best we can to simulate it well. Once we get through all the low-hanging fruit there’s probably a place where you want to get much better or much more prescriptive.

Austin: What sorts of pushback have you gotten of the whole idea of testing in production itself? What sorts of things have you found yourself needing to do to try to mitigate against totally messing people over in production?

Kolton: Yeah, so pushback, this is why I came to Netflix, to do failure testing. I felt like they were the leaders and that if there was a place I could learn this would be it. And I haven’t been disappointed. I feel like it’s been a great experience. How do I…what was the second part?

Austin: What sorts of mitigation have you found yourselves having to do to try to be able to test in Prod without screwing over live customers?

Kolton: Yeah, so that’s really…and again, if you’re interested…

Austin: I realize that’s kind of open-ended.

Kolton: Sure. If you’re interested, the FIT blog post, F-I-T, Failure Injection Testing, you can go and read on the tech blog. But in particular, it’s this concept of a failure scope. So whenever you want to cause a failure in production, you have to basically define who you’re going to impact. And we can scope things to an individual customer or device. We can scope them to an individual application. We can scope them to different clusters. And then we can do kind of this vertical slice, a random sample percent of customers.

Austin: Just to clarify here. You’re talking about causing a failure that you are expecting not to screw the customer over, correct?

Kolton: Mm-hmm, yep.

Austin: So your expectation going in is that you will fail a piece of software. The customer won’t notice.

Kolton: Yep.

Austin: And then if they do notice, then that would be an immediate point where you stop the test, restore it, and go back.

Kolton: Yeah. And again, the kind of mental graph I have is there’s outages that occur in the middle of the night. And then there’s the outages that occur during a failure test and they’re small and they’re contained and they have a lot less impact. And so that’s really how, back to the anti-fragile analogy, that’s how we bound the downside and benefit from the larger upside. All right. One more?


Kolton: Yes, similar to like a canary approach, you want to test on real people, on real behavior. Now, when we’re doing our testing and we’re scoping it to an individual customer device, that’s often one in our control. It’s my laptop. It’s my customer ID. And I go and I’ve already kind of done some due diligence that things behave as I expect. So before we’re going out and running it for a percentage of customers, we have a pretty high degree of confidence that it’s going to behave correctly.


Kolton: So the way that we know that we’re impacting real customers is we’re watching carefully upon all of our important metrics and dashboards at Netflix. Most importantly, can people stream? And so that is the Holy Grail metric. If at any point we see a dip in the expected number of streams and we’re confident that we, even if we’re not confident that we’re playing a role, if we’re running a failure test we’re stopping it. We’re cleaning it up, and we’re stopping that impact right away.

Austin: Are there other things other than such blatant indications that a customer is having a problem that you would consider a roll back, roll forward question point?

Kolton: I think that’s going to be based upon individual services. So there are some services where…okay, so take ratings. If we know that customers are getting bad ratings for some reason, then we would say, “Okay, that’s a data quality issue. We need to stop that.” But in general, the top level ones are pretty good. Yes?


Kolton: Yeah, so he mentioned that we have some metrics around customer service engagement and different issues. Again, as part of the exercise we’ve kind of informed our general operations team that we’re doing this. And so if our customer service reps start talking about some behavior, some error, then that’s a good hint that maybe we’re causing that pain.

Austin: It seems that a lot of that boils down to start by having metrics and start by knowing what you expect your metrics to show you in normal use. I think that’s it.

Kolton: All right. Thank you very much.

Expand Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.

Try the open source Datawire Blackbird deployment project.