Dark Launching with Consul at Hoosuite – Bill Monkman

Bill Monkman (Hootsuite)

Description

Dark Launching (A.K.A. Feature Flagging) is a technique and mindset that has truly shaped the way we write, test, and deploy code at Hootsuite. It gives our team realtime, fine-grained control over our production systems which helps to prevent issues from reaching users, and build developer confidence in a culture of pushing code many times per day.
In this presentation I will go over how the system helps us both in the context of microservices and monoliths, and how we made use of Consul, Hashicorp's HA service discovery / KV store, to make it more resilient and performant at scale.

Presentation Slides


Transcript

Austin: "How's everybody doing? We are back again. Thanks for sticking with us so far. We've made it all the way from Berlin to the UK to New York, and now we're in Vancouver with Bill Monkman who is one of the engineers over at Hootsuite. For those of you that don't know what Hootsuite is, it's a really amazing social media management platform that a lot of really large companies use to engage customers for marketing and a lot of customer service. They have a massive user base and have been growing really, really quickly. Bill is going to chat with us about using Dark Launching or using Consul for Dark Launching. Take it away, bud."

Bill: "All right, thanks. Yeah, I work at Hootsuite. I'm a senior specialist engineer here. I work on our platform team, so doing a lot of distributed systems stuff and Microservices scaling, moving away from our [inaudible 00:01:11]. I've been here for almost seven years, since we were under ten people. Now we have almost a thousand. I've seen a lot of growth and a lot of change, a lot of interesting stuff. Yeah, we're talking about Dark Launching with Consul, so I'll hop into that. Let's see if this works."

"All right, so Dark Launching with Consul. [inaudible 00:01:42] It integrates with Twitter, Facebook, all the big social networks. A lot of big business users, also a lot of small, medium business and a lot of just end users as well. Almost 15 million users right now, I think, and yeah, still going strong. Everything that we are running right now is in AWS, somewhere in the low thousands of servers. We do this mostly [inaudible 00:02:19] PHP, and all of our services are Scala and some Go, and some Python thrown in to the mix there. We do usually 10 to 20 releases to production per day. That's a good pace. It's really important to our team. As I said, we've been going through [inaudible 00:02:42] moving a lot of our [inaudible 00:02:46] code base to Microservices. Right now, after a couple of years at it, we're somewhere past [inaudible 00:02:54], again, mostly in Scala."

"This focuses quite a bit around Consul. Consul is something that we started using a couple of years ago, in 2014. We also use a much better [inaudible 00:03:09] tools like Vagrant, Packer, Terraform and have been using Vault for some [inaudible 00:03:16], so a lot of great schools. We have Consul deployed in all of our data centers, our AWS Regions in our staging and production environments in clusters of three to five servers. The Consul agent is installed on almost every [inaudible 00:03:33] cluster. If you're not too familiar with Consul, it's basically a ... It's a great tool. It doesn't give you a ton right out of the box, but it can be the foundation for a lot of great tools. It does server [inaudible 00:03:51]. It has a distributed key value store, and it allows you to automate a lot of tasks in a really easy way within your systems. Our first use of Consul is for Dark Launching."

"Dark Launching is a technique. Some people call it Feature Flagging, Feature Toggles. There are a lot of different names with a lot of different [inaudible 00:04:16] of it, but basically, it allows you to control your systems in real time. This [inaudible 00:04:27] for our monolith and also for our services, basically everything that we have has Dark Launching baked into it because it's really an important piece of our work. It's something that's used here, also at Facebook, Etsy and a bunch of other places. We have [inaudible 00:04:46], basically anything that we use, and it's become a really powerful tool for us. It gives our engineering team a greater sense of confidence when they're pushing code. They don't have to worry as much about pushing it back towards production where you're breaking things or causing [inaudible 00:05:10] because they have a safety net or a way to have control of the system [inaudible 00:05:17] so that they can push code and basically just be able to test it themselves and not affect customers [inaudible 00:05:25]."

"It allowed us to do some interesting [inaudible 00:05:28]. It allowed other departments to control the system. In the case of Support, maybe there's something that's not working and they need to turn it off. We've had cases where the Marketing team is doing a press release for [inaudible 00:05:42]. They've been able to send out the press release and then actually turn the feature on production, turn it on to real customers, which is pretty neat. This is what it looks like. This is just a snippet of PHP code, basically just checking is this future enabled. We give it a name. Usually, it's [inaudible 00:06:05] reference, and then the block of code. If this evaluates to true, then [inaudible 00:06:15] and the guts of this are [inaudible 00:06:21] a little bit."

"We have various ways of controlling access to that code we're controlling, in which cases you enter that block and execute that. The most simplistic Boolean and we go turn that feature [inaudible 00:06:39] percentage_static, which just basically takes the user ID and just does modules on it so that we can release something to 10% of users and 20% of users. It won't always be the same batch of users. Percentage_random is [inaudible 00:06:57] just a ... It's a random percentage each time that [inaudible 00:07:00]. User_list, we can turn the features off for very specific users or groups of users or to organizations within our systems [inaudible 00:07:11] using Hootsuite. We could turn it on for free users or pro users, for [inaudible 00:07:17]. We can target features at users who are speaking a certain language. We can target things at specific [inaudible 00:07:26] other ones that have [inaudible 00:07:29] previous cases that we've run into."

"The typical workflow for this is basically, you push your code out there. It has a Dark Launch clock in it. That Dark Launch [inaudible 00:07:48] to our management tool [inaudible 00:07:50] first use. Then you would see it up here, and then you could Dark Launch that new code that you just [inaudible 00:07:56] so you could go on to production and [inaudible 00:08:00] that new feature you just added. If it's working well, you can turn it on for our goal companies so that everyone in it is able to use that feature. Get some people in there to test it out, then we can roll it out to 10% of all users, watch the graphs, make sure everything is working and then roll it out 50%, 100%. Then [inaudible 00:08:21] make sure it's performing well and nothing's spiking on you. Then if it is, you just send that back to zero and it' disabled for everyone, and then you figure out what went wrong."

"There are all sorts of uses for [inaudible 00:08:40] that have been really pretty easy to solve with Dark Launching. Migrations another one. A lot of what we're getting with moving to Microservices, and I'm sure a lot of other people are facing the same issues, you're at [inaudible 00:08:56] moving to a Microservice and then you've got to do this migration from the monolith to the service. Dark Launching has made this really, really easy for us. We're able to find your attraction points within the code base [inaudible 00:09:15] and then come up with ways to do to migrations, either just on wholesale migration or a phased roll out, and be able to control it all [inaudible 00:09:27] back with Dark Launch codes. Also it's [inaudible 00:09:32] to let people try out new features ahead of full releases in order to get [inaudible 00:09:37]."

"Related to that, load testing, another thing that we've found it to be useful for. You're starting to move over to a new service you just made. You send it partial traffic. You can, again, roll it forward and backward, see how it performs, watch the graphs, watch the logs and make sure that everything is working the way you expect. Also, shadowing has been really easy for us. Even before we actually do a full migration [inaudible 00:10:09] to release a new service, we're able to start sending [inaudible 00:10:15] to a new service just to be able to get a sense of its performance and its capabilities and most of the capacity [inaudible 00:10:26] as we're bringing that sort of thing to production."

"Another interesting use case for us has been security and protection I guess to attacks, which definitely wasn't in our minds when we were designing the system but it's become useful. [inaudible 00:10:45] protection of the system in cases where, for example, [inaudible 00:10:48]. For a while, Twitter was having some serious reliability issues, and [inaudible 00:10:55] issues with their API would start bubbling up and hurting our system because we were so heavily dependent on them. Then a quick way to get around this is we just added the kill Twitter stream start launch code. Basically, whenever you sent us [inaudible 00:11:13], it just doesn't get involved in making requests to Twitter and just displacing us if we're using the same [inaudible 00:11:18] having issues. This was a really, really simple way for us to just say, "Okay, we know there are some issues right now. Let's just cut it off temporarily." Also, we've had cases where they're attacking us in various ways. We're trying to run [inaudible 00:11:35] or something like that, and we're able to foil a lot of that by being able to quickly change the properties of the system and the behavior of the system to mess with people's [inaudible 00:11:48]."

"A/B testing is another useful thing [inaudible 00:12:01] something complex that are also maybe testing for [inaudible 00:12:05] to use, but we found really simple cases where you really just want half the users get this whole thing, half the users get a new thing and then see how that shapes our, see what they like more and stuff like that. It's just a really easy way for us to do that. It's not happening very often."

"The flow for this is you wrap your code in Dark Launch block. You push that. As soon as it's executed for the first time, if it sees that the code there [inaudible 00:12:48] doesn't exist, then it will automatically register I think the key codes to our [inaudible 00:12:54] with some stampede protection to make sure [inaudible 00:12:58]. Then once that's written, you would use our [inaudible 00:13:04]. You can select the data center you want and the service you want. You can do some filtering and searching. You can go ahead and view the history of all the changes [inaudible 00:13:16] and actually go in and tweet them. [inaudible 00:13:22] and then you would be able to [inaudible 00:13:26] to the ones that I spoke about earlier, achieve value, [inaudible 00:13:32] descriptions and stuff like that, see history and also control whether or not it's available to jump servers. This is something that we've [inaudible 00:13:41] anything that is accessible via the front end to anybody [inaudible 00:13:48]. We restrict which of our codes are actually sent in the front end because, number one, they're not always needed. Pushing all of our data is [inaudible 00:14:01]. Second, we don't necessarily want everybody to just be able to see all of them. [inaudible 00:14:08]"

"Dark Launching has really become core to our continuous delivery. It's something that you think about every time you're pushing code and something that really has been embraced by our team. It's not something that's mandated. It's something that you will really easily see the value of it because they can push code and not really worry about it. That lets us keep up the pace of our delivery and our deployment. It's even gone so far as to change the way that we use [inaudible 00:14:57]. We use [inaudible 00:15:01], and we use it maybe in a little bit of a non-standard way because we don't really do much branching. We work in very, very short linked branches. Typically, we only have our branch active for a couple of days or [inaudible 00:15:16]. We want everyone to merge into master basically as quickly as possible so that everyone is all integrated and working in the same context [inaudible 00:15:24] mindset. Because we don't have branching [inaudible 00:15:34], that means the branching is pushed into production. [inaudible 00:15:38] branching in production where you have these [inaudible 00:15:43], but instead of being a branch in [inaudible 00:15:47]. It's easy for us to share and collaborate on those because it's how we can control that target."

"There are some associated costs. You've got to clean these things up. [inaudible 00:16:07] code base. Various teams have various ways of doing that. Some teams have Dark Launch shell cleanup allocated on [inaudible 00:16:17] there's a bunch of codes that they did this week that they don't need anymore. They'll just blast them [inaudible 00:16:23] everybody to be responsible for. Also, complexity can be an issue. If you're not using it in a responsible or smart way, it can lead to all sorts of ... if you have a bunch of these things, nest it and then [inaudible 00:16:44], which can be a little crazy if you let it get out of control. We'll try to keep a handle on that, and everybody's pretty responsible. They don't understand how to understand to make good use of the system."

"Our initial implementation of this was our staff [inaudible 00:17:11] servers, a bunch of PHP and FPM markers on each of them, MySQL database and [inaudible 00:17:18] servers. All the Dark Launching was just made in My SQL [inaudible 00:17:22]. That held us for a while, but we reached a certain point where it was no longer feasible. Dark Launching just became more and more ingrained in our culture. [inaudible 00:17:36] That became a burden on MySQL and also even on Memcached. That would have certain hot keys that would be accessed [inaudible 00:17:52], mix certain servers [inaudible 00:17:58] receiving way more traffic than the other ones that actually cause some issues [inaudible 00:18:04]. Also, it was too tied in to our core product, our monolith. There was no [inaudible 00:18:12] to work with it, and that became an issue when we move into Microservices, not feasible for a distributed system."

"Consul came on our radar. We were already fans of Hashicorp products. We saw a huge potential for push based solution to this problem. We wanted to make use of some of the other capabilities of Consul, and so we needed some place to start, some proof of concept for it. The timing was [inaudible 00:18:50]. We really liked where it was going. We liked the team and the feature set. Also, it was built on really solid foundation. It uses the RAFT consensus algorithm and the [inaudible 00:19:04] protocols [inaudible 00:19:08] established and well understood ideas there. We started experimenting with it for [inaudible 00:19:16]."

"This is Consul [inaudible 00:19:19]. Basically, it's showing [inaudible 00:19:23] historical right there. The Consul key value store has that [inaudible 00:19:28] prefix which is like [inaudible 00:19:30]. You can see here we've got Dark Launch codes [inaudible 00:19:38], and there are a couple of subsections within that. There's [inaudible 00:19:41] the Dark Launch name [inaudible 00:19:48] the data that I showed you earlier. There's some data that's not actually used here or not included here. We don't include stuff like the description and [inaudible 00:20:11] and stuff like that in this block because it's not useful to distribute that to all of the nodes that are actually doing the work because they don't need it. It's only useful for the admin tool, and so it's stored with the tool [inaudible 00:20:32]."

"The Consul has this concept of watches. You can set a watch on various things, including [inaudible 00:20:45] in this case. We're basically saying watch on this key prefix Dark Launch [inaudible 00:20:53]. Whenever something changes in here, anything the key value [inaudible 00:20:58] when it changes, it's [inaudible 00:21:03] triggered, and that is anything. It's any [inaudible 00:21:07] on your system. [inaudible 00:21:10] here will receive a big glob of whatever it was that actually changed. In our case, this is our Dark Launch handler."

"In the future implementation, the handler receives all the key value data, and it just writes out a PHP syntax config with all the data as an array. [inaudible 00:21:37]also will hit the web server that's running on localhost in the case of a [inaudible 00:21:44] to clear the in-memory cache. It basically just updates the file, clears the cache. The next time any code on that machine tries to read that cache and the data doesn't exceed, it will read it off from the file and then back into the memory cache on that server. If the flag that you're checking doesn't exist in that data, then it will contact the local Consul agent and add to the key value store. This is an example of what that file would look like. It's a big array, again, same value that [inaudible 00:22:25]."

"When you're modifying a flag, in order to [inaudible 00:22:35] make a change, the server that's running this has the Consul agent running on it. When you make your change, it will communicate with the agent and send it to the leader of the Consul cluster, which will commit it to its log and then [inaudible 00:22:53] which will be [inaudible 00:22:53] to everything, to all of the other agents. It will [inaudible 00:23:00]. The agent will file the watch, which will then break the big file. Then it will [inaudible 00:23:15] cache, and FPM will read the [inaudible 00:23:18] file. We have [inaudible 00:23:21] when you're creating a flag. It's reading that value [inaudible 00:23:28], which then stores that data into a cluster, and it gets sent out to all of the agents that care about it, including the [inaudible 00:23:38]."

"Similar implementation with Scala. The handler sees all the data. It writes out Typesafe HOCON to the big file, which is just a Typesafe config format. [inaudible 00:23:56] It's used a lot in [inaudible 00:23:58] and stuff like that. Our Scala library uses and notify to watch for changes to the file. When a file changes, there's an Akka agent, which is just [inaudible 00:24:12] that manages state. That [inaudible 00:24:14] will see that there's change. It will [inaudible 00:24:17] to get the state of the Dark Launch code. It basically just asks the actor if this code is [inaudible 00:24:28] or not and the actor is [inaudible 00:24:30] Dark Launch information. It's also [inaudible 00:24:36] with containers. We use Mesos and Marathon internally for running services in Scala and Go. Similar to the serious implementations, Consul is running on the Mesos slave host, and it is going to be using its handlers to provide all of the service Dark Launch data to disk. Then the containers [inaudible 00:25:00] the files that are written up by that handler."

"Some issues that we've seen, when we were setting up Consul initially, up until 0.5.1, it had some issues with how we had our networking set up. That was a real problem for going [inaudible 00:25:23] with it, but that's all been solved. [inaudible 00:25:28] Also, something to mention is that because this is a distributed system that deals with eventual consistency, atomicity is something that you don't have here. In our old system, we had enough cache which has the atomic operations. In here, we're dealing with the concept of convergence of data. This is not really an issue for how we use Dark Launching. We don't need to do things that require absolute millisecond synchronization. If we need to do that, we use other systems. Really, the typical convergence for us, we've seen almost all the time within a second. Here's a little graph where the time scale is one second. This is 2:55:35. This is 2:55:36. You can see that there's 60 web servers having their [inaudible 00:26:37] by Consul. They all happen within less than half a second."

"Some of the lessons that we learned from this. Consul has an ACL system, and it's really worthwhile to think about how you're going to lay out your system and make sure that you enable ACLs [inaudible 00:27:02]. You don't have to do it later on even if you don't necessarily have a plan for it, at least turn it on. Also, it's [inaudible 00:27:11]to how you're going to structure your [inaudible 00:27:14]. Because it does [inaudible 00:27:20] you need bi-directional communication between every node that wants to [inaudible 00:27:27] cluster on specific ports. That can be a little bit of a tough sell to the security team to convince our team that it was a good idea. Consul itself is [inaudible 00:27:40]. Security is a main concern of the [inaudible 00:27:48]. It's important to understand Consul's outage recovery process. You don't need it very often. We haven't had a case where we've had to use that in production, but it hasn't been [inaudible 00:28:07] everything goes down. We saw that in prefix events that will be delivered to noes even they were down at the time of the event. Key prefix [inaudible 00:28:22] within the system, I think."

"Consul, we found it to be a [inaudible 00:28:32]. It's worked really well for us from the start. Taking it to Dark Launching, which we're already using and [inaudible 00:28:44] a really key part of our system and being really good ways to introduce this [inaudible 00:28:51] technology to the team, which [inaudible 00:28:54] sooner. Due to our success with it for Dark Launching, we've gone on and expanded the scope of our usage of that. We're now using it for [inaudible 00:29:11] version of our [inaudible 00:29:13] service behind them, service discovery for a project that's using Akka cluster, distributed locking for various things within our system. It has a really great set up and easy to use locking functionality."

"Also, we have a fairly new Microservice discovery and routing system called Skyline that's built on [inaudible 00:29:41] and Consul. That's one of the key things in our push toward Microservices, a really easy way to do [inaudible 00:29:50] discover. The upgrade process has been interesting with Consul. I think they did put a lot of thought into the real world usage of these tools. Part of the real world usage is you've got it in production, you need to upgrade it. You can't be just taking things down and making them unavailable [inaudible 00:30:18] issues that are [inaudible 00:30:21] causing issues. Every update that we've done to it has been [inaudible 00:30:26] for keeping everything compatible even as you're walking up your versions and stuff like that. It's been a very great tool for us."

"It's increased the stability of the system. It decreased the load on the Memcached and MySQL quite a bit. Since the data is pushed now rather than pulled, our systems can always rely on an up to date version of Dark Launch data existing on the disks at all times. They don't need to worry about some [inaudible 00:31:06] or how they need to connect to it or pull data out of it. There's a certain amount of freedom in knowing that. For us, it's now usable in all of our data centers, all of our projects and all of our environments, which is [inaudible 00:31:27] for us because that's ... the shared data cross all of these projects and all of these data centers. This allows us to control access to not just the monolith but services and everything else we have going and do it all in tandem and have them be able to collaborate and share that [inaudible 00:31:49], which has been really, really valuable. I think that has helped us a lot in our [inaudible 00:31:55] to Microservices and helped the transition be relatively easy. That's what I've got."

Brian: "Great. Thank you very much, Bill. We have a few questions from the group, so let's just look below. Did you evaluate ZooKeeper as an alternative to Consul for service discovery? If you did, what was the reason for choosing Consul over it?"

Bill: "Yeah. We actually do use ZooKeeper for some things, our customer setup for one of our [inaudible 00:32:35]. Also, we use it with [inaudible 00:32:39] and other things. After using both side by side, Consul is just so much better to work with. It's been incredibly reliable. The functionality it has goes way beyond just the key values or just service discovery. They're just constantly putting more and more interesting stuff into it. I think the pace is just going to keep going, whereas I see ZooKeeper falling behind."

Brian: "Interesting. All right, great. Another question from the group. We all thought this in the room too when we saw the Dark Launch framework that you showed earlier. Is it open source? If it is, where can you find it? If not, any plans to open source it?"

Bill: "Yeah. That's definitely on the roadmap. There is definitely a question internally to get everything open sourced, including the PHP library, the Scala library, the [inaudible 00:33:50] interface and everything like that. It's been a little bit slower, but we've been taking chunks of it and making them more and more isolated [inaudible 00:34:04] and suitable for open source. I hope they make it sometime soon."

Brian: "Speaking of Dark Launching, how do you find it affects code readability? I know you mentioned that you do have to continually refactor and pull things out, but do you find if you hire someone who's never worked in a model, that they just get confused by all the ifs, the branching makes it hard to read? Or is it cleaned up enough that there's generally only a few of them for a short period of time at any one time?"

Bill: "Yeah. They can accrue over time, but people generally understand that with great power comes great responsibility. [inaudible 00:34:53] clean things up and try to keep everything [inaudible 00:34:56]. On the front of having new people come in, at first, people are a little bit shocked by it, but almost immediately, they're converts. Personally, after working this way for years now, I cannot work in any other way. It just makes everyone's job so much easier. It improves the reliability of your systems and the visibility and the control that you have over them. Going back to a system where you have to just build [inaudible 00:35:32] changing anything is to push out [inaudible 00:35:36] stuff like that, yeah, it would be hard to do. People really need to get better."

Brian: "Got it. Do you share the ... Do you find that there's a normalized responsibility for refactoring and pulling them out? Is it just automatic now, everybody does it? Or do you now and then have to put everybody in a room and say, "All right, guys, let's go clean up for a few hours."

Bill: "Yeah, there's definitely a bit of both. As I mentioned, various things and various people have their own methods for keeping on top of it, but we do have a yearly round of cleanup where we'll just go through [inaudible 00:36:19] and say, "Okay, these are the ones that [inaudible 00:36:23]." It's a little bit annoying to do that, but weighed against the benefits that we get out of it, it's ..."

Brian: "Yup, okay. We were also triggered by when you said distributed locking. The distributed systems nerds in the room were like, "Oh, danger, Will Robinson, danger." Is it core to the application flow? Is that a very common thing? Just thinking about all this locking and releasing, et cetera."

Bill: "Yeah, it's just something that ... It's not core to any of our products and functionality, but it's something that's come in really handy in certain cases. You often want to take on a lot of something, and Consul gives ... to do that across ... One case that we're doing is [inaudible 00:37:28] in some of our tests. We have automated tests that need to basically just run against a big block of Twitter username or something like that. We want to have a way to just lock ... We have 20 [inaudible 00:37:41] that are all running, and they might all be running this test at the same time. [inaudible 00:37:45] and then release them. Doing that with Consul is just trivial. Also, you can use it for ... One of the things we want to do with it is use [inaudible 00:37:57] process. We use it for [inaudible 00:38:01]. Basically, as you're executing, you're deploying, you could pull down ... if you have a script that's [inaudible 00:38:10] whatever, you could basically execute that command on everything in your infrastructure at the same time but then use the [inaudible 00:38:24] in Consul and say only ever do this on five [inaudible 00:38:28] at a time and not automatically do it for [inaudible 00:38:30]. There's some really interesting uses."

Brian: "You're actually using Consul to handle all that nasty distributed systems resilience."

Bill: "It does it in a [inaudible 00:38:44]."

Brian: "Great, great. Last question, you mentioned something about Skyline. Is that a product, an internal creation, open source thing?"

Bill: "Yeah. That's an internal thing that ... I've been working on that personally for a few months, and it's right now in the process of being open source. We're probably going to be releasing that hopefully within a couple of months. Yeah, so far, it's been working really well for us. It's built on a lot of existing tools that we've found to be really stable."

Brian: "Fantastic, great. Well, that's all of the questions, Bill. Thank you very much. We're out of time. This is great. Now we're going to go to your colleague, Adam. Appreciate your time, Bill. Thanks a lot."

Bill: "Thanks, everyone."

Expand Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.

Try the open source Datawire Blackbird deployment project.