How Hootsuite Manages Its Growing Microservices Landscape – Adam Arsenault

Adam Arsenault (Hootsuite)

Description

During our SOA transition at Hootsuite, we have noticed that visibility into our service relationships, dependencies and status is paramount to keeping our team, our build pipeline and application running smoothly. I’d like to share with you an API we baked into our SOA architecture that enables us to explore our applications service dependency graph in real time.

Presentation Slides


Transcript

Austin: "What's up guys? We're with our final speaker today. This is Adam Arsenault, another speaker from Hootsuite, going to chat about how they have adopted microservices. He's going to get into a lot of the nitty gritty. Where Bill was talking about dark launching, Adam is going to talk about the [architecture 00:00:32] stuff. Go ahead and take it away Adam."

Adam: "All right. Let's get into the presentation. All right, my name is Adam Arsenault and I'm here to talk to you about how Hootsuite manages it's growing microservice landscape. What we're going to talk about today. We're going to briefly talk about our road to SOA, so kind of like where we started, where we're going, where we are now. We're going to talk a little bit about the service graph, so what do we mean when we talk about the service graph. We're going to talk a little bit about an API that we built for monitoring and a tool that we built called Voltron. We'll do a quick demo and then do some lessons learned."

"I just want to double check that you guys can hear me still? All good? Okay great. Hootsuite started in 2008 and we started as a PHP Monolith, and in about 2013 we decided we were going to start a transition to SOA. During that time we were in a period of hyper growth and we were experimenting with continuous integration. Today we have over 20 services and counting in production, and yeah, we're still not finished but I think we're doing a pretty good job so far."

"During that time our CTO at the time really liked this quote from Melvin Conway. Basically, "Organizations which design systems ... are constrained to produce in design systems which are copies of the communication structures of these organizations." When we first started moving to SOA, we were basically a monolithic team. We were one giant development team."

"We stayed that way for a little bit as we started SOA, but then eventually we found that it would be better if we broke up into teams, so we broke up into a bunch of teams, and each team has their own goals, feature sets, and their own autonomy. Because we broke up into teams, teams didn't communicate as well as we did day to day when we were smaller and a monolith. Our teams were growing, so each team was getting more and more people on it, and each team was getting more services and more abstraction. We were hiding parts of the system from other teams, which previously we knew more about and they weren't necessarily hidden."

"This really led to a bunch of problems, a bunch of run time issues, a bunch of integration issues and a bunch of explosions in our environments. Really, there were kind of two main ones that caused a lot of stress, the first one being integration failures. In this scenario we have a dev, they have some changes, they merge those changes into our built pipeline, and those changes go to staging, and then all of a sudden the integration tests fail. We freeze our release pipeline so no one can do any releases to production, and obviously this is stressful. People are trying to figure out what's going wrong. There were certain times when we had problems that were not visible, so texts failing and maybe no code changes to that area, or other things like that."

"Devs that were possible changing one thing were seeing things break in another area, and that was stressful, and they didn't know how to fix it, and they didn't really know how to tell what was wrong. So, really common example here is you have front end tests failing because backend services weren't working properly. People were saying, is it really my code? How do I fix this? We had a really long time to diagnose these types of problems."

"The second is, you know, we all hate this, something goes down in production. All the on call teams affected start getting alerts and notifications, and you end up having to sift through a flood of notifications to figure out what's going wrong. This doesn't usually happen when your system is simpler and is a monolith, but when you start to break out into many, many different services, and these services have inter-dependencies, it's really challenging sometimes to find out what the root cause of the problem is."

"Really, for us, it's boiled down to the concept of visibility. More teams, leads to more services, leads to more moving parts, more distraction. Thus, the need for more visibility into the over-application and it's parts. People would often ask questions like, "What services are there? What services are failing? Why is this service failing? Did you know about service three or something like that, maybe a new service has been added?" So, we really wanted to find a way to help make all of this architecture that was changing much more visible."

"Let's talk a little bit about the service graph. We'll just do a couple definitions and some examples before we go into the API that we built here. When you're a monolith your service graph is very simple, right? You just have one node. It's just the app. As you start to add more services this starts to get a little bit more complicated, but at this point it's still pretty easy to reason about you have an app with S1 and S2 services, so let's keep adding some more."

"Still pretty easy to reason about. We don't really have any service inter-dependencies yet, so we really just have two levels in this graph. It's pretty easy to understand, but when we start to add service inter-dependencies, things get a little bit more complicated. Add more services and this gets even more complicated. What happens if something fails, like what happens if S2 goes down? In this graph we're seeing well, S2 is going down but in reality when S2 goes down it really means this."

"In that example I talked about before where we had a production issue, if S2 goes down here, all the people in this chain are going to start receiving alerts probably, because their service isn't working the way it's expected to."

"This diagram here is an actual screenshot or a diagram of our services in production at Hootsuite. As you can tell, there's lots and lots of services. Because of that it's very complicated. You know, if we look at this diagram and if, for example, S21 goes down, way down here in the corner in the bottom left, wow what's going to happen, right? There could be lots and lots of failures, lots of people getting alerts, and it's just not really easy to understand what all the pieces in this application are."

"Let's talk a little bit about the API that we built for health checking and monitoring. When we talk about our API, we talk about- we have the concept of a dependency, so every app or service had dependencies. A dependency could be internal or traversable, that's the way we define them, internal or traversable. An internal dependency is something that the node that owns it only cares about, and an example of this would be, for example, a database or a cache. A traversable dependency would be something like a service, so S1 there for example."

"One of the first endpoints that we created to help describe what's going on in our infrastructure is the about end point, and so the about endpoint returns data about a service or app, such as version, description, maintainers, links, documentation and the status of each individual dependency for that node. This is pretty awesome because it can describe itself. Each node in the graph can describe itself. It also can tell you if the dependencies that it has are down or up, or up or down. You can check the status of each of those dependencies as well."

"We also have, this is more of a dynamic endpoint that gets configured, but it is like a status slash dependency. Every dependency that you register in your service gets it's own endpoint, which you can call by itself to get the status of that endpoint, and it's going to return okay or an error message. Examples of this is say you had a service called core service, and you registered it with your service, then you would have an endpoint called status slash service core, or status slash db. This is a way for you to get really fine grained status about a particular dependency within your service, and figure out if it's working. If it's not working why it's not working."

"The next endpoint that we created is called the aggregate endpoint. This endpoint is meant to give the overall, really quick status of your service, so when yo call this endpoint you really expect to see, OK if everything is all good or you expect to see CRIT, with some sort of an error message, so something that goes wrong. You can also see warnings if the service is in a weird state where something could go wrong but it's not quite there yet, but I guess the point here is that for aggregate it checks all of the dependencies in your service. If, for example, you had the dependencies on the previous slide, it would check service core and it would check the database. If both of them were ok, it would return OK. If one of them was failing it would return CRIT with the error message. If two of them were failing, it would return one error message. You fix that error message and then it will turn the second, and that's kind of how we've decided that aggregate endpoint works."

"The next endpoint we have, which I think is the most interesting is an endpoint that allows you to traverse the service graph, so on the right hand side we see an app, which has a dependency of S1, and S1 has a dependency of S2. This endpoint allows you to traverse from the app, down to S2 and then perform an action, and the only action that we have allowed at this point is the about action, so it will basically allow you to start at the app and go down to S2, and call about, so you could figure out what was going on at S2 without actually needing direct access to S2. You could just get it through the service graph."

"Now that I've talked a little bit about the API, I want to talk a little bit about how we use this API. We use the API to monitor our services, like I said before. We use a tool called [Sensue 00:13:07] to actually query these end points on the services, to monitor individual hosts and reports success fail, and to send alerts to people. We can also use this endpoint to calculate end time, and we use a tool called Site 24/7 to do that."

"With this API we can also debug what's going on in the graph. We can use the about endpoint to look at a node in the graph and see which dependencies are up, which dependencies are down. We can use traverse endpoint to then look at different places in the graph. In this example we have S1, which has a database, and we have S2, so if the database is down when we call the about endpoint we expect to see a failure in the dependencies for the database, and we expect the aggregate endpoint to report that the database is down."

"One of the really cool things about this as well, is because it actually uses the real connectivity of your application itself, you can do things like debug connections between services and other things like that. You don't need to do it- you don't need actually to set up anything extra. You just get that for free because it uses the connections of the application itself."

"One of the other things that you can do, and this is I think the biggest benefit to this, is that it allows us to explore and learn about the service graph or the application and all of it's parts. So, if we look at this picture here, we have an app with a whole bunch of services and their inter-dependencies. With this API you can start at the app level and you can traverse to other levels of the service graph, and then have the node describe itself. That's pretty cool because now you can write code that knows how to traverse your service graph and do a whole bunch of tooling, and that's what we did to actually generate this picture here."

"This picture actually is generated every single night and allows us to see how our service graph has changed over time. If people add new services to the graph, they show up here. If people remove other services from the graph, they get removed, and if they change any inter-dependencies that also shows up here. That really, really helps us explore and learn. Oh yeah, here is- this, right here is our service graph changing over time. You can see that the dates are changing here, so this is actually us changing and adding services, and removing them over time. This gif was generated from that API. Let's wait for it to finish here. Getting pretty close. There's a link to that in the presentation if you want to check that out. Again, like I said before, this really allows us to document, which is awesome. Now, we can actually see how our service graph changes over time and record that."

"Our general monitoring strategy that we have, we use two tools. We use Sensu like I said before, and Sensu allows us to monitor single machines in our graph. It basically works at the machine level. It doesn't work at the whole service level. It's responsibility is to send alerts and notifications when one machine in a cluster goes down or starts reporting errors. The other tool that we have called Voltron, it's responsibility is the overall status of the application and services, and the main thing that we do there is we troubleshoot by drilling down, or we can explore the application. Let me give a quick demo here of the application."

"Is everybody following along still. I need to start up a couple things here, sorry. I'm just starting a couple demo apps that have some dependencies registered in our demo tool app. Once all this stuff gets going I'll open a browser here and we can check out the tool. Like I said before, the main responsibility of Voltron is to basically look at the application as a whole, and allow people to explore our service graph in real time, and actually see when things go up and down. It's basically an interactive web tool that's similar to that animated gif that we have, but it's a little bit different in the sense that it doesn't- it only shows a subset of the graph at one time. Has it's benefits."

"Let's see if we can get this going. Looks like it's still trying to compile here. Sorry I had issues with my computer just before so I got to ... before I started. Looks like it's trying here. As this is going I'll just keep going and then we'll come back to the demo here."

"The Voltron app itself, it's a play app. Doesn't really matter what technology you use but it's a play app. It has [reactrias 00:20:24] front end, uses websockets and then to connect to the backend, and then in the back end it uses play and akka. The overall architecture looks like this. We have browsers, which in the front end connect to the backend using websockets. Each websocket, each browser has an actor in the backend, which backs it's websockets, so it keeps state for that web socket and it connects, it communicates with a Status Poller Actor, which is the sole single place that polls against status endpoints, to build the UI. This whole system is real time using message passing, and synchronized using websockets."

"The basic flow is you with the browser will open some page to look at the status of some note in the app. It will send that state down to the websocket. The websocket will send that state to status actor, and if the actor's not already polling for that status, it will start polling for it. Every time it receives responses it will send it back to the actor, and then back to any browsers that care about that. The reason we decided to use websockets here is that we really wanted to keep all of the users in sync as they were debugging and trying to figure out what was wrong, as they were debugging services. It looks like there's a builder there, so I'll continue with the rest of the presentation and then at the end, hopefully I can clear that up and show you something before time is up here."

"Basically, the key thing for these quotes here is that, "When there's a production issue and we see lots of people going to use the Voltron tool to perform diagnostics on what might be wrong." "Voltron is often the first to tell us when Snowflake is down." Snowflake was a service that we had go down in staging sometimes that we had not great visibility into. When a critical service goes down, everything starts alerting and reporting problems, but Voltron gets through the noise by letting you drill down.

"This quote here is about debugging the actual service communication channels. So, Voltron told the people that were debugging it's service that the communication channel was okay, and there was an actual bug in the air, or bug in the service that they needed to fix. Our lesson learned. The biggest lesson here is that visibility empowers. It gives developers more power. It makes them much more productive, and helps with developer happiness."

"Create your SOA tools early, so automate all the things. Identify problems early and fix them, and ten times factor. I like to think of the ten times factor as, if I had ten times more services than I have now, would I actually be able to manage them, or would it just be crazy? If it would just be crazy and you couldn't manage them, then you need to re-look at your tools and you need to try and find a better way."

"Make status checking easy. We did this by standardizing. We did it by building the API I showed you. We did it by adding our status checking to a service framework so that everybody gets it for free, and then we also did it by sharing common status checks between projects. I mean, in that boat, when I mean more like, if you're going to check on MySQL database and everybody needs to check MySQL databases in your group, then you should probably have one shared way of doing that and everybody uses the same code. It just makes it easier, so everybody's checking the dependency the same way.

"I've mentioned this earlier but this was a pretty big thing for us was using websockets in the tool for real time just really helped synchronize the views so everybody's seeing the same thing, and it helped with performance. We didn't have to worry about multiple people trying to request data through an ajax endpoint at the same time, and causing other performance issues. Websockets are just much quicker."

"All right. I'd like to demo for you the Voltron app now. Like I mentioned before, the Voltron app uses the API that I talked about earlier in the presentation. One of the nicest things about the Voltron app is that it allows us to explore the application itself to see what services are there and what it's made of. I'll just go a little bit over the UI here first. If you look at the big box at the top here in the center, this is the metadata about the current app that you're looking at, so there's a time-stamp as to when it was last checked, the status of it, there's a version, there is a brief description of the app, there's owners and there's link to documentation, link to the actual code, and links to logs and dashboards to look at the status of the app itself. This is great because people can come here directly and they're quick links into what is currently running, and they can find out who owns it, and they can find out what the application or service does."

"The next thing I want to talk about is, if we look at the top bar here, this is actual history of the status checks that are happening on the application, so if we click on one we can see that at this time it was all good. The other thing that I wanted to point out is if we look down on the left hand side we have internal dependencies. Some of the internal dependencies are dependencies that only this app cares about, so they are internal to this app and if either one of those dependencies goes down, then we'll see an error. This app has Memcache and it has a MySQL database. On the right hand side we see our service dependencies, or dependencies that we deem are traversable."

"In this example app we have two services, one called service one, and one called service two. Anytime you see this sort of graph icon here, that means that it's clickable and you can navigate into the service graph at the next point. One thing I'd like to point out as well is that the Voltron app is actually doing a heartbeat right now on this demo application, and every time it heartbeats the application, it actually traverses the whole service graph to find out the status of everything in the service graph, and errors propagate up the service graph, so it's one service has an error, you would see that at the very, very top."

"Another thing that's interesting is that each node in the graph actually gets it's own dashboard, so if we go and we click on service one, we can go and see that service one has a dashboard and so we can learn about what service one is. Service one looks like it's a demo service for this demo app, and it has it's own documentation and it looks like it has a MongoDB. So that's pretty cool. Just by looking and clicking on that we can see that it's up, so it's green, and that it has a MongoDB. Let's go back."

"Oh, so it looks like we have another service called service two, so let's click on that and see what that's all about. Okay, so service two looks like it has a Redis database and it looks like it also has a service dependency on service one. That's interesting, so let's click on service one. Okay, so we can now traverse the graph from the main application into service two, and then into service one. One of the other things that's really interesting is without even traversing, you can actually enter the service graph at any point as long as you can connect to it. By going and changing the URL there, I can actually navigate directly to service one and it gets it's own dashboard, and I can also go directly to service two."

"You can see that right now we're entering from service two, and now we're looking at the service two dashboard and then we can actually go into service one as well. The cool thing is each team that owns a service can actually have their own dashboard as well, and you can use this in dev, and staging, and production as long as you can actually connect to your service. If you can't connect directly to your service, you can always go in through an entry point. In this example here we've chosen to use the main app as the entry point, and then we can traverse down to the service that we want to look at. This is great because it means we can monitor the whole application just by looking at this one dashboard. Let's go and see what an error looks like."

"Right now what I'm going to do is I'm going to simulate the database in service one failing. As you can see, the app turns red and then we see a bunch of errors on the right hand side in the services. Service one is the one that we made the database fail, so you can see it's red and it has an error message describing that the database connection failed. We can actually still navigate into the service and we can see that it was the Mongo database that was causing the issues. That's great. It allowed us to drill down and see the error. Now, we also see that service two looks like it's reporting an error, so let's figure out why service two looks like it's reporting an error. Oh, it's reporting an error because service one, it has a dependency in service one and service one is down."

"Let's go into that so we can basically drill down and we can see that the Mongo database is the issue here again. Let's resolve this issue. Great, so we just resolved the issue so now we're back. Let's go back up the graph here. Looks like service two is back to normal and our main dashboard is back to normal as well."

"One of the other things that's really cool about this is you can actually- new services will start showing up in the graph. I have another service running called service three, and it's not connected to the main application graph yet, but it has it's own dashboard because I can connect directly to it. Let me go and add this to the service graph, and let's see it show up directly on the application. Here we're seeing the dashboard for service three still, and if I go back to the main demo app, oh now you see service three is there. You can navigate into service three and you can see what's going on. The really cool thing about this is that you can basically grow, this dashboard will grow as your application grows, and as your application changes and allows you to monitor, explore, document and debug your application."

Brian: "Hey Adam, that was great. We had a question here about services knowing their own status. In the framework you have these slash status endpoints, what exactly does it mean for a service to have a healthy status. Does it check its database associations, or does it have an internal error account or whatever?"

Adam: "Yeah, so the way that we've built it is that when you- the framework basically allows people to register dependencies, so a dependency could be thought of as a database like you mentioned, or another service, or a cache or something like that. When you configure your service, you would say, if you have a database you would say I have a database, and this is the function that you would use to check the status of that database. This is the endpoint which I would like to expose for that status check. For example, [inaudible 00:34:38]. Once you've done that, the framework basically exposes status slash database like DB, and will either answer with okay if it's okay, or it will return an error message. It will return CRIT plus some sort of error message that describes what went wrong. Generally the way that we build status checks is we would have something like connect to the database and try to run a query. If any exceptions happen- so, have that in a tri-catch. If any exceptions happen catch that and then output that as the error from the status check."

Brian: "Did you ever have a client, or have you ever thought about having the clients that invoke those services influence the status, have the service be made aware of it's unhealthy status from the [inaudible 00:35:30]?"

Adam: "So we do have a concept of that called a circuit breaker, and that is another type of status check that we have. For example, if you have a dependency like another service or a database like you're saying, we can create a circuit breaker on it that will actually stop calling that dependency when it detects that it's failing a lot, and let it recover, and then come back."

Brian: "Great."

Adam: "So, it'll check every once in a while and those are different types of status checks that you can configure. By default, for checking a database or for checking another service, we don't actually use the circuit breakers that often because we actually want to know at this time what is the status. We don't really want to derive status at that time but we use circuit breakers to back off when things are encountering like maybe too much load, or too many errors, or to fail fast. Does that make sense?"

Brian: "Yeah, absolutely. Great. No, that was a very good answer. Thank you. I don't know if you want to retry your demo. I think if you're getting hit by demo demons we can move on, but you want to give it one more try?"

Adam: "Yeah. Looks like my- I don't know why this one's not running. Let's just try another question. I'll see if this actually loads, and if it does then good. If not I'll have to maybe- I think I have ... Yeah. I'll see if I have any screenshots or something."

Brian: "Okay, well the next question is actually about how much time you have left because we know you actually have to leave soon (laughs)."

Adam: "Yeah. I can go for another like two minutes, a few minutes here. I'd just really like to show the tool here."

Brian: "Mm-hmm (affirmative). I had a question on the whole sort of infrastructure you've got here with Voltron and all that. Is it pretty stable or do you guys constantly rev it. Is it something that just works now, and it's nice and taken for granted?"

Adam: "We are still evolving it. We are still kind of figuring out the best way to monitor and to- but the tool itself for exploration, is very stable and we use that day to day inside of our built pipelines and stuff, so we built widgets that we can attach other places that allow people to jump directly into this dashboard and navigate through it, and understand what services are up and down. That's worked really well for us, and I would say that's very stable."

Brian: "Got it. Okay, so I think given your time-frame if-"

Adam: "Yeah."

Brian: "It doesn't look like we're going to get the demo but that's not a problem."

Adam: "Yeah, sorry about that."

Brian: "Quite all right."

Adam: "Okay. Any other questions?"

Brian: "I think that's all we have time for so perfect timing for you as well."
Adam: "Okay, thanks a lot."

Brian: "Thank you Adam. Really great stuff and I'll send over to you."

Austin: "Thanks guys. All right. Guys, we're done. Thank you for everybody that came out. Thank you to all the speakers, Phil, Laurie, Dan, Daniel, Nick, Lachlan and Mike, and Bill and Adam. This has been a really great day. All of the talks are recorded and will be live on microservices.com sometime next week once I've got them edited and have the transcriptions done. Then, also if you were curious about checking out the Data Wire in decay and Data Wire mission control, you can do that at datawire.io. All the details are there. You can get started. You can actually code up with microservices in about ten minutes using the MDK which is, of course, open source. So, on behalf of everyone at Data Wire, I'm Austin. Thank you guys so much for coming out. Thank you for tuning in. Thank you for all the questions. We're going to do another one of these summits in January. It's going to be live, probably in San Francisco, so stay tuned for that and we will see you guys then. Cheers"

Expand Transcript

Stay in the Loop

Keep up with the latest microservices news.

Simplify and streamline microservice deployment.

Try the open source Datawire Blackbird deployment project.