HubSpot is an all-in-one sales and marketing platform made up of over 350 RESTful APIs deployed hundreds of times a day. This session will discuss the decisions and tradeoffs they made in our quest to maximize developer productivity in a fast moving environment.
Tom: All right, hey, everyone. I’m Tom Petr. I’m from HubSpot. I was talking to Richard earlier and he was joking we’re like the only East Coast microservices company. So I guess you could call us the Harvard architecture of microservices. I thought that was funny. Well, I actually went to UMass, so I’m not going to call it the UMass architecture. Anyway, yep, so I’m Tom Petr. I’m Tech Lead on the Platform as a Service team at HubSpot and I’m here to talk to you about how we maximize developer productivity in a microservices environment.
HubSpot, for those who don’t know, is an all-in-one sales and marketing platform that lets you attract, acquire, and keep customers. We have a lot of different tools. There’s blogging, web analytics, email sending, social media, I could go on and on. Let’s make sure this clicker works. Oh, okay. So from a numbers perspective, we’re over a hundred different engineers and we have over 1,600 different deployable items. So that’s web services, background jobs, Cron jobs, one-off tasks.
We deploy hundreds of times a day daily, and actually go from a git push to having something live in production in about 10 minutes, which is really fast. And this all works because our engineers own the end-to-end success of their products. We have teams of typically three to four engineers that own the entire slice of email or analytics or social media. They’re the ones that make most of the technical decisions and they’re the ones that wear the pagers.
We don’t have an ops team in the classical sense of the word. And microservices are what have allowed us to scale so well. That maps to our team structure. It provides a good technical template. It minimizes merge conflicts and broken builds because we partition our services by GitHub repository. And also compartmentalizes failure. But the magic here is that everyone gets to work on their own thing. People aren’t stepping on each other’s toes and they really feel a strong sense of ownership as to what they’re doing.
And it’s really nice. The problem is you can’t just do microservices. Whether it’s you have a monolith already and you want to chip away or you’re starting from scratch, you can’t just say, “I’m going to do microservices, done.” You’ve got to make investments in the environment, and the infrastructure, and your tooling. So what I want to do is go through the history of HubSpot and the different investments we’ve made along the way to get us where we are.
So I’m going to roll the clock back all the way back to September 2010. This is actually not when HubSpot started. It’s just when we first started tracking our deploys. Back then, there were 14 engineers and 14 active web services. They were running AWS, they were Java, they communicate with HTTP and JSON, and we were using MySQL for our databases. The first thing that we realized is that easy communication is essential.
Like I said, we communicate with HTTP. What we ended up doing was building a library just for communication. We call it HubSpot Connect and it wrapped around the Apache HTTP library. And it made a lot of things easier for us. We had good defaults for timeouts. We were able to sneak in extra things like auto-authentication. And just getting everyone on the same library allowed us to roll out improvements as we went on.
As we scaled up more, configuration got harder and harder. The way we were doing configuration was just properties files. In source control, if you want to make a change to a credential or a timeout or whatever, you had to commit your change, deploy your code. If it was a cross cutting change that means you have to redeploy everything. Huge pain in the ass. What we ended up doing was building a library called HubSpot Config. You can sense a theme here.
And what that allowed us to do was abstract away our configuration. Here’s a Java sample using Juice, dependency injection. But basically what you do is you just have these variables that can correspond to your configuration items and you can pull in the live value whenever you need it. This is very cool because it abstracts away the configuration. You can pull in from environment variables or Java properties or whatever else that you want.
We ended up building a store backed by ZooKeeper that allowed us to change these values in a separate web UI, so that if we wanted to make a change without having to redeploy everything we could do that. Or if we had to do some kind of global change like change a timeout or invalidate a cache we could do that once and everything would pick up the change. Another thing is you got to monitor everything, especially in a microservices environment.
What is just a funky one-off service that you create today could be the thing that you depend on, that everyone depends on, six months from now. And back in, for the longest time, up to like 2012 at HubSpot, you had a couple different options for monitoring. You had Nagios checks, you had Pingdom checks, and you had Selenium checks. And all of them have their pros and cons, but the problem was you actually had to manually set all these things up. The developers just never wanted to do that.
So in 2012 we built a new monitoring system called Rodan. So we departed from the HubSpot naming scheme to the Godzilla monster naming scheme. And the beauty of Rodan is in its simplicity. All it does it just accept data points. So we instrumented all of our apps so that they periodically report into Rodan with different values. So you can see her in the web UI all the different families that we have.
And if you click into it you can actually see all the different instances. You can see the requests per second, error rates, blah blah blah. And that, developers are getting automatically. All they have to do when they spin up a new service is basically just say what its name is. And each of those metrics can have rules associated with them. So you could say, “If this is a critical request, if it goes to zero requests a second, something bad is probably going on.” There’s also rules baked in, like every web request should probably alert if a server error goes above 50%.
So that made things a lot easier. We also beefed up our graphing capabilities. So we built this thing called LEDJS that lets you have…it’s kind of like a Python notebook version of Graphite and OpenTSDB, so we could really easily graph and see what is going on when we’re trying to debug something. Another thing that we learned, this is something near and dear to my heart. Command line tools can burn you. So I don’t know about you guys, but for the longest time, when you wanted to deploy something at HubSpot you just ran this command line tool called deployer.
So you’d deployer deploy and name your project. And what it’d be doing behind the scenes is SSHing into the hosts that your project runs on, laying down a new build, starting things up, SSHing into load balancers, applying configuration. There are lots of places where things can go wrong and a lot of room for error. And so what began as just a couple people using this one script, it turns into lots of people using it, and you’re kind of at the mercy of everyone’s machines.
If someone’s out of date or if someone is hacking on the script, that could totally screw up your deployment process. And the way we got around that is actually wrapping that in its own web service. So the same deployer code is now in a service that people hit through a web UI. And it’s nice because now we’re in control of what version that we’re running, and people can do their deployments wherever they want, even on the beach if they want.
The other nice thing is they’re going from a world where you have just a command line tool where you say, “Yeah, I want to do it. Okay, go,” to a nice web UI where you can see, “Okay, I’m going to deploy this thing. Let me actually get a dry run of what’s going to happen. Okay, it’s going to deploy these things. It’s going to deploy these builds. Okay, cool. Deploy.” So people feel a lot more confident in the tools because they know what’s actually going to go on behind the scenes.
Builds must be fast and they must be cheap. We use Jenkins for our stuff, but we ended up…we kind of appropriated a Heroku Buildpack model so that we weren’t tied to Jenkins. We have so many different build jobs that Jenkins is actually kind of slow for us. So we’re in the process of building a new build system, but it’s going to be really simple to migrate to it because we have this Buildpack process. There’s nothing tying us to Jenkins.
It’s also really essential to decouple the front end from the back end. When people talk about microservices, they’re saying, “Okay, our email service is now separate from our navigation service or setting service.” But if you have your front end and your back end people working together on that email service, you’re still going to be held up one from the other. So one thing that we did is we had this realization as we were moving first to Backbone and then to ReACT, we were seeing, “Oh, we have all these end points that are literally just serving html. There’s no point in having a server actually host this, so why don’t we just change our build process so that when we build our static content, just shove it into a CDN and when you’re deploying just point our load balancers to that CDN.”
It’s really nice because now our front end people can work asynchronously from our back end people. This is probably obvious, but you’ve got to automate your deployments and your infrastructure. So I want to walk you through what developing an app used to be like at HubSpot. So first you develop locally. Once you’re ready to test something out, you provision hardware for a QA environment. So you go into this tool and you have to say, “Okay, what kind of EC2 machine do I want and how many do I want? Do I err on the side of caution? Do I try to save HubSpot some money, and choose smaller instances or do I want to do bigger instances and not have to worry about scaling issues later?”
Once I choose what kind of instances I want, I’ve got to wait 20 or 30 minutes for Puppet to run, for DNS to be updated, for all these different things before my hosts are ready. Once they’re ready, then I used that script that I was talking about to actually deploy my code. So then I test that in QA, everything’s good. And then I have to do the same thing from production. I’ve got to decide what kind of host do I want, how many do I need? Roll it out. Everything is good until, like people said before, 4:00 a.m. your hosts die and you get paged.
You’ve got to spin up new hosts, redeploy, blah blah blah. It’s not a great setup. What we ended up doing, kind of like the Yelp people, we went down the route of building our own paths as well. We had this thing called Singularity, which is an Open Source project, and it’s an Apache Mesos scheduler. And it’s really nice because the combination of Singularity in Mesos, it lets us abstract away machines.
We’re not thinking in terms of, “Okay, this service has these three hosts.” We’re thinking, “Okay, this service has three instances that each use five CPUs and 10 gigs of memory.” And it also promotes a homogenous environment, because developers are just thinking in terms of the requirements of their app, not what Amazon instance can I shoehorn this into. It also allows us to scale out to specific processes. It allows us to scale out specific processes, so let’s say the email service, we have a huge email send.
In the past that would mean we’ve got to frantically spin up five more instances, wait 20 to 30 minutes for them to be ready, and then redeploy. Now we can just say, “Hey, that service in Mesos, just run double of it,” which is really nice. And the other nice thing is that it provides us with a centralized service registry. I’m going to be completely honest with you, and before we were using Mesos, if you asked me, “Hey, tell me everything that’s running everywhere,” I would start sweating.
I would be frantically writing some SSH scripts. But with Mesos, now you just have this nice API where you can say, “Yeah, tell me what is running.” That’s really cool. So the workflow after is much simpler. You just develop locally. You just deploy to QA. You just throw it in. You test it out, and then deploy to PROD, and then you’re done. And the other nice thing about this, like I said, is you have these nice APIs. You can surface information much easier, whereas before, a developer would be SSHing into boxes to get information about their service.
Now you can go this nice web page where you can see what’s running, what’s running previously. You get options to scale or pause or bounce. If you click into a task, you can see the lifecycle of it. You can kill it if it’s misbehaving. You can get resource usage. You can get health check information. You can even go into the sandbox of the task and tail log files, which is really, really nice. It’s all things that developers could do before, but just not as easily so it’s making them more productive because they can just do their job.
Another thing is use the environment to your advantage. The way we did service discovery, well still do service discovery, it’s not that fancy. We basically treat the developers like a customer. If you want to hit the example service, I just made up that URL, but if you want to hit example service it’s just a well-known URL: hubapi.com/example/v1/endpoint. And that’s really easy to understand, but when you look at it, it’s a huge waste of resources because we’re going to the public internet just to get back to our service.
And there are obviously better ways to do it. We could invest in SmartStack or some kind of new registry system, or service discovery system, but we realized, “Hey, we have control of our HTTP client because we wrote HubSpot Connect. And we know what’s running because we have Mesos. Why don’t we just make our client smart so that if we’re hitting something that we know about just talk straight to it? So we’ve been able to simplify a lot of things, save money by using intelligent clients to circumvent load balances.
Another important thing is to invest in onboarding. And that’s not just new hire training, but that’s making sure that your developers, when they get a new laptop they can be good to go in minutes. And we’ve struggled with that for a while. I think we’re on version 4 of our setup script. What it is now is based off of HomeBrew and brew-casks. And it’s been really, really… it’s worked really well so far.
So any developer that gets a new machine basically runs a script, goes through, sets up SSH keys, installs HomeBrew, installs brew-cast. We have a private tap as well for some special libraries, but also to pin what languages to specific versions so we can make sure everyone is running the same Python, everyone’s running the same Java, and it mirrors what’s running in production. So that’s really nice.
The other thing that sounds kind of obvious but can be hard is just know what’s out there. So when you’re in an environment with many services, that means you’re running a lot of things and each of those things are pulling in libraries, and those libraries are pulling other libraries. What do you guys do if you realize that you rolled out a bug in one of your core libraries? Well, in the past that would mean email the engineering team saying, “Hey, there’s a bug in this library. Please rebuild and redeploy.”
But that’s a pain in the ass, and no one wants to do that. And people are going to fall through the cracks anyway. So one thing that we did, again, because we had control of our environment, we took the thing that was sending the data points in the Rodan, our monitoring system, and also had it send information about what builds did the app pull in. So what that allowed us to do, it might be kind of hard to read, is we have this thing called Cattle Prod.
So when you know that you need to force some kind of change in our environment, instead of sending out an email to everyone saying, “Hey, this build might be bad. Please rebuild and redeploy,” you can actually say, “Okay, so for everything pulling in the content cache build before build number 263, instruct them to rebuild and redeploy,” or if they’re pulling in a specific version of another build. And what that means is the actual service owners will get this nice email saying, “Hey, these things that you own, they’re pulling things that need to change. Please redeploy.”
The thing that I really like about this is you can see that these investments really began to pay off. So you can see this blue line here is our number of engineers, and it’s slowly gone up. But you can see around, probably, late 2013 is when we really made lots of investments to our infrastructure and our number of active web APIs really skyrocketed. In addition, developers are getting more productive.
You can see that the average number of commits per month is on an upward trend, which is really nice. Additionally, because of all the investments we made in the infrastructure, people are logging in to our infrastructure tools less, so they’re operating less infrastructure. They’re condensing those machines. They’re setting up load balances, which is really nice. But the really nice thing is you guys don’t have to do this from scratch.
We started on this very early on when there weren’t a ton of tools out there. But there are lots of really nice Open Source tools that you guys can use. That’s it. Thanks. Yeah.
Question:
Tom: Yeah, yeah. So the question was how are we tracking these metrics. So this one, we actually have a database going back to September 2010. HubSpot started in 2006, I think, but it doesn’t go back that far. And it’s just a very simple SQL query to get month-by-month the cumulative active, meaning deployed, web APIs and engineers that are deploying. The commit one is actually pretty funny. We had this tool that we rolled out called Kumonga, going on the Godzilla monster theme, and it was supposed to be like a Facebook timeline for your contributions at HubSpot.
And we used it for a little while and it kind of fell by the wayside. But as I was building this presentation, I was like, “I really want to get commit information. I wonder if Kumonga is still running.” And believe it or not, it’s been running for four years. So this is just paging through an API to get all the commits for everyone, and averaging it together. This one it was just usage tracking on our infrastructure tools.
Austin: You mentioned being able to deploy into production from 10 minutes after a git push. Is that including time in your QA environment or not so much? Can you talk a little bit about that?
Tom: Yeah, that’s…
Austin: Obviously, 10 minutes for that is really fast.
Tom: Yeah, yeah. So yeah, I guess the way that usually works is so that time includes the git push, the build, deploying to QA, very minimal testing, and then deploying to production. Some of the things that we use to safeguard against failures like this is we really focus around the mean time to recovery, so in the one sense we use feature gates. So if you’re doing thing right, you’re wrapping your changes in a gate so that you can very easily pull it back if it’s not working, or slowing roll it out to beta-tolerant users. We also make it very quick to rollback. So if for whatever reason you weren’t able to wrap the thing in a gate or if you just need to pull back immediately, it’s a very quick operation to do that.
Austin: And do you end up with separate Mesos clusters for production and QA and all that kind of stuff?
Tom: Yeah, so the way the PaaS team works is we actually have three environments. There’s production, which is the production for the product. There’s QA, which is QA for the product, but we still treat it like production. And then there’s test, which is our QA, which is, I guess, the Wild West. So we have three different Mesos clusters for that.
Question:
Tom: Yeah, so you’re asking about security between the endpoints. What I meant by that is we’ve, over time, had different authentication mechanisms. It was just simple API key for a really long time, and then we transitioned to a more sophisticated approach later on. But the thing is, just like with the intelligent load balancing, because we control the HTTP client we can see, “Oh, this request is going to this service,” it wants an API key or it wants this new type of key. And it’ll inject it into your request for you automatically. So it’s a lot less thinking about what you need to do.
Question:
Tom: So I got to clarify, 2010 is just when the data…so you’re asking what tools are out there now, like has the landscape changed? 2010 is just when it was the beginning of the recorded data. We didn’t actually start the Platform as a Service team until, I think it was June of 2014. And before that it was all just random contributions from me or other people that were really into it, I think. It’s kind of a self-serving answer, but we Open Sourced most of our stuff so if we were doing it…if I went to another company I would just take all of our old stuff.
In terms of other things, though, SmartStack is really interesting. I know Datawire is working on something similar. I feel like I can’t give a great answer just because I’ve been so focused on our own tools, if you know what I mean.
Question:
Tom: Currently, it’s its own thing, yeah.
Austin: One last question from the net was back to Mesos again. So with HubSpot’s clients, you guys are running multi-tenant clusters or multiple clusters for clients or how do you do that? How do you handle that?
Tom: So we only, in terms of the app, each environment just has one cluster. We just run singularity. We don’t have any other frameworks running so there’s no other contention for apps. We do a small amount of SPARQ stuff for some machine learning purposes. And that we actually do have in its own cluster, just to isolate it because SPARQ can be really resource intensive sometimes. But we almost don’t consider it a cluster because it’s just SPARQ doing one thing. It’s almost like more devoting machines to it, if you know what I mean.
Cool. Thanks, guys.
Try the open source Datawire Blackbird deployment project.