Here at HomeAdvisor we’ve been thinking a lot lately about what it means to have a truly resilient architecture. As we continue our record growth, we’re realizing that our current architecture won’t allow us to scale the way we need to over the next few years. Like most organizations, the answer for us is becoming clear: we need to move into the cloud. In this post we’ll discuss the motivations behind moving to a multi data center architecture and the pros and cons of each approach we’re considering. We’re still very early on in the process and expect it to take us well into next year, so this will likely be the first of many posts as we begin to make this important change.
Our Current Architecture
As we’ve discussed previously, our current architecture is a hybrid of several large monolithic web applications and a growing suite of microservices. The monoliths are traditional Java web applications based on Apache Tomcat, while our Microservices are based on Spring Boot. In addition, all of our web applications and microservices rely on several core services to do work:
- Apache Kafka: Distributed message service.
- Apache ZooKeeper: Distributed configuration management used for service registration and Kafka consumer offset management.
- Oracle Database: Traditional RDBMS.
- Oracle Coherence: Distributed in-memory cache.
- ElasticSearch: Distributed denormalized document storage.
Redundancy comes in many different flavors. We run multiple instance of our custom web applications and microservices on different Virtual Machines (VMs). Traffic is load balanced across all instances so that we can lose a single application with little to no fuss. Since each VM runs on different physical hardware, a network rack failure is not detrimental, though it can certainly cause noticeable impacts. All of our core services mentioned previously run in some clustered fashion as well, with each cluster member being physically isolated from the others. They also use some form of data replication and sharding so the loss of a node doesn’t cause loss of data.
But with as much work as we’ve put into making our applications and services fault tolerant, there is still one problem with our current deployment: everything lives in the same data center. If that data center goes offline for any reason, so does our system. And that’s really the big problem we’re trying to solve for.
Choosing A Multi Data Center Architecture
There are many different forms of resiliency that we are looking at, each one involving Amazon Web Services (AWS) and each with its own pros and cons. While it may seem easy enough to spin up a few virtual servers in AWS and deploy our code there, our system services create some interesting technical challenges for us.
For example, ZooKeeper is vulnerable to network partition scenarios in which the cluster members cannot communicate with each other. This is especially a concern when the cluster is spread over a Wide Area Network (WAN). If a cluster members fails a heartbeat, you can easily fall into a split brain scenario in which you effectively have two clusters that may not have the same view of the world. For some use cases this may be tolerable. But in our case ZooKeeper stores information about where all of our microservices are running. Services that are actually healthy either cannot be discovered, or cannot discover other services they need to accomplish work. When this happens, it can bring our system to a halt very quickly.
We’ve also found in previous prototyping efforts that Coherence is also not very forgiving to network timeouts. So to simplify things, we’re making some assumptions about any multi data center architecture we choose:
- Each data center will have its own clusters that do not communicate with each other.
- Service clusters will have limited or no replication across data centers.
- Microservices and web applications will only communicate within their own data center.
- We assume database replication is handled separately from the applications and that database failover is as simple as a DNS or TNS update.
With that background in mind, the following sections will take a closer look at each model we’re considering.
A traditional Disaster Recovery (DR) architecture involves software in a standby mode waiting to be activated. The idea is that the backup (or cold) data center can be activated in the event of primary data center failure. Because only one data center is active at a time, there needs to be some mechanism for keeping services in sync. For the database, we’d use a third party like GoldenGate to duplicate transactions from the primary database to the backup database. For Kafka and ZooKeeper, a file level synchronization tool like rsync could be used to replicate data (remember that because of our simplification that applications never communicate outside their data center, ZooKeeper data would only include Kafka consumer data, and not service registration data).
ElasticSearch and Coherence are somewhat unique in this model. Ideally we’d like to synchronize operations instead of data, but since the services won’t be running in the backup data center, this isn’t an option. Of course the easiest solution is to just run them in the backup data center. Then we could use a 3rd party like Kafka to synchronize operations like index rebuilds, cache flushes, etc. If we don’t want to run these services in the backup data center, then we’re somewhat limited. ElasticSearch does provide a snapshot mechanism to backup indices, or we could use a similar file level synchronization tool like we would with Kafka and ZooKeeper. Coherence is problematic because, at least for us, we don’t store cache data on disk. It’s used as a in-memory cache that fronts our database, so our only option is to re-warm every cache in the event the backup center needs to be activated.
The disaster recovery approach has a couple of advantages. First, data replication is simplified because it’s essentially a one-way push from the active center to the backup data center. Synchronizing data between two active data sources involves the overhead of deconflicting transactions, ensuring unique IDs, etc. In terms of a database, this usually means table locks and the associated performance bottlenecks. Another advantage is that if the backup data center is hosted in the cloud, you can save tremendously on costs because you don’t have active services consuming CPU and Memory. The backup data center runs a skeleton set of system services and is only activated in the event of disaster at the primary data center.
Of course this approach has its disadvantages. First and foremost, it’s not responsive to failure. Because services and applications need to be started from scratch, valuable time is spent activating the backup data center. During this time customers will almost certainly have a degraded or broken experience. Additionally, because data is only being synchronized periodically, there’s a good chance the backup data center will always lag the primary data center. Another tricky part of this model is what happens after a data center failover when the original data center is ready to become master again. It’s not enough to plan for the one-way sync from the primary data center to backup, you need to account for how to push data back during the period of time that the backup data center is serving visitors and creating new data.
This builds on the disaster recovery approach in that each data center still has its own unique set of services and applications, but they are always active and taking traffic. The key is that the applications and microservices still point back to the same database. The primary database still requires some periodic sync to the backup database, and in the event of either database or primary data center failure, the backup database is activated.
The main advantage of this approach is that we are very close to true site redundancy: each data center can handle live traffic in any ratio we want. It also lets us do some A/B deployments, meaning we can deploy new code into one data center and test it before installing everywhere. This of course makes a lot of assumptions about the data model and backwards compatibility of new software releases, which is a topic entirely unto itself.
Of course the main downside of this approach is that there is still a single point of failure in the database. If the backup data center goes offline we don’t lose much aside from scale and capacity, since the database stays online. But if the shared database has a failure, or the entire primary data center goes offline, we’re back to the same problem as the disaster recovery approach: we’ll have a brief period of downtime while we switch the applications to use the backup database. We will likely also have data integrity issues if the database synchronization is not real time. In addition, we also now have to deal with two way data synchronization, though since the services are active in both data centers, we have lot more options. For example, we can use a Kafka mirroring tool to replicate Kafka topics instead of relying a lower level tool like rsync.
All in all, it’s a much better solution than the disaster recovery model. This is actually pretty close to our current state today: we already have a single database that periodically synchronizes to an offsite backup database. The part we’re missing is the second data center with applications and services.
This is an extension of the shared database model. Instead of a single active database in one data center that is shared across all data centers, each data center uses its own. During normal operations, the applications within a data center use only that database. Transactions and changes are synchronized external to the applications.
This is clearly the most robust of the three models. If a database fails in one data center, the applications in that data center simply begin using another database, just like the shared database model. If an entire data center fails, or a particular service fails within a data center, you simply stop routing traffic to it and let the other data center absorb the load.
It’s also the most difficult to implement. It’s very easy to make a statement like “the databases will stay synchronized across each data center”, but in reality this is a very complex problem to solve. In the shared database model, the offline or standby datacenter has no active connections. It can tolerate some latency because nothing is relying on it. There’s also no transaction contention or fear of re-using a unique ID. In a true hot-hot scenario, every database potentially needs to know about changes to every other database to avoid data integrity issues such as re-using unique IDs, etc. There’s also additional concerns around structural changes like column changes, package recompiles, etc.
Other Options For Resiliency
For as much time as we’ve spent thinking about how to build a redundant architecture across multiple data centers, there are some additional considerations that might make things easier. One of the hardest pieces of any of these models is data consistency, especially when it comes to Kafka, Coherence, and ElasticSearch. Since any model we choose requires us to keep certain aspects of our clusters in sync (Coherence cache flushes, Elasticsearch document re-index, etc.) we’ll likely spend a lot of time handling these services.
To that end, as part of our investigation, we’ll also look at moving some of these critical services into fully hosted cloud services. The idea is to not maintain any of these clusters ourselves, and instead let someone else do it. So instead of having distinct clusters in each data center and worrying about replication, we’d have a single cluster hosted by a 3rd party that all data centers would use. Essentially we would be pushing the onus for fault tolerance and resiliency onto the 3rd party vendor. Chances are, they have a lot more experience with their respective products than we do, so letting them handle the hard work may not be a bad idea.
There are two providers in particular we’re looking at. For Kafka, it looks like CloudKafka is the leading provider of hosted Kafka services, and the de facto cloud ElasticSearch provider is Elastic themselves. For Coherence, we didn’t find a reputable cloud provider, but since we’re looking at using Kafka to synchronize specific cache actions, moving to hosted Kafka probably is sufficient. Recall that ZooKeeper is only used for Kafka consumer information and service registration data. Since we don’t need to duplicate service registration data across data centers, moving to hosted Kafka would also solve our ZooKeeper replication problems.
Of course both of these sound great, but there is still some cost/benefit analysis to be done. Pushing your problems onto other people can certainly save time, but it comes at a increased cost over just doing it yourself. We would also need to be sure that the performance of these two cloud services would meet the needs of both our existing and new AWS data centers.
Clearly we’re still very early on in our move to a multi data center architecture. We still have not decided on which of the three models will suit us best, so there is much work to be done. But we feel like we have a good understanding of the pros and cons of each approach, and pretty soon we should be able to make a decision and start building out our new AWS infrastructure, whatever that ends up looking like. We’ll have a lot more to say in the coming weeks and months about our progress and anything we learn along the way.