Haystack: An Extensible Observability Platform from Expedia

0
6707

Most companies usually run hundreds of microservices, but what happens when one or more services fail at the same time? To improve the observability and quality of service, we need to connect these failure points across the distributed topology to reduce the mean time to discover and resolve issues. To understand how Expedia implemented tracing and diagnostics by building a solution called Haystack, Ankita K.S. from the EFY Group had a chat with Shreya Sharma, technical product manager, and Keshav Peswani, software engineer, Expedia Inc., during Open Source India 2018. Excerpts follow…

Q How do you think open source software has developed in India over the past few years?

Around 1997 and 1998, Linux was the only key open source project available but with time, we saw bigger and better companies come up with a lot of open source technologies. For example, at Expedia, we use many of the open source technologies developed by various other companies, as we are in the building mode. India is currently at a great stage, not only as a contributor but also as a creator of open source software solutions. I personally see a very huge change. Several years ago, when I started working, there were hardly any open source contributors. But today, with many key open source players, the scenario has changed for the better.

Q What are the new technologies that have come up in the recent past and how will they impact the future?

Inventions yield yet more inventions. With the growth in the microservices architecture, the need arose for a centralised system for load balancing, traffic managing, routing, health monitoring, etc. This is where a service mesh comes into the picture. A service mesh is a new architectural paradigm to facilitate these services on a platform level and it frees the application writers from those tasks. Istio and Linkerd are the most popular ones.

We’ve also observed growth in container orchestration systems like Kubernetes, Mesos and Marathon, which handle the infrastructure layer in a more efficient manner.

Q What is the main problem with respect to microservices?

As I mentioned before, a service mesh solves most of the problems in the microservices architecture. But there are a few problems like observability, for which a service mesh doesn’t provide an out-of-the box solution. What it provides is telemetry data that can be exported to any tracing backend for observability. This is where Haystack comes into the picture. It helps in diagnosing issues across multiple microservices, which is often a difficult and time-consuming task. Hence the problem boils down to finding a small needle in the haystack. The needle is the issue, the root cause or pin-point of failure in your microservices architecture. For example, Expedia has more than 2200 microservices. If there is a problem in any one of them, finding that one problem in the huge set of microservices without distributed tracing is next to impossible.

Q Can you tell us more about Haystack?

Our journey started with the logging context in log files, joining logs using log indexers and trying out Zipkin to write our distributed tracing and observability platform.

All these changes were need-driven as we moved from monoliths to the microservices architecture. Haystack, in two simple words, is an observability platform. It helps detect problems in microservices based applications.

Haystack includes tools such as traces to help visualise a waterfall view of all the downstream microservices called for a single request. Trends help in generating app level metrics. These metrics include failure counts, the duration, etc, of each microservice and its operations. The service graph provides a high-level overview of how various microservices interact with each other, at a glance. We also have the pipes sub-system to send trace data to external sinks such as AWS S3. This helps teams to store Haystack data for a longer duration in a cost-effective manner. Data scientists can also use this data for analysis.

Our system is built on the shoulders of important open source technologies such as Kafka (KStreams), Cassandra, ElasticSearch and MetricTank, with a key focus on extensibility and OpenTracing principles.

Q What were the main reasons behind Expedia coming up with a custom solution?

We wanted to build a platform that was extensible enough to solve our observability needs. Many of the existing systems in that space were not as extensible, as we needed to build additional features. We also wanted to have a system that followed OpenTracing standards so that the organisation is not tied to a single implementation. OpenTracing compliance also enabled us to support UUIDs (Universal Unique Identifiers) for span ID and trace ID. Finally, this presented us with an opportunity to contribute back to the open source community.

QWhat is the future roadmap for Expedia with respect to open source?

We have a new platform called Adaptive Alerting coming up. It is responsible for identifying anomalies in services’ health and triggering alerts – it’s the next open source project by Expedia, and will include new machine learning and neural network based algorithms. Adaptive Alerting will be integrated with Haystack, so that the latter shows anomalies detected by the former platform. As a team, we have grown from the development phase and added a lot of features to various open source projects over two years of dedicated work. Now our focus is on developing our open source community. We really want contributors who can help develop the tool. We also want people who can use it and let us know if they don’t like it; we want to hear about the problems and improve.

QWhat is your favourite open source project?

Although there are many open source projects out there, which impact and help thousands of developers, we are more biased towards Haystack, since we have been working on this since its inception. Haystack is not the first open source contribution of Expedia, but it is the first end-to-end platform. This has opened the gates for our talented engineers to contribute back to open source.

QWhat attracted you the most at OSI Days 2018?

When we first read about this conference, the first thing we noticed was that it wasn’t attached to any foundation, brand or company but was neutral. When we came here, what impressed us most was the amazing quality of talks, all of which were on cutting-edge technologies.

LEAVE A REPLY

Please enter your comment!
Please enter your name here