We are soon starting to switch the first ACP clients over to a new search backend. This blog post gives background information on why we have introduced a new search backend and what exactly we have done.
The ACP architecture
Most people only ever deal with acpcomposer. But ACP is actually a system of microservices, in which acpcomposer is providing the public json API. Acpcomposer is not only a json-endpoint though. It also contains a lot of business logic. In order to produce one content response, acpcomposer has to collect and put together many related data pieces from different data sources.
To illustrate with an example let’s look at how a story content is produced. We load the article itself from our escenic cms. The article may have som images and even a video. So we load image metadata from the same CMS and video metadata from a video backend. And then there is a link to a different article embedded in the article’s body. For this embedded article we have to load the cms-content as well, and find metadata for one suitable illustration. All these pieces are then put together in acpcomposer to one content document.
The same happens for a search in acpcomposer. First, we perform a search in elasticsearch, but acpcomposer only takes id’s from the search response. All titles, author names, dates and so on are loaded from the CMS for each search result. There are two important reasons for this: we want the same transformations and formatting in our search results as in our content responses. And we always must make sure that the truth about publication state (published or not) is in the cms. The search index also contains information about publication state, but we have no guarantee that this is in sync with the cms at all times. So we have to check with the CMS if each search result item actually is published.
Aligning search with the architecture
At the center of the architecture in ACP is the idea of unifying various data sources though connectors. And let acpcomposer then easily combine pieces of unified data to what ACP users know as content or sectiongrid entities.
A connector’s job is to transform a specific data format to the AtomPub XML format, so that article-like content from different data sources all look the same to acpcomposer, and therefore can be transformed the same way.
The only exception to this architectural idea has been search. Here, acpcomposer has directly accessed our elasticsearch cluster, and needed to handle elasticsearch data differently from all the other data. The reason for this is that we wanted as low latency as possible between the elasticsearch cluster and the end consumer. But experience told us that the slow searches we experienced mostly had other reasons. Things were slow inside of elasticsearch, but not necessarily in the networking between Elasticsearch and callers.
So now acpsearch will be be a proper connector in the acp family. It is a connector that talks to elasticsearch and produces AtomPub feeds and entries. And this makes it possible remove a lot of code and libraries from acpcomposer, which now only has to deal with AtomPub.
General support for syndication
Amedia Utvikling handles the data for a large number of publications, and we need to control access across these publications. In other words, we need to have syndication rules. Foremost reason beiing the need to implement legal restrictions between publications. But also for controlling access to test publications through search.
In addition to that, we have different syndication requirements for internal searches and public searches. Since acpcomposer is a public API, we have to use a stricter ruleset there. But journalists should be able to find content from a broader set of publications aswell for internal purposes.
With the old cluster, this was achieved by having different elasticsearch clients for different purposes. All of which were constructing their own queries and handling syndication differently. So, the elasticsearch client in acpcomposer was only one of many.
With acpsearch it was easy for us to provide search endpoints for all requirements. Acpcomposer only consumes and exposes the ones that are public-compatible. And internal searches can now also be done via acpsearch instead of via separate clients. In acpsearch we now have one microservice where all search logic is implemented, and where we can make sure that all syndication rules are implemented correctly.
The introduction of a new search component also gave us the opportunity to upgrade elasticsearch from version 1.x to a 2.x. By building up a completely new cluster and a new index, we could be sure that the “upgrade” would be safe. It is simply a matter of changing configuration if we want to switch between old and new cluster. And there is the ability to go back to the old cluster should anything happen.
Relevance of the indexed data
The old indexes contained quite a bit of unused data. When the indexer was originally set up, we could for the most part only guess which searches we would later perform. This lead to indexing large documents which contained much more information than we ended up needing. We also indexed types of data that we never searched for.
The new index now only contains data that is relevant for how we use search in ACP.
The indexing engine for the old cluster was meant to be a flexible tool that should be able to index data from any source. And that made sense at the time, because we did not have ACP in today’s form yet.
The indexer became a somewhat overengineered, enterprise grade piece of software. It is built on Apache Camel and contains routes and transformations for various input sources and types of data. A lot of the data transformation logic is actually very similar to what ACP does. Also, there are more than 300 instances of the indexer in production to cover all publications and their CMSes and archive databases. In summary, it is not very easy to maintain the indexing software.
So, even before we decided on creating a new elasticsearch cluster, we realized that we wanted a new, simpler, more specialized indexer.
The new indexer consumes unified ACP data, removing the need for camel and duplicated transformation code. It also can handle all changelogs from all relevant datasources in one installation, so we don’t need several 100 instances. We still run a handful of indexers for redundancy though.
The new indexer’s lines of java code is about one third of what the old one had - not that loc’s mean all that much in a java project ;-).
Performance and reliability
The biggest issue we are facing at the moment are timeouts in Elasticsearch. We try to work around this problem in several places, from retry mechanisms to caching mechanism. But it is not possible to counteract timeouts completely with these tools. It’s like covering a manhole with bandaid.
One reason for this certainly is the constantly increasing amount of search based requests that pile up in ACP. The second reason is that our elasticsearch cluster is not optimized for how we actually use search in ACP today. The first matter we cannot change. Search is and will be central in ACP. The latter we decided to address.
Optimizing indexes and sharding
As stated above, the old cluster was set up without knowing what types of searches we were going to execute. And without knowing too much about elasticsearch itself. It is important to have in mind that this is not an ideal precondition. Because as it turned out, it was the index configuration that caused the biggest performance problems.
Each of the several hundred old indexers did create it’s own index, and posted it’s documents to this index. At the time of writing, we have 496 indexes in the old cluster. Now, not all of these are actually active, some are abandoned due to renames and others are internal indexes which are not used by acp. But either way we talk about a big number here.
Also, there is a configuration setting in elasticsearch called shard per index. This setting cannot be changed once the index is created. The default is 5 + 1 replica per shard. We set it to 3 plus 2 replicas per shard, a bit lower because we had many relatively small indexes. That amounts to a total of 3 * 3 = 9 shards per index. Granted, if a particular search falls withing the index’s partitioning criteria, elasticsearch may possibly only have to access one shard for that search. But for most of our searches we have to access the whole index. Which means accessing 3 shards per search per index.
Another important fact is that we quite often search for content in all or almost all publications.
The final piece of information for understanding how the sum of the above hurt us, is that one shard in elasticsearch technically is one lucene index. And each lucene index is one process on the machine. If we now theoretically run a search across multiple indexes, e.g. for searching in all publications, elasticsearch will do the following:
First, it has to fan out the search to all involved shards, If we assume the search is across 300 indexes, we start 3 * 300 = 900 (shards-per-index * indexes) lucene searches. So we have 900 processes concurrently using cpu on the cluster nodes. Elasticsearch will then wait for all of the shards to finish, rank the collected results, and then fetch the selected documents.
The point here is that our example used 900 threads for a single search request. Which obviously is too much. Especially if you run many such searches at the same time. This is what we see all the time now, many cluster nodes run on max load.
So the only way to fix this was to build a new cluster, with fewer indexes and fewer shards.
The new cluster now only has one big index, with 5*3 shards. And we need fewer nodes in the cluster.