Intro
What we need is to find a chunk of data processing pipeline that will be sufice to migrate to some other cluster that will lead an overall cost reduction, at least for a while. User’s callbacks and statistics are the most suitable candidates for that.
Callbacks
This block consists of two parts:
- API endpoint (callback-api)
- data processing service (callback-processor)
A first stage of data processing is pretty straightforward: we receive a callback request (HTTP POST request), convert it to message-callback-event and produce other event variations to other Kafka topics:
For a quick decisions about subscriber-sources-activities events, callback-processor uses directory-sources-server instance which requires a lot of memory resources to keep everything hot and ready to use.
Statistics
All of events described above are then consumed by clickhouse from Kafka and then put to various materialized views. Another part of statistics service - statistics-listener - also listens to message-callback-event-parsed and serializes this data into Prometheus metrics, which are then scraped by Signoz.
Subscribers reach detection
directory-reach-detector working in a pair with directory-reach-extractor. Reach detector listens to message-callback-event (unparsed) and use it to find out users that receive messages, so they are excluded from offline-detection and we continue to send them messages. Since we don’t rely on this feature anymore, we can safely ignore this part of the pipeline and don’t touch it at all.
Workflow
Some workflow tasks consist from parts that heavily rely on clickhouse data that is only accessible via statistics-api. In order to preserve this functionality, we need to keep statistics-workflows connected to our Aurora cluster for it to still be able to retrieve and transfer data within a workflow pipeline.
Migration plan
It is all about traffic and how we will redirect it to the new cluster. Since we might run into troubles on the way, we better have a way to do it smoothly. With nginx servers running somewhere and callback-api endpoint targeted on it, we can set up a traffic mirroring and then switch to the new cluster when we are sure that everything works as expected. The worst case scenario would be if we can’t handle this much of traffic on our nginx servers, so we will force to switch callback-api endpoint back to old cluster, so the process should be flexible enough to handle this situation (to ensure that any cache does not intrude our plans).
Recap
In order to migrate callbacks and statistics to another cluster, we need to move the following services:
As well as all of these pre-requisites:
- clickhouse and Kafka running on the new cluster
- Aurora cluster to be accessible from the new cluster
- traffic mirroring prepared and it’s management is flexible enough to handle any issues that might occur during the migration