Problem

There is clearly a decrease in the number of subscriptions since 2024-05-30. It is unclear how it may happened, but it seems that it is a statistics only that is affected. Subscribers are still seems to be fine.

Investigation

Statistics is definitely loosing some data. I grabbed subscriptions from subscription_events and i clearly see the problem started on 2024-05-30.

select toDate(timestamp) as date, sum(subscribe_count), sum(unsubscribe_count) from subscription_events where date > '2024-05-25' group by date;
┌───────date─┬─sum(subscribe_count)─┬─sum(unsubscribe_count)─┐
2024-05-267177651958
2024-05-277003560527
2024-05-286645646801
2024-05-296724546755
2024-05-306577145128
2024-05-315241135305
2024-06-014366827825
2024-06-024602528777
2024-06-034132128881
2024-06-044093928778
2024-06-053006819112
└────────────┴──────────────────────┴────────────────────────┘

At the same time i see much larger amount of new subscribers in the system.

subs = Subscriber.objects.filter(subscribed_date__gte='2024-06-05')
>>> len([s for s in subs if bool(s.token)])
48796

It’s also not related to unsubscription process. The amount of unsubscribed users has also decreased which is seen from the previous query.

I verified how many users did we unsubscribe today:

>>> len([s for s in subs if s.unsubscribed_date is not None])
9885

which also did not present any useful insights.

At the current moment the only thing that might have been affected is event producing kafka library.

    def _register_subscription_event(self) -> None:
        kafka_producer = Producer(SubscriptionEvent)
        event = SubscriberEventsFactory.make_subscribe_event(self.subscriber)
        kafka_producer.send(event, async_mode=False)

Perhaps async_mode=False is not working as expected. I can try to switch it to True with further kafka_producer.poll() call.

I also restarted the service to eliminate any possible issues with Kafka library cache. Tomorrow results will show if it helped.

Possible causes

  • Kafka library cache issue
  • subscriber events are not being produced in a full amount
  • schema registry does not accept some of the events (subscribers_age field?)

Solution

None of the above helped. It turned out it was a clickhouse issue that caused the problem. After clickhouse restart the statistics started to return to normal. Should wait for another day to confirm that it is not an issue anymore.