Look for a subscribers without tokens

from apps.subscribers.models import Subscriber
 
query = Subscriber.objects.filter(subscribed_date__date='2024-08-06', token__isnull=True)
print(query.count())

Turned out there are 16681 users without tokens. This is a significant number and it may be the reason why we have a drop in the number of subscriptions and all related metrics.

Let’s find what accounts are causing the problem

query = query.values('firebase_app').annotate(count=Count('id'))
result = [i for i in query]

Find out why these users cant be subscribed

First, we want to check if there are any error codes for these users by checking their statuses:

errors = defaultdict(int)
unsub_dates = defaultdict(int)
query = Subscriber.objects.filter(subscribed_date__date__gt='2024-07-25', token__isnull=True)
for s in query:
    if s.inactive_reason:
        errors[s.inactive_reason] += 1
    if s.unsubscribed_date:
        unsub_dates[s.unsubscribed_date.strftime('%Y-%m-%d')] += 1

The output is the following:

>>> errors
defaultdict(<class 'int'>, {'invalid_sub_data': 77234})
>>> unsub_dates
defaultdict(<class 'int'>, {'2024-08-07': 7959, '2024-07-27': 417, '2024-07-28': 761, '2024-07-29': 1811, '2024-07-30': 6671, '2024-07-31': 6324, '2024-08-01': 6390, '2024-08-02': 6768, '2024-08-03': 7324, '2024-08-04': 7283, '2024-08-05': 8845, '2024-08-06': 16681})

The only thing we can tell now for sure is that all these users were processed by subscribers tasks service. The reason why they were not granted with new tokens is buried somewhere in the assistant service

The root cause

After a detailed investigation on an assistant service side, i’ve found that there are 501 errors appearing in the logs. The root cause of such errors is yet unknown but by manually resubscribing those users i was able to mitigate the issue. For some reason our assistant service was not able to handle such errors properly even though it was supposed to do so (with lock mechanism involved to reduce the number of parallel requests during service downtime or whatever was happening there).

Solution

The issue was related to the old if-elif-else statement logic which was difficult to maintain and understand. Issue was solved with some quick refactoring which made it easier to navigate through different cases and handle them properly. All 501 errors are now trigger a retry mechanism which will try to resubscribe the user after a certain period of time (postponed tasks)