Problem

I need to perform a query that will join data from PostgreSQL and ClickHouse databases. There is a subscribers activity table in Clickhouse that contains various metrics for each subscriber in case if there is an option activated for a corresponding channel id.

I want to aggregate this data with another table of subscribers from PostgreSQL to find out if there are some correlations between subscribers age and their activity (like delivery count, click count, etc).

Because of the old version of ClickHouse, I can’t use the PostgreSQL engine. This is something i will utilize in the future. As for now i will stick to the basic approach where i will insert all my data from CSV files in order to merge it later.

Create a proxy table in ClickHouse

CREATE TABLE subscribers_subscriber
(
    id UInt64,
    subscribed_date Date
) ENGINE = PostgreSQL('webpush-worker-rds-prd.cluster-ro-cmpslvys4mms.us-east-2.rds.amazonaws.com:5432', 'webpush', 'subscribers_subscriber', 'webpushuser', '#`;[_<7{&rUGQ>zs')

Create subscribers_subscriber table

CREATE TABLE subscribers_subscriber
(
    id Int64,
    subscribed_date Date
) engine = MergeTree() ORDER BY id

Insert data from CSV file

clickhouse-client --host=127.0.0.1 --port=9000 --user=default --database=statistics --query "INSERT INTO subscribers_subscriber FORMAT CSVWithNames" < /tmp/15050_subscribers.csv

Join two tables

SELECT date, age_range, sum(delivers) as delivers, sum(clicks) as clicks FROM (
    SELECT sa.date as date, dateDiff('day', ss.subscribed_date, sa.date) as age, sa.delivers, sa.clicks, CASE 
        WHEN age >= 0 AND age < 5 THEN '0-5'
        WHEN age >= 5 AND age < 10 THEN '5-10'
        WHEN age >= 10 AND age < 15 THEN '10-15'
        WHEN age >= 15 AND age < 20 THEN '15-20'
        WHEN age >= 20 AND age < 25 THEN '20-25'
        WHEN age >= 25 AND age < 30 THEN '25-30'
        WHEN age >= 30 AND age < 35 THEN '30-35'
        WHEN age >= 35 AND age < 40 THEN '35-40'
        WHEN age >= 40 AND age < 45 THEN '40-45'
        WHEN age >= 45 AND age < 50 THEN '45-50'
        WHEN age >= 50 AND age < 55 THEN '50-55'
        WHEN age >= 55 AND age < 60 THEN '55-60'
        WHEN age >= 60 AND age < 90 THEN '60-90'
        ELSE '90+'
    END AS age_range FROM (
        SELECT toDate(timestamp) as date, subscriber_id, sum(delivered_count) as delivers, sum(clicked_count) as clicks FROM subscribers_activities WHERE date >= '2024-07-06' and channel_id = 15050 group by subscriber_id, date
    ) sa JOIN subscribers_subscriber ss ON sa.subscriber_id = ss.id
) GROUP BY date, age_range ORDER BY date INTO OUTFILE '/tmp/15050_age_statistics_2.csv' FORMAT CSVWithNames;
SELECT * FROM subscribers_activities sa JOIN subscribers_subscriber ss ON sa.subscriber_id = ss.id WHERE toDate(sa.timestamp) > '2024-08-05' and sa.channel_id = 29057;