10 min read Part 8 of 10

Part 8: Materialized Views

Real-time analytics made easy with ClickHouse Materialized Views.

#ClickHouse #Materialized Views #Real-time #AggregatingMergeTree
Part 8: Materialized Views

Materialized Views in ClickHouse are one of its most misunderstood features. If you come from PostgreSQL or Oracle, you expect a materialized view to be a cached snapshot that you periodically refresh. In ClickHouse, a Materialized View is a write trigger. It is a real-time data pipe that processes data on the fly. Let’s see how they work.

The ClickHouse Paradigm: Views as Triggers

In PostgreSQL, when you run REFRESH MATERIALIZED VIEW, the database runs the view’s query from scratch, scans the source table, and overwrites the target table.

ClickHouse does not do this.

In ClickHouse, a Materialized View (MV) is a trigger that hooks into the ingestion stream of a source table.

  1. When a block of data is inserted into the Source Table, ClickHouse intercepts it in memory.
  2. It runs the MV’s query only on that inserted block.
  3. It writes the resulting transformed or aggregated block directly to a Target Table.
  4. The source table insert then completes.

This means the Materialized View never reads the source table on disk during writes. It is a pure, zero-read-overhead streaming ETL pipe.

ELI5: What is a Materialized View? Think of it like keeping a running total of your lemonade sales on a whiteboard, updating it with every sale, instead of counting 5,000 physical paper receipts from scratch at the end of the day. Read ELI5: Materialized Views for the analogy.


State-Based Aggregations: State and Merge Suffixes

If you want to count unique users per day, you can’t just sum up the counts. If Alice visits your site at 9:00 AM and again at 5:00 PM, she is 1 unique user. If you pre-aggregate hourly, she will count as 1 in the 9:00 AM bucket and 1 in the 5:00 PM bucket. Summing those up gives you 2, which is incorrect.

To solve this, ClickHouse stores the intermediate aggregation state rather than the final number.

For HyperLogLog, ClickHouse doesn’t write the integer 42 to the aggregated table; it writes the physical HLL sketch (a binary blob containing the state of the hash register).

To do this, we use two special suffixes:

  • -State: Used when inserting data via the MV. It converts a value into its aggregate state.
  • -Merge: Used when querying the target table. It combines the intermediate states to calculate the final aggregate.

Code Walkthrough: Real-time Page View Analytics

Let’s build a real-time aggregation pipeline to calculate hourly page views and unique users.

Step 1: Create the Source Table

This table stores raw clickstream events.

CREATE TABLE page_views (
    timestamp DateTime,
    user_id UInt64,
    url String
) ENGINE = MergeTree()
ORDER BY timestamp;

Step 2: Create the Target Table

This table holds the aggregated data. We use the AggregatingMergeTree engine. Notice the data types.

CREATE TABLE hourly_stats_target (
    hour DateTime,
    total_views UInt64,
    -- We store the intermediate unique user state, not a flat number
    unique_users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree()
ORDER BY hour;

Step 3: Create the Materialized View Bridge

We use uniqState to feed the target table’s AggregateFunction column.

CREATE MATERIALIZED VIEW hourly_stats_mv TO hourly_stats_target AS
SELECT
    toStartOfHour(timestamp) AS hour,
    count() AS total_views,
    uniqState(user_id) AS unique_users
FROM page_views
GROUP BY hour;

Step 4: Insert Raw Data

Let’s insert some mock data:

INSERT INTO page_views VALUES 
('2023-10-08 12:05:00', 1, '/home'),
('2023-10-08 12:10:00', 1, '/about'), -- Same user
('2023-10-08 12:15:00', 2, '/home'),
('2023-10-08 13:02:00', 3, '/home');  -- Next hour

Step 5: Querying the Aggregated Data

When querying the target table, you must use uniqMerge on the state column:

SELECT
    hour,
    sum(total_views) AS total_views,
    uniqMerge(unique_users) AS unique_users
FROM hourly_stats_target
GROUP BY hour
ORDER BY hour;

Output:

┌─hour────────────────┬─total_views─┬─unique_users─┐
│ 2023-10-08 12:00:00 │           3 │            2 │
│ 2023-10-08 13:00:00 │           1 │            1 │
└─────────────────────┴─────────────┴──────────────┘

Even if background merges have combined rows in the target table, uniqMerge will mathematically merge the HyperLogLog registers correctly, yielding the true count of 2 unique users for the 12:00 PM hour.


Why POPULATE is Dangerous (and How to Backfill Safely)

When creating a materialized view, you might be tempted to use the POPULATE keyword:

-- DANGEROUS IN PRODUCTION
CREATE MATERIALIZED VIEW my_mv POPULATE TO my_target AS SELECT ...

The Problem:

POPULATE tells ClickHouse to scan the entire historical contents of the source table and insert them into the target table during creation.

  • This operation blocks any new incoming inserts to the source table.
  • If your source table has billions of rows, the creation query will run for hours, consume all your RAM, and likely fail or lock up your database.
  • Any data inserted into the source table while the populate query is running might not be processed, leading to data loss.

Tiger King POPULATE meme POV: You clicked run on a CREATE MATERIALIZED VIEW POPULATE statement on a production table with 5 billion rows.

The Safe Backfill Pattern:

Instead of POPULATE, create the view normally (it will start processing new incoming inserts from that second forward). Then, backfill the historical data manually in partitioned chunks using INSERT INTO ... SELECT:

-- Backfill historical data in chunks (e.g., month by month)
INSERT INTO hourly_stats_target
SELECT
    toStartOfHour(timestamp) AS hour,
    count() AS total_views,
    uniqState(user_id) AS unique_users
FROM page_views
WHERE timestamp < '2023-10-08 12:00:00'
GROUP BY hour;

This keeps the transaction lock-free and gives you full control over resource consumption. For further reference, read the ClickHouse Materialized Views Reference.

In Part 9, we’ll dive into performance tuning, skip indexes, and compressing files to speed up queries.