Part 4: Data Modeling
Best practices for schema design, primary keys, and denormalization in ClickHouse.
If you try to design a ClickHouse schema using the Third Normal Form (3NF) you learned in university, you are going to have a bad time. OLAP databases hate joins. They don’t want clean, separated entities. They want wide, flat, ugly, denormalized tables. Let’s look at how to model data for maximum query speed.
The Cost of Joins at Scale
In a traditional relational database (like PostgreSQL), you split data to avoid redundancy. You have an orders table, a customers table, and a products table. When you run a report, you join them:
-- Great in Postgres. Slow in ClickHouse at scale.
SELECT c.country, SUM(o.total_amount)
FROM orders o
JOIN customers c ON o.customer_id = c.id
GROUP BY c.country;
In a distributed OLAP database, a JOIN is an incredibly expensive operation. The database engine has to copy the customers table across the network to every single node in the cluster (a Broadcast Join) or shuffle the data from both tables across the network based on the join key (a Hash Shuffle Join). Network latency and serialization will choke your queries.
ClickHouse’s design philosophy is simple: disk space is cheap, CPU cycles are cheap, but network and random memory access are slow.
Therefore, you should denormalize your tables. Put the customer’s country, name, and product category directly into the orders table at ingestion time. Yes, you will write duplicate strings millions of times, but ClickHouse’s compression will compress those duplicate values down to next to nothing, and your queries will scan a single table without touching the network.
Data engineers looking at a single 200-column flat table while their normalization textbook screams in horror.
Normalized vs. Denormalized: E-Commerce Case Study
Let’s design a schema for an e-commerce store tracking orders.
The Normalized Way (Anti-Pattern)
If you build this with separate tables:
orders:id,timestamp,customer_id,product_id,quantity,pricecustomers:id,name,email,country,marketing_segmentproducts:id,name,category,sku
To find revenue by country and product category, you have to run a 3-way join.
The ClickHouse Way (Flat Table Pattern)
We merge everything into a single, wide table. Save this DDL:
CREATE TABLE orders_flat (
order_id UInt64,
timestamp DateTime CODEC(DoubleDelta, LZ4),
customer_id UInt64,
customer_name String,
customer_country LowCardinality(String),
customer_segment LowCardinality(String),
product_id UInt64,
product_sku LowCardinality(String),
product_category LowCardinality(String),
quantity UInt16,
price Float64 CODEC(Gorilla, ZSTD),
revenue Float64 CODEC(Gorilla, ZSTD)
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (customer_country, product_category, timestamp);
For more schema patterns, consult the official ClickHouse Schema Design Guide.
ClickHouse Specific Data Types (The Optimization details)
Look closely at the data types in our DDL. We didn’t just use standard String or Float64 everywhere. Here is why:
1. LowCardinality(T)
When you have a column with a small number of unique values (like country, product_category, or customer_segment), wrap it in LowCardinality().
Instead of writing the full string "United States" a million times, ClickHouse creates a dictionary under the hood:
- It maps
"United States"to1,"Canada"to2, etc. - It writes the compact integer keys to disk.
- It only stores the string dictionary once.
This reduces the storage footprint by up to 90% and speeds up queries because comparing integers is much faster for a CPU than comparing strings. As a rule of thumb, use LowCardinality for any string column with less than 10,000 unique values.
2. Avoid Nullable Like the Plague
In PostgreSQL, you mark columns as NULL if they don’t have a value. In ClickHouse, avoid Nullable whenever possible.
When you define a column as Nullable(String), ClickHouse physically creates an extra file on disk called [column].null.
- This file contains a byte array of
1s and0s to keep track of which rows are null. - During queries, ClickHouse has to read this extra file, which degrades scan performance.
The Alternative: Use default sentinel values. Instead of a null string, use an empty string ''. Instead of a null integer, use -1 or 0. Set your column default:
-- Better than Nullable
parent_id Int64 DEFAULT -1
3. Compression Codecs
We explicitly added codecs to our DDL:
timestamp DateTime CODEC(DoubleDelta, LZ4):DoubleDeltacomputes the difference between adjacent timestamps. Since timestamps are sorted, the differences are tiny, making them compress incredibly well.price Float64 CODEC(Gorilla, ZSTD):Gorillais a specialized codec for floating-point values that exploits XOR patterns in consecutive floats.
Selecting the Sorting Key (ORDER BY)
Your primary sorting key (defined in ORDER BY) is the single most important decision you make.
In our table, we chose ORDER BY (customer_country, product_category, timestamp).
- Place the columns you filter by most frequently in the key.
- Put low-cardinality columns first. Since we put
customer_countryfirst, ClickHouse sorts the rows on disk by country first, then by category, then by timestamp. This maximizes compression and allows the sparse index to skip massive sections of disk. - Do not put high-cardinality columns (like
order_idor uniquetimestampvalues) at the front of your key unless you want a slow sparse index that eats up all your RAM.
Now that we have designed our flat table, we need to load data into it. In the next part, we’ll discuss the golden rules of ingestion and how to stream data from Kafka and S3.