Part 1: Introduction & Architecture
What is ClickHouse, why is it so fast, and how does it work under the hood?
Welcome to the ultimate ClickHouse guide. If you’ve ever watched a PostgreSQL query choke on 50 million rows while your CPU fans screamed in agony, you’re in the right place. ClickHouse doesn’t just run analytical queries faster; it runs them at speeds that feel like a cheat code. Let’s look under the hood.
The “Oh Crap, My Query is Still Running” Problem
We’ve all been there. You write a simple SELECT category, SUM(price), COUNT(*) FROM orders GROUP BY category on your transactional DB. It works fine in staging. Then you deploy to production, hit 100 million rows, and the database engine decides to go on a coffee break.
Why? Because traditional databases like PostgreSQL, MySQL, and Oracle are Row-Oriented (OLTP). They are built to handle transactions—like updating a user’s password, inserting a shopping cart item, or fetching a single profile. They store data on disk as complete rows.
ELI5: What is OLAP vs. OLTP, and why does it matter? Think of it like the difference between a supermarket cashier scanning a bar of soap and a regional manager calculating annual revenue trends across 500 stores. See ELI5: OLAP for the full breakdown.
Enter Column-Oriented Storage
ClickHouse is an open-source, column-oriented database management system (DBMS) designed for Online Analytical Processing (OLAP). Instead of organizing data by rows, it stores each column in its own physical files on disk.
Let’s look at how this changes the game:
Row-Oriented Storage (Postgres/MySQL)
On disk, a row-oriented table looks like this:
Row 1: [ID: 1, User: "Alice", Age: 30, Country: "US", Balance: 150.00]
Row 2: [ID: 2, User: "Bob", Age: 25, Country: "CA", Balance: 80.00]
Row 3: [ID: 3, User: "Charlie", Age: 35, Country: "US", Balance: 210.00]
If you want to calculate the average age of all users, the database engine has to read every single byte of Rows 1, 2, and 3 from the disk into memory, extract the Age column, and discard the rest. That is a massive amount of wasted disk I/O.
Column-Oriented Storage (ClickHouse)
On disk, ClickHouse stores the same data like this:
File 1 (ID): [1, 2, 3]
File 2 (User): ["Alice", "Bob", "Charlie"]
File 3 (Age): [30, 25, 35]
File 4 (Country): ["US", "CA", "US"]
File 5 (Balance): [150.00, 80.00, 210.00]
Now, when you run SELECT AVG(Age) FROM users, ClickHouse bypasses Files 1, 2, 4, and 5 entirely. It opens File 3, streams the integers straight to the CPU, and returns the result. You read exactly what you need—nothing more.
Data engineers when they discover columnar storage formats and say goodbye to waiting 40 minutes for simple aggregates.
ELI5: How does columnar storage work? Imagine tearing the “Age” column out of a spreadsheet and throwing the rest of the sheet in the shredder. Check out ELI5: Columnar Storage for a funny analogy.
Why ClickHouse is Blazing Fast
Columnar storage is just the entry ticket. Plenty of databases do columnar storage (Redshift, BigQuery, Snowflake), but ClickHouse is notoriously faster when running on single-node or bare-metal setups. Here is the secret sauce:
1. Vectorized Query Execution (SIMD)
ClickHouse doesn’t just read data in columns; it processes it in vectors (blocks of rows). Traditional databases process data row-by-row (using the Volcano iterator model). ClickHouse leverages modern CPU architectures using SIMD (Single Instruction, Multiple Data) instructions.
Instead of asking the CPU to add two numbers, ClickHouse loads a whole array (vector) of numbers into the CPU’s SSE or AVX registers and adds them in a single instruction cycle. For more on this hardware acceleration, check out the Official ClickHouse CPU Optimization Docs.
2. Extreme Data Compression
Because columns contain data of the same type, they compress incredibly well. In the Country file above, you have repeating values like "US", "CA", "US". ClickHouse applies specialized codecs like DoubleDelta, Gorilla, or standard ZSTD to shrink these files to a fraction of their original size. Smaller files on disk mean less physical data has to be read into memory, which speeds up queries.
Read the ClickHouse Compression Documentation to learn how compression shapes query speed.
3. Parallel Processing
ClickHouse is designed to consume every single CPU core you throw at it. When you run a query, ClickHouse breaks the columnar data blocks into chunks and processes them in parallel across all available cores. If you have a 64-core server, ClickHouse will happily max out all 64 cores to finish your query in milliseconds.
4. No Garbage Collection (C++)
Unlike Java or Scala-based query engines (like Presto, Trino, or Apache Spark) which run on the JVM and are subject to garbage collection pauses, ClickHouse is written in highly optimized C++. There is no garbage collector to freeze your query because the JVM decided to clean up memory.
The Trade-Offs (Or: When NOT to use ClickHouse)
ClickHouse is a race car. It is built to go extremely fast in a straight line on a track. But don’t try to use it to move furniture. Here are the things ClickHouse is terrible at:
- Transactional Workloads (OLTP): If you need to support millions of users updating their profile photos or checking out shopping carts, do not use ClickHouse. It does not support fast, single-row updates or deletes. Updates in ClickHouse are expensive background mutations.
- High-Concurrency Point Lookups: ClickHouse is built for heavy queries scanning billions of rows. If you have 50,000 users per second querying
SELECT * FROM users WHERE id = 12345, use Redis or Postgres. ClickHouse will choke because each query expects to utilize multiple cores. - No Real Joins at Scale (Mostly): While ClickHouse supports SQL
JOINs, joining massive tables requires loading the right-side table into memory. In a distributed cluster, large joins can quickly run out of memory. The ClickHouse way is to design denormalized schemas.
| Feature | ClickHouse (OLAP) | PostgreSQL (OLTP) |
|---|---|---|
| Storage Layout | Column-oriented | Row-oriented |
| Write Pattern | Large batch inserts (10k+ rows) | Single-row transactions (ACID) |
| Query Target | Aggregations over billions of rows | Point updates and fetches by ID |
| Compression | High (often 5x to 10x) | Low to Medium |
| CPU Usage | Maxes out all cores | Single thread per query (mostly) |
What’s Next?
Now that you understand the architecture, it’s time to build a playground. In the next part, we’ll install ClickHouse natively and via Docker, and configure our system settings to prevent the dreaded out-of-memory crashes.