10 min read Part 1 of 10

Part 1: Introduction & Architecture

What is ClickHouse, why is it so fast, and how does it work under the hood?

#ClickHouse #Architecture #OLAP
Part 1: Introduction & Architecture

Welcome to the ultimate ClickHouse guide. If you’ve ever watched a PostgreSQL query choke on 50 million rows while your CPU fans screamed in agony, you’re in the right place. ClickHouse doesn’t just run analytical queries faster; it runs them at speeds that feel like a cheat code. Let’s look under the hood.

The “Oh Crap, My Query is Still Running” Problem

We’ve all been there. You write a simple SELECT category, SUM(price), COUNT(*) FROM orders GROUP BY category on your transactional DB. It works fine in staging. Then you deploy to production, hit 100 million rows, and the database engine decides to go on a coffee break.

Why? Because traditional databases like PostgreSQL, MySQL, and Oracle are Row-Oriented (OLTP). They are built to handle transactions—like updating a user’s password, inserting a shopping cart item, or fetching a single profile. They store data on disk as complete rows.

ELI5: What is OLAP vs. OLTP, and why does it matter? Think of it like the difference between a supermarket cashier scanning a bar of soap and a regional manager calculating annual revenue trends across 500 stores. See ELI5: OLAP for the full breakdown.

Enter Column-Oriented Storage

ClickHouse is an open-source, column-oriented database management system (DBMS) designed for Online Analytical Processing (OLAP). Instead of organizing data by rows, it stores each column in its own physical files on disk.

Let’s look at how this changes the game:

Row-Oriented Storage (Postgres/MySQL)

On disk, a row-oriented table looks like this:

Row 1: [ID: 1, User: "Alice", Age: 30, Country: "US", Balance: 150.00]
Row 2: [ID: 2, User: "Bob",   Age: 25, Country: "CA", Balance: 80.00]
Row 3: [ID: 3, User: "Charlie", Age: 35, Country: "US", Balance: 210.00]

If you want to calculate the average age of all users, the database engine has to read every single byte of Rows 1, 2, and 3 from the disk into memory, extract the Age column, and discard the rest. That is a massive amount of wasted disk I/O.

Column-Oriented Storage (ClickHouse)

On disk, ClickHouse stores the same data like this:

File 1 (ID):      [1, 2, 3]
File 2 (User):    ["Alice", "Bob", "Charlie"]
File 3 (Age):     [30, 25, 35]
File 4 (Country): ["US", "CA", "US"]
File 5 (Balance): [150.00, 80.00, 210.00]

Now, when you run SELECT AVG(Age) FROM users, ClickHouse bypasses Files 1, 2, 4, and 5 entirely. It opens File 3, streams the integers straight to the CPU, and returns the result. You read exactly what you need—nothing more.

Distracted Boyfriend Columnar Storage Data engineers when they discover columnar storage formats and say goodbye to waiting 40 minutes for simple aggregates.

ELI5: How does columnar storage work? Imagine tearing the “Age” column out of a spreadsheet and throwing the rest of the sheet in the shredder. Check out ELI5: Columnar Storage for a funny analogy.


Why ClickHouse is Blazing Fast

Columnar storage is just the entry ticket. Plenty of databases do columnar storage (Redshift, BigQuery, Snowflake), but ClickHouse is notoriously faster when running on single-node or bare-metal setups. Here is the secret sauce:

1. Vectorized Query Execution (SIMD)

ClickHouse doesn’t just read data in columns; it processes it in vectors (blocks of rows). Traditional databases process data row-by-row (using the Volcano iterator model). ClickHouse leverages modern CPU architectures using SIMD (Single Instruction, Multiple Data) instructions.

Instead of asking the CPU to add two numbers, ClickHouse loads a whole array (vector) of numbers into the CPU’s SSE or AVX registers and adds them in a single instruction cycle. For more on this hardware acceleration, check out the Official ClickHouse CPU Optimization Docs.

2. Extreme Data Compression

Because columns contain data of the same type, they compress incredibly well. In the Country file above, you have repeating values like "US", "CA", "US". ClickHouse applies specialized codecs like DoubleDelta, Gorilla, or standard ZSTD to shrink these files to a fraction of their original size. Smaller files on disk mean less physical data has to be read into memory, which speeds up queries. Read the ClickHouse Compression Documentation to learn how compression shapes query speed.

3. Parallel Processing

ClickHouse is designed to consume every single CPU core you throw at it. When you run a query, ClickHouse breaks the columnar data blocks into chunks and processes them in parallel across all available cores. If you have a 64-core server, ClickHouse will happily max out all 64 cores to finish your query in milliseconds.

4. No Garbage Collection (C++)

Unlike Java or Scala-based query engines (like Presto, Trino, or Apache Spark) which run on the JVM and are subject to garbage collection pauses, ClickHouse is written in highly optimized C++. There is no garbage collector to freeze your query because the JVM decided to clean up memory.


The Trade-Offs (Or: When NOT to use ClickHouse)

ClickHouse is a race car. It is built to go extremely fast in a straight line on a track. But don’t try to use it to move furniture. Here are the things ClickHouse is terrible at:

  1. Transactional Workloads (OLTP): If you need to support millions of users updating their profile photos or checking out shopping carts, do not use ClickHouse. It does not support fast, single-row updates or deletes. Updates in ClickHouse are expensive background mutations.
  2. High-Concurrency Point Lookups: ClickHouse is built for heavy queries scanning billions of rows. If you have 50,000 users per second querying SELECT * FROM users WHERE id = 12345, use Redis or Postgres. ClickHouse will choke because each query expects to utilize multiple cores.
  3. No Real Joins at Scale (Mostly): While ClickHouse supports SQL JOINs, joining massive tables requires loading the right-side table into memory. In a distributed cluster, large joins can quickly run out of memory. The ClickHouse way is to design denormalized schemas.
FeatureClickHouse (OLAP)PostgreSQL (OLTP)
Storage LayoutColumn-orientedRow-oriented
Write PatternLarge batch inserts (10k+ rows)Single-row transactions (ACID)
Query TargetAggregations over billions of rowsPoint updates and fetches by ID
CompressionHigh (often 5x to 10x)Low to Medium
CPU UsageMaxes out all coresSingle thread per query (mostly)

What’s Next?

Now that you understand the architecture, it’s time to build a playground. In the next part, we’ll install ClickHouse natively and via Docker, and configure our system settings to prevent the dreaded out-of-memory crashes.