10 min read Part 5 of 10

Part 5: Ingestion

Getting data into ClickHouse: INSERTs, Formats, and Integrations.

#ClickHouse #Ingestion #Formats #Integrations
Part 5: Ingestion

ClickHouse is an ingestion beast. It can swallow millions of rows per second without breaking a sweat. But if you try to feed it like you feed PostgreSQL—inserting rows one by one as they happen—ClickHouse will throw a tantrum, lock up your disk, and crash with the dreaded “Too many parts in all partitions” error. Let’s learn how to ingest data the right way.

The Golden Rule: Batch or Die

Every time you run an INSERT statement in ClickHouse, it creates a new physical directory (part) on disk containing all the column files for those rows.

If you insert 100 rows per second one by one:

  • ClickHouse creates 100 directories on disk every second.
  • The background merger cannot keep up. ClickHouse wants to merge these folders, but it can only merge so fast.
  • The number of active parts climbs past 300, and ClickHouse aborts all inserts with: DB::Exception: Too many parts in all partitions in table...

The Rule: Always batch your inserts. Write at least 10,000 rows (ideally 100,000+) in a single INSERT statement.

Boromir row-by-row insert meme One does not simply row-by-row insert into a columnar database.

If your application cannot buffer data before sending it, you have two choices:

  1. Use an intermediary buffer like Kafka or Vector to aggregate rows.
  2. Turn on ClickHouse’s built-in Asynchronous Inserts feature.

How to Configure Asynchronous Inserts

Asynchronous inserts tell the ClickHouse server to buffer small inserts in memory and write them to disk in batches. Open your connection profile and run:

SET async_insert = 1;
SET wait_for_async_insert = 0;
  • async_insert = 1: Enables server-side buffering.
  • wait_for_async_insert = 0: The server returns a success status to the client immediately, writing the data in the background (fire-and-forget). If you want the client to wait until the data is safely written, set this to 1. For configuration specifics, see the ClickHouse Async Insert Documentation.

Loading Files via CLI

ClickHouse supports dozens of file formats (CSV, TSV, JSONEachRow, Parquet, Native). The CLI is highly optimized for pipes.

Inserting JSON Logs

Imagine you have a file logs.json where each line is a JSON object:

{"timestamp":"2023-10-05 12:00:00","user_id":42,"action":"click"}
{"timestamp":"2023-10-05 12:01:00","user_id":43,"action":"view"}

You can stream this file directly into ClickHouse:

clickhouse-client \
  --database="default" \
  --query="INSERT INTO user_logs FORMAT JSONEachRow" < logs.json

Loading from CSV

If you have a CSV file with headers, use the CSVWithNames format:

clickhouse-client \
  --query="INSERT INTO my_table FORMAT CSVWithNames" < data.csv

Querying Object Storage (S3 & GCS) Directly

You don’t always have to load data into a table to query it. ClickHouse can query files sitting in AWS S3 or Google Cloud Storage directly using the s3() table function.

SELECT 
    product_category,
    SUM(quantity) AS total_sold
FROM s3(
    'https://s3.us-east-1.amazonaws.com/my-bucket/orders/*.parquet',
    'AWS_ACCESS_KEY_ID',
    'AWS_SECRET_ACCESS_KEY',
    'Parquet'
)
GROUP BY product_category
ORDER BY total_sold DESC
LIMIT 5;

This is incredibly useful for ad-hoc exploration. If you decide you do want to load this data, just prepend INSERT INTO my_table to the query:

INSERT INTO orders_local
SELECT * FROM s3('https://s3.us-east-1.amazonaws.com/my-bucket/orders/*.parquet', 'Parquet');

Read the ClickHouse S3 Table Function Docs for format configurations.


The Streaming Pipeline: Kafka Engine + Materialized View

If you are streaming events from Apache Kafka, ClickHouse has a built-in integration pattern.

Warning: You cannot query a Kafka engine table directly (doing so consumes the messages, and they vanish). Instead, you use a Materialized View to act as a bridge that pulls messages from the Kafka topic and writes them to a standard MergeTree table.

[ Kafka Topic ] ===> [ Kafka Engine Table ] === ( Materialized View Bridge ) ===> [ Target MergeTree Table ]

The 3-Step Setup

Step 1: Create the Target Table (MergeTree)

This is where your data will live permanently.

CREATE TABLE events_local (
    timestamp DateTime,
    user_id UInt64,
    event_name String
) ENGINE = MergeTree()
ORDER BY (event_name, timestamp);

Step 2: Create the Kafka Engine Table

This acts as a consumer client listening to the Kafka topic.

CREATE TABLE events_kafka_queue (
    timestamp DateTime,
    user_id UInt64,
    event_name String
) ENGINE = Kafka
SETTINGS 
    kafka_broker_list = 'localhost:9092',
    kafka_topic_list = 'user-events',
    kafka_group_name = 'clickhouse-consumer-group',
    kafka_format = 'JSONEachRow',
    kafka_num_consumers = 2;

Step 3: Create the Materialized View Bridge

This background runner pulls data from the queue and inserts it into the target table.

CREATE MATERIALIZED VIEW events_kafka_mv TO events_local AS
SELECT timestamp, user_id, event_name
FROM events_kafka_queue;

Once the Materialized View is created, ClickHouse starts reading from Kafka in the background. The consumer offsets are managed automatically by ClickHouse Keeper or ZooKeeper. For scaling this pipeline, refer to the ClickHouse Kafka Integration Docs.

Now that you have loaded millions of rows into ClickHouse, we need to extract them. In the next part, we’ll master the art of querying and explore the secret weapon: PREWHERE.