Part 5: Ingestion
Getting data into ClickHouse: INSERTs, Formats, and Integrations.
ClickHouse is an ingestion beast. It can swallow millions of rows per second without breaking a sweat. But if you try to feed it like you feed PostgreSQL—inserting rows one by one as they happen—ClickHouse will throw a tantrum, lock up your disk, and crash with the dreaded “Too many parts in all partitions” error. Let’s learn how to ingest data the right way.
The Golden Rule: Batch or Die
Every time you run an INSERT statement in ClickHouse, it creates a new physical directory (part) on disk containing all the column files for those rows.
If you insert 100 rows per second one by one:
- ClickHouse creates 100 directories on disk every second.
- The background merger cannot keep up. ClickHouse wants to merge these folders, but it can only merge so fast.
- The number of active parts climbs past 300, and ClickHouse aborts all inserts with:
DB::Exception: Too many parts in all partitions in table...
The Rule: Always batch your inserts. Write at least 10,000 rows (ideally 100,000+) in a single INSERT statement.
One does not simply row-by-row insert into a columnar database.
If your application cannot buffer data before sending it, you have two choices:
- Use an intermediary buffer like Kafka or Vector to aggregate rows.
- Turn on ClickHouse’s built-in Asynchronous Inserts feature.
How to Configure Asynchronous Inserts
Asynchronous inserts tell the ClickHouse server to buffer small inserts in memory and write them to disk in batches. Open your connection profile and run:
SET async_insert = 1;
SET wait_for_async_insert = 0;
async_insert = 1: Enables server-side buffering.wait_for_async_insert = 0: The server returns a success status to the client immediately, writing the data in the background (fire-and-forget). If you want the client to wait until the data is safely written, set this to1. For configuration specifics, see the ClickHouse Async Insert Documentation.
Loading Files via CLI
ClickHouse supports dozens of file formats (CSV, TSV, JSONEachRow, Parquet, Native). The CLI is highly optimized for pipes.
Inserting JSON Logs
Imagine you have a file logs.json where each line is a JSON object:
{"timestamp":"2023-10-05 12:00:00","user_id":42,"action":"click"}
{"timestamp":"2023-10-05 12:01:00","user_id":43,"action":"view"}
You can stream this file directly into ClickHouse:
clickhouse-client \
--database="default" \
--query="INSERT INTO user_logs FORMAT JSONEachRow" < logs.json
Loading from CSV
If you have a CSV file with headers, use the CSVWithNames format:
clickhouse-client \
--query="INSERT INTO my_table FORMAT CSVWithNames" < data.csv
Querying Object Storage (S3 & GCS) Directly
You don’t always have to load data into a table to query it. ClickHouse can query files sitting in AWS S3 or Google Cloud Storage directly using the s3() table function.
SELECT
product_category,
SUM(quantity) AS total_sold
FROM s3(
'https://s3.us-east-1.amazonaws.com/my-bucket/orders/*.parquet',
'AWS_ACCESS_KEY_ID',
'AWS_SECRET_ACCESS_KEY',
'Parquet'
)
GROUP BY product_category
ORDER BY total_sold DESC
LIMIT 5;
This is incredibly useful for ad-hoc exploration. If you decide you do want to load this data, just prepend INSERT INTO my_table to the query:
INSERT INTO orders_local
SELECT * FROM s3('https://s3.us-east-1.amazonaws.com/my-bucket/orders/*.parquet', 'Parquet');
Read the ClickHouse S3 Table Function Docs for format configurations.
The Streaming Pipeline: Kafka Engine + Materialized View
If you are streaming events from Apache Kafka, ClickHouse has a built-in integration pattern.
Warning: You cannot query a Kafka engine table directly (doing so consumes the messages, and they vanish). Instead, you use a Materialized View to act as a bridge that pulls messages from the Kafka topic and writes them to a standard MergeTree table.
[ Kafka Topic ] ===> [ Kafka Engine Table ] === ( Materialized View Bridge ) ===> [ Target MergeTree Table ]
The 3-Step Setup
Step 1: Create the Target Table (MergeTree)
This is where your data will live permanently.
CREATE TABLE events_local (
timestamp DateTime,
user_id UInt64,
event_name String
) ENGINE = MergeTree()
ORDER BY (event_name, timestamp);
Step 2: Create the Kafka Engine Table
This acts as a consumer client listening to the Kafka topic.
CREATE TABLE events_kafka_queue (
timestamp DateTime,
user_id UInt64,
event_name String
) ENGINE = Kafka
SETTINGS
kafka_broker_list = 'localhost:9092',
kafka_topic_list = 'user-events',
kafka_group_name = 'clickhouse-consumer-group',
kafka_format = 'JSONEachRow',
kafka_num_consumers = 2;
Step 3: Create the Materialized View Bridge
This background runner pulls data from the queue and inserts it into the target table.
CREATE MATERIALIZED VIEW events_kafka_mv TO events_local AS
SELECT timestamp, user_id, event_name
FROM events_kafka_queue;
Once the Materialized View is created, ClickHouse starts reading from Kafka in the background. The consumer offsets are managed automatically by ClickHouse Keeper or ZooKeeper. For scaling this pipeline, refer to the ClickHouse Kafka Integration Docs.
Now that you have loaded millions of rows into ClickHouse, we need to extract them. In the next part, we’ll master the art of querying and explore the secret weapon: PREWHERE.