Part 7: Advanced Features
Unlock the power of Arrays, Maps, and Window Functions in ClickHouse.
Many developers assume analytical databases only work well with flat, boring tables. But ClickHouse treats nested data structures—like Arrays and Maps—as first-class citizens. You can perform functional programming, write inline lambdas, parse JSON on the fly, and run window functions directly inside ClickHouse. Let’s unlock these advanced tools.
Mastering Arrays with Functional Programming
ClickHouse has some of the most robust array manipulation functions in the database world. Instead of writing slow nested loops or subqueries, you can use lambda expressions to filter and transform arrays directly in your SQL.
The Basic Array Toolkit
Let’s look at standard array features first:
SELECT
[1, 2, 3, 4] AS numbers,
has(numbers, 3) AS contains_three, -- Returns 1 (true)
length(numbers) AS size, -- Returns 4
numbers[2] AS second_element -- Returns 2 (1-indexed!)
Functional Programming: arrayMap and arrayFilter
Say you have a table storing temperature metrics as an array: [15.0, 22.5, 30.1]. You want to convert these to Fahrenheit and filter out any temperatures below 70°F.
You can write this using lambdas:
SELECT
-- 1. Convert Celcius to Fahrenheit
arrayMap(x -> (x * 9/5) + 32, [15.0, 22.5, 30.1]) AS fahrenheit_temps,
-- 2. Filter out values less than 70
arrayFilter(x -> x >= 70, fahrenheit_temps) AS hot_temps;
arrayMap(x -> expression, array): Applies the lambda function to every element in the array.arrayFilter(x -> condition, array): Returns a new array containing only elements that satisfy the condition.
Unnesting Arrays with arrayJoin
If you need to convert an array into individual rows (similar to UNNEST in BigQuery or EXPLODE in Hive), use arrayJoin():
-- Create some dummy data
SELECT
user_id,
arrayJoin(['tech', 'finance', 'sports']) AS interest
FROM (SELECT 1 AS user_id);
Output:
┌─user_id─┬─interest─┐
│ 1 │ tech │
│ 1 │ finance │
│ 1 │ sports │
└─────────┴──────────┘
Performance Tip: Be careful with arrayJoin. It duplicates the other columns in the row for every array element, which multiplies the data volume in memory. Use it late in your query lifecycle.
For a complete list of array functions, refer to the ClickHouse Array Functions Docs.
Maps: Key-Value Pairs Under the Hood
The Map(Key, Value) data type is perfect for semi-structured data where the keys are not known in advance—such as custom user attributes or query parameters.
Physical Storage of Maps
A common misconception is that ClickHouse stores Maps as serialized JSON strings. It doesn’t.
Under the hood, Map(String, String) is stored as two parallel arrays: a keys array Array(String) and a values array Array(String).
- When you query
attributes['browser'], ClickHouse searches the keys array for'browser'. - It identifies the index of the key and pulls the corresponding value from the values array. This physical layout maintains columnar scan performance even for maps with thousands of keys.
Let’s look at a schema utilizing Maps:
CREATE TABLE events_map (
timestamp DateTime,
event_name String,
attributes Map(String, String)
) ENGINE = MergeTree()
ORDER BY timestamp;
-- Inserting a map
INSERT INTO events_map VALUES
('2023-10-07 12:00:00', 'click', {'browser':'Chrome', 'version':'118', 'os':'macOS'});
-- Querying the map
SELECT
attributes['os'] AS os_type,
count() AS total
FROM events_map
WHERE attributes['browser'] = 'Chrome'
GROUP BY os_type;
Processing JSON Logs on the Fly
If you are ingesting raw JSON strings into a standard String column, you can extract values dynamically using the JSONExtract* family of functions:
WITH '{"user": {"id": 101, "name": "Ankur"}, "tags": ["admin", "dev"]}' AS json_raw
SELECT
JSONExtractInt(json_raw, 'user', 'id') AS user_id,
JSONExtractString(json_raw, 'user', 'name') AS user_name,
JSONExtractArrayRaw(json_raw, 'tags') AS tags_raw;
For deep nested JSON parsing specs, see the ClickHouse JSON Extraction Documentation.
Window Functions: Syntactic Sugar with a Price
ClickHouse supports standard SQL window functions (like ROW_NUMBER(), RANK(), LEAD(), LAG(), and SUM() OVER ()):
SELECT
user_id,
timestamp,
revenue,
sum(revenue) OVER (PARTITION BY user_id ORDER BY timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_revenue
FROM orders_flat;
The Performance Caveat
While window functions are powerful, they should be used with caution in high-performance analytical environments.
ClickHouse is built to process data in streaming blocks. Because a window function requires sorting and partition grouping across potentially the entire dataset, it forces ClickHouse to break its streaming model, buffer the data in memory, and perform single-threaded sorts.
If you run a window function over a table with 500 million rows, you risk crashing your server with an Out-of-Memory (OOM) error. Where possible, try to pre-aggregate data using Materialized Views before applying window calculations.
One does not simply run a window function over a billion rows without checking the memory footprint.
We will cover Materialized Views in Part 8.