Skip to content

Backfill Plugin

This document covers practical usage of the optional backfill plugin.

  • Builds deterministic, immutable backfill plans that divide a time window into chunks.
  • Executes backfills against ClickHouse with per-chunk checkpointing, automatic retries, and idempotency tokens.
  • Detects materialized views and automatically generates correct CTE-wrapped replay queries.
  • Supports resume from checkpoint, cancel, status monitoring, and doctor-style diagnostics.
  • Integrates with chkit check for CI enforcement of pending backfills.
  • Persists all state as JSON/NDJSON on disk.

The plugin follows a plan-then-execute lifecycle:

  1. plan — Build an immutable backfill plan dividing the time window into chunks.
  2. run — Execute the plan with checkpointed progress.
  3. status — Monitor chunk progress and run state.

Additional commands: resume (continue from checkpoint), cancel (stop execution), doctor (actionable diagnostics).

chkit check integration reports pending or failed backfills in CI.

In clickhouse.config.ts, register backfill(...) from @chkit/plugin-backfill.

import { defineConfig } from '@chkit/core'
import { backfill } from '@chkit/plugin-backfill'
export default defineConfig({
schema: './src/db/schema/**/*.ts',
plugins: [
backfill({
stateDir: './chkit/backfill',
defaults: {
chunkHours: 6,
maxParallelChunks: 1,
maxRetriesPerChunk: 3,
retryDelayMs: 1000,
requireIdempotencyToken: true,
timeColumn: 'created_at',
},
policy: {
requireDryRunBeforeRun: true,
requireExplicitWindow: true,
blockOverlappingRuns: true,
failCheckOnRequiredPendingBackfill: true,
},
limits: {
maxWindowHours: 720,
minChunkMinutes: 15,
},
}),
],
})

The run and resume commands execute SQL against ClickHouse when a connection is configured. Configure clickhouse at the top level of clickhouse.config.ts:

export default defineConfig({
clickhouse: {
url: process.env.CLICKHOUSE_URL || 'http://localhost:8123',
username: 'default',
password: '',
database: 'default',
},
schema: './src/db/schema/**/*.ts',
plugins: [backfill(...)],
})

The URL and credentials can come from environment variables in CI environments.

The plugin supports two strategies for backfilling data, chosen automatically based on your schema:

Table backfill (table strategy): For direct table targets, inserts data by selecting from the same table within the time window. This is the most common case.

Materialized view replay (mv_replay strategy): When the target is a materialized view’s to table, the plugin detects the view’s aggregation query and wraps it in a CTE (Common Table Expression). This re-materializes the aggregation for each chunk window, ensuring correctness for aggregate backfills. Requires requireIdempotencyToken: true for safe resumable retries.

The backfill plugin needs a time column to build WHERE clauses for each chunk. It resolves the column through a layered fallback chain:

  1. CLI flag--time-column <column> on the plan command.
  2. Schema-level configplugins.backfill.timeColumn on the table definition.
  3. Global defaultdefaults.timeColumn in the plugin options.
  4. Auto-detection — Scans ORDER BY columns and common time column names (created_at, timestamp, event_time, etc.) for DateTime/DateTime64 types.

Schema-level configuration is the recommended approach when different tables use different time columns. Define it directly in the table() call:

import { table } from '@chkit/core'
export const events = table({
database: 'app',
name: 'events',
columns: [
{ name: 'event_time', type: 'DateTime' },
{ name: 'id', type: 'UInt64' },
],
engine: 'MergeTree',
orderBy: ['event_time', 'id'],
primaryKey: ['event_time', 'id'],
plugins: {
backfill: { timeColumn: 'event_time' },
},
})

This requires importing @chkit/plugin-backfill somewhere in the project (typically in clickhouse.config.ts) to activate the type augmentation. The plugins.backfill object is fully typed — autocomplete and type errors work as expected.

Configuration is organized into three groups plus a top-level stateDir.

Top-level:

  • stateDir (default: <metaDir>/backfill) — Directory for plan, run, and event state files.

defaults group:

OptionTypeDefaultDescription
chunkHoursnumber6Hours per chunk
maxParallelChunksnumber1Max concurrent chunks
maxRetriesPerChunknumber3Retry budget per chunk
retryDelayMsnumber1000Exponential backoff delay between retries (milliseconds)
requireIdempotencyTokenbooleantrueGenerate deterministic tokens
timeColumnstringauto-detectFallback column name for time-based WHERE clause (overridden by schema-level config)

policy group:

OptionTypeDefaultDescription
requireDryRunBeforeRunbooleantrueRequire plan before run
requireExplicitWindowbooleantrueRequire --from/--to
blockOverlappingRunsbooleantruePrevent concurrent runs
failCheckOnRequiredPendingBackfillbooleantrueFail chkit check on incomplete backfills

limits group:

OptionTypeDefaultDescription
maxWindowHoursnumber720 (30 days)Maximum window size
minChunkMinutesnumber15Minimum chunk size

Invalid option values fail fast at startup via plugin config validation.

All commands exit with: 0 (success), 1 (runtime error), 2 (config error).

Build a deterministic backfill plan and persist immutable plan state.

FlagRequiredDescription
--target <db.table>YesTarget table in database.table format
--from <timestamp>YesWindow start (ISO timestamp)
--to <timestamp>YesWindow end (ISO timestamp)
--chunk-hours <n>NoOverride chunk size (defaults to defaults.chunkHours)
--time-column <column>NoTime column for WHERE clause (auto-detected if omitted)
--force-large-windowNoAllow windows exceeding limits.maxWindowHours
--forceNoDelete existing plan and regenerate from scratch

Execute a planned backfill with checkpointed chunk progress.

FlagRequiredDescription
--plan-id <hex16>YesPlan ID (16-char hex)
--replay-doneNoRe-execute already-completed chunks
--replay-failedNoRe-execute failed chunks
--force-overlapNoAllow concurrent runs for the same target
--force-compatibilityNoSkip compatibility token check
--force-environmentNoSkip environment mismatch check (plan was created for a different ClickHouse cluster/database)

Resume a backfill run from last checkpoint. Automatically retries failed chunks.

FlagRequiredDescription
--plan-id <hex16>YesPlan ID (16-char hex)
--replay-doneNoRe-execute already-completed chunks
--replay-failedNoRe-execute failed chunks (enabled by default on resume)
--force-overlapNoAllow concurrent runs for the same target
--force-compatibilityNoSkip compatibility token check
--force-environmentNoSkip environment mismatch check (plan was created for a different ClickHouse cluster/database)

Show checkpoint and chunk progress for a backfill run.

FlagRequiredDescription
--plan-id <hex16>YesPlan ID (16-char hex)

Cancel an in-progress backfill run and prevent further chunk execution.

FlagRequiredDescription
--plan-id <hex16>YesPlan ID (16-char hex)

Provide actionable remediation steps for failed or pending backfill runs.

FlagRequiredDescription
--plan-id <hex16>YesPlan ID (16-char hex)

When configured, chkit check includes a plugins.backfill block in JSON output and can fail with plugin:backfill.

Finding codes:

  • backfill_required_pending — A plan has no run or the run is not completed.
  • backfill_chunk_failed_retry_exhausted — A run has exhausted retries on a failed chunk.
  • backfill_policy_relaxedfailCheckOnRequiredPendingBackfill is disabled (warning only).

When failCheckOnRequiredPendingBackfill is true (default), pending backfills cause chkit check to fail with an error. When false, they emit a warning instead.

All state is persisted to the configured stateDir:

<stateDir>/
plans/<planId>.json # Immutable plan state (written once)
runs/<planId>.json # Mutable run checkpoint (updated per chunk)
events/<planId>.ndjson # Append-only event log

Plan IDs are deterministic: sha256("<target>|<from>|<to>|<chunkHours>|<timeColumn>|<envFingerprint>") truncated to 16 hex characters. When a ClickHouse connection is configured, an environment fingerprint is included in the plan ID, so different clusters/databases automatically produce different plan files. Re-planning with the same parameters produces the same plan ID.

When clickhouse is configured in clickhouse.config.ts, backfill plans are bound to the specific ClickHouse cluster and database to prevent accidental cross-environment execution. The plan file stores:

  • environment.fingerprint — A 16-char hash of the URL origin + database name
  • environment.url — The cluster URL (for human readability)
  • environment.database — The target database

When running or resuming a plan, chkit verifies the plan’s environment matches the current config. If there’s a mismatch (e.g., you created the plan against staging but switched to production), execution is blocked with a clear error message.

To override the check (e.g., intentionally backfilling production using a staging plan), use --force-environment on the run or resume command.

Plans created without a ClickHouse config (offline/dry-run) have no environment binding and can run against any environment.

Basic backfill:

Terminal window
chkit plugin backfill plan --target analytics.events --from 2025-01-01 --to 2025-02-01
chkit plugin backfill run --plan-id <planId>
chkit plugin backfill status --plan-id <planId>

Failed chunk recovery:

Terminal window
chkit plugin backfill plan --target analytics.events --from 2025-01-01 --to 2025-02-01
chkit plugin backfill run --plan-id <planId> # some chunks fail
chkit plugin backfill resume --plan-id <planId> # automatically retries failed chunks

CI enforcement:

Terminal window
chkit check # fails if pending backfills exist
  • maxParallelChunks is declared but execution is currently sequential.