Parallel sync

Parallel Sync (Experimental Feature)

Parallel Sync is an experimental feature designed to significantly improve sync speed by utilizing multiple CPUs/threads. It is particularly useful in environments with high network latency, where PostgreSQL, Elasticsearch/OpenSearch, and PGSync servers run on different networks.

Why Use Parallel Sync?

In distributed setups, slow request/response times especially during database queries can limit sync performance. Even with server side cursors, delays in fetching the next batch of records can bottleneck the process.

Parallel Sync addresses this by running an initial high speed, parallel sync to populate Elasticsearch/OpenSearch in one iteration. After this, a regular PGSync process can continue to run as a daemon for ongoing sync.

How It Works:

PGSync leverages PostgreSQL's internal ctid system column, which uniquely identifies rows based on their position in a table (page and row number).

The sync process paginates records using ctid, distributing work evenly across available CPUs/threads.

Each worker thread processes a "chunk" of data in parallel, using efficient filtered queries based on page and row numbers.

Bulk inserts to Elasticsearch/OpenSearch are executed concurrently, maximizing throughput.

Parallel Sync Execution Modes

Parallel Sync supports multiple execution modes to match your system's architecture and performance needs.

parallel_sync -c schema.json -m multiprocess

Available Modes:

synchronous — Runs in a single threaded, sequential mode (baseline behavior).
multithreaded — Uses multiple threads within a single process for parallel sync.
multiprocess — Spawns multiple processes to perform sync in parallel.
multithreaded_async — Combines multithreading with asynchronous I/O for improved concurrency.
multiprocess_async — Combines multiple processes with asynchronous I/O for maximum parallelism and efficiency.