Skip to content

Parallel Sync

Parallel Sync significantly improves sync speed by utilizing multiple CPUs/threads. It is particularly useful in environments with high network latency, where PostgreSQL, Elasticsearch/OpenSearch, and PGSync servers run on different networks.

Why Use Parallel Sync?

In distributed setups, slow request/response times during database queries can limit sync performance. Even with server-side cursors, delays in fetching the next batch of records can bottleneck the process.

Parallel Sync addresses this by running an initial high-speed, parallel sync to populate Elasticsearch/OpenSearch in one iteration. After this, a regular PGSync process can continue to run as a daemon for ongoing sync.

How It Works

PGSync leverages PostgreSQL's internal ctid system column, which uniquely identifies rows based on their position in a table (page and row number).

  • The sync process paginates records using ctid, distributing work evenly across available CPUs/threads
  • Each worker thread processes a "chunk" of data in parallel, using efficient filtered queries based on page and row numbers
  • Bulk inserts to Elasticsearch/OpenSearch are executed concurrently, maximizing throughput

Execution Modes

Parallel Sync supports multiple execution modes to match your system's architecture and performance needs.

parallel_sync -c schema.json -m multiprocess

Available Modes:

Mode Description
synchronous Runs in single-threaded, sequential mode (baseline behavior)
multithreaded Uses multiple threads within a single process for parallel sync
multiprocess Spawns multiple processes to perform sync in parallel
multithreaded_async Combines multithreading with asynchronous I/O for improved concurrency
multiprocess_async Combines multiple processes with asynchronous I/O for maximum parallelism and efficiency