A Postgres-native sink for Apache Spark

Deterministic Postgres writes, without the JDBC guesswork.

PGStyx replaces Spark's generic JDBC with a connector built for PostgreSQL — keyed upserts, pool-safe concurrency, runtime schema tolerance, and type fidelity for JSONB, UUID, and arrays.

View implementation guide Compare plans

Native UPSERT

Atomic keyed updates without hand-rolled merge logic

Schema safety

Handle widening and add-column drift on the write path

Type fidelity

First-class support for JSONB, UUID, arrays, and precision numerics

Pooling

Internal HikariCP pooling to prevent PostgreSQL connection storms

Upsert in one call

// Upsert into users by user_id — one call
df.write
  .format("pgstyx")
  .option("url", "jdbc:postgresql://host:5432/warehouse")
  .option("dbtable", "users")
  .option("writeMode", "upsert")
  .option("mergeKeys", "user_id")
  .option("schemaEvolution", "addColumns,widen")
  .save()

# Upsert into users by user_id — one call
(df.write
   .format("pgstyx")
   .option("url", "jdbc:postgresql://host:5432/warehouse")
   .option("dbtable", "users")
   .option("writeMode", "upsert")
   .option("mergeKeys", "user_id")
   .option("schemaEvolution", "addColumns,widen")
   .save())

-- Upsert into users by user_id — one call
CREATE TABLE users_upsert USING pgstyx
OPTIONS (
  url 'jdbc:postgresql://host:5432/warehouse',
  dbtable 'users',
  writeMode 'upsert',
  mergeKeys 'user_id',
  schemaEvolution 'addColumns,widen'
) AS SELECT * FROM incoming_users;

Ready for production v0.9.0 · JVM datasource

02 — How it works

One datasource swap. The rest of the job stays the same.

PGStyx is a JVM datasource registered as pgstyx. Your existing df.write call keeps its shape — PGStyx takes over inside that call to handle pooling, upserts, type coercion, and schema drift before the rows hit Postgres.

Source

Your DataFrame

Any Spark DataFrame or temp view. Scala, Python, or SQL — your choice of API.

df.write .format(pgstyx)

rows +
schema

Connector

Write-path logic

Pooled connections, keyed upserts, type coercion, schema alignment, retries.

HikariCP UPSERT type map evolve

safe
writes

Sink

PostgreSQL

Rows land with full type fidelity. No connection storms, no staging tables, no manual DDL.

jsonb uuid int[] numeric

Atomic One .save() call No foreachPartition, no staging-table rituals.

Pool-safe Bounded connections Spark parallelism decoupled from Postgres connection limits.

Drift-aware Evolve at write time Add-column and widening handled without rerun work.

Faithful Types round-trip JSONB, UUID, arrays, and precision numerics stay intact.

03 — Generic JDBC vs PGStyx

Same upsert. Half the code. None of the archaeology.

A realistic production upsert: merge incoming orders into a mutable table by order_id, preserve jsonb metadata, and keep Postgres responsive under parallel Spark writes. On the left, what you write today. On the right, what PGStyx collapses it into.

Generic Spark JDBC — as things are

// Stage table + merge + manual type handling
val staging = "orders_staging_" + UUID.randomUUID

df.repartition(8)                          // cap parallelism manually
  .withColumn("metadata", to_json(col("metadata"))) // cast jsonb → string
  .write
  .format("jdbc")
  .option("url", url)
  .option("dbtable", staging)
  .option("numPartitions", "8")
  .save()

spark.read.format("jdbc").option("url", url)
  .option("query", s"""
    -- hand-written merge into orders by order_id
    -- preserve metadata and audit columns
    -- update this every time the shape changes
  """).load()

sql(s"DROP TABLE $staging")   // and hope nothing failed mid-run

Connection storms under partition fan-out
Manual jsonb casting, staging cleanup
A schema change upstream breaks the SQL
Errors leave orphan staging tables

Lines of code 24

PGStyx — same job

// Native upsert, pooled connections, schema drift absorbed

df.write
  .format("pgstyx")
  .option("url", url)
  .option("dbtable", "orders")
  .option("writeMode", "upsert")
  .option("mergeKeys", "order_id")
  .option("schemaEvolution", "addColumns,widen")
  .save()

Internal HikariCP pooling — no connection storms
Keyed upsert handled inside the connector
JSONB, UUID, arrays preserved with binary fidelity
Add-column and widening handled at runtime

Lines of code 9

04 — Failure modes

Where Spark-to-Postgres jobs
usually start hurting.

The connector isn't the problem on day one. It becomes the problem when concurrency rises, tables go mutable, types get awkward, and a teammate merges a column on a Friday afternoon.

Connection exhaustion

Spark's one-connection-per-partition default crashes PostgreSQL as concurrency rises. PGStyx decouples tasks from connection limits with internal HikariCP pooling.

Resolves: PG_ERR: too_many_connections

Brittle upsert logic

Standard Spark JDBC leaves keyed refresh logic to custom job code. Teams end up in foreachPartition loops or staging tables. PGStyx provides UPSERT in a single .save().

Replaces: staging-table merge rituals

Type mapping nightmares

JSONB, UUID, and arrays get mangled as strings or bytes in generic JDBC. PGStyx treats them as first-class citizens with binary fidelity and schema correctness.

Preserves: jsonb · uuid · int[]

Runtime schema drift

A widening column or new attribute should not break the write. PGStyx handles add-column and widening cases at runtime so jobs don't turn into rerun work.

Intercepts: drift before it becomes DDL

05 — Plans

Start with the core write path.
Upgrade when the workload gets harder.

Community is usable for commercial work. Pro adds safer schema and validation controls. Enterprise covers tighter security, streaming, and deeper operational requirements.

Community

Free

Start with the core write path: append, overwrite, single-key upsert, pooling, retries, and metrics.

Start with docs →

Pro · Recommended

$899

Use Pro when the question changes from 'can we write?' to 'can we keep this job safe in production?'

Ask about Pro →

Enterprise

$4,999

Enterprise is for streaming workloads, custom certificate material, and deeper operational tuning on stricter platforms.

Ask about Enterprise →

Ship the write path. Stop re-writing it.

Start with the implementation guide and a working first upsert in Scala, Python, or SQL. Move to Pro when the workload asks for composite keys, schema evolution, or TLS.

Read the guide Compare plans