Skip to content
A Postgres-native sink for Apache Spark

Deterministic Postgres writes, without the JDBC guesswork.

PGStyx replaces Spark's generic JDBC with a connector built for PostgreSQL — keyed upserts, pool-safe concurrency, runtime schema tolerance, and type fidelity for JSONB, UUID, and arrays.

Native UPSERT
Atomic keyed updates without hand-rolled merge logic
Schema safety
Handle widening and add-column drift on the write path
Type fidelity
First-class support for JSONB, UUID, arrays, and precision numerics
Pooling
Internal HikariCP pooling to prevent PostgreSQL connection storms
02 — How it works

One datasource swap. The rest of the job stays the same.

PGStyx is a JVM datasource registered as pgstyx. Your existing df.write call keeps its shape — PGStyx takes over inside that call to handle pooling, upserts, type coercion, and schema drift before the rows hit Postgres.

Source
Your DataFrame
Any Spark DataFrame or temp view. Scala, Python, or SQL — your choice of API.
df.write .format(pgstyx)
rows +
schema
Connector
Write-path logic
Pooled connections, keyed upserts, type coercion, schema alignment, retries.
HikariCP UPSERT type map evolve
safe
writes
Sink
PostgreSQL
Rows land with full type fidelity. No connection storms, no staging tables, no manual DDL.
jsonb uuid int[] numeric
Atomic One .save() call No foreachPartition, no staging-table rituals.
Pool-safe Bounded connections Spark parallelism decoupled from Postgres connection limits.
Drift-aware Evolve at write time Add-column and widening handled without rerun work.
Faithful Types round-trip JSONB, UUID, arrays, and precision numerics stay intact.
03 — Generic JDBC vs PGStyx

Same upsert. Half the code. None of the archaeology.

A realistic production upsert: merge incoming orders into a mutable table by order_id, preserve jsonb metadata, and keep Postgres responsive under parallel Spark writes. On the left, what you write today. On the right, what PGStyx collapses it into.

Generic Spark JDBC — as things are
// Stage table + merge + manual type handling
val staging = "orders_staging_" + UUID.randomUUID

df.repartition(8)                          // cap parallelism manually
  .withColumn("metadata", to_json(col("metadata"))) // cast jsonb → string
  .write
  .format("jdbc")
  .option("url", url)
  .option("dbtable", staging)
  .option("numPartitions", "8")
  .save()

spark.read.format("jdbc").option("url", url)
  .option("query", s"""
    -- hand-written merge into orders by order_id
    -- preserve metadata and audit columns
    -- update this every time the shape changes
  """).load()

sql(s"DROP TABLE $staging")   // and hope nothing failed mid-run
  • Connection storms under partition fan-out
  • Manual jsonb casting, staging cleanup
  • A schema change upstream breaks the SQL
  • Errors leave orphan staging tables
Lines of code 24
PGStyx — same job
// Native upsert, pooled connections, schema drift absorbed

df.write
  .format("pgstyx")
  .option("url", url)
  .option("dbtable", "orders")
  .option("writeMode", "upsert")
  .option("mergeKeys", "order_id")
  .option("schemaEvolution", "addColumns,widen")
  .save()
  • Internal HikariCP pooling — no connection storms
  • Keyed upsert handled inside the connector
  • JSONB, UUID, arrays preserved with binary fidelity
  • Add-column and widening handled at runtime
Lines of code 9
04 — Failure modes

Where Spark-to-Postgres jobs
usually start hurting.

The connector isn't the problem on day one. It becomes the problem when concurrency rises, tables go mutable, types get awkward, and a teammate merges a column on a Friday afternoon.

01

Connection exhaustion

Spark's one-connection-per-partition default crashes PostgreSQL as concurrency rises. PGStyx decouples tasks from connection limits with internal HikariCP pooling.

Resolves: PG_ERR: too_many_connections
02

Brittle upsert logic

Standard Spark JDBC leaves keyed refresh logic to custom job code. Teams end up in foreachPartition loops or staging tables. PGStyx provides UPSERT in a single .save().

Replaces: staging-table merge rituals
03

Type mapping nightmares

JSONB, UUID, and arrays get mangled as strings or bytes in generic JDBC. PGStyx treats them as first-class citizens with binary fidelity and schema correctness.

Preserves: jsonb · uuid · int[]
04

Runtime schema drift

A widening column or new attribute should not break the write. PGStyx handles add-column and widening cases at runtime so jobs don't turn into rerun work.

Intercepts: drift before it becomes DDL
05 — Plans

Start with the core write path.
Upgrade when the workload gets harder.

Community is usable for commercial work. Pro adds safer schema and validation controls. Enterprise covers tighter security, streaming, and deeper operational requirements.

Community
Free

Start with the core write path: append, overwrite, single-key upsert, pooling, retries, and metrics.

Start with docs
Enterprise
$4,999

Enterprise is for streaming workloads, custom certificate material, and deeper operational tuning on stricter platforms.

Ask about Enterprise

Ship the write path. Stop re-writing it.

Start with the implementation guide and a working first upsert in Scala, Python, or SQL. Move to Pro when the workload asks for composite keys, schema evolution, or TLS.