Tokenized Datalake · Private Preview · v1.0

The data lake your
security team approves of.

rjbase.io is a high-performance tokenized datalake platform. Ingest petabytes, tokenize sensitive fields on the way in, and run zero-trust analytics without surfacing raw data to a single analyst, model, or downstream system.

SOC 2 · TYPE II
HIPAA READY
PCI · TOKENIZED
GDPR · CCPA
8.4 PB
Indexed across preview clusters
142 µs
Median tokenized field lookup
99.995%
Control-plane availability SLO
0
Raw values leaving the vault
Why rjbase

A lake that tokenizes by default.

Most lakes solve storage. rjbase.io solves storage and the question your security team asks five minutes later: who can actually see this row?

Field-level tokenization

Every sensitive column is replaced with a deterministic token at ingest. Raw values live in a separate vault that analysts and models never touch.

Reversible — for who you say

Detokenization is a policy decision, not a query. Define who, where, and why; rjbase enforces it at the engine and audits every reveal.

Hot, warm & cold in one query

A tiered storage engine routes recent partitions to NVMe, the long tail to object storage — your SQL never has to know.

Zero-trust access plane

Mutual-TLS service identities, short-lived row-scoped capabilities, and per-query justification — enforced before the planner ever runs.

SQL, Python, & gRPC

Speak ANSI SQL, dataframe Python, or our streaming gRPC API. Engines share the same catalog, policy graph, and tokenization vault.

Operates itself

Auto-compaction, partition pruning, vacuum, and bloom-filter rebuilds happen on schedule. No 3 a.m. pages over “the lake” again.

Architecture

Four layers. One control plane.

rjbase.io is built around a single contract: every byte enters via the tokenization layer, and every byte that leaves passes through the policy engine. Storage, compute, and catalog all sit underneath that boundary.

  • Pluggable backends — S3, GCS, Azure Blob, on-prem MinIO
  • Open table format · Iceberg-compatible metadata
  • HSM-backed token vault with rotation-safe identifiers
  • Per-tenant compute pools with hard isolation
// access
SQL PyClient gRPC REST BI adapter
// policy
RBAC + ABAC row scoping column masking audit ledger
// tokenize
deterministic format-preserving HSM vault rotation
// storage
NVMe hot SSD warm Object cold Iceberg Parquet
Developer experience

Tokenization is a column type.

Declare which fields are sensitive in your table DDL. rjbase handles vault writes, masking, and detokenization based on the policies you wire up — your application code never changes.

Read the docs
-- Declare a tokenized table
CREATE TABLE events.payments (
  id          UUID     PRIMARY KEY,
  customer_id TEXT     TOKENIZED('pii.customer'),
  card_pan    TEXT     TOKENIZED('pci.pan', format = 'fpe'),
  amount      DECIMAL(12,2),
  ts          TIMESTAMP
) PARTITION BY days(ts);

-- Analysts query the lake; PCI fields stay opaque
SELECT customer_id, sum(amount)
FROM   events.payments
WHERE  ts >= now() - INTERVAL '30 days'
GROUP BY customer_id;

-- Only the fraud service, with justification, can detokenize
SELECT detokenize(card_pan, reason => 'chargeback:48391')
FROM   events.payments
WHERE  id = '7f8c…';
Now in private preview

Tokenize your data. Keep your governance.

Talk to our team about an evaluation cluster. We typically have engineering partnerships running inside two weeks.