Tokenized Datalake · Private Preview · v1.0

The data lake your
security team approves of.

rjbase.io is a high-performance tokenized datalake platform. Ingest petabytes, tokenize sensitive fields on the way in, and run zero-trust analytics without surfacing raw data to a single analyst, model, or downstream system.

Request access See how it works

SOC 2 · TYPE II

HIPAA READY

PCI · TOKENIZED

GDPR · CCPA

8.4 PB

Indexed across preview clusters

142 µs

Median tokenized field lookup

99.995%

Control-plane availability SLO

Raw values leaving the vault

Why rjbase

A lake that tokenizes by default.

Most lakes solve storage. rjbase.io solves storage and the question your security team asks five minutes later: who can actually see this row?

Field-level tokenization

Every sensitive column is replaced with a deterministic token at ingest. Raw values live in a separate vault that analysts and models never touch.

Reversible — for who you say

Detokenization is a policy decision, not a query. Define who, where, and why; rjbase enforces it at the engine and audits every reveal.

Hot, warm & cold in one query

A tiered storage engine routes recent partitions to NVMe, the long tail to object storage — your SQL never has to know.

Zero-trust access plane

Mutual-TLS service identities, short-lived row-scoped capabilities, and per-query justification — enforced before the planner ever runs.

SQL, Python, & gRPC

Speak ANSI SQL, dataframe Python, or our streaming gRPC API. Engines share the same catalog, policy graph, and tokenization vault.

Operates itself

Auto-compaction, partition pruning, vacuum, and bloom-filter rebuilds happen on schedule. No 3 a.m. pages over “the lake” again.

Architecture

Four layers. One control plane.

rjbase.io is built around a single contract: every byte enters via the tokenization layer, and every byte that leaves passes through the policy engine. Storage, compute, and catalog all sit underneath that boundary.

Pluggable backends — S3, GCS, Azure Blob, on-prem MinIO
Open table format · Iceberg-compatible metadata
HSM-backed token vault with rotation-safe identifiers
Per-tenant compute pools with hard isolation

// access

SQL PyClient gRPC REST BI adapter

// policy

RBAC + ABAC row scoping column masking audit ledger

// tokenize

deterministic format-preserving HSM vault rotation

// storage

NVMe hot SSD warm Object cold Iceberg Parquet

Developer experience

Tokenization is a column type.

Declare which fields are sensitive in your table DDL. rjbase handles vault writes, masking, and detokenization based on the policies you wire up — your application code never changes.

Read the docs

-- Declare a tokenized table
CREATE TABLE events.payments (
  id          UUID     PRIMARY KEY,
  customer_id TEXT     TOKENIZED('pii.customer'),
  card_pan    TEXT     TOKENIZED('pci.pan', format = 'fpe'),
  amount      DECIMAL(12,2),
  ts          TIMESTAMP
) PARTITION BY days(ts);

-- Analysts query the lake; PCI fields stay opaque
SELECT customer_id, sum(amount)
FROM   events.payments
WHERE  ts >= now() - INTERVAL '30 days'
GROUP BY customer_id;

-- Only the fraud service, with justification, can detokenize
SELECT detokenize(card_pan, reason => 'chargeback:48391')
FROM   events.payments
WHERE  id = '7f8c…';

The data lake your security team approves of.