Format-Preserving Tokens: PCI Scope Reduction Without Schema Changes

The standard pitch for tokenization services is something like this: "Replace sensitive values in your database with random tokens. We hold the originals; you hold the tokens. PCI scope reduced." Then you go to wire it up and discover the catch: the tokens are 32-character base32 blobs, your credit_card column is VARCHAR(16), your billing form runs Luhn validation client-side, and your fraud-detection vendor keys on the BIN. The "drop-in" claim was true only if you were starting from scratch.

The first time we shipped tokenization for a customer, we hit exactly that wall. Their cards table had been around for nine years. Three internal services validated card-shape on read. Their analytics pipeline grouped on BIN. The tokens we issued didn't fit any of it.

The second time, we built the tokens to fit.

What "format-preserving" actually buys you

Format preservation means the token passes the same shape validation as the original. Concretely:

A credit card token is 16 digits, passes the Luhn checksum, and preserves the BIN (first 6 digits) and the last 4 digits. Everything in between is random. Your billing form's luhnCheck() still returns true. Your fraud vendor still groups on BIN. Your VARCHAR(16) column doesn't change shape.
An SSN token is NNN-NN-NNNN. The last 4 are preserved (so you can still display "ending in 1234" to the customer). The first three digits are 9XX — and that is not arbitrary, it's the regulatory escape hatch.
An email token is <random>@example.com. The domain is preserved (lowercased) so analytics and fraud signals that key on email domain still work. The local part is opaque base32.

Each of these takes a non-trivial implementation. Let's look at the SSN one because it shows the kind of detail you have to get right.

The SSN trick: 9XX is not real

If you tokenize an SSN with a random 9-digit number, two problems show up immediately. First, your token might collide with a real person's actual SSN. Second, if a token leaks, an attacker can't tell whether it's a real SSN or a tokenized one — and either way, treating it as real triggers compliance obligations.

The fix is in the rules: the Social Security Administration does not issue SSNs that begin with 9. The 9XX-XX-XXXX range is reserved for ITINs (individual taxpayer identification numbers), but those have a fixed middle-digit structure that further constrains the space. KnoxCall's SSN tokens always start with 9 and avoid the ITIN middle-digit ranges. The result: a token shaped exactly like an SSN, validated by the same regex, that cannot collide with a real SSN under any circumstance.

If a token leaks: the leaked string is provably not anyone's actual SSN. The compliance posture is much narrower. The display story to the customer ("we tokenized your tax ID, last 4 are still 5678") is unchanged.

The PAN trick: solving Luhn the lazy way

Luhn is a weighted-sum checksum. The last digit is always whatever digit makes the weighted sum a multiple of 10. If you preserve six digits at the front and four at the back, you have six middle digits to pick. Pick the first five randomly, then solve for the sixth so the whole 16-digit number passes Luhn.

// PAN format token generator (simplified)
function generatePanToken(originalPan) {
  const bin = originalPan.slice(0, 6);     // preserved
  const last4 = originalPan.slice(-4);     // preserved
  const middle5 = randomDigits(5);          // pure random
  const candidate = bin + middle5 + '_' + last4;
  // Solve the bridge digit so the whole 16-digit number passes Luhn:
  return setBridgeDigitForLuhn(candidate);
}

The "bridge digit" is always solvable in O(1) — Luhn has exactly one digit value that satisfies the equation for any given context. We never reject and retry; the math hands us the answer.

The result: 100,000 distinct tokens per (BIN, last4) pair. That is more than enough — the BIN+last4 combination already narrows the entropy meaningfully, and the use case is "store this token in your DB, look it up by lifetime, detokenize when you need to charge the card." Collisions within a BIN+last4 pair are caught by a UNIQUE(tenant_id, vault_id, token) constraint and resolved by trying a fresh random middle.

Encryption and rotation

Format preservation handles the read-and-write story. Encryption and rotation handle the lifecycle.

Each vault has its own AES-256-GCM key managed via KnoxCall's Crypto Keys primitive. Tokens carry the key version they were encrypted under (e.g. v1). When you rotate the vault:

A new key version becomes active.
New tokens encrypt under the new version.
Existing tokens still decrypt under their original version. No re-encryption of stored data needed.

This matters because the alternative — re-encrypting every token in your database after a key rotation — is a brutally expensive operation that you will defer indefinitely the first time you have to schedule a four-hour read-write maintenance window for it. The right answer is to make rotation cheap, and that's what versioned ciphertext does.

Cryptographic erasure

The hard part of GDPR right-to-be-forgotten and HIPAA records-management is finding every place a customer's data lives. Database, application logs, backup snapshots, downstream analytics extracts, BI tool caches. You can never be sure you got it all. The cryptographic answer changes the question.

If a customer's data is encrypted under a key version, and you destroy the key version, every copy of that ciphertext — wherever it ended up — becomes permanently unreadable. The bytes still exist; the bytes are now noise. You haven't deleted the data; you have made the data inaccessible to anyone, forever, including yourself.

KnoxCall Vaults makes this a one-click operation: destroy vault → cryptographic erasure of every token in it. The audit row stays. The token strings stay (in case anyone has a referencing FK). The plaintext is gone.

This is the closest the storage industry has to a "DELETE that actually means it." For data that lives across systems you don't control — think backup tapes off-site, BI tools that cache, snapshots in someone else's S3 — it's the only honest answer to "did you delete my data?"

What you put in a vault

The honest answer: anything that's expensive to leak and cheap to look up. The starter cases:

Credit card numbers (PAN format). Replace your cards table primary value. Detokenize at charge-time only. PCI auditors love this.
SSNs / tax IDs (SSN format). US-only; the 9XX trick depends on SSA conventions. For non-US tax IDs, use generic format.
Customer emails (email format). Especially in analytics warehouses where the marketing team needs to group by domain but has no business reason to see local-parts.
Anything else (generic format). Free-text PII, JSON blobs, API tokens you want to manage centrally, document IDs you want to be able to revoke per-customer.

Try it

Free tier on KnoxCall ships with one vault and 1,000 tokens — generic format only. Starter tier ($19/mo) adds another four vaults but keeps format-preserving formats locked. Pro tier ($99/mo) unlocks the format-preserving formats (PAN, SSN, email) and bumps you to 1M tokens. Enterprise removes the cap.

If you're already on KnoxCall: Protect → Vaults → New Vault. Pick the format. The Playground tab on the detail page lets you tokenize and detokenize without writing any code.

Deeper docs at wiki/essentials/vaults. The implementation lives at src/vaults/ and the format generators are in src/vaults/token-formats.ts if you want to read the bridge-digit-solving in eight lines of TypeScript.