Designing a Stateless SSH Certificate Authority
Most certificate authorities maintain sessions, track state, and require careful coordination between replicas. The BeaconSSH certificate server does none of this. Every request is independently authenticated, authorized, and signed — then forgotten.
This post explains how that works and why it’s a deliberate security decision, not a limitation.
What Stateless Means Here
The certificate server doesn’t maintain login sessions. There are no session cookies, no JWTs for the server itself, no refresh tokens. Each signing request arrives with an OAuth access token, and the server validates that token against the OAuth provider on every single request.
This seems expensive. It is. Every request incurs a round-trip to GitHub’s OAuth endpoint, adding 100–200ms of latency. But here’s what you get in return:
- No session hijacking surface. There’s no session token to steal, forward, or replay.
- Instant revocation. If a user’s GitHub access is revoked, their next certificate request fails immediately. No cache to wait out.
- No replica coordination. Any instance of the certificate server can handle any request. There’s no session affinity, no sticky routing, no shared session store.
The Request Lifecycle
Every signing request follows the same path:
1. Receive request (OAuth token + public key + principals)
2. Validate token → call OAuth provider
3. Extract identity → map to internal user
4. Check user enabled → PostgreSQL lookup
5. Evaluate policy → are these principals allowed?
6. Sign certificate → use active CA private key
7. Record audit event → write to PostgreSQL
8. Return certificate → done, forget everything
Step 2 is where the statelessness pays off. There’s no “is this user already logged in?” check. There’s no “refresh the session.” It’s a clean, independent verification every time. The server doesn’t know or care whether this user made a request 30 seconds ago.
Why Not Cache OAuth Tokens?
The most common pushback on this design is: “why not cache valid tokens for 60 seconds? You’d cut latency in half.”
This approach seems reasonable, but fails because of the revocation window. If you cache tokens with a 60-second TTL, and a user’s access is revoked at second 5, they can still obtain SSH certificates for the remaining 55 seconds. For a system that grants access to production infrastructure, that’s 55 seconds of unauthorized certificate issuance — each certificate then valid for its full lifetime.
The math works against you: 55 seconds of cached validity × 15-minute certificate lifetime = certificates that grant access for up to 15 minutes after the user should have been locked out.
No caching. Every request hits the provider.
CA Key Management
The certificate server maintains one or more CA key pairs, but only one is active for signing at any time.
This distinction exists for rotation. When you rotate:
- Generate new key pair, mark as active
- Old key becomes inactive — no new certs signed with it
- Both public keys published in the CA bundle
- Hosts trust both during the transition
- After all old-key certificates expire, remove old public key from bundle
The overlapping trust window is critical. An atomic CA swap would instantly invalidate every active certificate. Every SSH session killed. Every user needing a new certificate simultaneously. The load spike alone could take down the server.
CA rotation is a public key problem, not a private key problem. The private key switch is instant. The trust transition on hosts must be gradual.
Scaling Without State
Because there’s no state to coordinate, horizontal scaling is straightforward:
- Put multiple certificate server instances behind a load balancer
- Any instance can handle any request
- PostgreSQL handles audit log writes (with standard replication)
- Redis (optional) tracks active certificates — but isn’t authoritative
If an instance crashes mid-request, nothing is lost. The client retries. The next instance handles it. There’s no orphaned session, no lock to release, no state to recover.
The Tradeoff
Statelessness isn’t free:
- Higher latency — every request pays the OAuth validation cost
- Provider dependency — if GitHub’s OAuth endpoint is down, no certificates are issued
- No request batching — each request is fully independent, no amortization
For BeaconSSH, these tradeoffs are acceptable. Certificate issuance isn’t a hot path — it happens once every few minutes per user, not thousands of times per second. The security properties (instant revocation, no session hijacking, no replica coordination) outweigh the latency cost.
Summary
The stateless design means: no sessions to steal, no caches to poison, no replicas to coordinate, and no stale state to reconcile. Every request stands alone. The certificate server is a pure function: token in, certificate out, nothing remembered.
The next post examines what happens after the certificate is issued — why short-lived credentials change the security model of SSH access.