Island of Misfit Startups: Part II (KesslerGym)
Island of Misfit Startups: Part II (KesslerGym)
Don’t train for World War III. Train for a Tuesday in LEO with bad data.
Bona Fides: I learned all of this by building a world-class simulation system for FDA-regulated safety-critical medical systems. The physics of metabolism are decently modeled and understood. (Not totally, but “ok”.) Noise from sensors, actuators, etc., were all reasonably modeled. But real systems on real people, in the real world? Man, shit just happens. Stuff you’re not even able to name, or understand. The “wtf was that right THERE in the trace?” So we captured the wtf moments, replayed them, systemetized them, weaponized them against the design we hoped would survive. The algorithm that survived 500,000 in-silico pivotal trials? That’s the winner.
LEO Overcrowding
This is a right now problem. The FCC just approved 7,500 additional Starlink Gen2 satellites, bringing SpaceX’s authorized constellation to 15,000, with another 15,000 still under review. Half of those new satellites must be operational by December 2028. That’s less than three years to double the active constellation.
Starlink has plenty of company, too. Amazon’s Kuiper has authorization for 3,236 satellites. OneWeb is building out. Rivada is coming. The Chinese mega-constellations are deploying. Everyone is racing to fill the shell.
The conjunction rate scales superlinearly with object count. Double the satellites, and you more than double the potential collision events, because every new object can interact with every existing object. We’re heading toward thousands of conjunction assessments per day.
The new Starlink authorization includes orbital shells as low as 340km. Lower altitude means lower latency for internet service and faster deorbit for dead birds. It also means denser traffic, more conjunctions, and tighter reaction windows. The FCC’s order acknowledges the debris risk: they’ve granted themselves authority to pause deployments if collision thresholds are exceeded. Regulators are hedging because they know the math is getting ugly.
Meanwhile, 95% of the objects up there are silent. Zombies, dead satellites, rocket bodies, orbital… crud. Debris from that time China tested an anti-satellite weapon and created 3,500 new trackable objects. These things don’t broadcast. But they do tumble! And drift. And the data we have on them is “meh” at best. The public catalog runs 8-24 hours stale, TLEs are Keplerian approximations of non-Keplerian orbits, and the tracking infrastructure was designed for hundreds of objects, not tens of thousands.
Traffic has outgrown ground-based ops. Thus: autonomy isn’t a feature request. It’s the only architecture that survives.
Decisioning Must Move to the Edge
Collision avoidance for satellites currently works like this: ground stations receive tracking data, crunch the numbers, identify potential conjunctions, and uplink maneuver commands to the spacecraft. Human operators review the analysis. Decisions flow up and down a communication chain with latency measured in minutes to hours.
This worked when LEO was sparse, but I don’t believe it will survive what’s coming.
The math is mathy. Ground-based collision avoidance requires: downlink of tracking data, processing time, human review, command generation, and uplink of maneuver instructions. The issue isn’t the RF’s 200 milliseconds on a good day. It’s those pesky, slow humans. The ops team needs to validate the maneuver. When conjunctions happen weekly, that’s fine. When conjunctions happen hundreds of times per day across a constellation, you cannot keep humans in that loop.
The only answer is autonomy, using onboard control systems. (::cough:: AI.) Edge compute has gotta make (at least some) maneuver decisions without phoning home and hoping Uncle Eddie picks up.
The Training Problem
You need autonomous agents making maneuver decisions at the edge. How do you train them?
The standard Reinforcement Learning approach: build a physics simulation, define reward functions, run millions of episodes, deploy the trained policy. Every serious space AI company is doing some version of this.
The problem is that every training environment is lovely, clean and beautiful.
The sim gives you crisp sensor readings, clean orbital elements, data that arrives on time and means what it says. You train your agent until it’s gorgeous on the benchmarks. Then you deploy it into the actual world, where the sensor just glinted off a tumbling rocket body and hallucinated three objects, the TLE is eight hours stale, and the ground link dropped out at the worst possible moment.
Your gorgeous agent rapidly deorbits because it learned to trust its inputs. In the real world, trusting your inputs will get you spontaneously disassembled. Physics is a bitch.
Gym Rats
Most Reinforcement Learning (RL) gyms are bad gyms that fail in predictable ways.
Operator-defined scenarios. Someone sat down and decided what situations the agent should encounter. Conjunction at 45 degrees. Debris field with known density. Sensor failure mode #3. The scenarios reflect what the designer thought was important, which is (if you’re lucky) a subset of what’s actually important. The agent masters the test and fails the deployment because reality didn’t read the spec.
Brittle data assumptions. The gym assumes data arrives on time, means what it says, and comes from calibrated sensors. Real data is late, ambiguous, and comes from sensors that drift, saturate, and hallucinate. The agent trained on clean data has no immune system for dirty data.
Frankenstein world models. The physics engine came from one source, the sensor models from another, the communication model from a third. Each component was validated independently against its own assumptions. Nobody validated them together. The seams between subsystems are where reality leaks in, and the gym papered over every seam.
Every world model is an abstraction, whether it’s white-box closed-form math, black-box neural net, or anything in between. Abstractions work by leaving things out. But sometimes, it turns out that what you ignored was more important than what you kept.
In statistical terms: every model has residuals. The residuals are the gap between what the model predicts and what actually happens. Standard practice is to minimize residuals during model fitting and then ignore them during deployment. The residuals become rounding error, noise, someone else’s problem.
But residuals aren’t noise. Residuals are everything your model doesn’t understand. Sensor glint off a tumbling rocket body is a residual. Atmospheric drag variation from a solar storm is a residual. Catalog cross-tagging error is a residual. Ground station dropout during a conjunction is a residual.
When your agent encounters a situation dominated by residuals, it has no training for what’s happening. It learned the model. Reality sent the residuals.
A good gym treats residuals as first-class citizens. It doesn’t just model the physics; it models the ways the physics model fails or is incomplete. It doesn’t just simulate sensors; it simulates sensor failures, sensor drift, sensor lies. It doesn’t train for the world as modeled; it trains for the gap between model and world.
That’s the difference. Bad gyms teach agents to trust the model. Good gyms teach agents to survive when the model is wrong.
The Startup: KesslerGym
The physics is easy. We can simulate orbital mechanics to arbitrary precision. The hard problem is epistemic fog: the rot, the lies your sensors tell you, the lies the catalog tells you, the lies that accumulate because space is full of silent actors and everyone’s data is stale.
KesslerGym is a “messy reality” simulation environment for training autonomous space traffic management agents. It’s a gym that treats residuals as the curriculum.
1. The Rot Engine
A “grey agent” trying to confuse you (as opposed to a “red agent” trying to kill you). It injects:
- Zombie TLEs (objects that don’t exist, or existed eight hours ago)
- Sensor ghosting (glare, radar blooms, streak saturation)
- Cross-tagging errors (two objects become one, one becomes three)
- Stale timestamps (data that says “now” but means “yesterday”)
- Communication dropouts at the worst possible moment
Building a physics simulator is commoditized. Building a sensor failure simulator requires data on how sensors actually fail. What does a star tracker glitch look like during a solar storm? How does radar bloom behave at different aspect angles? You need proprietary datasets from satellite operators showing real anomalies, but operators don’t share that data.
This is the chicken-and-egg problem. You need failure data to build the Rot Engine, but you need the Rot Engine to attract operators who have failure data. The bootstrap probably requires partnership with an insurer or government program that can compel data sharing as a condition of coverage or licensing. Ugly, but solvable.
2. Edge-Native Constraints
The simulation doesn’t just penalize collisions. It penalizes compute.
Your agent runs in a virtualized environment that mimics the actual hardware it’ll deploy on: Nvidia Jetson Orin, Xilinx Versal, whatever. If your brilliant maneuver decision requires more FLOPS than your 15-watt edge processor can deliver, you fail. If it requires a ground link that takes 200ms and you’ve only got 90 seconds, you fail.
Most academic RL ignores this entirely. The agent trains on a cluster of A100s and then someone asks “wait, this has to run on a radiation-hardened chip that’s five generations behind a consumer GPU?” The policy doesn’t fit. The latency blows the budget. Back to square one.
The agent must be pragmatic and parsimonious. Correct and cheap. Safe and possible.
3. The Skepticism Reward
The reward function doesn’t just reward “avoided collision.” It rewards epistemic humility.
The agent that says “The data is sketchy, I’m going to coast and watch” probably does better. The agent that fires thrusters based on a single noisy sensor reading gets penalized, even if it happens to work this time.
We’re training for robustness under uncertainty.
Competence from Confusion
The agent trained on perfect data learns “when conjunction probability exceeds X, execute maneuver Y.” That works great until the conjunction probability estimate is garbage because the underlying TLE was garbage, and now your agent just burned a week’s worth of delta-V dodging a ghost.
KesslerGym trains agents that question their own inputs. That recognize when data quality has degraded below the threshold where aggressive action is justified. That understand the difference between “I know where this object is” and “I have a stale estimate with error bars the size of a football field.”
The agent that survives 10,000 years in KesslerGym isn’t the one with the best physics model. It’s the one that learned when to distrust its physics model.
Business
Four customer segments, in order of realistic adoption:
1. The Insurers (AXA XL, Allianz, Marsh)
They’re terrified. “Autonomous maneuver” is an actuarial nightmare. If some startup’s AI fires thrusters wrong and triggers a collision cascade, who’s liable? How do you price the risk of Kessler Syndrome?
Insurers are the natural early adopters because they’re already in the business of quantifying risk. KesslerGym gives them what they need: “This agent has survived N simulated years with zero at-fault collisions under realistic noise conditions.” Now you can price the policy. Pilot programs here first.
2. The Regulators (FCC, Space Force, FAA-AST)
They want to say yes to autonomous systems. The industry is screaming for it. But regulators need cover. They need a standard they can point to when something eventually goes wrong. “The operator was KesslerGym certified” becomes the get-out-of-jail card for the regulator who approved the license.
This happens as shadow standards first. Informal guidance. “We recommend…” Eventually it hardens into requirement.
3. The Hardware Makers (Aethero, Little Place Labs, etc.)
They’re selling edge compute for space: Jetsons and Versals hardened for radiation. The pitch is “autonomous everything.” The customer asks: “Cool, what software do I run on this?”
They’ve got the hammer. They need nails. KesslerGym-trained policies become the reference architecture that proves their hardware can run real autonomous safety. Once insurers and regulators are on board, hardware vendors adopt to complete their stack.
4. The Mega-Constellations (Starlink, Kuiper, OneWeb)
They will resist until compelled. They always do.
Mega-constellations have internal teams, proprietary systems, and no desire to submit to external validation. They’ll argue their sims are better, their data is richer, their engineers are smarter. They’ll be right about some of it.
But when insurers refuse to underwrite without certification, when regulators make it a license condition, when a competitor’s collision gets blamed on “inadequate testing”… then they’ll adopt. Kicking and screaming, but they’ll adopt.
The Sneaky Part
The gym is a means to an end. The real business is the Kessler Score, the certification that lets autonomous systems operate and be insured.
Selling simulation software is a tough, niche SaaS business. Selling certification to insurers is a billion-dollar business. If Allianz says “We won’t insure your constellation unless your collision avoidance agent has a Kessler Score of 95+,” you own the market.
The strategic play:
Phase 1: Partner with insurers and regulators to establish the Kessler Score as the benchmark for autonomous compliance. Pilot programs. Shadow standards. “We recommend…” language that hardens into requirement over time.
Phase 2: Run certification as a service. Operators submit their agents; you run them through the gauntlet; you issue the score. The test itself remains a black box.
Phase 3: Defend the moat by keeping the certification suite proprietary.
That last point matters. The temptation is to open-source the Rot Engine to drive adoption and build community. This is a mistake. If students have the exact answer key, they overfit to the test. Starlink forks your open-source engine, trains their agents against it internally, and games the certification without ever developing true robustness.
The Rot Engine can have public components: basic sensor noise models, standard failure modes, documented corruption patterns. But the certification suite must include proprietary rot that operators never see. Undocumented failure modes. Novel corruption patterns. The equivalent of Moody’s keeping their exact rating methodology confidential.
You’re not selling software. You’re selling trust. And trust requires that the test be harder than what operators can train against on their own.
This is the Underwriters Laboratories play for space autonomy. UL doesn’t make toasters; UL provides the stamp that lets toasters be sold. Moody’s doesn’t issue bonds; Moody’s provides the rating that lets bonds be priced. KesslerGym doesn’t build satellites or train agents; KesslerGym provides the certification that lets autonomous systems operate and be insured.
The Generalized Wedge
KesslerGym is one instance of a broader pattern. The startup wedge is weaponizing epistemic pragmatism.
Everywhere machine automation meets the real world, there’s a gap between the training abstraction and deployment reality. Clean gym, dirty world. The agent learns the model; reality sends the residuals. This is true for:
- Autonomous vehicles: trained on millions of miles of simulation, deployed into weather, construction zones, and humans who don’t behave like the behavior model predicted
- Medical systems: trained on curated datasets, deployed into populations with different demographics, comorbidities, and equipment
- Financial trading: trained on historical data, deployed into regime changes, flash crashes, and correlated failures the backtest never saw
- Industrial robotics: trained in digital twins, deployed into factories where the gripper wears down, the lighting changes, and someone left a coffee cup in the workspace
- Agricultural automation: trained on idealized crop rows, deployed into fields with rocks, variable soil, and weather the sim didn’t model
In every domain, the same dynamic plays out. Operators train on clean data because clean data is what they have. They deploy into messy reality because messy reality is where the work happens. The gap is where failures live.
Simulations will always be abstractions, and abstractions will always leave things out. The business is certifying robustness to the gap: verifying that the agent knows what it doesn’t know, that it fails gracefully when the model breaks down, that it has trained against residuals rather than just against the model.
Whoever owns that certification in a given domain owns the trust infrastructure for automation in that domain. Space is the beachhead. The pattern generalizes.
I don’t have the astrodynamics background. I don’t have the relationships with the space insurers or the regulatory bodies. And frankly, I don’t want to spend five years lobbying the Space Force.
But someone should build this. The window is now. The mega-constellations are deploying. The autonomous systems are coming whether we’re ready or not. And every agent being trained right now is being trained on clean data, which means every agent being deployed is fragile. Gorgeous on benchmarks, blind to reality.
The first company that can certify “this AI knows what it doesn’t know” owns the trust layer for autonomous systems.
Space is just where the pain is sharpest right now. The wedge works everywhere.
This is Part II of “The Island of Misfit Startups.” Part I was LensReader, on fixing the thermodynamics of attention. The series explores startup architectures built on uncomfortable truths the market hasn’t priced in yet. Or maybe stuff I just don’t have time to do. Maybe both.