Teaching a Transformer to Read DNA: How EabhaSeq Works

I've been building EabhaSeq for a while now, and there were probably five or six moments where I was genuinely convinced I'd cracked it. The model would produce something that looked right, I'd run the numbers, and some metric I hadn't been tracking closely enough would come back completely wrong. That cycle went on for months, and each iteration taught me something I hadn't understood before about what "correct" actually means for synthetic genomic data.

The project started as a synthetic cfDNA generator for NIPT research, backed by an Emergent Ventures grant, and the core idea is still the same: generate biologically accurate cell-free DNA so that labs can train and validate prenatal testing algorithms without needing large cohorts of patient samples. I've written about the broader project here and the open-source release here, so I won't repeat all of that. This post is specifically about the hardest technical problem I hit, what caused it, and how I fixed it, and then about what happened when I got my hands on real clinical data and tested everything end-to-end for the first time.

Dozens of tests, dozens of failures

The early versions of the model were statistically convincing. GC content distributions matched real cfDNA, fragment length profiles looked right, nucleotide transition patterns were plausible. I kept finding reasons to think the next version would be the one that held together. It never quite did.

The real test, the one that exposes everything, is alignment. You take your generated fragments, run bwa mem against GRCh38, and see how many of them actually map to the genome at reasonable mapping quality. For a long time I was getting rates under 10% for correct chromosome placement. The model had learned the texture of DNA without learning its geography. It was producing convincing-looking ATCG strings that were, genomically speaking, gibberish, and I spent a lot of time debugging why before I understood the root cause clearly enough to fix it properly.

The fix ended up being what I now call reference-conditioned generation, and once I understood the problem precisely the solution was reasonably clean to implement. After that fix, mapping quality jumped to over 94% at MAPQ ≥ 30, chromosome placement came out above 94%, and the distributional properties the model had always been good at remained intact. The full validation methodology is on the EabhaSeq site if you want the numbers in detail.

Then I got real T21 data

The alignment fix was the technical unlock, but the real validation came from getting access to actual karyotype-confirmed clinical samples. The dataset I used was from Lun et al. 2014 (PRJNA215135), 26 real clinical cfDNA samples including 6 confirmed trisomy 21 cases and 20 euploid controls. I had not touched this data at any point during training or development. It was purely a held-out test.

I trained a classifier entirely on synthetic data, zero real patient samples in training, and ran it blind on the 26 real clinical samples. The TSTR AUC (train synthetic, test real) came out at 0.87, and 0.98 when excluding one genuinely anomalous sample with a fetal fraction so low (~1-2%) that no chromosome-fraction NIPT method would detect it. Four of the six T21 cases in that dataset would fail standard NIPT by z-score, with z-scores of 2.47, 1.57, 1.29, and 0.86. The synthetic-trained model correctly ranked all four above the euploid samples (excluding the extreme anomaly). The permutation test on 1,000 random label shuffles came back at p = 0.002, so the result isn't noise.

The augmentation results were equally interesting. When you add synthetic data to a training set with only a handful of real samples, which is the reality for rare aneuploidies, the improvement is significant. With just 3 real training samples, adding synthetic data improved AUC by +19.7 points, from 0.661 to 0.858, and won 88% of 100 random stratified splits. The variance also drops, which matters as much as the mean for clinical applications. The WisecondorX results, which is the clinical-grade NIPT tool used in European laboratories, showed 100% detection of T21, T18, and T13 at 8% fetal fraction using a purely synthetic reference panel.

Full results, methodology, and per-patient breakdowns are at eabhaseq.com/validation and eabhaseq.com/validation/details.

I've been building this for long enough that getting those numbers back felt genuinely surprising. The approach works, and it works at a level that's useful for real clinical algorithm development, not just as a research curiosity.

The rest of this article covers the specific technical problem that caused the alignment failures and exactly how reference-conditioned generation solves it. It's a fairly deep dive into the architecture, the dual-level conditioning mechanism, and how the generation loop works in practice.

This content is for subscribers. If you want to follow along with the technical details of how EabhaSeq works under the hood, and get future posts on the validation work, the 107 conditions the platform now covers, and what real clinical validation looks like, subscribing gets you all of that.

This post is for subscribers only

Sign up to read the post and as well as all other member only posts. Subscribing only takes a few seconds and will give you immediate access.

Subscribe now

Teaching a Transformer to Read DNA: How EabhaSeq Finally Works

Dozens of tests, dozens of failures

Then I got real T21 data

This post is for subscribers only

Claude Code agents and subagents: what they actually unlock

Subscribe

Teaching a Transformer to Read DNA: How EabhaSeq Finally Works

Dozens of tests, dozens of failures

Then I got real T21 data

This post is for subscribers only

Claude Code agents and subagents: what they actually unlock

You might also like

Claude Code agents and subagents: what they actually unlock

Teaching AI Agents to Find Their Own Way Around Cont3xt

The Claude Agent SDK: What It Is and Why It's Worth Understanding

Subscribe

What I'm building and learning, weekly