How to build a Betfair market data recorder in Java — capturing live Streaming API data to disk for backtesting, replay, and strategy analysis.
Every serious Betfair strategy starts the same way: you have a theory about market behaviour, and you need data to test it against. The problem is that live market data is ephemeral — once a race goes in-play and settles, that pre-race ladder movement is gone unless you captured it. I learned this the hard way after spending two weeks building a weight-of-money signal, then realising I had no historical data to validate it against. The solution is a market data recorder: a component that sits alongside your live trading framework, writes raw Streaming API messages to disk, and lets you replay them later.
A recorder built properly will store months of data at modest cost, replay it at arbitrary speeds, and serve as the foundation for every strategy you develop.
The design is deliberately simple: Streaming API → message queue → writer thread → disk. The IO thread receives raw JSON lines from the ESA socket and drops them onto a LinkedBlockingQueue. A dedicated writer thread drains the queue to a buffered file writer. Keeping IO and write concerns on separate threads means a slow disk write never stalls the stream.
@Component
public class MarketDataRecorder {
private final BlockingQueue<RecordedMessage> writeQueue = new LinkedBlockingQueue<>(50_000);
private final ScheduledExecutorService writerExecutor = Executors.newSingleThreadScheduledExecutor();
@PostConstruct
public void start() {
writerExecutor.execute(this::writeLoop);
}
public void record(String marketId, String rawJson) {
long timestampMs = System.currentTimeMillis();
writeQueue.offer(new RecordedMessage(marketId, timestampMs, rawJson));
}
private void writeLoop() {
while (!Thread.currentThread().isInterrupted()) {
try {
RecordedMessage msg = writeQueue.take();
getWriter(msg.marketId()).writeLine(msg.timestampMs(), msg.rawJson());
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} catch (IOException e) {
log.error("Write failure — market {}", e.getMessage(), e);
}
}
}
}
Call recorder.record(marketId, rawLine) directly from your stream read loop, before any parsing. Record the raw JSON, not your parsed model — if you later decide you need a field you didn’t previously parse, the data is still there.
Each market gets its own gzip-compressed NDJSON file. NDJSON (newline-delimited JSON) is trivially writable and readable, and gzip compression brings file sizes down by roughly 85% without any meaningful complexity overhead.
public class MarketFileWriter implements Closeable {
private final GZIPOutputStream gzip;
private final BufferedWriter writer;
public MarketFileWriter(Path outputDir, String marketId, LocalDate date) throws IOException {
String filename = date + "_" + marketId.replace(".", "-") + ".ndjson.gz";
Path file = outputDir.resolve(filename);
this.gzip = new GZIPOutputStream(new FileOutputStream(file.toFile()), 65_536);
this.writer = new BufferedWriter(new OutputStreamWriter(gzip, StandardCharsets.UTF_8), 65_536);
}
public void writeLine(long timestampMs, String rawJson) throws IOException {
// Prepend timestamp so replay can reconstruct timing
writer.write(timestampMs + " " + rawJson);
writer.newLine();
}
@Override
public void close() throws IOException {
writer.flush();
gzip.finish();
writer.close();
}
}
The timestamp prefix is critical. Without it, replay has no way to know the elapsed time between messages, and your backtested latency calculations are meaningless.
Organise files by date and market ID. A typical directory layout:
data/
2026-03-15/
2026-03-15_1-234567890.ndjson.gz (horse racing pre-race)
2026-03-15_1-234567891.ndjson.gz
2026-03-16/
...
The replay component reads files back in order, publishing messages to your existing strategy pipeline with timing that mirrors the original:
public class MarketDataReplayer {
private final MarketStateCache cache;
private final List<StrategyListener> listeners;
public void replay(Path file, double speedMultiplier) throws IOException {
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(new GZIPInputStream(
new FileInputStream(file.toFile())), StandardCharsets.UTF_8))) {
String line;
long prevTimestamp = -1;
while ((line = reader.readLine()) != null) {
int space = line.indexOf(' ');
long timestamp = Long.parseLong(line.substring(0, space));
String rawJson = line.substring(space + 1);
if (prevTimestamp > 0) {
long gapMs = (long) ((timestamp - prevTimestamp) / speedMultiplier);
if (gapMs > 0 && gapMs < 5_000) { // skip gaps > 5s (heartbeats)
Thread.sleep(gapMs);
}
}
prevTimestamp = timestamp;
cache.applyRawChange(rawJson);
listeners.forEach(l -> l.onMarketUpdate(cache));
}
}
}
}
The speedMultiplier parameter lets you run at 10x or 50x for fast iteration during strategy development, then drop to 1x for a realistic final validation pass.
A busy recording day — covering all UK and Irish horse racing plus a few football markets — produces around 2–4 GB of compressed data. That’s manageable, but it compounds. A few housekeeping practices keep it under control:
Only record markets you’re actually interested in. Filter by event type, country, and minimum liquidity before subscribing on the stream. There’s no value in archiving obscure trotting races if your strategy only targets flat racing.
Implement a rolling retention policy. Anything older than 90 days gets deleted automatically unless flagged for long-term storage:
@Scheduled(cron = "0 3 * * * *") // 3am daily
public void pruneOldFiles() {
LocalDate cutoff = LocalDate.now().minusDays(retentionDays);
try (Stream<Path> dirs = Files.list(dataDir)) {
dirs.filter(p -> {
try {
return LocalDate.parse(p.getFileName().toString()).isBefore(cutoff);
} catch (DateTimeParseException e) {
return false;
}
}).forEach(p -> FileUtils.deleteQuietly(p.toFile()));
}
}
Most people record EX_BEST_OFFERS and EX_LTP. Here’s what else is worth capturing:
EX_TRADED_VOL per runner, not just per market. The per-runner traded volume breakdown lets you reconstruct how money flowed across selections over time — essential for understanding whether large movements were driven by a single runner or represented broad market reappraisal.
MARKET_DEF changes. Market definition messages contain in-play status, race start time, and runner status (active, winner, removed). You need these to correctly identify market phase boundaries in your backtests.
EX_ALL_OFFERS for at least the top 5 ladder levels. Best-price-only data can mislead you about genuine liquidity. Storing 5 levels lets you reconstruct what the ladder looked like, which is important for modelling realistic execution at various stake sizes.
Message receipt timestamps with millisecond precision. The latency between your system receiving a message and you processing it is invisible unless you record it. Over time, latency distribution gives you a realistic picture of execution feasibility.
.tmp file during recording, then compressing and renaming on close. This makes it easier to recover partial files after a crash.If you’re building a Betfair trading framework and want to talk architecture, get in touch.