Performance Profiling with JDK Flight Recorder and Mission Control

A service starts underperforming and the usual tools tell you nothing useful. CPU is elevated but not maxed out. GC metrics look normal. Response times have crept up by 40% over the past fortnight. Nobody changed anything obvious. In fifteen years of production Java work I’ve seen this pattern more times than I can count — and the answer is almost always something the JVM can show you directly if you ask it the right questions.

JDK Flight Recorder (JFR) and JDK Mission Control (JMC) are the right tools. They’ve been production-safe since Java 11, the overhead in default mode is under 1%, and together they expose CPU hotspots, allocation pressure, GC behaviour, and lock contention in a way that no metrics dashboard can replicate. At UBS Warburg, profiling the bond pricing engine with an earlier version of these tools was what found a tight serialisation loop that was throttling throughput on high-volume days. The fix was two lines. The investigation without profiling would have taken weeks.

Enabling JFR and Capturing a Recording

JFR ships with the JDK — no agent, no library dependency, nothing to install. Three ways to capture a recording:

On-demand from the command line against a running process:

# Find your PID
jps -l

# Start a 2-minute recording with the profile settings
jcmd <pid> JFR.start duration=120s filename=/tmp/app-recording.jfr settings=profile

# Or dump the current continuous recording immediately
jcmd <pid> JFR.dump filename=/tmp/app-dump.jfr

At JVM startup — continuous recording, dump on exit:

java -XX:StartFlightRecorder=duration=0,\
  filename=/var/log/jfr/app.jfr,\
  settings=profile,\
  dumponexit=true \
  -jar your-service.jar

duration=0 means record indefinitely. dumponexit=true is the critical flag: it writes the in-memory recording to disk when the JVM exits, which means you capture data right up to a crash. Combine it with maxage=2h,maxsize=512mb for a rolling circular buffer — when an incident occurs, dump immediately and you have the two hours of JVM behaviour leading up to it.

Two settings profiles matter:

default — roughly 0.1% overhead; safe to run permanently in production
profile — roughly 1% overhead; much richer data; use for investigation windows of a few minutes

For an underperforming live service, I always start with a 3-minute profile recording during normal load. That window is almost always enough.

Programmatic Recording in Spring Boot

For services where you want to trigger recordings automatically — on a slow-request alert, during a load test, or via an actuator endpoint — drive JFR from code:

@Component
@Slf4j
public class JfrRecordingService {

    public Path captureRecording(String label, Duration duration) throws Exception {
        Path destination = Path.of(
            "/var/log/jfr/recording-" + label + "-" + Instant.now().toEpochMilli() + ".jfr"
        );

        Recording recording = new Recording();
        recording.setName(label);
        recording.setDuration(duration);
        recording.setDestination(destination);

        // Select the events you care about
        recording.enable("jdk.CPUSample").withPeriod(Duration.ofMillis(20));
        recording.enable("jdk.ObjectAllocationInNewTLAB");
        recording.enable("jdk.ObjectAllocationOutsideTLAB");
        recording.enable("jdk.GarbageCollection");
        recording.enable("jdk.GCPhasePause");
        recording.enable("jdk.JavaMonitorEnter").withThreshold(Duration.ofMillis(1));
        recording.enable("jdk.JavaMonitorWait").withThreshold(Duration.ofMillis(10));
        recording.enable("jdk.SocketRead").withThreshold(Duration.ofMillis(20));

        recording.start();
        log.info("JFR recording '{}' started → {}", label, destination);
        return destination;
    }
}

Wire this to a Spring Boot actuator endpoint or a Micrometer alert callback and you get automatic profiling snapshots whenever a SLA threshold is breached — no manual intervention needed.

Reading the Flame Graph

Open the .jfr file in JMC (download from adoptium.net or use the JDK-bundled version). Navigate to the Method Profiling view and select Flame Graph.

The flame graph shows call stacks sampled during recording. Width is proportional to the percentage of CPU samples where that method appeared on the stack. Taller stacks are deeper call chains; the methods at the top of the widest towers are your CPU hotspots.

What to look for first:

Wide blocks in your own application code are your primary targets — these represent genuine business logic that is consuming CPU. Ask whether the work is necessary and whether it can be made cheaper.

Wide blocks in framework code (Spring AOP proxies, Jackson serialisation, Hibernate SQL generation) usually indicate you’re doing too many small operations — N+1 query patterns, unnecessary DTO serialisation on every request, or redundant bean lookups inside tight loops.

Flat-topped wide blocks — methods appearing frequently at the very top of shallow stacks — are CPU-bound hotspots. String manipulation, regex compilation, hash computation, and primitive boxing are the usual culprits here.

At Mosaic Smart Data, the flame graph for the analytics ingestion service revealed that String.format inside a high-frequency logging statement was accounting for nearly 8% of all CPU samples. The format string was building a diagnostic message that was never actually written at the configured log level. One log.isDebugEnabled() guard eliminated it entirely.

Memory Allocation Analysis

CPU profiling tells you where time goes. The Memory tab tells you where allocations come from — and these are often completely different places.

The allocation profile shows which classes are being allocated most, and which call stacks are responsible. Key patterns to diagnose:

byte[] and char[] — almost always string operations, serialisation, or logging. In high-throughput paths, even StringBuilder inside a loop that runs thousands of times per second generates significant allocation pressure.

Boxing types (Integer, Long, Double) — using Map<String, Long> in a hot counter path boxes a primitive on every merge or put. Consider a purpose-built structure for genuinely hot paths:

// Allocates a Long object on every update
Map<String, Long> counters = new HashMap<>();
counters.merge(key, 1L, Long::sum);

// Hot-path alternative: mutate a long[] holder, no boxing
Map<String, long[]> counters = new HashMap<>();
counters.computeIfAbsent(key, k -> new long[1])[0]++;

Short-lived event or DTO objects — in messaging pipelines, wrapping every inbound message in a new object that is immediately deserialized, processed, and discarded generates constant young-generation pressure. If GC frequency is high but heap never grows, this is usually why.

The Betfair streaming framework I run processes hundreds of market updates per second. Early versions created a new RunnerChange object per delta message. Switching to a mutable state model with in-place updates cut allocation rate by ~60% and reduced minor GC frequency noticeably.

GC and Lock Contention

The GC tab in JMC visualises pause duration, heap occupancy, and collection frequency over the recording window. Long pauses in a latency-sensitive service are symptoms worth treating at the source rather than masking with GC tuning.

For genuine low-latency requirements — sub-5ms p99 — switch to ZGC:

java -XX:+UseZGC -Xmx8g -jar your-service.jar

ZGC delivers sub-millisecond pauses at the cost of slightly higher CPU. G1 (the default since Java 9) is a good general-purpose choice; only move to ZGC if GC pauses are measurably contributing to your latency distribution.

The Thread tab and Lock Instances view expose contention. Threads spending significant time in BLOCKED state are waiting on monitors held by other threads — these show up as wide blue segments in the thread timeline. The Lock Instances view shows which monitors have the most contention, how long threads waited, and which call stacks were involved. Common patterns: connection pool starvation, synchronized in-memory caches under concurrent load, and synchronous log appenders.

ProTip: What to Look at First

When you open a recording from a service that’s underperforming and you don’t know where to start:

Flame graph — is there an obvious wide hotspot in application code? If yes, investigate it.
Allocation by call stack — is the allocation rate disproportionate to the workload? Look for the top five call sites.
Lock Instances sorted by wait time — are any threads spending significant time blocked? Which monitor?
GC pause timeline — are pauses long or frequent? What’s the heap occupancy trend?

Work through those four questions on most recordings and you’ll find something actionable within 20 minutes. Most production JVM performance problems are either a CPU hotspot, an allocation hotspot, a lock contention bottleneck, or a GC configuration mismatch — and JFR surfaces all four.

If you’re dealing with a Java service that’s underperforming and conventional diagnostics aren’t pointing at the cause, get in touch.