Private Client – Java Performance Optimisation & Profiling

The client’s Java service was hitting latency spikes under load that they couldn’t reproduce in development and couldn’t explain in production. The p99 response time was breaching their SLA several times a day. The team had tried the obvious things — increasing heap, adjusting GC settings, scaling horizontally — without lasting improvement.

The root causes were not obvious. They never are in these engagements, which is why the investigation phase matters more than any fix.

Investigation

Load profile first — JMeter load tests were run against a staging environment with production-equivalent data volumes to reproduce the latency pattern reliably before any changes were made. A problem you can’t reproduce consistently is a problem you can’t verify solving.

async-profiler under load — CPU and allocation flame graphs captured during sustained load revealed the hottest call paths. Several were surprising: a supposedly “lightweight” validation step was allocating heavily due to regex compilation on every call; a serialisation library was performing reflection on a class hierarchy far deeper than anticipated.

Connection pool analysis — HikariCP metrics exposed via Micrometer showed the pool was routinely exhausting under peak load, causing threads to queue waiting for a connection. The pool was sized for average throughput, not burst capacity — a classic configuration error that doesn’t show up in light testing.

GC log analysis — G1GC pause events were correlating with the latency spikes. Heap sizing and region configuration adjustments eliminated the major pauses.

Fixes

Regex patterns pre-compiled to static fields — allocation in the validation path dropped by 85%
Serialisation replaced with a pre-configured ObjectMapper instance — reflection cost eliminated
HikariCP pool size increased and idle timeout tuned to match observed burst traffic patterns
G1GC heap and region size tuned based on allocation rate data from the profiler

Outcome

p99 response time dropped from 800ms (breaching SLA) to 95ms. The team received a full write-up of each finding, the fix, and the measurement that confirmed it — not just a diff, but an explanation. Two engineers attended a half-day knowledge transfer session on reading flame graphs and interpreting HikariCP metrics.