Systematic investigation and resolution of latency issues in a Java Spring Boot service — profiling under load to identify non-obvious bottlenecks, resolving them methodically, and bringing p99 response times within SLA. Delivered with full knowledge transfer so the team could sustain the gains.
A Java Spring Boot service was breaching its p99 SLA several times a day under load. The team had tried the obvious fixes — increasing heap, adjusting GC settings, scaling horizontally — without lasting improvement. The root causes were non-obvious, which is exactly why a systematic investigation was needed.
Load profile first: JMeter tests against staging with production-equivalent data to reproduce the pattern reliably before any changes. async-profiler CPU and allocation flame graphs under sustained load to identify hottest call paths. HikariCP pool analysis via Micrometer exposing pool exhaustion under burst load. GC log correlation mapping G1GC pause events to observed latency spikes.
p99 response time dropped from 800ms to 95ms — an 88% improvement. Full write-up of each root cause, fix, and confirming measurement delivered to the team, plus a half-day knowledge transfer session on flame graph reading and HikariCP metric interpretation.
The client’s Java service was hitting latency spikes under load that they couldn’t reproduce in development and couldn’t explain in production. The p99 response time was breaching their SLA several times a day. The team had tried the obvious things — increasing heap, adjusting GC settings, scaling horizontally — without lasting improvement.
The root causes were not obvious. They never are in these engagements, which is why the investigation phase matters more than any fix.
Load profile first — JMeter load tests were run against a staging environment with production-equivalent data volumes to reproduce the latency pattern reliably before any changes were made. A problem you can’t reproduce consistently is a problem you can’t verify solving.
async-profiler under load — CPU and allocation flame graphs captured during sustained load revealed the hottest call paths. Several were surprising: a supposedly “lightweight” validation step was allocating heavily due to regex compilation on every call; a serialisation library was performing reflection on a class hierarchy far deeper than anticipated.
Connection pool analysis — HikariCP metrics exposed via Micrometer showed the pool was routinely exhausting under peak load, causing threads to queue waiting for a connection. The pool was sized for average throughput, not burst capacity — a classic configuration error that doesn’t show up in light testing.
GC log analysis — G1GC pause events were correlating with the latency spikes. Heap sizing and region configuration adjustments eliminated the major pauses.
p99 response time dropped from 800ms (breaching SLA) to 95ms. The team received a full write-up of each finding, the fix, and the measurement that confirmed it — not just a diff, but an explanation. Two engineers attended a half-day knowledge transfer session on reading flame graphs and interpreting HikariCP metrics.