Available Hire Me
← All Work Java Expert
Private Client

Java Performance Optimisation & Profiling

Systematic investigation and resolution of latency issues in a Java Spring Boot service — profiling under load to identify non-obvious bottlenecks, resolving them methodically, and bringing p99 response times within SLA. Delivered with full knowledge transfer so the team could sustain the gains.

Java JVM Profiling Spring Boot HikariCP JMeter async-profiler Grafana
Sector Technology / Java Services
Role Freelance Java Performance Specialist
Duration 2024
Team Trinity Logic Ltd (solo) + client engineering team
Challenge

A Java Spring Boot service was breaching its p99 SLA several times a day under load. The team had tried the obvious fixes — increasing heap, adjusting GC settings, scaling horizontally — without lasting improvement. The root causes were non-obvious, which is exactly why a systematic investigation was needed.

Approach

Load profile first: JMeter tests against staging with production-equivalent data to reproduce the pattern reliably before any changes. async-profiler CPU and allocation flame graphs under sustained load to identify hottest call paths. HikariCP pool analysis via Micrometer exposing pool exhaustion under burst load. GC log correlation mapping G1GC pause events to observed latency spikes.

Outcome

p99 response time dropped from 800ms to 95ms — an 88% improvement. Full write-up of each root cause, fix, and confirming measurement delivered to the team, plus a half-day knowledge transfer session on flame graph reading and HikariCP metric interpretation.

800ms p99 before
95ms p99 after
88% Improvement
4 identified Root causes
Technical Deep Dive

Private Client – Java Performance Optimisation & Profiling

The client’s Java service was hitting latency spikes under load that they couldn’t reproduce in development and couldn’t explain in production. The p99 response time was breaching their SLA several times a day. The team had tried the obvious things — increasing heap, adjusting GC settings, scaling horizontally — without lasting improvement.

The root causes were not obvious. They never are in these engagements, which is why the investigation phase matters more than any fix.

Investigation

Load profile first — JMeter load tests were run against a staging environment with production-equivalent data volumes to reproduce the latency pattern reliably before any changes were made. A problem you can’t reproduce consistently is a problem you can’t verify solving.

async-profiler under load — CPU and allocation flame graphs captured during sustained load revealed the hottest call paths. Several were surprising: a supposedly “lightweight” validation step was allocating heavily due to regex compilation on every call; a serialisation library was performing reflection on a class hierarchy far deeper than anticipated.

Connection pool analysis — HikariCP metrics exposed via Micrometer showed the pool was routinely exhausting under peak load, causing threads to queue waiting for a connection. The pool was sized for average throughput, not burst capacity — a classic configuration error that doesn’t show up in light testing.

GC log analysis — G1GC pause events were correlating with the latency spikes. Heap sizing and region configuration adjustments eliminated the major pauses.

Fixes

  • Regex patterns pre-compiled to static fields — allocation in the validation path dropped by 85%
  • Serialisation replaced with a pre-configured ObjectMapper instance — reflection cost eliminated
  • HikariCP pool size increased and idle timeout tuned to match observed burst traffic patterns
  • G1GC heap and region size tuned based on allocation rate data from the profiler

Outcome

p99 response time dropped from 800ms (breaching SLA) to 95ms. The team received a full write-up of each finding, the fix, and the measurement that confirmed it — not just a diff, but an explanation. Two engineers attended a half-day knowledge transfer session on reading flame graphs and interpreting HikariCP metrics.