Processing Large Datasets Without Killing Your Database

Bulk data processing is one of those problems that looks simple until it isn’t. Load a list, transform it, write it out. Fine for a thousand records. At a million records, your naive implementation has killed your database connection pool, blown the JVM heap, and been running for four hours with no way to restart from where it failed. Spring Batch exists precisely for this problem space. I’ve used it extensively on smart metering systems processing tens of millions of meter read records overnight — here’s what I’ve learned.

The Core Model: Job, Step, Chunk

Spring Batch’s abstractions are small and composable:

Job — the overall batch process; composed of one or more Steps
Step — a unit of work; most Steps follow the read-process-write model
ItemReader — reads one item at a time from a data source
ItemProcessor — transforms or filters an item (optional)
ItemWriter — writes a chunk of processed items

The chunk model is the critical insight: rather than processing one record at a time and committing each, Spring Batch accumulates commit-interval records, processes them, writes them all, and then commits the transaction. One database round-trip per chunk instead of one per record.

@Configuration
@EnableBatchProcessing
public class MeterReadJobConfig {

    @Bean
    public Job meterReadJob(JobRepository jobRepository, Step meterReadStep) {
        return new JobBuilder("meterReadJob", jobRepository)
            .start(meterReadStep)
            .build();
    }

    @Bean
    public Step meterReadStep(
            JobRepository jobRepository,
            PlatformTransactionManager transactionManager,
            ItemReader<MeterReadRecord> reader,
            ItemProcessor<MeterReadRecord, NormalisedMeterRead> processor,
            ItemWriter<NormalisedMeterRead> writer) {

        return new StepBuilder("meterReadStep", jobRepository)
            .<MeterReadRecord, NormalisedMeterRead>chunk(500, transactionManager)
            .reader(reader)
            .processor(processor)
            .writer(writer)
            .faultTolerant()
            .skip(MalformedReadException.class)
            .skipLimit(1000)
            .retry(TransientDataAccessException.class)
            .retryLimit(3)
            .build();
    }
}

The chunk(500, transactionManager) call sets the commit interval to 500. Every 500 successfully processed items, the writer is called and the transaction commits. If the transaction fails, only that chunk is rolled back — not the entire job.

Reading Large Result Sets: JdbcCursorItemReader

The most common mistake I see in Spring Batch implementations is loading the entire dataset into memory before processing — either via a JdbcPagingItemReader with a huge page size, or by pulling a List<Entity> in a @BeforeStep and iterating over it.

For large datasets, you want JdbcCursorItemReader. It opens a database cursor and streams rows one at a time, keeping memory consumption constant regardless of result set size:

@Bean
@StepScope
public JdbcCursorItemReader<MeterReadRecord> meterReadReader(DataSource dataSource) {
    return new JdbcCursorItemReaderBuilder<MeterReadRecord>()
        .name("meterReadReader")
        .dataSource(dataSource)
        .sql("""
            SELECT meter_id, read_timestamp, read_value, read_type, status
            FROM meter_reads
            WHERE processing_date = :processingDate
              AND status = 'PENDING'
            ORDER BY meter_id, read_timestamp
            """)
        .rowMapper(new MeterReadRowMapper())
        .fetchSize(1000)       // hint to the JDBC driver for pre-fetching
        .verifyCursorPosition(false)
        .build();
}

fetchSize(1000) tells the JDBC driver to pre-fetch 1000 rows at a time from the database into the client buffer, reducing network round-trips while still streaming from the application’s perspective. The actual memory footprint remains bounded.

@StepScope is important — it means the reader is created fresh for each Step execution, which enables late binding of JobParameters via SpEL:

@Value("#{jobParameters['processingDate']}") LocalDate processingDate

Without @StepScope, JobParameters values aren’t available when the bean is constructed.

ItemProcessor — Transform and Filter

The processor is where business logic lives. Return null to filter an item out (it won’t be passed to the writer):

@Bean
@StepScope
public ItemProcessor<MeterReadRecord, NormalisedMeterRead> meterReadProcessor() {
    return record -> {
        if (record.getReadValue() < 0) {
            log.warn("Negative read value for meter {}, skipping", record.getMeterId());
            return null; // filtered — not written
        }

        return NormalisedMeterRead.builder()
            .meterId(record.getMeterId())
            .readAt(record.getReadTimestamp())
            .kwh(convertToKwh(record.getReadValue(), record.getReadType()))
            .validated(true)
            .build();
    };
}

Processors should be stateless. If you need to accumulate state across records (e.g. running totals per meter), that’s better handled in the writer or a downstream aggregation step.

ItemWriter — Efficient Bulk Inserts

For writing, JdbcBatchItemWriter uses JDBC batch operations — it batches all items in the chunk into a single executeBatch() call:

@Bean
public JdbcBatchItemWriter<NormalisedMeterRead> meterReadWriter(DataSource dataSource) {
    return new JdbcBatchItemWriterBuilder<NormalisedMeterRead>()
        .dataSource(dataSource)
        .sql("""
            INSERT INTO normalised_meter_reads
                (meter_id, read_at, kwh, validated, created_at)
            VALUES
                (:meterId, :readAt, :kwh, :validated, NOW())
            ON CONFLICT (meter_id, read_at)
                DO UPDATE SET kwh = EXCLUDED.kwh, validated = EXCLUDED.validated
            """)
        .beanMapped()
        .build();
}

The ON CONFLICT ... DO UPDATE (Postgres UPSERT) makes the writer idempotent — if the job is restarted after a partial failure, re-writing the same records doesn’t produce duplicates.

Skip and Retry Policies

Batch jobs should be resilient to bad data without failing entirely. The faultTolerant() builder enables fine-grained control:

.<MeterReadRecord, NormalisedMeterRead>chunk(500, transactionManager)
    .reader(reader)
    .processor(processor)
    .writer(writer)
    .faultTolerant()
    .skip(MalformedReadException.class)      // skip these, log, continue
    .skip(ValidationException.class)
    .skipLimit(1000)                          // but not more than 1000 in total
    .noSkip(DataAccessException.class)        // never skip DB errors
    .retry(TransientDataAccessException.class) // retry transient DB errors
    .retryLimit(3)
    .build();

Skipped items are logged to the BATCH_SKIP_LOG table in the job repository. After the job completes, you can query skipped records, inspect why they failed, and decide whether to reprocess them separately.

Restartability via JobParameters

A critical Spring Batch feature is restartability. If a job fails at chunk 847 of 2,000, you can restart it from chunk 848 without reprocessing the first 846.

This relies on JobParameters being stable across runs. Always include a date or run identifier:

@Autowired
private JobLauncher jobLauncher;

@Autowired
private Job meterReadJob;

public void runJob(LocalDate processingDate) throws Exception {
    JobParameters params = new JobParametersBuilder()
        .addLocalDate("processingDate", processingDate)
        .addLong("startedAt", System.currentTimeMillis()) // unique per run
        .toJobParameters();

    jobLauncher.run(meterReadJob, params);
}

The startedAt parameter ensures uniqueness — Spring Batch won’t restart a COMPLETED job, so if you want to re-run a completed job, you need a distinct JobParameters instance.

Parallel Processing: Partitioned Steps

For very large datasets, a single-threaded step may not meet your SLA. Spring Batch’s partitioned steps split the work across multiple threads or JVM instances.

The Partitioner divides the data by range — for meter reads, partition by meter_id modulo N:

@Bean
public Step partitionedMeterReadStep(
        JobRepository jobRepository,
        Step meterReadStep) {

    return new StepBuilder("partitionedMeterReadStep", jobRepository)
        .partitioner("meterReadStep", meterIdRangePartitioner())
        .step(meterReadStep)
        .gridSize(8)                // 8 partitions, one per thread
        .taskExecutor(new SimpleAsyncTaskExecutor())
        .build();
}

@Bean
public Partitioner meterIdRangePartitioner() {
    return gridSize -> {
        Map<String, ExecutionContext> partitions = new HashMap<>();
        long totalMeters = meterRepository.count();
        long rangeSize = totalMeters / gridSize;

        for (int i = 0; i < gridSize; i++) {
            ExecutionContext context = new ExecutionContext();
            context.putLong("minMeterId", i * rangeSize);
            context.putLong("maxMeterId", (i == gridSize - 1) ? Long.MAX_VALUE : (i + 1) * rangeSize - 1);
            partitions.put("partition" + i, context);
        }
        return partitions;
    };
}

Each partition gets its own StepExecution and its own reader scoped to its minMeterId/maxMeterId range. Failures in one partition don’t affect others.

Monitoring via JobExplorer

Spring Batch persists all job and step execution state to the batch schema tables. JobExplorer gives you programmatic access:

@Autowired
private JobExplorer jobExplorer;

public BatchJobStatus getLastRun(String jobName) {
    List<JobInstance> instances = jobExplorer.getJobInstances(jobName, 0, 1);
    if (instances.isEmpty()) return BatchJobStatus.NEVER_RUN;

    JobInstance latest = instances.get(0);
    List<JobExecution> executions = jobExplorer.getJobExecutions(latest);
    JobExecution lastExecution = executions.get(0);

    return new BatchJobStatus(
        lastExecution.getStatus(),
        lastExecution.getStartTime(),
        lastExecution.getEndTime(),
        lastExecution.getStepExecutions().stream()
            .mapToLong(StepExecution::getWriteCount)
            .sum()
    );
}

Expose this via an Actuator endpoint or a monitoring dashboard — knowing whether last night’s job completed and how many records it processed is non-negotiable for overnight batch workloads.

ProTips

Tune commit-interval based on your row size and write latency. 500 is a reasonable starting point, but profile: too small and you’re round-tripping too often, too large and a single chunk failure is expensive to retry.
JdbcPagingItemReader is not the same as JdbcCursorItemReader. Paging loads a full page into memory and is susceptible to data shifting between pages if your source table is being modified during the run. Prefer cursor for stable reads.
The batch schema tables grow forever. Implement a purge job to delete old BATCH_JOB_EXECUTION records, or they’ll become a performance problem on a high-frequency batch schedule.
Use @StepScope on readers, processors, and writers whenever you use JobParameters. Without it, you get a singleton bean constructed at startup before parameters are available.

If you’re designing or troubleshooting batch processing pipelines at scale, get in touch.