Generally, all applications have some data in some form. In enterprise applications, we deal with large volume of data. Sometimes processing a single operation faster is not sufficient due to the large volume of data. This requires a robust transactional control and failure recovery mechanisms, which are complex to implement. In order to handle all this, Spring Framework provides a module ‘Spring Batch’ for the Batch Processing. Therefore, our topic for discussion is ‘Spring Batch Tutorial’.
Spring Batch Processing offers processing of data in the form of batch jobs. Spring Batch offers reusable functions for processing large volume of records. It also includes logging/tracing, transaction management, job processing statics, skip, job restart, and resource management. Spring Batch has taken care of all that with an optimal performance. Here, in our article ‘Spring Batch Tutorial’, we are going to learn about Spring Batch and its related concepts.
What should you already know to work on Spring Batch Processing?
You should have experience in Java. Since Spring Batch is built on top of Spring Framework, you should have knowledge on Spring framework basics. Additionally, you should have a general understanding of relational database, or flat files, since we will be moving data to and from these data stores. Apart from that, you should know the basic concept of Spring Batch, at least general terminologies which are given in this Spring Batch Tutorial.
What are the software needed to work on Spring Batch?
1) Java Development Kit (JDK) – 14+ in our case
2) IDE for Enterprise Java Developers (STS in our case)
3) Database (MySQL in our case)
What is a Batch Processing?
Let’s first know what is a Batch Processing before going through the Spring Batch Tutorial.
The basics Batch Processing is a technique that offers processing of data in a group with huge amount instead of a single unit of data. In a batch processing, we can process a high volume of data with minimal human interaction. Unlike a web or desktop application, there is no GUI (Graphical User Interface) that represents there is no human interaction with a batch job. Many live applications execute billions of transactions everyday through batch processing.
What is Spring Batch?
Spring Batch is one of the core modules of Spring framework that supports batch processing. Using Spring Batch, we can create a robust batch processing system accordingly as per our requirement.
Let’s discuss a real time usage of Spring Batch Concept
Suppose you want to process high volume of data from source to destination and if you think that traditional way of doing it is a very complicated process. In this situation, you can think of implementing Spring Batch. For example, Let’s assume you have a report generation system, where you want to retrieve a high volume of data from a database and write the same data in a CSV file. Here, traditional way of generating report may not handle the high volume of data. Hence, in this scenario we can use Spring Batch to fulfil our requirements. Obviously, in this case the database will be source and CSV file will be the destination.
What are the common use cases of using Spring Batch?
There are various common use cases where the usage of Spring Batch provides the high performant system and reduces the development effort. Let’s know the most commonly used use cases in this Spring Batch Tutorial.
1) Data Migration: When any legacy application requires adoption with a modern application, we have to migrate legacy data to a database compatible with the modern system. This is the place where Spring Batch does a magic with its reading, writing and processing the involved components.
2) ETL (Extract, Transform, Load): ETLs are common in integration scenarios. For example, an application may periodically generate files with data that needs to be loaded, transformed, and persisted into another application’s database.
3) Parallel Processing: Sometimes processing of a single operation faster doesn’t provide optimal performance due to the large volume of data. Hence, parallel processing becomes the only way to optimize the time & performance. However, it requires a robust transactional control and failure recovery mechanism, which are complex to implement. But, Spring Batch takes care of all these complexities.
4) Reporting: Reporting is also an ideal use case for Batch Processing. In reporting, we process large datasets to calculate and distribute information. It also requires a collection of data over the period and often time based such as monthly bank statement, quarterly financials.
5) Exchange of Information: Batch Processing can also be an ideal scenario for the exchange of information between two systems or the same system. Batch jobs can generate information to be sent to another system or receive information from another system for processing. This will typically take place using a predefined integration strategy between the two systems such as, SFTP, messaging or direct database connections.
What are the benefits of using Spring Batch?
The benefits of using Spring Batch are the most important part of the Spring Batch Tutorial.
1) Spring Batch offers over 17 ItemReader and 15 ItemWriter implementations covering vast options for input and output (File, JDBC, NoSQL, JMS, etc). All of these provide declarative I/O options so that you don’t have to write and test code for stateful readers and writers.
2) A collection of Tasklet (Spring Batch’s equivalent to JSR-352’s Batchlet) implementations, including ones for executing shell commands and interfacing with Hadoop.
3) For developers, using Spring Batch is just a matter of configuring the appropriate reader, or writer for the data store.
4) The framework stores metadata regarding jobs and their executions out of the box in the job repository. This is very helpful when determining what jobs has executed, why a job had failed.
5) The ability to stop/start/restart jobs and maintain state between executions. It provides the capability to restart the job from where we left off based upon the information about the job in the job repository.
6) The ability to skip and retry records as they are being processed.
7) Transaction management- Spring Batch handles transactions for you. It provides transactional writers that can rollback in the event of an error.
8) The ability to notify other systems when errors occur via messaging by integrating Spring Integration.
9) It offers Java or XML based configuration.
10) All the Spring features like DI, AOP, testability, etc. are still available for developers.
11) Vendor independence – By using Spring Batch, you get to use a framework that open source and not tied to any one vendor.
12) Big data support – Through the Spring for Apache Hadoop project, there are a number of extensions to Spring Batch that allow it to work well with Hadoop. You can run Spring Batch jobs on YARN, you can execute Pig, Hive, MapReduce, etc. jobs.
13) Integration with Spring XD – Spring XD provides a distributed runtime for the deployment, management, and execution of batch jobs.
14) Focus only on business rules: The majority of the technical aspects surrounding the creation of batch applications have been solved and the developer needs to spend more time solving the business needs than building the whole framework.
What are the terminologies used in Spring Batch?
In order to work on Spring Batch implementation confidently, we should have a clear understanding of each terminology used in the concept of Spring Batch Processing. Let’s discuss them one by one.
A job represents the entire batch process that we want to execute. It can have one or more steps that execute in a flow. In the small use cases, generally one job has one step, but when multiple sources/destinations are involved, it can have multiple steps. The flow from one step to another can be dynamic, such as it can be conditional or occur in parallel. For example, transferring data from a source to a destination with or without the transformation of data is an example of the Job.
A step is a phase in a job that defines how the actual processing will occur for that portion of the job. There are two types of steps: Tasklet-based Step and Chunk-based Step.
It contains a single method named ‘execute()’ on its interface that runs over and over until it gives signal to stop. Tasklets are typically used for things like setup logics, stored procedures or other custom logics that can’t be achieved without the box components.
It is used in scenarios where we need to process data from a data source. In this case, each step has a reader, writer and an optional processor. It leverages the ItemReader interface to read chunks of data from a data source. Then writes the chunks in a transaction using the ItemWriter interface. Optionally, we can include an ItemProcessor implementation to perform transformations on the data. Generally, we use Chunk-based Step in real time use cases.
ItemReader: It reads data from the source.
ItemWriter: ItemWriter writes the data to the destination.
ItemProcessor: It does calculations, validations, filtering on the data as part of data transformation before writing the data into the destination.
The entire job starts execution using a Job Launcher, which may pass job parameters to the job. Many enterprises prefer to launch jobs using a scheduler.
As the job runs, metadata regarding the job is written to the job repository. Typically, the Job Repository stores metadata regarding JobParameters, JobInstances, JobExecutions, and StepExecutions. Job Repository is a kind of memory. We can use in-memory databases, such as H2DB, MySql etc. in order to accommodate this metadata. We don’t even need to write any code for this.
When running Spring Batch jobs, the concept of a job instance and a job execution must be understood.
When a job launcher creates a job, it typically will pass the name of the job and some parameters. The combination of job name and its parameters defines a new JobInstance which is created.
When we execute a JobInstance, we create a new JobExecution. When we execute the job with the same parameters again, it is the same instance of the job. On the other hand, when we execute the same job with the different parameters, it creates a different JobInstance. For each time that we execute a JobInstance whether it is the same or different, we get a new JobExecution.
As the steps execute within a job, there is a very similar concept of JobExecution is applied. So, each execution of a step is going to create a new StepExecution. The StepExecution is associated with a JobExecution.
How does the Step Execution happen in Spring Batch Chunk-Based Processing?
The Step Execution is the main part of the whole Job Execution. Let’s assume we have a use case where the source is a CSV file and the destination is a MySQL database. Here we want to pull records from the CSV file and after some data transformation (processing) write the records to MySQL database. For example, suppose that we have 5000 records in the CSV file. Let’s consider the chunk size as 600.
The Chunk-based Step Execution involves three components: ItemReader to read data from the source, ItemWriter to write the data to the destination, and optionally ItemProcessor to transform the data if needed.
Chunk Size: When reading, processing, and writing the items, we perform these operations on smaller groups of the data referred to as chunks. When performing a chunk-based step, we typically provide a chunk size which determines how many items will be found within a chunk. As aforementioned, in this example, the chunk size has been set to 600.
When processing starts, the ItemReader reads the first record from the source, and then pass it to the processor for processing. Next, again read the second record and process it as before. Similarly, read & process operations continue until it completes the chunk size. Here, read & process will take place till the 600 records (chunk size). Once we met the chunk size, the entire chunk will be passed to the ItemWriter for the first time.
In our case, it will collect the 600 records in the form of a collection(chunk). Finally, the records within the first chunk will be written to the database. This process will continue until all the records get written into the database. In our case, write operation will take place 9 times only. However, at the 9th time it will write remaining 200 records.
What are some common Readers & Writers in Spring Batch?
This section is the most important part of whole Spring Batch concept for a developer. As discussed earlier, choosing the appropriate reader and/or writer is the major task for a developer. Hence, we must have clear idea of the same.
ItemReader is an interface with a single method named read() provided by Spring framework. Implementations of the ItemReader interface retrieve data from a data source one item at a time for processing within the batch job. The framework provides several implementations for reading from common data stores such as databases, files, and message queues. Spring Framework provides some of the implementations of ItemReader as listed below. We can use them to consume items from different data sources.
When leveraging these ItemReaders, we will need to provide some specific configurations for each reader that will instruct the reader how to consume the items from the data store.
ItemWriter is an interface with a single method named write() provided by Spring framework. Multiple items are written as chunks to the data store as opposed to writing a single item at a time. The number of items written is determined by the chunk size, which helps keep our batch processing efficient. The framework provides several implementations for writing to common data stores like relational databases, flat files, or Kafka topics. Spring Framework provides some of the implementations of ItemWriter as listed below. We can use them to write items to different data sources.
During chunk-based step processing, business logic can be inserted between when items are read and when the items are written. In order to implement this logic, Spring Bach provides the ItemProcessor interface with a single method process(). The ItemProcessor interface is implemented to introduce custom business logic that occurs between the ItemReader and the ItemWriter. This allows developers to address custom batch processing logics. Typical use cases for the ItemProcessor include transformation, validation, and filtering of the items flowing through chunk-based processing.
If you are simply moving data from one source to another, the ItemProcessor may not be necessary. However, most jobs require some sort of processing.
After going through all the theoretical part of ‘Spring Batch Tutorial’, finally, we should be ready to implement Spring Batch Processing in a real time project. Further, we expect from you to extend the knowledge provided in the article ‘Spring Batch Tutorial’ and implement the concept in your project accordingly. For further learning on Spring Batch Tutorial, you may visit the Spring Batch from spring.io. In addition, If there is any update in the future, we will also update the article accordingly. Moreover, Feel free to provide your comments in the comments section below.
Link To Spring Batch Example
Below is the link to explore example using this Spring Batch Tutorial:
How to transfer data from CSV file to MySQL database using Spring Boot Batch concept and Spring Data JPA?