Recently in my SpringBoot application I was working on importing data from multiple CSV files into my GCS(Google Cloud) Postgress DB.

Problem statement

  1. The import operation will be triggered by a REST endpoint in SpringBoot.
  2. All the CSV files will be available in the GCS bucket.
  3. A CSV file can contain 1 to 10 million records.
  4. We will have to all the CSV files before proceeding to next task.
  5. The import operation status should be recorded.

Below are the solutions I explored,

  • Method 3: Use GCP import API. This might be little slow as import API is synchronous and only allows importing one file at a time.Only suitable if the import is straightforward and has no follow up operations on the imported data.
  • Method 4: Stream file and save in DB. This is the fastest. We can read the file as a stream and save each CSV line that we read in the database. We can leverage database batching to save multiple lines (ex: 1000) at a time. If we stream the file we need not download the full file locally. We can also stream the file asynchronously thus improving speed.

Method 2: Use Postgres CLI COPY command

/* Login */
psql postgres://postgres:postgres@localhost:5432/postgres
/* Copy */
\COPY employee_table FROM 'employees.csv' WITH CSV HEADER;

Method 1: Perform PostgreSQL Import CSV Job using the COPY Command.

COPY employees(emp_id,first_name,last_name,dob,city)
FROM ‘C:newdbemployees.csv’
DELIMITER ‘,’
CSV HEADER;

Solution

  • Method 4 (Stream file and save in DB) if you have to upload multiple files parallelly
  • Method 3 (GCS import API) if you are have lesser files and uploading one by one synchronously is not a concern

Disadvantages

Method 1 & Method 2:

We must do lot of heavy lifting and handle many error cases.

  • The files are present in the GCS bucket. Hence, to pass it to the command, I will have to download it first and keep it in the system memory. Which is not ideal.
  • After importing is done, I must delete the file form the server. If any error occurs here then the system memory will increase.
  • During the import operation my system memory and CPU usage will spike. Especially when the CSV files have millions of records.
  • Accessing production psql is generally not recommended and has security concerns. It would lead to easier SQL query injection, etc...

Method 3

The main problem with GCS import API is that its synchronous. We can only run one import operation at one time per DB instance. This is because postgres itself under hood only allows one COPY process to run at a given time. It is a necessary condition to avoid data inconsistency.

  • If one table is being updated by 2 different files then the order of data cannot be preserved.
  • Also, it will be added overhead to rollback incase of failure when 2 or more files are being imported

Hence we have to ensure to wait for one file import to complete and then trigger another file.

Method 4

  • High CPU utilisation. Processing parallelly may need higher configuration of the system
  • High database utilisation. Since we are saving line by line. When saving large files. Huge number of queries run. Hence we will need higher configuration of database
  • Better error handling. Asynchronous processes are often difficult to handle, we need to plan for errors are multiple stages. Like Connecting to DB, thread locking, query fails, etc...

Author Of article : Vishesh Read full article