Declarative Data Pipelines: Moving from Code to Configuration

In data teams today, writing Python code to create data pipelines has become second nature, especially for analytics workloads. Creating new DAGs, tasks, and operators in Airflow - now the industry standard for data orchestration - is just part of our daily routine. But what if there's a simpler, more accessible approach to building these pipelines?

Before we explore this alternative, let's examine how another domain - DevOps - evolved its approach to building and shipping software, and see what lessons we can apply to data orchestration.

Evolution of Infrastructure Management

Let's examine a simple but common task: deploying a web application with a database. The way teams handled this task evolved dramatically over time.

Shell scripts (2000s)

In the early 2000s, deploying applications meant writing detailed shell scripts that listed every command needed. These scripts were brittle, hard to maintain, and required deep system knowledge:

#!/bin/bash

# Install dependencies
apt-get update
apt-get install -y nginx mysql-server

# Configure MySQL
mysql -u root -e "CREATE DATABASE myapp;"
mysql -u root -e "CREATE USER 'myapp'@'localhost';"
mysql -u root -e "GRANT ALL PRIVILEGES ON myapp.*;"

# Deploy application
cd /var/www/html
tar -xzf myapp.tar.gz
chown -R www-data:www-data /var/www/html

# Start services
service mysql start
service nginx start

Configuration Management (2010s)

By 2010, tools like Puppet introduced a paradigm shift. Instead of listing commands, teams defined their desired system state in a declarative format. The tool would figure out how to achieve that state:

package { ['nginx', 'mysql-server']:
  ensure => installed,
}

service { 'nginx':
  ensure  => running,
  enable  => true,
  require => Package['nginx'],
}

service { 'mysql':
  ensure  => running,
  enable  => true,
  require => Package['mysql-server'],
}

exec { 'create-db':
  command => 'mysql -u root -e "CREATE DATABASE myapp;"',
  unless  => 'mysql -u root -e "SHOW DATABASES;" | grep myapp',
  require => Service['mysql'],
}

file { '/var/www/html/myapp':
  ensure  => directory,
  source  => 'puppet:///modules/myapp/files',
  recurse => true,
  owner   => 'www-data',
  group   => 'www-data',
}

Infrastructure as Code (2016+)

Cloud platforms like AWS took this declarative approach even further. With CloudFormation, engineers simply specified the resources they needed, and AWS handled all implementation details:

AWSTemplateFormatVersion: '2010-09-09'
Resources:
  WebServer:
    Type: AWS::EC2::Instance
    Properties:
      InstanceType: t2.micro
      ImageId: ami-0c55b159cbfafe1f0
      UserData: 
        Fn::Base64: !Sub |
          #!/bin/bash
          yum update -y
          yum install -y nginx

  Database:
    Type: AWS::RDS::DBInstance
    Properties:
      Engine: mysql
      DBName: myapp
      MasterUsername: admin
      MasterUserPassword: password
      DBInstanceClass: db.t2.micro

  SecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Allow web traffic
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 80
          ToPort: 80
          CidrIp: 0.0.0.0/0

The difference over the years is that, instead of actually writing the steps that needed to be executed (imperative), the engineers moved on to specifying what they wanted and let the system figure out how it was done (declarative).

Evolution of Infrastructure Teams

This shift from imperative to declarative approaches fundamentally changed how infrastructure teams operated. Let's look at what this evolution meant in practice.

In the early days, deploying infrastructure involved a lot of back-and-forth between developers and system administrators:

With declarative infrastructure, the interaction changed dramatically:

The key difference was that teams now had a common language - YAML configurations - that both developers and infrastructure engineers could understand and work with effectively. This shift to configuration-driven workflows revolutionized how infrastructure teams operated.

Current state of data pipelines

Data teams today face similar challenges to what infrastructure teams encountered in the 2000s. To understand these parallels, let's examine how data pipelines are typically built, using a common scenario: moving data from S3 to GCS, then loading it into BigQuery - a pattern you might use when you have transactional database backups in S3 but your analytics stack runs on BigQuery.

def transfer_s3_to_gcs(
    s3_bucket,
    s3_key,
    gcs_bucket,
    gcs_object_name,
    aws_conn_id='aws_default',
    gcp_conn_id='google_cloud_default'
):
    # Initialize hooks
    s3_hook = S3Hook(aws_conn_id=aws_conn_id)
    gcs_hook = GCSHook(gcp_conn_id=gcp_conn_id)

    # Create a temporary file
    with tempfile.NamedTemporaryFile() as temp_file:
        # Download file from S3
        s3_hook.download_file(
            key=s3_key,
            bucket_name=s3_bucket,
            local_path=temp_file.name
        )

        # Upload file to GCS
        gcs_hook.upload(
            bucket_name=gcs_bucket,
            object_name=gcs_object_name,
            filename=temp_file.name
        )

# Create DAG
dag = DAG(
    's3_to_gcs_transfer',
    default_args=default_args,
    description='Transfer files from S3 to GCS',
    schedule_interval='@daily',
    catchup=False
)

# Define the transfer task
transfer_task = PythonOperator(
    task_id='transfer_s3_to_gcs',
    python_callable=transfer_s3_to_gcs,
    op_kwargs={
        's3_bucket': 'your-s3-bucket-name',
        's3_key': 'path/to/your/file.csv',
        'gcs_bucket': 'your-gcs-bucket-name',
        'gcs_object_name': 'path/to/destination/file.csv'
    },
    dag=dag
)

# Load from GCS to BigQuery
load_to_bq = GCSToBigQueryOperator(
    task_id='load_to_bigquery',
    bucket='your-gcs-bucket',
    source_objects=['path/to/destination/file.csv'],
    destination_project_dataset_table='your-project:dataset.table'
)

# Set task dependencies (if you add more tasks)
transfer_task > load_to_bq

When comparing this approach to what we saw with the DevOps workflows, the code is imperative in nature. We tell the orchestrator how to perform each step in the data pipeline, instead of what needs to be done and letting it figure out the how.

Data team interactions

Unsurprisingly, the data team workflows within companies mirror the interactions between developers and sysadmins in the 2000s. When a data scientist or analyst needs a new pipeline, they have to coordinate with data engineers who are often juggling multiple priorities:

Reusability

Another significant downside of the current imperative approach in Airflow is the difficulty of code reuse. Looking at Airflow's abstraction layers, we can see why:

While Airflow provides hooks and operators that can be reused across DAGs, the tasks themselves must be written directly in the DAG file along with the business logic. Even if we need to create a new DAG that shares most of the same tasks but adds just one additional step, we have to rewrite all the task definitions in the new DAG file.

The tasks have to be defined in this DAG again, even though the transfer_s3_to_gcs task and the load_to_bigquery task are the same as the previous DAG.

Tight coupling

The real challenge becomes apparent when we need to update implementation details. Consider a scenario where we need to make our transfer process more scalable. The current approach, which might work for smaller files, will fail for file sizes exceeding the Airflow worker's memory. Even worse, it could stall or crash the worker, disrupting other executing DAGs.

Instead, to ensure the workload scales independently, we create an instance and use the instance's STARTUP_SCRIPT to transfer the files.

with DAG('s3_to_gcs_transfer',
         default_args=default_args,
         schedule_interval='@daily',
         catchup=False) as dag:

    # Create and start instance
    # Use the STARTUP_SCRIPT to transfer files in the instance
    create_instance = ComputeEngineInsertInstanceOperator(
        task_id='create_transfer_instance',
        project_id='your-project-id', 
        zone='us-central1-a',
        body={
            'name': 'transfer-instance-{{ ds_nodash }}',
            'machineType': 'n1-standard-2',
            'metadata': {
                'items': [
                    {'key': 'startup-script', 'value': STARTUP_SCRIPT},
                    {'key': 's3_bucket', 'value': 'your-s3-bucket'},
                    {'key': 's3_key', 'value': 'path/to/your/file'},
                    {'key': 'gcs_bucket', 'value': 'your-gcs-bucket'},
                    {'key': 'gcs_path', 'value': 'destination-path'}
                ]
            }
        }
    )

    # Delete instance
    delete_instance = ComputeEngineDeleteInstanceOperator(
        task_id='delete_transfer_instance',
        project_id='your-project-id',
        zone='us-central1-a',
        resource_id='transfer-instance-{{ ds_nodash }}',
        trigger_rule='all_done'
    )

    # Load from GCS to BigQuery
    load_to_bq = GCSToBigQueryOperator(
        task_id='load_to_bigquery',
        bucket='your-gcs-bucket',
        source_objects=['path/to/destination/file.csv'],
        destination_project_dataset_table='project:dataset.table'
    )

    create_instance >> delete_instance
    create_instance >> load_to_bq

When we need to update the DAGs to use this scalable approach, there isn't a clean way to do it. Every DAG using the transfer logic needs to be updated, tested, and redeployed individually. The refactoring effort multiplies with the number of files that need updating.

This limitation stems from Airflow's fundamental design pattern where business logic (the nodes and structure of the DAG) is tightly coupled with implementation logic (the actual task code). This coupling not only makes maintenance difficult but also creates a barrier between data engineers and other team members who could potentially define and modify pipelines themselves.

Declarative data pipelines

A declarative data platform could address these challenges by completely separating the technical implementation details (how to move data from S3 to GCS) from the pipeline business logic (which pipeline needs to move data from S3 to GCS). This separation would allow data engineers to focus on building robust, reusable components while enabling analysts and scientists to define pipelines without deep technical knowledge.

Config-based workflows

In summary, we can identify three key requirements for building an effective declarative data platform:

Separation of Concerns: Pipeline definitions (what) should be completely separate from task implementations (how)
Reusability: Tasks should be reusable across different pipelines without code duplication
Simplified Interface: Teams should be able to define pipelines using a simple, declarative syntax

Here's how such a platform might work. Instead of writing Python code, you'd define your pipeline in YAML:

tasks:
  - name: daily_transfer
    task: s3_to_gcs
    params:
      s3_bucket: your-s3-bucket-name,
      s3_key: path/to/your/file.csv,
      gcs_bucket: your-gcs-bucket-name,
      gcs_object_name: path/to/destination/file.csv

  - name: daily_load_to_bq
    task: gcs_to_bq
    depends_on: s3_to_gcs
    params:
      gcs_bucket: your-gcs-bucket-name,
      gcs_object_name: path/to/destination/file.csv,
      bq_dataset: your-bq-dataset-name,
      bq_table: your-bq-table-name

The actual implementation of these tasks would be written in Python, but crucially, they would be independent of any specific pipeline. This separation allows data engineers to optimize and maintain task implementations without affecting pipeline definitions.

Benefits

This approach also offers other key advantages.

Reduced cognitive load

Declarative systems mirror how humans naturally think about problem-solving. When designing a new pipeline, we typically think top-down about what tasks need to be included. However, imperative systems like Airflow require bottom-up thinking, where you must detail every implementation step before building the pipeline.

Improved collaboration

Since creating and updating tasks are decoupled from business logic, teams can work more efficiently:

Data analysts can create and modify pipelines withouAuthor Of article : Jonathan Bhaskar Read full article