Overview

This study note covers the details from the video "DE Zoomcamp 2.2.7 - Manage Schedules and Backfills with BigQuery in Kestra". The video focuses on adding schedules to a BigQuery workflow and performing backfills for data from 2019 and 2020 using the workflow orchestration tool, Kestra.

Key Concepts

  1. Workflow Orchestration:
    • Automating tasks and data pipelines to handle data workflows efficiently.
  2. Kestra:
    • An open-source orchestration tool used to manage complex workflows.
  3. BigQuery:
    • A fully-managed, serverless data warehouse that enables scalable analysis over large datasets.
  4. Backfills:
    • The process of filling in missing data retrospectively.

Workflow Overview

  • The workflow allows the automatic scheduling of data extraction and uploading to BigQuery.
  • It replaces manual input date selection with automatic triggers for running tasks at specified times.
  • Supports both "green" and "yellow" datasets.

Steps to Implement Schedules and Backfills

  1. Setup Triggers:
    • Triggers are set to run at different times for different datasets.
    • Example: "Green" dataset at 9:00 AM and "Yellow" dataset at 10:00 AM on the 1st of each month.
  2. Workflow Modifications:
    • Replace manual inputs with automatic triggers.
    • Modify the workflow to handle the new triggers.
  3. Handling Data Storage:
    • Extract data from GitHub.
    • Upload the data to Google Cloud Storage (GCS) for BigQuery to use.
  4. Executing Backfills:
    • Use the trigger to execute backfills for previous months.
    • Example: Backfill for the entire year of 2019.
  5. Managing BigQuery Tables:
    • Delete existing tables in BigQuery to start with a fresh dataset.
    • Set backfill date range and execute backfill in Kestra.
    • Monitor execution and check for completion.

Practical Steps

  1. Trigger Configuration:
    • Set up triggers in Kestra to run at the specified times.
    • Ensure triggers are set to run after the data is available (e.g., after 9:00 AM on the 1st of each month).
  2. Workflow Execution:
    • Add labels to track execution and differentiate between normal runs and backfills.
    • Monitor the progress in the execution logs.
  3. Data Management:
    • Ensure the BigQuery dataset is empty before starting backfills.
    • Check the Google Cloud Storage bucket for uploaded CSV files.
    • Verify the data in BigQuery tables.

Summary of Execution

  • Labels are used to indicate backfill execution.
  • Data for all months is processed and stored in BigQuery.
  • Example: 12 CSV files, each around 600-700 MB, uploaded to GCS.
  • BigQuery table contains 18.9 GB of data with unique row IDs to prevent duplicates.

Conclusion

  • The video demonstrates how to manage schedules and backfills effectively using Kestra and BigQuery.
  • Automating workflows helps in efficient data processing and handling large datasets.
  • Future videos will cover running workflows with DBT and deploying to the cloud.

Additional Resources

Notes

  • Ensure your system has sufficient storage and processing power when running large datasets locally.
  • Always verify the scheduled times and data availability to avoid execution failures.
  • Utilize labels and logs to monitor and debug workflows efficiently.

Author Of article : Pizofreude Read full article