Overview
This study note covers the details from the video "DE Zoomcamp 2.2.7 - Manage Schedules and Backfills with BigQuery in Kestra". The video focuses on adding schedules to a BigQuery workflow and performing backfills for data from 2019 and 2020 using the workflow orchestration tool, Kestra.
Key Concepts
- Workflow Orchestration:
- Automating tasks and data pipelines to handle data workflows efficiently.
- Kestra:
- An open-source orchestration tool used to manage complex workflows.
- BigQuery:
- A fully-managed, serverless data warehouse that enables scalable analysis over large datasets.
- Backfills:
- The process of filling in missing data retrospectively.
Workflow Overview
- The workflow allows the automatic scheduling of data extraction and uploading to BigQuery.
- It replaces manual input date selection with automatic triggers for running tasks at specified times.
- Supports both "green" and "yellow" datasets.
Steps to Implement Schedules and Backfills
- Setup Triggers:
- Triggers are set to run at different times for different datasets.
- Example: "Green" dataset at 9:00 AM and "Yellow" dataset at 10:00 AM on the 1st of each month.
- Workflow Modifications:
- Replace manual inputs with automatic triggers.
- Modify the workflow to handle the new triggers.
- Handling Data Storage:
- Extract data from GitHub.
- Upload the data to Google Cloud Storage (GCS) for BigQuery to use.
- Executing Backfills:
- Use the trigger to execute backfills for previous months.
- Example: Backfill for the entire year of 2019.
- Managing BigQuery Tables:
- Delete existing tables in BigQuery to start with a fresh dataset.
- Set backfill date range and execute backfill in Kestra.
- Monitor execution and check for completion.
Practical Steps
- Trigger Configuration:
- Set up triggers in Kestra to run at the specified times.
- Ensure triggers are set to run after the data is available (e.g., after 9:00 AM on the 1st of each month).
- Workflow Execution:
- Add labels to track execution and differentiate between normal runs and backfills.
- Monitor the progress in the execution logs.
- Data Management:
- Ensure the BigQuery dataset is empty before starting backfills.
- Check the Google Cloud Storage bucket for uploaded CSV files.
- Verify the data in BigQuery tables.
Summary of Execution
- Labels are used to indicate backfill execution.
- Data for all months is processed and stored in BigQuery.
- Example: 12 CSV files, each around 600-700 MB, uploaded to GCS.
- BigQuery table contains 18.9 GB of data with unique row IDs to prevent duplicates.
Conclusion
- The video demonstrates how to manage schedules and backfills effectively using Kestra and BigQuery.
- Automating workflows helps in efficient data processing and handling large datasets.
- Future videos will cover running workflows with DBT and deploying to the cloud.
Additional Resources
- Kestra Documentation: Kestra.io
- BigQuery Documentation: Google Cloud BigQuery
- DE Zoomcamp Resources: Data Engineering Zoomcamp
Notes
- Ensure your system has sufficient storage and processing power when running large datasets locally.
- Always verify the scheduled times and data availability to avoid execution failures.
- Utilize labels and logs to monitor and debug workflows efficiently.
Author Of article : Pizofreude Read full article