Building a Scalable, Reliable, and Cost-Effective Event Scheduler for Asynchronous Jobs

Introduction

Welcome back to my blog! 😁 This is where I talk to myself—and hopefully, to you—about the engineering problems I solve at work. I do this mainly because finding solutions excites me. My journey of identifying inefficiencies, bottlenecks, and challenges has led me to tackle a common yet critical problem in software engineering.

That problem is the need to execute actions asynchronously—often with precise timing and sometimes on a recurring basis. Following my core approach to problem-solving (across space and time), I decided to build a solution that wasn’t just tailored to a single action but was extendable to various use cases. Whether it's sending notifications, processing transactions, or triggering system workflows, many tasks require scheduled execution. Without a robust scheduling mechanism, handling these jobs efficiently can quickly become complex, unreliable, and costly.

To address this, I set out to build a scalable, reliable, and cost-effective event scheduler—one that could manage delayed, immediate and recurring actions seamlessly.

In this article, I’ll walk you through:

The problem that led to the need for an event scheduler
The functional and non-functional requirements for an ideal solution
The system design and architecture decisions behind the implementation

By the end, you’ll have a clear understanding of how to build a serverless scheduled actions system that ensures accuracy, durability, and scalability while keeping costs in check. Let’s dive in!

The Problem: Managing Subscription Changes Across a Calendar Cycle

Subscription management comes with unique challenges, especially when handling cancellations or downgrades 😭. Users can request these changes at any time during their billing cycle, but due to the prepaid nature of subscriptions, such modifications can only take effect at the end of the cycle. This delay introduces a need for asynchronous execution—a system that can record these requests immediately but defer their execution until the appropriate time.

The Solution: A proper scheduling mechanism

Without a proper scheduling mechanism, managing these deferred actions efficiently becomes complex. The system must ensure that every request is executed at the right time while preventing missed or duplicate actions. Furthermore, frequent executions—such as batch processing of multiple scheduled changes—must be handled without overwhelming the system. To address this, we needed a reliable, scalable, and cost-effective scheduler capable of handling delayed and recurring execution seamlessly.

Functional Requirements: Defining the Core Capabilities

A robust and scalable scheduled actions system must be able to efficiently schedule, execute, update, monitor, and retry actions while ensuring reliability and flexibility.

1. Scheduling and Creating Actions

The system must allow users to schedule actions with:

Action type, execution time, execution data, and metadata as required fields.
Optional fields like repeat, frequency, and execution remainder.
Early validation to ensure actions conform to their fulfillment requirements.

2. Updating or Deleting an Action

Users can update or delete an action before it is locked (2 minutes before execution). Once locked, no external changes are allowed.

3. Action Status Management

Each action must have an internally managed status that reflects its execution progress. Status transitions and results must be logged in metadata for tracking.

4. Action Fulfillment Mapping

Every action must map to a specific fulfillment service responsible for its execution. Actions without a matching fulfillment service must be flagged to prevent execution errors.

5. Retrying Failed Actions

Failed actions must retry using exponential backoff to handle temporary failures. Actions that exceed the maximum retry limit must be flagged for manual intervention.

6. Handling Immediate vs. Delayed Actions vs. Repeated Actions

The system must distinguish between immediate and delayed actions to ensure timely execution:

Immediate actions (execution within 2 minutes) must be processed in real-time without scheduling delays.
Delayed actions (execution after 2 minutes) must be scheduled and processed at the correct time.
Repeated actions must be proceed for the required number of times at the required frequency

Non-Functional Requirements (NFR): Ensuring a Reliable and Scalable System

A scheduled actions system must meet key NFRs to guarantee reliability, scalability, security, and maintainability.

1. Reliability and Durability

Actions must execute correctly and on time (±2 minutes).
Repeating actions must execute exactly as scheduled with the correct frequency.
Failed actions must retry with exponential backoff, and non-repeating actions must execute only once.

2. Scalability

The system must scale dynamically to handle high request loads.
A serverless architecture ensures cost-efficiency and flexibility.
A queue-based approach (e.g., AWS SQS) must regulate execution frequency to prevent overloading downstream services.

3. Availability

The system must be always available with no cold starts, ensuring immediate execution when needed. A serverless architecture supports this with reasonable cost

4. Security

Signature-based validation must secure requests and prevent unauthorized execution.

5. Maintainability

The system must be modular, encapsulated, and organized within a single repository.
Infrastructure and database indexing rules must be codified.
A typed language must be used for better reliability.
Local testing must be enabled with encrypted environment variables.
A startup script can automate package installation and environment setup.
Comprehensive tests must ensure safe changes and integration.

6. Observability

API endpoints must expose:
- All scheduled actions.
- Actions filtered by status.
- Retry functionality for failed actions.
A centralized logging system must track execution issues consistently.

Tools: Powering the Scheduled Actions System

AWS Lambda: Serverless Compute for Execution

Enables event-driven execution without managing servers.
Handles action scheduling and validation.
Processes immediate actions using real-time event streams.
Executes delayed actions at the scheduled time.
Manages fulfillment tasks based on the action type.

Amazon EventBridge: Managing Scheduled Execution

Acts as a scheduler for delayed actions.
Polls for due pending actions every 5 minutes and enqueues them for processing.
Ensures execution happens within ±2 minutes of the scheduled time.

Amazon SQS: Queueing Actions for Scalability

Decouples execution workloads by handling scheduled actions asynchronously.
Controls fulfillment request frequency to prevent system overload.
Uses FIFO (First-In-First-Out) processing to maintain execution order and prevent duplicate executions.

Amazon DynamoDB: Storing Scheduled Actions

Serves as the primary database for storing scheduled actions.
Provides fast read/write operations for handling high workloads.
Stores metadata for tracking execution status, retries, and results.
Uses DynamoDB Streams to trigger immediate executions.

Amazon API Gateway: Exposing Endpoints for Management

Provides HTTP endpoints for creating, updating, and deleting scheduled actions.
Exposes monitoring endpoints to retrieve actions by status and retry failed actions.
Ensures secure access with authentication and authorization mechanisms.

System Design: Database Schema for Scheduled Actions

Field	Description
`id`	Unique identifier for each scheduled action.
`data`	Stores execution-specific details.
`action`	Defines the type of action to execute.
`executionTime`	Specifies when the action should run.
`repeat`	Indicates if the action should repeat.
`frequency`	Defines the interval for recurring actions.
`executionRemainder`	Tracks the remaining number of executions.
`status`	Execution state ("PENDING", "IN_PROGRESS", "COMPLETED", "FAILED").
`createdAt`	Timestamp when the action was created.
`updatedAt`	Last modified timestamp.
`retryCount`	Counts failed execution retries.
`metadata`	Stores logs and additional execution details.

Example: Scheduled Notification Action

{
    "data": {
        "mobile": "60123456789",
        "subject": "Test",
        "name": "Joojo",
        "templateType": "USER_LATE_PAYMENT_NOTIFICATION",
        "notificationType": "SMS"
    },
    "repeat": true,
    "frequency": "DAILY",
    "executionRemainder": 5,
    "action": "SEND_NOTIFICATION",
    "executionTime": 1736930117120
}

Project structure

I used a layered-modular approach for maintainability, scalability, and ease of change. Many times, different teams may want to extend changes in a service without introducing unintended side effects. I tried to achieve this by organizing components into distinct modules. Let's dive deeper below

1. Single Application with a Modular Design

The entire system is built as a single application, but with a modular structure that separates concerns. Each module is responsible for a specific aspect of the system, making the codebase easier to navigate and modify.

./src
├── app.ts
├── clients
├── config
├── controllers
├── handlers
├── helpers
├── middleware
├── models
├── routes
├── service
├── types
└── utils

2. Serverless Handlers for Distributed Execution

The project is designed around AWS Lambda, with different handlers exported and structured to allow seamless execution of scheduled actions. These handlers ensure that various tasks are processed independently, improving fault tolerance and scalability.

Action Handlers: Manage creating, scheduling, retrieving, updating, deleting, and processing scheduled actions. This keeps all action-related logic centralized, making it easy to modify without affecting other parts of the system.
Delayed Action Handlers: Specifically handle actions that need to be initiated at a later time. This separation ensures that delayed actions are efficiently scheduled and processed without interfering with real-time execution.
Immediate Action Handlers: Trigger execution for actions that must start within 2 minutes, using DynamoDB Streams to detect changes and initiate execution instantly. This ensures timely processing of urgent tasks.
Fulfillment Handlers: Ensure that scheduled actions are executed properly by interacting with the appropriate fulfillment services. This design allows fulfillment logic to evolve independently of action scheduling.

├── handlers
│   ├── fulfillment.ts
│   ├── initiate-scheduled-actions.ts
│   ├── initiate-stream-actions.ts
│   └── process.ts
       ├── http-apis.ts

3. Maintainability through Separation of Concerns

Each module in the project is self-contained, meaning changes to one component do not directly impact others. This reduces the risk of breaking existing functionality and simplifies debugging.

Controllers handle request routing and execution logic.
Services manage business logic and data interactions.
Clients interact with external services like databases, queues, and APIs.
Models define the data structures used across the system.
Middleware ensures that requests pass through validation and authentication layers.
Utilities provide reusable helper functions for logging, error handling, and retries.

./src
├── clients
├── controllers
├── middleware
├── models
├── service
└── utils

4. Ease of Extensibility

With a modular design, new features can be added without modifying core components. For example:

A new type of scheduled action can be introduced by adding a new action in the fulfillment service without modifying the existing scheduling or queuing logic.
A new external service integration can be implemented by extending the clients module, ensuring seamless communication with third-party systems.

Delayed Execution: Ensuring Timely Execution

The system efficiently processes scheduled actions through periodic execution, ensuring that all pending actions are executed at the right time without delays.

Periodic Execution for Scheduled Actions

A Lambda function periodically scans the database for actions with PENDING status and an executionTime that is due.
Amazon EventBridge acts as a scheduler, triggering this Lambda function every 5 minutes to ensure that actions are picked up on time.
The function enqueues these pending actions into Amazon SQS, ensuring a reliable and scalable execution pipeline.

Why This Approach Works

Efficient batch processing ensures that multiple actions can be picked up at once.
Scalability is maintained by decoupling execution with SQS, preventing system overload. Queues are extremely critical to handling load towards downstream systems.
State Management: Actions follow a lifecycle (PENDING → IN_PROGRESS → COMPLETED/FAILED/NO_ACTION), with each state persisted in the database for tracking and recovery.
Execution Handling: Successful executions are marked COMPLETED, failures are marked FAILED, and recurring actions update their execution remainder before resetting to PENDING.
Automatic Retries: Failed actions use exponential backoff for retries. If retries exceed the limit, the action remains FAILED until manually reset.
Idempotency & Data Integrity: Execution remainderAuthor Of article : JOOJO DONTOH Read full article