How to load-balance like a seasoned waiter

Software systems often parallel the real world. Imagine running a busy restaurant, where customers line up to make orders whilst the kitchen prepares the meals. In the software world, your users are the customers, and your backend services are the kitchen. With more people online than ever before, that line might start to grow out the front door. The ability to scale is no longer optional, it is essential.

Know Your Options

Vertical Scaling - Scale Up Horizontal Scaling - Scale Out
Expanding your restaurant by adding more tables or a larger kitchen. In software terms, this means scaling up your infrastructure. More powerful CPUs, larger memory, increased throughput etc. This is a relative quick fix, but comes with diminishing returns and limits on how big everything can get. Opening new restaurant locations to serve more customers simultaneously and distribute existing flows. In software terms, this means adding more API servers, more worker nodes or creating many database replicas. This approach is more flexible and scalable than vertical scaling in the long term.

A Steppingstone - Queues

Just like how customers queue for their order, we create a pull-based task queue for our order management system using AWS Simple Queue Service (SQS). Tasks get queued into the SQS, and a consumer service will continuously poll this queue to process the tasks.

This gives a lot of control for the queue consumer to dictate the frequency of polling, which works well in systems that cannot handle high throughput or requires non-concurrency like the SAP ERP (more on that later). SQS also provides built-in dead-letter-queues, retry policies, at least once delivery guarantee and scales automatically.

Vertical scaling involves sizing up the compute power of the consumer (CPU, RAM etc). Horizontal scaling involves spinning up more consumers of the SQS.

However, queues have limitations:

  • Latency between order arrival and processing. 
  • Inefficient polling process that checks constantly even when the queue is empty.
  • Limited fan out, message is designed to be consumed by one service making it difficult for different components to react to the same event.
  • Concurrency issues, need to tune the message visibility timeout to ensure it isn’t picked up by another consumer instance when scaling out.

The Gold Standard - Events

Scalability starts at the architectural level; enter event-driven programming. Instead of queuing up, customers scan a QR code, and their order is sent instantly to the kitchen, the waiter, the pay desk all at once. No delay, no queues.

We recreate this by having an event publisher that sends messages to an event bus. Which notifies the event subscribers: warehouse, emailing service and SAP simultaneously, allowing them to react to the same order independently. Adding the previous vertical and horizontal scaling options mentioned above, creates a powerful system for processing and dispatching orders as the separate components can be scaled independently. Which also lends well to a microservices architecture.

This model mitigates a lot of the previous approach’s deficiencies:

  • Lower latency compared, as it is push based not pull based.
  • Subscribers can be scaled individually.
  • Event bus is built to handle concurrency.

There are two ways to implement this in AWS: EventBridge and SNS (Simple Notification Service). We choose EventBridge for its ability to handle more complex workflows and native integrations with third party SaaS applications like Zendesk. 

Unlike the SNS approach where messages must be published to a specific topic and risks the number of topics growing too large; EventBridge receives from many sources at once. With advanced filtering capabilities, it can inspect the full event payload and route them to the appropriate consumers. Additionally, event archiving and replays are also supported for improved debugging. 

Here is the basic implementation:

  • Publish events to the EventBridge from your application
  • Define Event Rules that filter events based on its payload
  • Configure Targets for each rule - they can have multiple targets

Our target will be SQS as it allows our preexisting .NET services to plug into the new event-driven system without major modifications. However, serverless lambda functions are on the table if we can remove the dependency on SAP, more on this later.

While powerful, event-driven architecture is not without its drawbacks:

Debugging and tracing: Events are asynchronous and loosely coupled making it difficult to find cause and effect. Need to set up comprehensive logging and distributed tracing. 

Eventual consistency: System components may be temporarily out of sync. Making it harder to understand behaviour. Logic must also be built to handle stale data gracefully.

Event schema evolution: Changing the payload structure can cause breaking impacts for downstream services. Need to document clearly how the payload is consumed and have a versioning strategy.

Despite these challenges, with the right tooling and implementation, event-driven architectures can be made highly observable, testable and resilient. 

A Slow Chef in the Kitchen

Sometimes, the bottleneck isn’t your system but it’s external dependencies. In our case, SAP is that slow chef. The Data Interface API is single threaded and does not support batch processing. If you throw too many requests, it will choke no matter how fast other components are. Identification of such bottlenecks are crucial, lest those other scaling efforts are wasted. 

Luckily, in our case we can upgrade SAP from the DI API to a modern alternative called the Service Layer. It is designed with scalability in mind:

  • Uses HTTP and OData protocols
  • Can parallel process
  • Automatic load-balancing
  • Does not require local installation like DI API

These properties make it much easier to develop web and mobile applications which are much more accessible than DI API and SAP windows client installations. The service layer’s more stateless nature lends nicely to the event-driven architecture mentioned above. By adopting it we can bring SAP in line with the rest of our system in terms of scalability.

Conclusion

Just like a restaurant, software systems should be designed with future scalability in mind. Start with simple abstractions like task queues and evolve to fully decoupled event-driven systems. Horizontal scaling is often better and more flexible than vertical scaling. When faced with external bottlenecks, tackle them head-on. Architecture is the business, with intentional design, your restaurant won’t just keep up but thrive.

References

  1. The Complete Guide to Event-Driven Architecture | by Seetharamugn | Medium
  2. The Art of Scaling: Building Systems for Millions of Users | by Alex Glushenkov | Medium
  3. Amazon SQS Features | Message Queuing Service | AWS
  4. Amazon SQS, Amazon SNS, or Amazon EventBridge? - Amazon SQS, Amazon SNS, or EventBridge?
  5. Service Layer 1/6: API
  6. A Guide to Event-Driven Architecture Pros and Cons | Solace

Jerry LiuSource