Optimizing communication in distributed services
Performance, reliability, and maintainability are essential elements of everything we build here at Moov. In design discussions, we pay particular attention to how our services are defined and how information moves between them. Whether you are designing a new system or reasoning about an existing one, it’s helpful to think of it from two dimensions: the individual service definitions and the communication between them.
Right-sizing service scope
One of the challenges in designing a distributed system is defining the domain and scope of the individual services.
Maintaining separation of concerns is essential when defining a service. Isolating its responsibility to a single functional area helps to enforce that separation. A well-designed service exposes a simple, high-level interface with which other services or external entities interact. The less one service knows about the domain of another, the better. Implementation details are kept separate from the public interface, limiting the impact maintenance or refactoring will have on consumers. Essentially, a service’s interface should describe what, not how.
If services are defined too broadly, having responsibilities beyond a single area, the separation between those areas will erode over time. The resulting loss of precision in the domain model will hurt maintainability and limit future extensibility.
On the other hand, if a service is defined too narrowly with operations for a single functional area spread between multiple services, efficiency will be lost. Such services are often tightly-coupled, working in concert to support a single domain. This coupling increases the scope and frequency of inter-service communication and the number of possible points of failure. Overly granular services typically lead to more specific and complex interfaces. These fine-grained interfaces leak implementation details and expose business logic, placing more burden on other services.
Choosing the right communication type
Distributed services communicate with each other to orchestrate higher-level functions. Similar to communication between people, the more connections and modes of communication there are, the harder it is to communicate effectively and efficiently. Compare, for example, the dynamics of a face-to-face conversation between two people to that of a group of people communicating remotely over multiple channels: text, phone, social media, and other chat-based systems.
At a basic level, communication falls into two categories: synchronous and asynchronous.
Synchronous communication is the simpler of the two. It results in a clear sequential flow of messages that follow a predictable path. Synchronous exchanges follow a back-and-forth pattern with one party (the caller) controlling the exchange. Outgoing messages are requests; incoming messages are responses. One message cycle completes before the next begins. Success or failure is immediately evident and can be handled in the same execution context. The HTTP response/request pattern is an example of a synchronous communication pattern.
Synchronous communication is direct and predictable. It is also easy to reason about and debug. However, it is limited to two parties per exchange and is prone to failure when connected services are unavailable.
Unlike synchronous communication, asynchronous exchanges are not bidirectional; there is no “response” component, and no party has sole control. Messages may receive an immediate response or go unanswered indefinitely. Several messages may be sent before receiving any response, and requests and responses can be interleaved, arriving out of sequence.
When you send or receive an email or a text message, you participate in asynchronous communication.
An underlying queueing mechanism or message broker facilitates asynchronous communication in a distributed architecture. Two common examples are RabbitMQ and Apache Kafka. The patterns that arise from asynchronous communication are usually described as producer/consumer or publish/subscribe. Success or failure in this mode is less obvious and more challenging to detect than in a synchronous mode.
Asynchronous communication enables concurrent communication with multiple parties. Because there is no expectation of a response, a wider range of recovery and retry mechanisms are available. However, with no one party “in charge,” failure detection and recovery become more complicated.
While there are examples of purely synchronous and purely asynchronous configurations, most distributed systems use a hybrid of the two.
Orchestrating communication flow
Determining where to use which type of communication is an essential facet of system design.
Synchronous patterns are best suited for scenarios where requestors require immediate responses. This may be because they are blocked from completing some task until the request is fulfilled or have conditional logic dependent on some aspect of the request. Because of its serial nature, synchronous is also a natural choice when the order of operations matters. Synchronous communication follows a “pull” model; services pull in information.
Asynchronous patterns are typical for long-running batch processes or other “fire and forget” scenarios. They are also preferred in situations where the flow of messages is inconsistent or subject to spikes or bursts. The underlying queue platform throttles traffic to ease pressure on downstream services. Situations where information needs to be shared with multiple parties are also well-suited for asynchronous handling. Asynchronous communication follows a “push” model; services push out information.
Service communication at Moov
At Moov, we use a hybrid approach: synchronous communication over HTTP and asynchronous via events over Kafka. External communication differs by payment rail and could be HTTP, TCP/IP socket, or file exchange via SFTP.
All incoming customer requests are handled via HTTP through our Dashboard app or Rest APIs.
- Requests for information (GETs), including all internal service communications, are handled synchronously to provide the fastest turnaround.
- Requests for service (e.g., POST, PUT) typically initiate event-driven processes. Our event-driven processes enable queueing to accommodate back pressure during peak load, and delivery guarantees greater stability.
Transactional processing is primarily event-based, allowing separate services to handle different aspects of any given process. Communication with external entities such as the individual payment rails is dependent on what those entities support.
The following is a high-level abstracted view of handling an incoming request to transfer funds. Many details have been left out for the sake of simplicity and to focus on communication patterns rather than business logic. Only the first leg of the process is presented. Note that some asynchronous events happen in parallel and others are “chained.”
The initial steps are synchronous and are executed serially:
- A request comes into the Transfer Service’s Rest API over HTTP.
- The Transfer Service makes an internal HTTP request to retrieve payment method data.
- A response is returned to the caller.
The first asynchronous message is produced as a side-effect of the HTTP handling prior to sending the response, dove-tailing the end of the synchronous activity with the start of the asynchronous activity.
- The Ledger Service consumes and processes the first message before producing a second message.
- The second message is consumed by both the Transfers Service and one of our rail-specific services (ACH, Card, RTP, etc.). The handling of the second message may happen concurrently or one service may process it before the other.
- Lastly, the rail-specific service communicates directly with the payment rail. This may be synchronous or asynchronous, depending on the rail.
From this example, you can see how our communication choices reinforce our commitment to performance, reliability, and maintainability.
- The decision to use synchronous HTTP for the initial call ensures a predictable, responsive customer-facing API.
- Choosing asynchronous events for transaction processing offers us clear patterns that are fast and reliable at scale.
- Well thought out, repeatable patterns result in more obvious designs that are easier to understand, maintain and extend.
Interested in sharing your own distributed system design experiences or learning more about ours? Join the conversation with our community of builders or join our team here at Moov.