Building High Performance and High Availability Service

6 min readDec 7, 2020



The goal of this document is to highlight some key points in ensuring high performance characteristics of a service.

Ensuring High Performance

Every enterprise-level service must meet the following criteria:

  1. Request throughput requirements
    The number of requests the service can process in a specified period of time that the client and service agreed upon.
  2. Response time SLA requirements
    The response time that client and the service agreed upon. This can be measured in several ways:
  3. Average response time — what is the average response time?
  4. 50th percentile response time — what is the threshold in which 50% of the response time will be below?
  5. 95th percentile response time — what is the threshold in which 95% of the response time will be below?
  6. 99th percentile response time — what is the threshold in which 99% of the response time will be below?

Load Testing

The first step in ensuring high performance characteristics is the run loading testing to know the current state of the service and act accordingly. Load testing is intended to determine the current performance characteristics of a service. It should be run on the production service infrastructure. Any incoming traffic during load testing should be disallowed. If it is not possible to run load testing on the existing cluster, then a cluster with similar configuration should be created for load testing to mimic the behavior of the actual system. Following metrics should be captured during load testing:

  1. Number of requests processed per second
  2. Average response time
  3. 50th percentile response time
  4. 95th percentile response time
  5. 99th percentile response time

Meeting Response time SLA Requirements

The service should strive to meet 99 percentile response time SLA. If the response time is above the SLA then determine the slowest hop in the request path. For a highly available system, the request path may include following hops (see diagram below):

  1. Client → load balancer
  2. Time taken by load balancer
  3. Load balancer → Pod or VM
  4. Time taken by the application in the Pod or VM
  5. Pod → backend services/storage.
  6. Time taken by the backend service/storage

Measure time spent in each of these hops to determine where you may be able to improve the performance.

Pro Tip

  1. A quick and simple approach could be to create a controller endpoint that returns an empty response to determine if any intermediary component between client and application server is responsible for the slowness.
  2. A separate controller endpoint can be created that returns large enough payload.

The difference between the response time of these two endpoints would indicate if payload is too big to cause slowness.

Meeting Request Throughput Requirements

Typically the throughput of a service is increased by adding more resources either by scaling up horizontally (adding more pods to the cluster) or vertically (adding more CPU and Memory) or both.

  • Ideally adding resources should increase throughput linearly.
  • Generally combination of vertical and horizontal scaling gives optimal throughput performance.

Horizontal Scaling

To ensure effective horizontal scaling,

  • Remove all pods in the cluster except one.
  • Run load test on one pod cluster and record the throughput
  • Gradually add pod and run load testing and observe increase/decrease in throughput
  • Make sure throughput increases linearly as more pods are added.

Note: In theory a service is infinitely scalable horizontally but the downside of horizontal scaling is that when there are too many pods, the deploy process takes longer.

Vertical Scaling

To ensure effective vertical scaling,

  • Remove all pods in the cluster except one.
  • Lower the CPU/memory allocation to minimum
  • Run load test on one pod cluster and record the throughput
  • Gradually add more CPU/Memory and run load testing and observe increase/decrease in throughput
  • Make sure throughput increases linearly as more cpu cores are added.

Note: Vertical scaling is limited by the maximum number cpu cores a system can have.

Scaling External Dependencies

If the service has external resource dependencies (DB, remote cache, remote service etc.), then it is important to make sure these dependencies are capable of supporting additional load.

Avoiding Disk Operation

Disk operation is slow. A high performance service must not access the disk in any shape or form. Even though the service directly doesn’t access the disk, the framework or library can use the disk for logging,tracing or buffering purposes. Check the dotnet 3.0 pitfalls if the service is written in .net core 3.0+

Reducing Middleware Overhead

Middleware (also called filters) is a software component that’s assembled into an app pipeline to handle requests and responses. They are intended to be lightweight and should be simple, non-compute intensive work. Examine all the middlewares used and make sure they are not adding significant latency.

Avoiding Blocking I/O Calls

Blocking I/O operation wastes compute resources. Replace them with non-blocking, asynchronous operation whenever possible.

Avoiding Mixing Sync and Async Pattern

Sync over Async or Async over Sync can potentially cause deadlock. Either follow sync or async pattern but not both.

Finding Optimal Load Balancing Algorithm

Every highly available service is configured behind a load balancer(haproxy, nginx, ingress). By default, load balancer applies a round robin algorithm to distribute traffic among service instances. This may not be ideal which may negatively impact service capacity. Run load testing on all the available load balancing algorithms and select the one that improves performance.

High Availability

High availability is a high percentage of time that the system is functioning. It can be formally defined as (1 — (down time/ total time))*100%. Although the minimum required availability varies by task, systems typically attempt to achieve 99.999% (5-nines) availability. Down time can be quantified by the number of times service returns 5XX http response against total number of calls made. That way, the formula can be written as,

HA=(1-(number of 5xx responses/total number of requests made )) * 100%

Fault Tolerance Fallback

While high availability is acceptable in most of the cases, fault tolerance can be more appropriate in certain situations. Fault tolerance seeks to provide 100% availability either through redundancy or through gracefully degraded fallback.

  • Fallback can be implemented in the same datacenter supporting a subset of features.
  • Fallback can be implemented in a different environment/datacenter. This option is more reliable with additional latency.
  • Fallback can be triggered automatically (through circuit breaker) or manually (through feature toggle)


  • Fallback should be used as a temporary measure. It should be used as minimally as possible.
  • Performance penalties associated with the fallback should be clearly communicated and agreed upon with the stakeholders.


  • A service must notify upstream services/stakeholders when fallback is deployed.
  • A service should keep record of requests that were handled by fallback.

Separation of concerns

A complex service is made up of several different components. Each component has different resource requirements, different availability requirements. Running these components in a decoupled manner makes it much easier to manage and monitor. Product cache service is a good example in this case. The service has two components. The API is responsible for serving product data while the worker is responsible for keeping product data up to date. The API is unaffected by the worker cluster from an availability point of view. The API cluster can be scaled up or down without affecting the workers and vice versa.

Multi-datacenter Applications

Load Testing

Running load testing in one location may not be sufficient due to potential differences in hardware configuration, distance etc. The test should be run in each of the data centers and the results should be compared to identify performance discrepancies.

Data Freshness

If the service is dependent on replicated data then it should be built taking replication latency into consideration.

(Credit Timothy B.)