Automatic Reliability

The Software Development Life Cycle (SDLC)

We usually think of the software development life cycle for any piece of new code as a vector progressing to the right, all the way from design through development, and integration and deployment to production. The life cycle usually ends when the piece of software is replaced by a new version and the cycle continues.

Software Development Life Cycle

While the SDLC is intentional and the essence of our work, there is another life cycle to consider - the one of a production failure.

The Production Failure Life Cycle

With software development we often inadvertently introduce faults and defects that become production failures - they too have a life cycle.

Production Failure Life Cycle


Fault is defined as an abnormal condition or defect at the component, equipment, or sub-system level which may lead to a failure.

Wikipedia .

A fault can result from a defect (e.g. bug) or an abnormal condition (e.g. poor configuration) that may lead to a production failure.

Faults are detected by running tests at different stages of the SDLC (e.g. via unit, integration or acceptance testing).

Traditionally it was the responsibility of QA engineers to find faults early. Today it has become much more common that developers are also responsible to detect faults early.


Failure is the state or condition of not meeting a desirable or intended objective, and may be viewed as the opposite of success.

Wikipedia .

A failure in production is considered expensive as it is likely to disrupt the normal business operation and cause monetary loss.

The Cost to Detect a Fault in Pre-production Vs a Production Failure

The cost of a production failure is significantly higher than the cost of detecting a fault in pre-production. We would much rather detect faults early, in pre-production long before they spiral into production failures.

Reliability Gates

Defects, bugs, faults or failures are not new. They are part of software engineering and we use techniques such as unit, integration, acceptance, load, stress and chaos testing as well as monitoring, to try to detect and prevent them from happening, contain their 'blast radius' or recover fast once failures occur.

Modern Reliability Gates

The Modern Test Pyramid

If we consider the traditional testing pyramid and the nature of the Microservice architecture, we find that testing has reduced in scope.

Unit Testing

Unit testing is our first reliability gate, however developers get stuck fairly quickly with Unit testing mostly due to service dependencies and independent service roadmaps.

Service Testing

While there are practices to support service-testing (e.g. mocks, API-contract testing), they are very complex and costly to implement, resulting in very little service-testing done in spite the rising number of services as part of a modern architecture.

End-to-end Testing

Most modern engineering teams reduce the scope of testing to Unit testing and End-to-end testing as the more common testing techniques, while component, service and integration testing are becoming harder and harder to implement.

While gaps are forming in the testing pyramid, new techniques are emerging that increase the confidence, like Canary and a stronger reliance on monitoring and fast rollbacks.

Testing Pyramid Shift

Microservices Testing is Hard

Microservice architecture is relatively new and complex, especially when combined with tools such as Kubernetes and Docker that have great influence on the system behaviour.

While testing a single service in a Microservice architecture is theoretically easy, it really isn’t, since services are typically part of a larger business-logic dependent on other services to complete the logic. As mentioned before, the existing practices to support service-testing (e.g. mocks, API-contract testing), are complex and costly to implement.

The testing challenge exacerbates when you bake in the amount of microservices you’d like your testing strategy to cover.

Overall what was supposed to be a federation of lightweight de-coupled and testable services, become very challenging to test and prevent faults from reaching production.

Automatic Faults Detection

UP9 introduces a new and automatic testing technique to further increase the confidence of deployment by automatically detecting faults early on in pre-production and preventing them from becoming production failures, by filling in the gap between unit testing and end-to-end testing as well as providing additional layers of protection once in production.

Automatic Faults Detection

By launching active protection layers that provide API test-coverage in multiple stages (e.g. Integration, Canary, Production), UP9 can detect faults early on, and prevent these faults from becoming production failures.

The machine-generated API test-coverage is instant and self-updating supporting a wide scale of distributed microservices that change rapidly with test-coverage that increases over time.


UP9 provides an out-of-the-box test automation for microservices, kubernetes and cloud-native, replacing the need for developers to constantly build and maintain tests, while providing comprehensive service test-coverage.


  • Automatic generation and maintenance of CI-ready test-code, based on service traffic
  • Observability into API-contracts, business-logic and service architecture
  • Automatic reliability, test-coverage and root-cause analysis
  • Machine generated tests include functional, regression, performance and edge-case test-cases, covering all services and all service-endpoints

UP9 offloads the microservice testing workload from developers giving them precious time back.