What Stops You From Having Zero-Touch Deployment?

2 Monaten ago

9 Min Read

W

This article aims to prove that eliminating coordinated deployments and human-supervised deployment processes is not only possible but essential for modern software development. Organizations can achieve steady CI/CD without operator involvement in happy-path scenarios by implementing comprehensive testing strategies, proper deployment culture, and automated quality controls that build trust in the deployment pipeline.

Background

The deployment of complex services and infrastructures has traditionally been correlated with manual monitoring of deployment processes and their success. However, in the world of effective software development, each organization should consider automating as many processes as possible that are not related to creativity and value generation. CI/CD processes and software release procedures are typically maintenance processes – not creative work that allows teams to review, test, and deliver updates to customers successfully.

The creative work that teams should focus on is the actual updates and new project capabilities themselves. Therefore, the goal of each organization should be to achieve a state of zero-touch CI/CD processes, where human intervention occurs only as a final approval control and doesn’t require significant operational effort.

This challenge becomes particularly sensitive in startup environments. Early-stage companies are intensely focused on delivering value. First, they work on functionality to deliver a minimum viable product to the market. In this situation, they don’t spend much effort building a foundation for a healthy CI/CD environment that minimizes failure risks, simply because there are no customers yet.

As soon as companies become exposed to external customers who depend on their services and generate revenue, they start prioritizing the implementation of proper processes. However, at that moment, it becomes more complicated. Development processes start to slow down. Teams begin investing in operational excellence and continuous integration, but they try to do the minimum work necessary to ensure they still have the capacity to deliver features. The more companies delay operational investments, the more technical debt they accumulate for the future.

Problem

The problem statement is clear: to achieve zero-touch CI/CD processes, we need processes where engineers are involved only where they’re critically needed, and don’t spend time in places where it’s unnecessary. We need to ensure that development and deployment can co-exist smoothly, with development teams having all processes set up and pointing toward automation of repetitive tasks.

Opportunity

The solution for achieving zero-touch deployment is fundamentally related to trust and confidence. Teams should trust their automated approval workflows and CI/CD quality controls enough to delegate identification and signaling of issues before they reach production. They can achieve it through the following efforts:

Comprehensive Testing Strategy

From a continuous integration perspective, engineers should have properly defined testing approaches to ensure new changes don’t break existing functionality and meet all technical standards and expectations.

Essential Test Coverage:

Unit Tests: Cover lines of business logic (not necessarily 100% code coverage). Focus on testing business logic and decisions, ensuring consistency and proper functionality. No need to unit test object definitions – concentrate on the logic that makes business decisions.
Integration and End-to-End Tests: Ensure that contracts for the services are followed and that key user stories remain functional in each release candidate. End-to-end tests should cover all existing user stories implemented in the project, along with their corresponding edge cases. The team can collaborate with a Project Owner to run a user story mapping activity, identifying all active user stories and ensuring each of them is covered with dedicated tests.
Security Vulnerability Scanning: Verify that code artifacts don’t depend on third-party packages with identified security vulnerabilities. This workflow should fail immediately when such vulnerabilities are detected. Tools like Dependabot on GitHub, as well as many other solutions, make this achievable. To ensure that the team doesn’t introduce new security vulnerabilities on its own, automated security tests should be implemented. Utilize ScoutSuite for automated cloud infrastructure audits in combination with self-written Authentication and Authorization tests as a foundation for your security assessment.
Code Review Processes: While not automated, peer review remains essential because there are many ways to implement the same business logic. Aspects such as maintainability, code reuse, and logical structure should align with organizational standards.

Previously, teams had to spend days and weeks achieving such levels of coverage, and in many cases, writing tests was as time-consuming as implementing the business logic. However, this is where LLMs can and should support all the teams. Modern code-oriented agents can produce 90% of the test code you need in minutes. Teams will need to allocate additional time to ensure all edge cases are covered and instruct the LLM to replicate specific behaviors of the business logic. Still, it will never compare with the effort of writing everything from scratch.

Deployment Pipeline Architecture

The general guidance is to perform as many deployments as possible, with each pull request converted into a release version and steadily deployed through the deployment process. To ensure quality in each release, the project should have pre-production stages:

1. Development Environment (Unstable)

Used by engineers for features they’re currently developing. Engineers can load code that hasn’t been submitted for review to test how the change fits into the overall project picture.
Shouldn’t be a part of the main deployment pipeline. It can be a set of sidecar environments that replicate the main infrastructure.

2. Testing/QA Environment (Can be Unstable)

Used for quality assurance of specific features already in the build
Can have lightweight automated tests as part of the workflow that approves the propagation of the change set to the next stage of release deployment.
Should have a fraction of production data storage (dummy auto-generated data or anonymized data from production). This stage should automatically test the release candidate against basic functional and security requirements. Those tests should not take longer than 30 minutes and serve the only purpose – to signalize the issue faster.
The project may not have this stage if the approval workflows of the next stage have an acceptable duration, or if maintaining additional replicas of the service is not sustainable for the company.

3. Pre-Production Environment (Stable)

Should have minimal differences from the production environment.
Database size should emulate production scale (with dummy but sophisticated data).
Should be stable enough for customer or stakeholder demonstrations.
Used for final comprehensive testing.

Comprehensive Testing in Pre-Production:

Before reaching production, a release candidate should pass all possible tests:

Unit tests to verify no unexpected changes in the depth of the business logic.
Full integration test suite to verify API contracts and interactions of the service with other services.
Security tests to ensure no critical security volnurabilities introduced in the service code and new dependencies.
Load tests to ensure the service can handle the existing production load and scale with the growth.
Infrastructure tests (chaos testing) to ensure the service can handle the failure of non-critical dependencies properly and signal failures in critical areas as soon as possible.
Canary tests – fast tests of the key user capabilities. Those tests should run every 1-5 minutes against each production environment to ensure the health of key features and signal any unexpected behaviors.
End-to-end tests to ensure all user stories work as expected from start to end through all the services and project capabilities.

These tests ensure releases have sufficient quality for production deployment. Certainly, there are more test methodologies to consider, but here we focus on the basic case, avoiding the specifics of each project, and concentrate on building minimal confidence.

Production Deployment Controls

Production environments should have their own controls and guardrails:

Functional Tests: Run core functional tests, integration tests, and end-to-end tests in production.
Gradual Deployment: Implement blue-green deployment with a gradual traffic allocation (e.g., 5% → 50% → 100%, or any other suitable distribution).
Canary Testing: Very lightweight tests running every 1-5 minutes to verify basic service functionality.
Staged Rollout: Include wait times or “bake time” between deployment stages to allow issues to surface through test failures or configured alarms.

Cut the costs of the test creation

Leveraging AI for Test Development:

Modern development can significantly benefit from Large Language Models (LLMs) for test creation. While business logic coding may require human expertise, testing has lower quality standards than production code and serves the purpose of building confidence in releases.

Testing typically involves flat structures of modules and page models that LLMs can handle with good quality.

Behavioral Testing with Gherkin:

Product managers and those responsible for functional acceptance can write tests using Gherkin syntax. Test frameworks convert these user stories into coding constructs and run them as regular tests. This ensures user stories are directly converted into test cases that verify release quality and automatically validate acceptance criteria.

While this won’t replace product managers manually checking that everything looks correct, it ensures experience consistency and that each release follows the initially defined user stories.

Deployment Culture and Operational Excellence

Timing and Scheduling:

Restrict deployments to working hours when troubleshooting resources are available. Consider disabling production deployments on weekends and, optionally, on Friday afternoons. Troubleshooting deployment issues is intensive work that shouldn’t happen during off-hours.

Infrastructure Health Monitoring:

If the service is running in the cloud, automated pipeline processes should verify the cloud service’s health status before initiating deployments. All popular cloud providers have health dashboards where they actively report any ongoing incidents. Automated workflows should verify that underlying infrastructure services aren’t impacted before proceeding.

Configuration of Drift Detection:

Analyze the infrastructure before deploying to a new stage to ensure the infrastructure hasn’t been manually modified. Detect any drifts that could impact deployment health, stop the deployment, and raise alarms if critical drifts are detected.

Comprehensive Alarming Strategy:

Each deployment stage should have dedicated alarms configured. For blue-green deployments with traffic allocation, alarming becomes complex because you may not have enough data when traffic allocation is small, but waiting for larger allocation might be too late.

The solution is to:

Deploy to small infrastructure pieces first.
Have alarms dedicated to those specific pieces.
Ensure consistent failure detection triggers alarms. Configure alarms for API availability, latency, and all service SLAs
Allow automation to respond to alarms before human operators are involved. If the service starts to raise alarms after deployment, the pipeline automation should be capable of automatically rolling back all instances of this release candidate. The alarm should involve an operator to evaluate the incident itself, but during off-hours it can take a significant time for a person to wake up, read alarms, apply incident runbooks, and understand the root cause.

Conclusion

The path to zero-touch deployment isn’t about eliminating human oversight entirely – it’s about building systems trustworthy enough that human intervention becomes the exception rather than the rule. Organizations that invest in comprehensive testing strategies, proper deployment culture, and automated quality controls will find themselves with the competitive advantage of rapid, reliable deployments.

The question isn’t whether zero-touch deployment is possible – it’s whether your organization is ready to invest in the trust-building infrastructure that makes it achievable.

Maksim

I build AI-powered products and lead engineering teams. I've launched platforms from zero to millions of users and learned most lessons the hard way. I write about the gap between engineering theory and practice, what actually matters when building products, and the decisions that shape teams and systems.

What Stops You From Having Zero-Touch Deployment?

Background

Problem

Opportunity

Comprehensive Testing Strategy

Deployment Pipeline Architecture

Production Deployment Controls

Cut the costs of the test creation

Deployment Culture and Operational Excellence

Conclusion

About the author

Maksim

Add Comment

Cancel reply

Maksim

Get in touch

Background

Problem

Opportunity

Comprehensive Testing Strategy

Deployment Pipeline Architecture

Production Deployment Controls

Cut the costs of the test creation

Deployment Culture and Operational Excellence

Conclusion

About the author

Maksim

Add Comment

Cancel reply

Read more

Maksim

Get in touch