Are You Choosing Cloud Services Based on Their Marketing Promises?

A

This article highlights the critical challenges engineers and leaders face when selecting cloud services and providers. While functional requirements typically drive service selection, operational reliability and incident management capabilities are often overlooked, creating significant business risks. The article explores practical approaches to evaluate cloud services beyond their feature sets, focusing on operational maturity, control mechanisms, and historical reliability data to ensure your chosen solution meets both functional and operational requirements.

Background

The cloud computing landscape has fundamentally transformed how organizations operate. Today, the majority of companies run their businesses on cloud infrastructure, and even organizations with strict regulatory requirements or security concerns (those that previously insisted on isolated, on-premises environments) are migrating to the cloud. (Inc., 2024) Modern cloud providers now offer highly secure, compliant environments that meet even the most stringent regulatory standards.

However, the abundance of choice creates its own challenges. Engineers and leaders must navigate not only which cloud provider to select, but also which specific service within that provider’s ecosystem best solves their business problem. Most cloud providers offer multiple ways to solve the same engineering challenge, each with different functional capabilities, behavioral characteristics, and operational trade-offs.

While the functional capabilities of cloud services are typically well documented with clear happy-path scenarios, another critical aspect demands attention before committing to any infrastructure decision: operational risk assessment. Official documentation excels at describing what services do, but rarely provides comprehensive guidance on how services behave during incidents, how quickly they recover from failures, and what control mechanisms you have during operational disruptions.

The rise of serverless computing over the past decade has amplified this challenge. The serverless paradigm promises that businesses can offload operational and security responsibilities to cloud providers, relying on provider engineering teams to handle failure scenarios quickly and effectively. In many cases, this approach has proven sustainable as cloud services develop mature operational backbones with polished processes for availability, resilience, and fault tolerance.

However, from a business perspective (especially for companies with critical operational requirements, like healthcare), relying on the “good intent” that a service will improve over several years isn’t viable. Businesses need solutions that meet their resilience requirements right now. The decision of which cloud service to adopt can be destiny-impacting, making proper evaluation essential from day one.

Problem

The core challenge is clear: How do you properly select a cloud service or solution that balances your operational risks and business requirements without unpleasant surprises?

Serverless solutions, while attractive for their simplicity and reduced operational overhead, remove your ability to influence the operational state of the service. When incidents occur, recovery responsibility lies entirely with the cloud provider’s service team. Providers typically expose uptime commitments (99.95%, 99.99%, and so on), but these Service Level Agreements (SLAs) alone don’t tell the complete story.

The critical questions that remain unanswered are:

  • What factors should you evaluate to ensure a service works for your specific requirements out of the box?
  • How do you prepare your architecture to minimize business impact when service issues occur?
  • What level of operational control do you need to enforce your own reliability standards?

Opportunity

Let’s explore a concrete scenario to understand how to approach cloud service selection with operational requirements in mind.

Critical Service Example:

Imagine a startup developing a real-time health tracking service for critical patients. The system provides centralized monitoring with the ability to react within seconds to any deviation in the patient’s health status. For this service, even a one-minute interruption could be life-threatening. The availability requirements are extreme: there’s no acceptable downtime window longer than one minute.

When evaluating cloud solutions for such a service, serverless options appear attractive initially. You upload your business logic, configure scaling parameters and provisioning settings, integrate with other services, and, in theory, forget about operational concerns. However, the moment an incident occurs, the limitations become apparent.

Understanding Incident Reality:

Incidents are inherently difficult to predict entirely from an engineering perspective. The best a service development team can do is ensure they have fast, automated fallback mechanisms to reroute traffic when problems arise. Modern cloud architectures help by partitioning ecosystems within regions into multiple availability zones. When an outage affects one partition, traffic and requests can be processed in another location with minimal impact.

The critical factor here is detection and rebalancing speed: How quickly can the service detect an unhealthy zone and rebalance traffic to healthy locations?

The Partial Failure Challenge:

Here’s what most teams learn the hard way: The majority of operational failures don’t cause complete outages of entire regions or availability zones. Instead, impacts are partial, meaning health checks to a fraction of the service still succeed. Basic routing strategies, such as health-check-based load balancing, will continue allocating some requests to the degraded location.

On the other hand, each cloud service has its own criteria for determining when a data center is unhealthy and should not serve traffic. Services configure these thresholds based on their availability requirements and their understanding of optimal customer experience.

Evaluating Service Commitments:

For our critical health monitoring example, here’s the evaluation framework:

1. Check SLA Commitments:

First, examine the service’s availability commitment. If it’s 99.95%, understand what that means in practice: the service could experience a single extended outage period and still meet its SLA. Ask yourself: Would this outage duration be acceptable for your business? If the answer is no, you need either a more reliable service with stronger guarantees or a different architectural approach. Also, keep in mind that this SLA is a target, and service can miss it. As a result of such a miss, the cloud provider can compensate you for the cost of the service, but this will not cover losses to your business.

2. Assess Operational Control:

This is where serverless solutions often fall short. Many serverless offerings don’t provide sufficient control over operational posture. You cannot manually mark specific availability zones or areas as faulty (control remains entirely with the cloud provider’s side).

If the service team’s reaction time (based on their best practices and automated responses) doesn’t align with your requirements, you face a critical mismatch. Consider services that offer greater manual control, enabling you to enforce your own operational standards.

Examples of Control Limitations:

AWS Lambda and API Gateway don’t allow you to manage which availability zones host your instances. You configure the service, it launches across the entire region, and only the service team controls the deployment topology. While these services offer fast, automated failure detection and traffic migration, if your business has extremely critical uptime requirements, you may need alternatives.

Container-Based Alternatives:

Services like Amazon ECS, EKS, or similar container orchestration platforms provide more granular control. You can manually terminate instances or evict them from specific availability zones whenever needed, independent of the service team’s automated responses. If you observe service interruptions in a specific zone, you can remove all instances from the request routing path within seconds.

With this level of control, you can enforce your business requirements and operational standards directly, rather than depending entirely on the provider’s automated systems.

3. Analyze Historical Incident Data:

Cloud providers maintain health dashboards with incident histories. AWS Health Dashboard, for example, contains ongoing service interruptions and provides one year of historical incident data for each service in each region, including:

  • Incident start and end times.
  • Impact type (full outage, partial outage, increased latency, etc.)
  • Affected components.

By analyzing this historical data, you can identify patterns:

  • What types of operational issues does this service experience?
  • How long does recovery typically take?
  • Are interruptions usually partial or complete?
  • Are issues localized to specific availability zones?
  • What’s the frequency of incidents?

Making the Trade-off Decision:

The fundamental trade-off is between operational simplicity and operational control:

Serverless Solutions:

  • Minimal operational overhead.
  • Rich feature sets (API Gateway provides authentication, throttling, schema validation, etc.)
  • Limited control during incidents.
  • Dependency on the provider’s incident response.

Container/Semi-Managed Solutions:

  • More operational responsibility.
  • Greater control over traffic routing and instance management.
  • Ability to manually intervene during incidents (e.g., removing traffic from specific zones with a single API call in seconds)
  • Requirement to implement features yourself (authentication logic, throttling, schema validation)

For many services, the operational simplicity of serverless solutions outweighs the control limitations. However, for critical services with stringent availability requirements, the additional operational work of semi-managed solutions becomes justified.

Conclusion

Choosing cloud services only on functional capabilities is like buying a car based solely on its entertainment system: you’re ignoring the engine, brakes, and other features that determine whether you’ll reach your destination safely.

The operational characteristics of cloud services: their incident history, recovery patterns, and the control mechanisms they provide, are just as critical as their feature sets. For services with stringent availability requirements, the luxury of serverless simplicity may be a risk you cannot afford.

The question isn’t whether serverless or managed services are better; it’s whether you’ve assessed your operational requirements and matched them against the service’s operational reality. Have you looked beyond the marketing materials and happy-path documentation to understand how the service behaves when things go wrong?

Before committing to your next cloud service, ask yourself: Are you making your decision based on complete information, or are you choosing blindly and hoping for the best?

About the author

Maksim

I build AI-powered products and lead engineering teams. I've launched platforms from zero to millions of users and learned most lessons the hard way. I write about the gap between engineering theory and practice, what actually matters when building products, and the decisions that shape teams and systems.

Add Comment

By Maksim

Maksim

Get in touch

Reach out if you want to discuss engineering leadership, collaborate on something interesting, or suggest topics you'd like me to write about.