Control Planes and Data Planes in AWS

The control plane (CP) includes the systems that configure resources running in the data plane. Control planes provide the administrative APIs used to create, read/describe, update, delete, and list (CRUDL) resources. In other words, control plane allocates and configures the resources, and the data plane runs them.

Data plane consists of the systems for consuming those resources, which is basically primary function of the service.

Control planes (DP) and data planes are decoupled to enhance the resiliency so that a failure in the control plane should not impact the data plane. AWS incorporate this design principle in most of their services to enhance the performance and availability of their services.

Data Plane’s Operation with Impaired Control Plane

The control plane replicates its configuration data to multiple replicas of data plane distributed across the regions. That enables the data plane to continue working in the event of control plane impairment. Data plane can access and operate on resources that has been already provisioned with or without control plane.

For instance, if the ability to deploy and configure a load balancer is out, we can continue to utilize pre-deployed load balancers to serve requests based on current configuration. Similarly, EC2 control plane is responsible for allocating and reconfiguring instances. Data plane is responsible for currently running and interacting with EC2 instances and can do so even when control plane becomes unavailable.

Difference between control and data planes

Control planes are usually more complex compared to data planes. Hence failures more common in control planes. Based on the principals described in the section above, we can generally isolate whether you’re using the control or data plane based on the service and the API action. Following are some examples:

Control Plane Actions:

  • Create resource
    • Launch EC2
    • Create S3 bucket)
    • Create a Lambda function.
  • Read resource attributes (describe a resource)
  • Update resource attributes
    • Update network configuration for an ALB
  • Delete resource

Data Plane Actions:

  • Interacting with resourc
    • Running EC2 instance
    • SSH to EC2
    • Reading/writing to EBS volume
    • Putting objects in S3
    • Answering DNS queries
    • HTTP/HTTPS/TCP to load balancer
    • Lambda invocations
    • Getting item from DynamoDB table
    • Putting item into DynamoDB table
    • Performing health checks

Application Dependencies

Any given application’s degree of dependency varies on use case basis. Each workload must be assessed individually by thoroughly examining the AWS API calls issued by it and analyze the answers to the following questions:

  • What is the ratio of control plane and data plane APIs?
    • Creating resources? [CP]
      • EC2 autoscaling
      • EC2 reconfiguring
      • RDS, Fargate, SageMaker etc. depend on EC2 (CP)
    • Working with existing resources, reading/writing data only (DP)
  • How does this ratio contribute to business continuity risk factors?
  • How to mitigate these risks?
  • How are failover runbooks impacted to react accordingly?

Control Plane Risk Mitigations

Depending upon application’s exposure to control plane hiccups, risk mitigation strategy may differ accordingly. Here are some examples:

  1. Adjust the compute (VPC, subnets, EC2, load balancers, Lambda, RDS, Fargate, etc.) capacity preemptively. Pre-allocate enough to accommodate predicted load spikes. Scale down to align with your load cycle.
  2. Plan to execute manual failover from healthy region, e.g., RDS slave promotion, enabling Lambda triggers.
  3. Make your applications region agnostic and idempotent. That itself may help eliminate regional control plane dependencies.
  4. Eliminate or reduce dependency on control plane.
  5. Avoid dependence on Route53 routing policy updates for failover. Instead, leverage endpoint health checks and/or other CloudWatch metrics to accomplish the same.
  6. Perform AZ evacuation when isolated Availability Zone is impacted, impairing availability or latency. Services with Availability Zone Independence (AZI) such as Amazon EC2 and EBS, because parts of those services have control planes that are also zonally independent.
    1. Data Plane: Mitigate impact by preventing work from being routed to or stop work from being done in the impacted Availability Zone.
    2. Control Plane: Update the configuration of resources with control plane actions to both prevent capacity from being provisioned in the impacted Availability Zone as well as stop inter-Availability Zone communication with that Availability Zone.
    3. Use Route53 Application Recovery Controller (ARC) API call for routing control to a regional endpoint of a cluster.
  7. Periodically test and validate your risk mitigation plans.

Canary Probes to Monitor Control Plane Hiccups

Once the application dependency is established on control plane with details API list scoped by the application’s business as usual (BAU) activities, the API calls should be incorporated into application health checks. For example, instead of returning a status HTML page, it can be replaced with a dynamic page that can perform all the required API calls on test/dummy resources to avoid data corruption and only then return an OK (200) status.

Canary Probe Example-1: S3

A canary probe (e.g. Lambda) for S3 control plane actions may consist of the following S3 API operations performed on a dummy resource within a specific region.

1. CreateBucket:

PUT /v20180820/bucket/<name> HTTP/1.1

Host: Bucket.s3-control.amazonaws.com

…LocationConstraint…

2. PutBucketPolicy:

PUT /v20180820/bucket/<name>/policy HTTP/1.1

Host: Bucket.s3-control.amazonaws.com

x-amz-account-id: AccountId

3. PutBucketTagging

PUT /?tagging HTTP/1.1

Host: Bucket.s3-control.amazonaws.com

x-amz-account-id: AccountId

. . .

Canary Probe Example-2: API Gateway

A canary probe for API gateway actions may consist of the following control plane API operation performed on a dummy resource within a specific region.

1. CreateRestApi:

Creates a new RestApi resource.

POST /restapis HTTP/1.1

Content-type: application/json

2. CreateDeployment:

POST /restapis/restapi_id/deployments HTTP/1.1

Content-type: application/json

3. DeleteDeployment:

DELETE /restapis/restapi_id/deployments/deployment_id HTTP/1.1

4. DeleteRestApi:

DELETE /restapis/restapi_id/deployments/deployment_id HTTP/1.

References and Further reading

Advice for building resilient systems in AWS

AWS Architecture Blog: Doing Constant Work to Avoid Failures

Static stability using Availability Zones

AWS Whitepaper; Zonal services

AWS Whitepaper; Control planes and data planes

AWS Whitepaper; AWS Fault Isolation Boundaries

AWS Well-Architected; Control plane and data plane

AWS Whitepaper; Detecting and Mitigating Gray Failures; Control planes and data planes

Avoiding overload in distributed systems by putting the smaller service in control

Route53; Data and control planes for routing control

Image Credit: AWS Architecture Blog: Doing Constant Work to Avoid Failures

Previous Article

Leave a Reply

Your email address will not be published. Required fields are marked *