BYOC Control Plane/Data Plane Architectures

Introduction

Architectures that split up the control plane and data plane are becoming increasingly common with the growing popularity of Bring Your Own Cloud (BYOC). This architecture, in which the vendor’s control plane can connect to and manage a data plane deployed on customers’ cloud infrastructure, has quite a few benefits. A data plane is often simpler to deploy than an entire stack, a vendor-managed control plane allows for seamless use of a product regardless of where the data plane lives, and the customer can have strong data sovereignty guarantees and potentially even increased resilience. The customer hands over the management of the data plane to the vendor, which frees up human resources as well. In many cases, this is the best of both worlds.

Let’s take a look at “ACME Inc.,” a hypothetical vendor using a demo ClickHouse application to offer a solution that feels like SaaS but whose data plane is deployed into their customer’s account. We will discuss architecture, Nuon app configuration, permissions, and extended vendor access via role delegation. All of the code and configs referenced in this article are available on GitHub and, as a matter of fact, you can become an ACME “customer” yourself, here.

The Challenge

Imagine, if you will, you are an engineer in the ClickHouse division at ACME, and you’ve just been tasked with developing a BYOC solution for your ClickHouse cluster product. Times are changing, and some of your clients working at large enterprises in highly regulated industries are starting to explore customer-hosted options.

Today, ACME offers control and data planes in their cloud infrastructure that allow customers to create and connect to multi- and single-tenant ClickHouse clusters. The planned future state is a hybrid model where ACME continues to provide the control plane for cluster orchestration. Customers will host the ClickHouse data plane (the clusters) on their own cloud infrastructure using best practices from the existing SaaS environment.

Architecture

You and your team arrive at the following architecture: a fully customer-hosted data plane with any number of ClickHouse clusters, which can be configured to have private, public, or Tailscale ingresses, allowing for a variety of access patterns for customers of all shapes and sizes. Your control plane remains as is and, with minor modifications, it can connect to clusters at arbitrary endpoints.

Additional Requirements

You’ve got the architecture down, but naturally the product team has requirements. Customers must be able to log into ACME, create a new organization, provision an EKS cluster, and then create ClickHouse clusters for that organization. Customers must have the option to enable Tailscale which allows them to bring their own tailnet for added security. Finally, some customers have mentioned ACME may retain certain access to their cloud resources for debugging purposes while others have opted to deny any additional access.

Your goal is now to enable customers to provision an EKS cluster and its related infrastructure on their own AWS account and then apply a set of CRDs and Helm charts on the cluster.

How can we go from a simple form to this end-to-end BYOC experience without overwhelming the user?

Solving with Nuon

Fortunately, you are no stranger to Terraform and Helm. Now you just need to sort out the best way to configure, package, and distribute all of this IaC. The infrastructure at play includes a VPC network with the desired topology and subnets, an AWS EKS cluster with Karpenter so each cluster can run on a dedicated node pool to avoid those pesky noisy neighbors, a couple of customer-provided secrets for Tailscale (if enabled), a few Kubernetes operators, and finally, the ACME-CH data plane agent which will run alongside the clusters, maintain order, and ensure the ClickHouse clusters on the customer’s infrastructure are up to date with the configurations from the control plane. If the customer opts to enable it, you’ll want to ensure there is an IAM role ACME can assume in the unlikely scenario that you need to look at EKS cluster logs directly.

If this sounds more than a bit daunting, that’s because it is. IaC is a powerful tool, but packaging and distributing it, ensuring it executes correctly, and monitoring the results is a bit beyond a simple 'helm install' or 'terraform apply'. Let’s take a look at how this type of application architecture can be implemented with Nuon.

App Config

The code for the following app config can be found here.

First, some core infrastructure. Nuon’s default VPC CloudFormation stack is good enough for your purposes and the aws-eks-karpenter-sandbox is perfect for this use case. You create a new app and configure the sandbox.toml and stack.toml files accordingly.

We’ll need to define a few inputs to allow for customizing each install, which can be done via the control plane by either ACME or the customer. This app has been configured with inputs to make Tailscale and role delegation optional.

If a customer decides to enable Tailscale, they will need to provide some secrets. Other secrets are required, but these can be auto-generated without customer input. User-provided secrets become inputs in the CloudFormation stack template (more on that in a moment).

Once we’ve set up a VPC, an EKS cluster with Karpenter, user-provided inputs, and some secrets in AWS Secrets Manager, we can think about components.

Next, we define some images — including images for the Tailscale and ClickHouse operators as well as images for the ClickHouse keeper and server (what fun is an un-replicated single node deployment?). These images are pulled into an ECR repository for the data plane installation on the customer’s account. They can then be used by downstream components. This also allows customers to scan these images and lock down the cluster to ensure only images from this ECR repo are used in pods.

After the images and a couple of core components — such as the default Karpenter node pools — are in place, we use the 'kubernetes_manifest' component (docs) to apply the CRDs for Tailscale. Tailscale is optional, so if it’s disabled, we set the operator replicas to zero. We use a 'helm' component (docs) to install the ClickHouse operator CRDs, ensuring we reference the images we defined precisely for this purpose. Once the CRDs are applied, we can deploy the Tailscale proxy, which can also be scaled down if disabled. With this, all of the core cluster components are in place. Now we can deploy the ACME-CH data plane agent — a simple script that reconciles resources on the cluster with resources that are defined in the control plane and retrieved via the ACME-CH control plane API.

The final component we define is a terraform module that accepts a 'vendor_role_arn' and, if provided, creates a role in the customer account, grants it permission to the EKS cluster and its CloudWatch logs, and grants the vendor role permission to assume it. This component lives alongside the rest of the configuration for transparency. This component only creates its resources if — once the customer has opted to enable it — we pass a non-empty 'vendor_role_arn'.The role delegation component is a great way to provide an escape hatch that allows direct access to specific resources using the AWS API. This type of cross-account permissions can be useful for debugging and monitoring, and is a common pattern in BYOC deployments. In short, Nuon offers a flexible way to manage vendor access to an application running in a customer’s cloud.

Control Plane Nuon Integration

With the app defined and synced to the ACME Nuon organization, we are now ready and able to create installs, or individual instances of this app. When a customer creates an organization and provides inputs, the control plane uses those to create an install with the Nuon Python SDK. At this point, the Nuon control plane starts a provision workflow, which can be fetched by the ACME-CH control plane and displayed to the customer.

In the Beginning, There Was a CloudFormation Stack

Once the customer has created an org, they’ll be prompted to run the stack on AWS using a Quick-Create link. We also provide a CloudFormation stack preview link so the customer can review the infrastructure before applying the stack.

The CloudFormation stack creates the VPC and an AutoScaling Group for the Nuon runner from the stacks specified in the stack.toml file. The runner is responsible for the sandbox, images, components, and actions.

Let There Be a Sandbox

Once the stack is created and the runner is, well, running, the provision workflow proceeds to deploy the sandbox. In the case of this application, we use the Nuon 'aws-eks-karpenter-sandbox', which means the sandbox will create an EKS cluster with Karpenter installed in addition to an ECR repository for the installation.

Components of Every Kind

Once the sandbox is deployed, the components can be deployed to the cluster. Nuon maintains a dependency graph so that components deploy in an orderly fashion. Some of the components in this app require secrets to be munged, so we have defined a Nuon action to write secrets in the place and shape the operators expect them. For this Nuon App, components include CRDs and operators for tailscale and AWS ALBs as kubernetes manifests, an ACM Certificate terraform module, a helm chart for the ACME data plane agent, and a terraform module for role delegation.

Provisioning ClickHouse Clusters

Once the provision workflow is complete, we are ready to start creating ClickHouse clusters. For the purposes of this app, we enable deploying a single ClickHouse node or a 2-node cluster with a 3-node ClickHouse keeper cluster to support replication. Customers are able to decide which kind of ingress they want for their cluster. Clusters with a private ingress are only accessible from within the EKS cluster. Clusters with a public ingress get an ingress with an ALB. Clusters with a Tailscale ingress get an ingress of class “Tailscale,” which creates a service that is only accessible via the tailnet. Each cluster gets its own EC2NodeClass and Nodepool, so each node runs on its own host.

When a cluster is created, the ACME-CH data plane agent running on the EKS cluster will poll the ACME-CH control plane API, create the relevant resources on the cluster, and report the status of the resources (e.g. Ingress, ClickHouse installation) back to the control plane. This is useful for sending the status of the load balancer to the control plane so the ingress address can be displayed to the user.

Presto!

At this point, the customer has completed the set up of an ACME-CH data plane on their own AWS account. They can now create ClickHouse clusters on their EKS Cluster.

customers can now create ClickHouse in their cluster

If the cluster has an ingress, it can be queried directly from the control plane.

In Closing

As a member of the ACME Inc. ClickHouse division, you have now successfully created a new deployment option called Bring Your Own Cloud (BYOC). ACME leadership is excited because this solution unlocks new revenue, and customers in regulated industries can securely provision ClickHouse clusters and access their ClickHouse data. From a technical front, you have configured your app in Nuon to easily and securely create customer installations in their cloud accounts. Customers benefit from ACME managing the control plane in the ACME cloud and ClickHouse data in their cloud.