inference chart. Each model becomes its own Deployment, Service, and Ingress, reachable at its own hostname through a shared Application Load Balancer.
Prerequisites
Poolside distributes the Helm deployment bundle as a.tar.gz archive. Extract it before you start:
Required AWS infrastructure
You provision the following AWS foundation before you deploy the chart. For a turnkey foundation that provisions all of it, apply the Terraform reference architecture in thepoolsideai/reference_architectures repository, or reproduce the same architecture in your own infrastructure-as-code. For the architecture diagram and design decisions, see Reference architecture.
- EKS cluster, Kubernetes 1.29 or later, with an IAM OIDC provider enabled. The OIDC provider is what makes IRSA work, so the model servers can read checkpoints from S3 without static credentials.
- GPU node group with enough GPU memory for the models you deploy. The reference architecture uses
p5e.48xlarge;p5andp5eninstances also fit. The node group runs an EKS-optimized GPU AMI that can schedule containerized GPU workloads, so each node advertises thenvidia.com/gpuresource. Apply thenvidia.com/gpu=true:NoScheduletaint to keep non-GPU workloads off these nodes; the chart’s default tolerations already tolerate it. These instances are usually not available on demand and need reserved capacity. For instance shapes, model packing, and capacity reservations, see Reference architecture. - NVIDIA GPU Operator to expose GPUs to the cluster. Run it in one of two modes, depending on your AMI:
- If the AMI already includes the NVIDIA driver and container toolkit, such as the AL2023 or AL2 NVIDIA accelerated AMIs or the Bottlerocket NVIDIA variant, run the GPU Operator in device-plugin-only mode with the driver and toolkit subcomponents turned off. This is what the reference architecture does.
- If the AMI ships without drivers, run the full GPU Operator so it installs the driver and container toolkit.
- AWS Load Balancer Controller, installed and running in the cluster. It reconciles the per-model
Ingressobjects into an Application Load Balancer. Its admission webhook must be reachable, which requires the controller to have ready pods on schedulable nodes: a controller scaled to zero or stuckPendingrejects everyServiceandIngressthe chart creates, and the install fails. Tag your subnets for load balancer discovery:kubernetes.io/role/elb=1on public subnets for an internet-facing load balancer, orkubernetes.io/role/internal-elb=1on private subnets for an internal one. - Amazon S3 bucket for the model checkpoints. Server-side encryption with a KMS key is recommended. Bucket versioning is unnecessary because the checkpoints are content-addressed.
- Amazon ECR to host the bundled container images.
- AWS Certificate Manager certificate covering the hostnames you assign to the models. The load balancer terminates TLS with this certificate. Decide the per-model hostnames now: you set them as the
ingressHostvalues in Step 6, and the certificate must cover every one of them.
Workstation tools
Install the following tools on the host you use to run the deployment:helm3.12or laterkubectl, configured for your EKS clusterskopeo, to copy the bundled images into Amazon ECRawsCLI, to create AWS resources and upload checkpoints to S3jq, to parse JSON responses from the inference APItar, to extract the deployment bundle and model checkpointscurl, to call the inference APIeksctl(optional), to associate an IAM OIDC provider if your cluster lacks one, or to create the IRSA role and service account with the alternative in Step 4
atlas container image, and each model checkpoint is tens of GB more. On a constrained workstation, run the extraction and the uploads from an EC2 instance in the bucket’s region that has enough disk, which also speeds up the checkpoint upload, and delete each local copy once it is uploaded.
Step 1: Create the namespace
The inference stack runs in a single namespace:Step 2: Upload the container images to Amazon ECR
The bundle ships theatlas inference server image and the public-docs documentation site image as OCI archives under ./containers/. Copy them into Amazon ECR.
Amazon ECR does not create repositories on push, so create the repositories first. Each repository name must match an image name from the bundle:
skopeo to your ECR registry:
<account-id>.dkr.ecr.<aws-region>.amazonaws.com/<image-name>:<image-tag>. The tags are specific to the bundle you received, not fixed values, and the chart’s image.name, image.tag, docs.image.name, and docs.image.tag are preset to match the archives, so you set only image.registry at install time. You do not need to type the tags anywhere, but you can confirm what was pushed. For example, to inspect the atlas image:
You do not need an image pull secret. The GPU node group’s instance role authorizes ECR pulls through the AWS-managed
AmazonEC2ContainerRegistryReadOnly policy, so kubelet pulls the image directly.Step 3: Upload model checkpoints to S3
The model servers download their checkpoints from S3 on pod startup, so the checkpoints must be in place before you deploy the chart. Poolside provides the checkpoint files separately from the deployment bundle. Confirm the local path and the destination prefix with your Poolside contact. Uploading checkpoints is time consuming. Start it now and continue with the remaining steps in parallel. Poolside ships each model checkpoint as a single.tar archive that contains one top-level directory holding the checkpoint files. The model server loads the unpacked files (*.safetensors, *.json, and the tokenizer files) directly from the prefix you point at, so extract each archive’s contents into its own directory, not the .tar and not the nested folder. The --strip-components=1 flag drops the archive’s top-level directory so the files land at the directory root:
./checkpoints/laguna-m, ./checkpoints/laguna-xs, and ./checkpoints/point upload to checkpoints/laguna-m, checkpoints/laguna-xs, and checkpoints/point under the bucket:
s3:// prefix for each model. You reference it in the models.<key>.model paths in Step 6, and the paths must match what you uploaded exactly: a misspelled or missing prefix downloads nothing and the pod never starts.
Checkpoints are typically tens of GB per model. To speed up the transfer, run the upload from an EC2 instance in the same region as the bucket, and tune
aws configure set default.s3.max_concurrent_requests and default.s3.multipart_chunksize.Step 4: Create the IRSA role
The model servers read checkpoints from S3 through an IAM role assumed by theinference service account. This is the recommended path on EKS, and it keeps static AWS credentials out of the cluster.
Save the following permissions policy to a file named inference-pod-policy.json. It grants read-only access to the checkpoint bucket and decrypt access to the bucket’s KMS key, and nothing else:
inference-pod-policy.json
KMSDecryptForS3 statement.
Save the role’s trust policy to a file named inference-pod-trust.json. It allows the cluster’s OIDC provider to assume the role only for the inference service account in the poolside-models namespace:
inference-pod-trust.json
<oidc-provider> with your cluster’s OIDC issuer host and path, such as oidc.eks.<aws-region>.amazonaws.com/id/<oidc-id>. Retrieve it from the cluster, stripping the https:// scheme:
eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve.
The trust policy’s
sub condition embeds the namespace: system:serviceaccount:poolside-models:inference. This guide deploys into poolside-models. If you deploy into a different namespace, change the namespace in the sub value to match, or the model servers get AccessDenied when they read from S3.inference-pod-policy.json, create the role with the trust policy from inference-pod-trust.json, and attach the policy to the role:
arn:aws:iam::<account-id>:role/inference-pod-role. In Step 6, you annotate the chart’s service account with this ARN, and the chart creates the annotated inference service account for you.
Alternative: create the role and service account with eksctl
Alternative: create the role and service account with eksctl
eksctl create iamserviceaccount creates the IAM role and the Kubernetes service account together, and builds the trust policy for you, so you do not need inference-pod-trust.json. You still create the permissions policy first, because --attach-policy-arn attaches an existing policy:eksctl already creates the service account, set serviceAccount.create: false and serviceAccount.name: inference in your values file so the chart uses the existing account instead of creating a second one.If you cannot use IRSA, create a Kubernetes secret with static credentials and set
s3.secretName in your values file instead:Step 5: Create the API key secret (recommended for internet-facing)
To require an API key on the model servers, create a secret containing the key inpoolside-models:
authentication.secretName in the next step. For an internet-facing load balancer, Poolside strongly recommends enabling an API key; for an internal load balancer it is optional.
Step 6: Configure the values file
Create aninference_values.yaml file in the bundle root. Set the fields that apply to your environment. The example below deploys three models, two Laguna variants and Point, exposes each through its own ALB ingress, and reads checkpoints through IRSA:
inference_values.yaml
models.<key>.model must match the locations you uploaded in Step 3. The image.name and image.tag fields default to the values that match the bundled archive, so you do not set them.
Each model is exposed at its own hostname through a separate
Ingress named inference-<model-key>. All three ingresses share the poolside load balancer group, so the AWS Load Balancer Controller provisions one Application Load Balancer for all of them. Give every model a unique ingressHost, and point each hostname’s DNS record at the load balancer once it is provisioned.modelType selects the default serving arguments for that class of model. It takes one of three values:
agent: defaults for the agent models, such as Laguna M.agent_small: the agent defaults with the context length and batch size capped, for smaller variants such as Laguna XS or for GPUs with less memory.completion: defaults for completion models, such as Point.
modelType for each model in your bundle. Use the value provided for your checkpoint; the example above reflects the current Laguna and Point models.
GPU count and tensor parallelism
Set gpus to the number of GPUs each model needs. The model server reads the number of GPUs allocated to its pod and shards the model across them, so you do not pass a tensor-parallel-size argument. The number of GPUs must be a value the server supports, such as 1, 2, 4, or 8, and your hardware must meet the model’s minimum GPU memory. For the per-model memory requirements, see Supported configurations, where the example keys laguna-m, laguna-xs, and point correspond to Laguna M.1, Laguna XS.2, and Point. The values above target the reference architecture’s p5e.48xlarge nodes: laguna-m uses 4 GPUs, while laguna-xs and point use 1 each.
Step 7: Install the chart
Install theinference chart into poolside-models:
Step 8: Verify the deployment
Check that the model pods are running:<model-key> is the key you set under models in the values file, such as laguna-m:
ingressHost, pointing it at the load balancer address. Then confirm routing works, where <model-hostname> is the ingressHost you set for that model:
Authorization header.
Step 9: Call the inference API
Each model serves the OpenAI-compatible API at its own hostname. The base URL has the form:model field in the request body is the served model name (modelName), which the server validates against what it loaded. It does not need to be unique across models, because the hostname already selects the backend: in the example values, both Laguna variants use the modelName Laguna but answer at different hostnames.
Send a chat completion request, where <served-model-name> is the model’s modelName:
laguna-m model served as Laguna:
TLS
The load balancer terminates TLS with the ACM certificate you set inalb.ingress.kubernetes.io/certificate-arn, and the listen-ports and ssl-redirect annotations in Step 6 serve every model over HTTPS on port 443 and redirect HTTP to HTTPS. To cover additional hostnames, issue or import a certificate in AWS Certificate Manager that includes them, then update the certificate-arn annotation. Unlike the upstream Kubernetes deployment, you do not create a TLS secret in the cluster or add a tls block to the values file.
Offline documentation (optional)
The bundle also ships the Poolside documentation site, which the sameinference chart can deploy in-cluster so operators have local access to the docs. It is off by default. To enable and expose it through the Application Load Balancer, see Set up offline documentation.
Troubleshooting
- If a model pod is
Pending, confirm the cluster has enough GPUs for thegpusvalue you requested and that the NVIDIA GPU Operator is healthy. Runkubectl describe pod <pod-name> -n poolside-modelsand check the scheduling events. - If pods stay in
Initor restart in a loop, check the init container logs withkubectl logs -n poolside-models <pod-name> -c model-downloader. A stale or misspelled checkpoint path syncs nothing and the pod never starts. AnAccessDeniederror usually means the IRSA role’s policy does not cover the bucket or its KMS key. - If the load balancer never receives an address, confirm the AWS Load Balancer Controller is running and that your subnets carry the
kubernetes.io/role/elborkubernetes.io/role/internal-elbtags. Check the controller logs for theIngressevents. - If requests return a 5xx from the load balancer, confirm the target group is healthy. The ALB health check uses
/healthon each model’s port; a model that is still downloading its checkpoint stays unhealthy until it is ready. - If image pulls fail, confirm the GPU node group’s instance role has the
AmazonEC2ContainerRegistryReadOnlypolicy and that theatlasrepository exists in ECR. - If
helm installfails withno endpoints available for service "aws-load-balancer-webhook-service", the AWS Load Balancer Controller has no ready pods, so its admission webhook rejects theServiceobjects the chart creates. Confirm the controller is running withkubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller, make sure the cluster has schedulable nodes, then runhelm installagain. - If model pods never become ready and the init container logs show no checkpoint files synced, the S3 prefix may hold the checkpoint
.tarinstead of its extracted contents. The downloader only fetches unpacked files such as*.safetensorsand*.json. Extract the archive and re-upload its contents, as in Step 3.