Skip to main content
Follow these steps to deploy Poolside model inference on your Amazon EKS cluster and serve models through an OpenAI-compatible API. For an overview of this deployment approach and architecture, see Amazon EKS deployment. This guide deploys the standalone inference chart. Each model becomes its own Deployment, Service, and Ingress, reachable at its own hostname through a shared Application Load Balancer.

Prerequisites

Poolside distributes the Helm deployment bundle as a .tar.gz archive. Extract it before you start:
tar -xzf <bundle-name>.tar.gz
cd <bundle-name>
Confirm that you are working from the root of the extracted bundle. The bundle root contains the following directories:
./scripts/
./containers/
./charts/

Required AWS infrastructure

You provision the following AWS foundation before you deploy the chart. For a turnkey foundation that provisions all of it, apply the Terraform reference architecture in the poolsideai/reference_architectures repository, or reproduce the same architecture in your own infrastructure-as-code. For the architecture diagram and design decisions, see Reference architecture.
  • EKS cluster, Kubernetes 1.29 or later, with an IAM OIDC provider enabled. The OIDC provider is what makes IRSA work, so the model servers can read checkpoints from S3 without static credentials.
  • GPU node group with enough GPU memory for the models you deploy. The reference architecture uses p5e.48xlarge; p5 and p5en instances also fit. The node group runs an EKS-optimized GPU AMI that can schedule containerized GPU workloads, so each node advertises the nvidia.com/gpu resource. Apply the nvidia.com/gpu=true:NoSchedule taint to keep non-GPU workloads off these nodes; the chart’s default tolerations already tolerate it. These instances are usually not available on demand and need reserved capacity. For instance shapes, model packing, and capacity reservations, see Reference architecture.
  • NVIDIA GPU Operator to expose GPUs to the cluster. Run it in one of two modes, depending on your AMI:
    • If the AMI already includes the NVIDIA driver and container toolkit, such as the AL2023 or AL2 NVIDIA accelerated AMIs or the Bottlerocket NVIDIA variant, run the GPU Operator in device-plugin-only mode with the driver and toolkit subcomponents turned off. This is what the reference architecture does.
    • If the AMI ships without drivers, run the full GPU Operator so it installs the driver and container toolkit.
  • AWS Load Balancer Controller, installed and running in the cluster. It reconciles the per-model Ingress objects into an Application Load Balancer. Its admission webhook must be reachable, which requires the controller to have ready pods on schedulable nodes: a controller scaled to zero or stuck Pending rejects every Service and Ingress the chart creates, and the install fails. Tag your subnets for load balancer discovery: kubernetes.io/role/elb=1 on public subnets for an internet-facing load balancer, or kubernetes.io/role/internal-elb=1 on private subnets for an internal one.
  • Amazon S3 bucket for the model checkpoints. Server-side encryption with a KMS key is recommended. Bucket versioning is unnecessary because the checkpoints are content-addressed.
  • Amazon ECR to host the bundled container images.
  • AWS Certificate Manager certificate covering the hostnames you assign to the models. The load balancer terminates TLS with this certificate. Decide the per-model hostnames now: you set them as the ingressHost values in Step 6, and the certificate must cover every one of them.
An S3 gateway VPC endpoint keeps checkpoint downloads off your NAT gateways and reduces data transfer cost. If your worker nodes pull from Amazon ECR through private networking, also configure the Amazon ECR interface VPC endpoints required for private image pulls.

Workstation tools

Install the following tools on the host you use to run the deployment:
  • helm 3.12 or later
  • kubectl, configured for your EKS cluster
  • skopeo, to copy the bundled images into Amazon ECR
  • aws CLI, to create AWS resources and upload checkpoints to S3
  • jq, to parse JSON responses from the inference API
  • tar, to extract the deployment bundle and model checkpoints
  • curl, to call the inference API
  • eksctl (optional), to associate an IAM OIDC provider if your cluster lacks one, or to create the IRSA role and service account with the alternative in Step 4
Disk space Stage the deployment from a host with tens of GB free. The extracted bundle is roughly 20 GB, because it carries the atlas container image, and each model checkpoint is tens of GB more. On a constrained workstation, run the extraction and the uploads from an EC2 instance in the bucket’s region that has enough disk, which also speeds up the checkpoint upload, and delete each local copy once it is uploaded.

Step 1: Create the namespace

The inference stack runs in a single namespace:
kubectl create namespace poolside-models

Step 2: Upload the container images to Amazon ECR

The bundle ships the atlas inference server image and the public-docs documentation site image as OCI archives under ./containers/. Copy them into Amazon ECR. Amazon ECR does not create repositories on push, so create the repositories first. Each repository name must match an image name from the bundle:
for image_name in $(find ./containers -name "*.tar" -type f -exec basename {} .tar \; | sed 's/__.*//' | sort -u); do
  aws ecr describe-repositories --repository-names "$image_name" --region <aws-region> >/dev/null 2>&1 \
    || aws ecr create-repository --repository-name "$image_name" --region <aws-region>
done
Authenticate skopeo to your ECR registry:
aws ecr get-login-password --region <aws-region> \
  | skopeo login --username AWS --password-stdin <account-id>.dkr.ecr.<aws-region>.amazonaws.com
Upload the images with the provided script. Pass your ECR registry host as the target:
chmod +x ./scripts/upload_images.sh
./scripts/upload_images.sh <account-id>.dkr.ecr.<aws-region>.amazonaws.com
The script pushes each archive to <account-id>.dkr.ecr.<aws-region>.amazonaws.com/<image-name>:<image-tag>. The tags are specific to the bundle you received, not fixed values, and the chart’s image.name, image.tag, docs.image.name, and docs.image.tag are preset to match the archives, so you set only image.registry at install time. You do not need to type the tags anywhere, but you can confirm what was pushed. For example, to inspect the atlas image:
aws ecr describe-images --repository-name atlas --region <aws-region> \
  --query 'sort_by(imageDetails,&imagePushedAt)[-1].imageTags' --output text
You do not need an image pull secret. The GPU node group’s instance role authorizes ECR pulls through the AWS-managed AmazonEC2ContainerRegistryReadOnly policy, so kubelet pulls the image directly.

Step 3: Upload model checkpoints to S3

The model servers download their checkpoints from S3 on pod startup, so the checkpoints must be in place before you deploy the chart. Poolside provides the checkpoint files separately from the deployment bundle. Confirm the local path and the destination prefix with your Poolside contact. Uploading checkpoints is time consuming. Start it now and continue with the remaining steps in parallel. Poolside ships each model checkpoint as a single .tar archive that contains one top-level directory holding the checkpoint files. The model server loads the unpacked files (*.safetensors, *.json, and the tokenizer files) directly from the prefix you point at, so extract each archive’s contents into its own directory, not the .tar and not the nested folder. The --strip-components=1 flag drops the archive’s top-level directory so the files land at the directory root:
mkdir -p ./checkpoints/laguna-xs
tar -xf laguna_xs_<bundle-suffix>.tar -C ./checkpoints/laguna-xs --strip-components=1
Confirm the files sit at the root of the directory, not under a subfolder:
ls ./checkpoints/laguna-xs
# config.yaml  generation_config.json  model.safetensors  tokenizer/
You choose the S3 layout. Give each model its own prefix, because every model entry in the values file points at one prefix. The command below preserves the local directory structure, so ./checkpoints/laguna-m, ./checkpoints/laguna-xs, and ./checkpoints/point upload to checkpoints/laguna-m, checkpoints/laguna-xs, and checkpoints/point under the bucket:
aws s3 cp ./checkpoints s3://<bucket-name>/checkpoints --recursive --region <aws-region>
Note the full s3:// prefix for each model. You reference it in the models.<key>.model paths in Step 6, and the paths must match what you uploaded exactly: a misspelled or missing prefix downloads nothing and the pod never starts.
Checkpoints are typically tens of GB per model. To speed up the transfer, run the upload from an EC2 instance in the same region as the bucket, and tune aws configure set default.s3.max_concurrent_requests and default.s3.multipart_chunksize.

Step 4: Create the IRSA role

The model servers read checkpoints from S3 through an IAM role assumed by the inference service account. This is the recommended path on EKS, and it keeps static AWS credentials out of the cluster. Save the following permissions policy to a file named inference-pod-policy.json. It grants read-only access to the checkpoint bucket and decrypt access to the bucket’s KMS key, and nothing else:
inference-pod-policy.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "S3ListBucket",
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": "arn:aws:s3:::<bucket-name>"
    },
    {
      "Sid": "S3ReadObjects",
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:GetObjectTagging"],
      "Resource": "arn:aws:s3:::<bucket-name>/*"
    },
    {
      "Sid": "S3GetBucketLocation",
      "Effect": "Allow",
      "Action": ["s3:GetBucketLocation"],
      "Resource": "*"
    },
    {
      "Sid": "KMSDecryptForS3",
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:DescribeKey"],
      "Resource": "<s3-kms-key-arn>"
    }
  ]
}
If the checkpoint bucket uses SSE-S3 rather than SSE-KMS, omit the KMSDecryptForS3 statement. Save the role’s trust policy to a file named inference-pod-trust.json. It allows the cluster’s OIDC provider to assume the role only for the inference service account in the poolside-models namespace:
inference-pod-trust.json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<account-id>:oidc-provider/<oidc-provider>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<oidc-provider>:aud": "sts.amazonaws.com",
          "<oidc-provider>:sub": "system:serviceaccount:poolside-models:inference"
        }
      }
    }
  ]
}
Replace <oidc-provider> with your cluster’s OIDC issuer host and path, such as oidc.eks.<aws-region>.amazonaws.com/id/<oidc-id>. Retrieve it from the cluster, stripping the https:// scheme:
aws eks describe-cluster --name <cluster-name> --region <aws-region> \
  --query 'cluster.identity.oidc.issuer' --output text \
  | sed 's#^https://##'
If the command returns no issuer, the cluster has no IAM OIDC provider yet. Associate one before you create the role, either through the reference architecture’s Terraform or with eksctl utils associate-iam-oidc-provider --cluster <cluster-name> --approve.
The trust policy’s sub condition embeds the namespace: system:serviceaccount:poolside-models:inference. This guide deploys into poolside-models. If you deploy into a different namespace, change the namespace in the sub value to match, or the model servers get AccessDenied when they read from S3.
Create the policy from inference-pod-policy.json, create the role with the trust policy from inference-pod-trust.json, and attach the policy to the role:
aws iam create-policy \
  --policy-name inference-pod-policy \
  --policy-document file://inference-pod-policy.json

aws iam create-role \
  --role-name inference-pod-role \
  --assume-role-policy-document file://inference-pod-trust.json

aws iam attach-role-policy \
  --role-name inference-pod-role \
  --policy-arn arn:aws:iam::<account-id>:policy/inference-pod-policy
The role’s ARN is arn:aws:iam::<account-id>:role/inference-pod-role. In Step 6, you annotate the chart’s service account with this ARN, and the chart creates the annotated inference service account for you.
eksctl create iamserviceaccount creates the IAM role and the Kubernetes service account together, and builds the trust policy for you, so you do not need inference-pod-trust.json. You still create the permissions policy first, because --attach-policy-arn attaches an existing policy:
aws iam create-policy \
  --policy-name inference-pod-policy \
  --policy-document file://inference-pod-policy.json

eksctl create iamserviceaccount \
  --cluster <cluster-name> \
  --namespace poolside-models \
  --name inference \
  --attach-policy-arn arn:aws:iam::<account-id>:policy/inference-pod-policy \
  --approve
Because eksctl already creates the service account, set serviceAccount.create: false and serviceAccount.name: inference in your values file so the chart uses the existing account instead of creating a second one.
If you cannot use IRSA, create a Kubernetes secret with static credentials and set s3.secretName in your values file instead:
kubectl create secret generic aws-credentials \
  --from-literal=AWS_ACCESS_KEY_ID=<access-key-id> \
  --from-literal=AWS_SECRET_ACCESS_KEY=<secret-access-key> \
  -n poolside-models
To require an API key on the model servers, create a secret containing the key in poolside-models:
kubectl create secret generic vllm-auth \
  --from-literal=VLLM_API_KEY=<vllm-api-key> \
  -n poolside-models
Reference it through authentication.secretName in the next step. For an internet-facing load balancer, Poolside strongly recommends enabling an API key; for an internal load balancer it is optional.

Step 6: Configure the values file

Create an inference_values.yaml file in the bundle root. Set the fields that apply to your environment. The example below deploys three models, two Laguna variants and Point, exposes each through its own ALB ingress, and reads checkpoints through IRSA:
inference_values.yaml
# Name the release resources "inference" so the service account matches the IRSA trust subject
fullnameOverride: inference

image:
  # Registry the atlas image was uploaded to in Step 2
  registry: <account-id>.dkr.ecr.<aws-region>.amazonaws.com

serviceAccount:
  create: true
  annotations:
    # ARN of the IRSA role from Step 4
    eks.amazonaws.com/role-arn: arn:aws:iam::<account-id>:role/inference-pod-role

# Required on EKS. Unlike OpenShift, upstream Kubernetes does not assign a user ID automatically
podSecurityContext:
  runAsNonRoot: true
  runAsUser: 10003
  seccompProfile:
    type: RuntimeDefault

authentication:
  # API key auth. Strongly recommended for an internet-facing load balancer.
  # Uses the Step 5 secret; set to "" to disable (reasonable only for an internal LB).
  secretName: vllm-auth

ingress:
  enabled: true
  className: alb
  annotations:
    # Use "internal" for a VPC-internal load balancer
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]'
    alb.ingress.kubernetes.io/ssl-redirect: "443"
    alb.ingress.kubernetes.io/ssl-policy: ELBSecurityPolicy-TLS13-1-2-2021-06
    alb.ingress.kubernetes.io/group.name: poolside
    # ARN of the ACM certificate covering the model hostnames below
    alb.ingress.kubernetes.io/certificate-arn: <acm-certificate-arn>
    alb.ingress.kubernetes.io/healthcheck-path: /health

models:
  laguna-m:
    model: s3://<bucket-name>/checkpoints/laguna-m
    modelName: Laguna
    modelType: agent
    gpus: 4
    ingressHost: <laguna-m-hostname>
  laguna-xs:
    model: s3://<bucket-name>/checkpoints/laguna-xs
    modelName: Laguna
    modelType: agent_small
    gpus: 1
    ingressHost: <laguna-xs-hostname>
  point:
    model: s3://<bucket-name>/checkpoints/point
    modelName: Point
    modelType: completion
    gpus: 1
    ingressHost: <point-hostname>
The checkpoint paths in models.<key>.model must match the locations you uploaded in Step 3. The image.name and image.tag fields default to the values that match the bundled archive, so you do not set them.
The example exposes the models through an internet-facing load balancer. Poolside strongly recommends enabling API-key authentication on any internet-facing endpoint, so the example sets authentication.secretName to the secret from Step 5. The chart does not enforce this: if you set secretName: "", the endpoint is reachable without a key. Only do that for an internal load balancer.
Each model is exposed at its own hostname through a separate Ingress named inference-<model-key>. All three ingresses share the poolside load balancer group, so the AWS Load Balancer Controller provisions one Application Load Balancer for all of them. Give every model a unique ingressHost, and point each hostname’s DNS record at the load balancer once it is provisioned.
Model type Each model’s modelType selects the default serving arguments for that class of model. It takes one of three values:
  • agent: defaults for the agent models, such as Laguna M.
  • agent_small: the agent defaults with the context length and batch size capped, for smaller variants such as Laguna XS or for GPUs with less memory.
  • completion: defaults for completion models, such as Point.
Poolside specifies the modelType for each model in your bundle. Use the value provided for your checkpoint; the example above reflects the current Laguna and Point models. GPU count and tensor parallelism Set gpus to the number of GPUs each model needs. The model server reads the number of GPUs allocated to its pod and shards the model across them, so you do not pass a tensor-parallel-size argument. The number of GPUs must be a value the server supports, such as 1, 2, 4, or 8, and your hardware must meet the model’s minimum GPU memory. For the per-model memory requirements, see Supported configurations, where the example keys laguna-m, laguna-xs, and point correspond to Laguna M.1, Laguna XS.2, and Point. The values above target the reference architecture’s p5e.48xlarge nodes: laguna-m uses 4 GPUs, while laguna-xs and point use 1 each.

Step 7: Install the chart

Install the inference chart into poolside-models:
helm install inference ./charts/inference \
  --namespace poolside-models \
  -f ./inference_values.yaml

Step 8: Verify the deployment

Check that the model pods are running:
kubectl get pods -n poolside-models
Each model server takes time to become ready on first start because it downloads its checkpoint from S3. Watch a model’s logs to track progress, where <model-key> is the key you set under models in the values file, such as laguna-m:
kubectl logs -f -n poolside-models deploy/inference-<model-key>
Confirm that an ingress was created for each model and that the load balancer has an address:
kubectl get ingress -n poolside-models
Create or update a DNS record for each ingressHost, pointing it at the load balancer address. Then confirm routing works, where <model-hostname> is the ingressHost you set for that model:
curl -s https://<model-hostname>/v1/models \
  -H "Authorization: Bearer <vllm-api-key>"
If API key authentication is off, omit the Authorization header.

Step 9: Call the inference API

Each model serves the OpenAI-compatible API at its own hostname. The base URL has the form:
https://<model-hostname>/v1
Requests are routed to a model by hostname, so each hostname serves exactly one model. The model field in the request body is the served model name (modelName), which the server validates against what it loaded. It does not need to be unique across models, because the hostname already selects the backend: in the example values, both Laguna variants use the modelName Laguna but answer at different hostnames. Send a chat completion request, where <served-model-name> is the model’s modelName:
curl https://<model-hostname>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-name>",
    "messages": [{"role": "user", "content": "Write a function that reverses a string."}]
  }'
For example, to call the laguna-m model served as Laguna:
curl https://laguna-m.example.com/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Laguna",
    "messages": [{"role": "user", "content": "Write a function that reverses a string."}]
  }'
If you enabled API key authentication in Step 5, include the key as a bearer token:
curl https://<model-hostname>/v1/chat/completions \
  -H "Authorization: Bearer <vllm-api-key>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "<served-model-name>",
    "messages": [{"role": "user", "content": "Write a function that reverses a string."}]
  }'

TLS

The load balancer terminates TLS with the ACM certificate you set in alb.ingress.kubernetes.io/certificate-arn, and the listen-ports and ssl-redirect annotations in Step 6 serve every model over HTTPS on port 443 and redirect HTTP to HTTPS. To cover additional hostnames, issue or import a certificate in AWS Certificate Manager that includes them, then update the certificate-arn annotation. Unlike the upstream Kubernetes deployment, you do not create a TLS secret in the cluster or add a tls block to the values file.

Offline documentation (optional)

The bundle also ships the Poolside documentation site, which the same inference chart can deploy in-cluster so operators have local access to the docs. It is off by default. To enable and expose it through the Application Load Balancer, see Set up offline documentation.

Troubleshooting

  • If a model pod is Pending, confirm the cluster has enough GPUs for the gpus value you requested and that the NVIDIA GPU Operator is healthy. Run kubectl describe pod <pod-name> -n poolside-models and check the scheduling events.
  • If pods stay in Init or restart in a loop, check the init container logs with kubectl logs -n poolside-models <pod-name> -c model-downloader. A stale or misspelled checkpoint path syncs nothing and the pod never starts. An AccessDenied error usually means the IRSA role’s policy does not cover the bucket or its KMS key.
  • If the load balancer never receives an address, confirm the AWS Load Balancer Controller is running and that your subnets carry the kubernetes.io/role/elb or kubernetes.io/role/internal-elb tags. Check the controller logs for the Ingress events.
  • If requests return a 5xx from the load balancer, confirm the target group is healthy. The ALB health check uses /health on each model’s port; a model that is still downloading its checkpoint stays unhealthy until it is ready.
  • If image pulls fail, confirm the GPU node group’s instance role has the AmazonEC2ContainerRegistryReadOnly policy and that the atlas repository exists in ECR.
  • If helm install fails with no endpoints available for service "aws-load-balancer-webhook-service", the AWS Load Balancer Controller has no ready pods, so its admission webhook rejects the Service objects the chart creates. Confirm the controller is running with kubectl get pods -n kube-system -l app.kubernetes.io/name=aws-load-balancer-controller, make sure the cluster has schedulable nodes, then run helm install again.
  • If model pods never become ready and the init container logs show no checkpoint files synced, the S3 prefix may hold the checkpoint .tar instead of its extracted contents. The downloader only fetches unpacked files such as *.safetensors and *.json. Extract the archive and re-upload its contents, as in Step 3.
For questions about hardware requirements, infrastructure configuration, or deployment issues, contact Poolside support.