Skip to main content

Introduction

Use this page when you need to inspect or troubleshoot an on-premises Poolside model inference deployment. The commands on this page assume you have shell access to the deployment host and kubectl access to the RKE2 cluster.

Helpful aliases

Set these aliases for the current shell session, or add them to your shell profile.
# Shorthand kubectl, for example, k get pods.
alias k=kubectl

# Switch the current namespace, for example, kcs poolside-models.
alias kcs='kubectl config set-context --current --namespace '

# Common get commands.
alias kgp='kubectl get pods'
alias kgd='kubectl get deployments'
alias kl='kubectl logs'

# Describe Kubernetes resources.
alias kd='kubectl describe'

Check namespaces

On-premises model inference deployments commonly use these namespaces:
  • poolside-models for model inference workloads
  • poolside-services for supporting infrastructure services, such as S3 object storage and the model checkpoint uploads to S3
  • poolside-registry for the embedded OCI registry that serves containers
  • kube-system for RKE2 system components
  • poolside-cert-manager for certificate management
List namespaces:
kubectl get namespaces
Check all pods:
kubectl get pods -A
Pods should usually be Running or Completed. Pods in Pending, ContainerCreating, Init, CrashLoopBackOff, or ImagePullBackOff require additional investigation.

Check model workloads

The poolside-models namespace contains deployed model inference workloads and model upload jobs.
kcs poolside-models
kubectl get deployments,svc,ingress,pods,jobs
Check model pod logs:
kubectl logs <pod-name> -n poolside-models
If a model pod is still initializing, check the model downloader container logs:
kubectl logs <pod-name> -c model-downloader -n poolside-models
Inspect recent events:
kubectl get events -n poolside-models --sort-by=.lastTimestamp

Check supporting services

The poolside-services namespace contains infrastructure services required by model inference, such as S3-compatible object storage.
kcs poolside-services
kubectl get deployments,statefulsets,svc,ingress,pods
Inspect recent events:
kubectl get events -n poolside-services --sort-by=.lastTimestamp

Check GPU availability

Confirm that the host detects the expected NVIDIA GPU devices:
lspci | grep -i nvidia
Confirm that Kubernetes reports GPUs as allocatable:
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{" "}{.status.allocatable.nvidia\.com/gpu}{"\n"}{end}'
Check GPU Operator pods:
kubectl get pods -n gpu-operator
Check the NVIDIA Container Toolkit DaemonSet:
kubectl get daemonset -n gpu-operator | grep nvidia-container-toolkit
kubectl get pods -n gpu-operator | grep nvidia-container-toolkit
Host-level nvidia-smi is available only when NVIDIA drivers are installed on the host. If nvidia-smi is not available on the host, use a GPU-enabled Kubernetes pod or a GPU Operator validation container to confirm runtime access to the GPUs. If model pods are stuck in Pending or ContainerCreating, inspect the pod details and GPU Operator events:
kubectl describe pod <pod-name> -n poolside-models
kubectl get events -n gpu-operator --sort-by=.lastTimestamp

Check certificates

Poolside on-premises deployments use cert-manager to issue and renew certificates for internal and ingress endpoints.
kubectl get certificates,issuers,clusterissuers -A
kubectl get pods -n cert-manager
kubectl get events -n cert-manager --sort-by=.lastTimestamp
If a certificate is not ready, describe it:
kubectl describe certificate <certificate-name> -n <namespace>

Resolve SSL and x509 errors

The cluster is the source of truth for certificates. Use Kubernetes certificate and secret resources when you need to inspect the current certificate state. Exported certificate files under poolside-install/certs are local copies generated by the deployment process. The deployment installs CA certificates into the deployment host’s trust chain. When the deployment uses self-signed certificates, clients that connect to Poolside services also need to trust the self-signed CA certificate. If a client returns an x509 or certificate authority error, import the self-signed CA certificate into that client’s trusted root store. After you import the certificate, restart the application, browser, shell session, or client process so it reloads the trust store.

Import on Windows

Double-click the certificate file (.crt or .pem).
Click Install Certificate.
Choose Local Machine.
Select Place all certificates in the following store.
Select Trusted Root Certification Authorities.
Finish the import.

Import on macOS

Double-click the certificate file.
Open the certificate in Keychain Access.
Expand Trust.
Set When using this certificate to Always Trust.
Close the window and enter your password when prompted.
Alternatively, import the certificate into the system keychain:
sudo security add-trusted-cert -d -r trustRoot -k /Library/Keychains/System.keychain <self-signed-ca-file>

Import on Ubuntu and Debian

sudo cp <self-signed-ca-file> /usr/local/share/ca-certificates/
sudo update-ca-certificates

Import on Red Hat Enterprise Linux and Fedora

sudo mkdir -p /usr/local/share/ca-trust-source/anchors
sudo cp <self-signed-ca-file> /usr/local/share/ca-trust-source/anchors/
sudo update-ca-trust

Distribute the self-signed CA certificate

Use your organization’s normal endpoint management process to distribute the self-signed CA certificate to clients that access Poolside services. Common options include:
  • Group Policy for domain-joined Windows hosts
  • Configuration management tools, such as Ansible or Puppet
  • Mobile device management tools, such as Microsoft Intune
  • Browser enterprise policies
  • Internal file shares or package repositories