Troubleshooting

Common issues, their root causes, and how to fix them. Each entry includes the symptom, underlying cause, and step-by-step resolution.

"Connection refused" on service access

Symptom: Intermittent Connection refused errors when accessing a Kubernetes Service via its ClusterIP.

Cause: The kube-proxy flushes and rebuilds all iptables rules on every sync cycle. During the brief window between flushing the old rules and installing the new ones, connections are refused. This was fixed with hash-based state comparison — kube-proxy now computes an order-independent hash (XOR) of the current state and skips the flush+rebuild if nothing changed.

Fix:

Restart kube-proxy: podman compose -f compose.yml restart kube-proxy
Verify rules are installed: podman compose -f compose.yml exec kube-proxy iptables -t nat -L KUBE-SERVICES
Check kube-proxy logs for sync errors: podman compose -f compose.yml logs kube-proxy

"Watch failed: context canceled"

Symptom: Watch connections drop immediately with context canceled errors. Controllers fail to receive events. kubectl -w exits unexpectedly.

Cause: HTTP/2 ALPN was not configured on the TLS server. Go's client-go library requires the server to negotiate HTTP/2 via ALPN. Without it, the client falls back to HTTP/1.1 and watches fail.

Fix:

Ensure the API server's TLS configuration includes ALPN protocols: alpn_protocols = [b"h2", b"http/1.1"]
Regenerate TLS certificates if needed: bash scripts/generate-certs.sh
Restart the API server: podman compose -f compose.yml restart api-server

LIST resourceVersion

A related issue was LIST operations returning timestamps instead of etcd mod_revisions as the resourceVersion. This caused 1123+ watch failures in conformance testing. Ensure your storage backend returns proper revision numbers.

DNS not working

Symptom: Pods cannot resolve service names. nslookup kubernetes.default fails from inside a pod.

Cause: One of two issues:

CoreDNS was not deployed — the bootstrap script was not run
The br_netfilter kernel module is not loaded, preventing bridged traffic from being processed by iptables

Fix:

        
          
          
          
          terminal
        
        # Run the bootstrap script to deploy CoreDNS
bash scripts/bootstrap-cluster.sh

# Verify CoreDNS is running
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Load br_netfilter kernel module (Linux only)
sudo modprobe br_netfilter
sudo sysctl net.bridge.bridge-nf-call-iptables=1

Pods stuck in Pending

Symptom: Pods remain in Pending phase indefinitely. No events show scheduling attempts.

Cause: Several possible reasons:

No schedulable nodes available (all nodes NotReady or cordoned)
Nodes are tainted and the pod lacks matching tolerations
Insufficient resources (CPU, memory, or extended resources like fakecpu)
The scheduler is not running

Fix:

        
          
          
          
          terminal
        
        # Check node status
kubectl get nodes

# Check for taints on nodes
kubectl describe node node-1 | grep -A5 Taints

# Check scheduler is running
podman compose -f compose.yml ps scheduler
podman compose -f compose.yml logs scheduler

# Describe the pod to see scheduling events
kubectl describe pod <pod-name>

Container restart loops

Symptom: Containers keep restarting. Pod shows high restart count in kubectl get pods.

Cause: Several possibilities:

RestartPolicy: With OnFailure, containers that exit with a non-zero code are automatically restarted. With Always, all containers are restarted regardless of exit code.
postStart hook failure: If a postStart lifecycle hook fails, Kubernetes kills the container (not just logs the error). This triggers a restart.
Liveness probe failure: A failing liveness probe causes the kubelet to kill and restart the container.
OOM kills: The container exceeds its memory limit.

Fix:

        
          
          
          
          terminal
        
        # Check current container logs
kubectl logs <pod-name>

# Check previous container logs (before restart)
kubectl logs <pod-name> --previous

# Check pod events for restart reasons
kubectl describe pod <pod-name>

# Check container exit code and reason
kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].lastState}'

PVC stuck in Pending

Symptom: PersistentVolumeClaim stays in Pending status. Pods that reference it cannot start.

Cause:

No StorageClass is defined in the cluster
No dynamic provisioner is running to create PersistentVolumes
No matching PersistentVolume exists for static provisioning

Fix:

        
          
          
          
          terminal
        
        # Check StorageClasses
kubectl get storageclasses

# Check available PersistentVolumes
kubectl get pv

# Describe the PVC for events
kubectl describe pvc <pvc-name>

# Check if there is a default StorageClass
kubectl get sc -o jsonpath='{.items[?(@.metadata.annotations.storageclass\.kubernetes\.io/is-default-class=="true")].metadata.name}'

"Port already in use" on startup

Symptom: podman compose -f compose.yml up fails with bind: address already in use on port 6443, 2379, or 6379.

Cause: Another process (minikube, kind, a real Kubernetes cluster, or a previous Rūsternetes instance) is already listening on the same port.

Fix:

        
          
          
          
          terminal
        
        # Find what is using port 6443
lsof -i :6443

# Find what is using port 2379 (etcd)
lsof -i :2379

# Find what is using port 6379 (Redis)
lsof -i :6379

# Stop a previous Rūsternetes cluster
podman compose -f compose.yml down

# Or change the port mapping in compose.yml
# ports: "7443:6443"

"Cannot connect to API server"

Symptom: kubectl commands return Unable to connect to the server.

Cause: The API server is not running, TLS certificates are missing or invalid, or the kubeconfig is not set.

Fix:

        
          
          
          
          terminal
        
        # Check all services are running
podman compose -f compose.yml ps

# Check API server logs
podman compose -f compose.yml logs api-server

# Verify TLS certs exist
ls -la .rusternetes/certs/

# Verify KUBECONFIG is set
echo $KUBECONFIG
# Should be: ~/.kube/rusternetes-config

# Full restart
podman compose -f compose.yml down
podman compose -f compose.yml up -d
bash scripts/bootstrap-cluster.sh

Build is slow

Symptom: podman compose -f compose.yml build takes 10–15 minutes. Test binary compilation takes 5–10 minutes.

Cause: This is expected for the first build. Rust compilation of the full workspace (216,000+ lines) is CPU-intensive. Subsequent builds use layer caching and only recompile changed crates.

Fix:

Allocate more CPUs and memory to your container runtime
Rebuild only the changed service: podman compose -f compose.yml build api-server
Build natively for faster iteration: cargo build
Use cargo test -p rusternetes-kubelet to test a single crate instead of the entire workspace

Podman: permission denied

Symptom: kube-proxy fails to start with permission errors related to iptables. Volume mounts fail with EACCES.

Cause: kube-proxy requires CAP_NET_ADMIN capability for iptables manipulation. Rootless Podman does not grant this by default.

Fix:

        
          
          
          
          terminal
        
        # Run with rootful Podman
sudo podman-compose -f compose.yml up -d

# Or set Podman Machine to rootful mode
podman machine set --rootful
podman machine stop
podman machine start

Podman Machine fails on macOS

Symptom: Podman Machine fails to start with VZErrorDomain Code=1 or similar virtualization errors on macOS Sequoia 15.7+.

Cause: Compatibility issues between Podman Machine's virtualization framework and newer macOS versions.

Fix:

Update Podman to the latest version: brew upgrade podman
Remove and recreate the machine: podman machine rm -f && podman machine init --memory 8192 --cpus 4 && podman machine set --rootful && podman machine start
Ensure you are using the ARM-native Homebrew on Apple Silicon (/opt/homebrew/bin/brew)

Debugging Commands

A reference of useful commands for diagnosing issues:

        
          
          
          
          terminal
        
        # Describe a resource for events and status
kubectl describe pod <name>
kubectl describe node <name>
kubectl describe svc <name>

# View pod logs (current and previous container)
kubectl logs <pod-name>
kubectl logs <pod-name> --previous
kubectl logs <pod-name> -c <container-name>

# View cluster events sorted by time
kubectl get events --sort-by='.lastTimestamp'
kubectl get events -A --sort-by='.lastTimestamp'

# View Docker Compose service logs
podman compose -f compose.yml logs -f api-server
podman compose -f compose.yml logs -f scheduler
podman compose -f compose.yml logs -f controller-manager
podman compose -f compose.yml logs -f kubelet
podman compose -f compose.yml logs -f kube-proxy

# Enable debug logging for a service
RUST_LOG=debug podman compose -f compose.yml up api-server

# Check service health
podman compose -f compose.yml ps
curl -k https://localhost:6443/healthz

# Check iptables rules (kube-proxy)
podman compose -f compose.yml exec kube-proxy iptables -t nat -L KUBE-SERVICES

# Check etcd health
podman compose -f compose.yml exec etcd etcdctl endpoint health

# Check Redis health (if using Redis backend)
podman compose -f compose.redis.yml exec redis redis-cli ping