When the Pod Is Fine but Nothing Works — Debugging Kubernetes Services and PVCs
Last time I deliberately broke Pods three ways and watched them fail to start
(CrashLoopBackOff and friends).
This time the Pods are all Running and healthy — and things still don't work.
The bug isn't in the Pod; it's in the plumbing around it: the Service → Endpoints → Pod
forwarding chain, and the PVC → StorageClass → PV binding chain.
The skill here is knowing which link broke without guessing.
Part 1: Service connectivity — get endpoints splits the problem in two
When you can't reach a Service, the single most useful first command is not describe pod.
It's:
kubectl get endpoints <svc>
Endpoints is the table that says "which Pod IPs does this Service actually forward to." That one command cleaves every connectivity bug into two halves: can't find the Pods (endpoints empty) vs found them but can't reach them (endpoints populated). Each half points somewhere completely different.
Failure 1: selector doesn't match the Pod labels → endpoints empty
kubectl expose deployment app --port=80 --target-port=80 --name=app-bad-selector
kubectl patch svc app-bad-selector -p '{"spec":{"selector":{"app":"WRONG"}}}'
- Symptom: can't connect; the Pods are all Running.
- Investigation:
kubectl get endpoints app-bad-selector→ empty (<none>). Thenkubectl get svc app-bad-selector -o wideto read the Selector, and compare againstkubectl get pods --show-labels. - Cause: the Service selector matches no Pod labels, so the Endpoints controller can't assemble a backend list.
- Fix: correct the selector (or the Pod labels) so they line up.
- Concept: a Service doesn't know about Pods directly — it only knows labels. The "which Pod IPs" lookup lives in the Endpoints / EndpointSlice object. Wrong selector → that table is empty. Empty endpoints almost always means a selector problem.
Failure 2: targetPort points at a port nothing is listening on → connection refused
kubectl expose deployment app --port=80 --target-port=8080 --name=app-bad-port
- Symptom: still can't connect — but it breaks differently than Failure 1.
- Investigation:
kubectl get endpoints app-bad-port→ this time there are IPs (the selector matches, so endpoints listpodIP:8080). Then curl it and watch how it fails → connection refused, immediately, not a hang. - Cause: the Service port is fine, but
targetPortis 8080 while the container actually listens on 80. kube-proxy DNATs the packet topodIP:8080, nothing is listening there, and the kernel replies with a TCP RST. - Fix: set
targetPortto the port the container really listens on. - Concept: endpoints populated + connection refused → it's a targetPort / wrong-container-port problem. "Refused" means the packet reached the Pod and got rejected — so it's not a routing problem, it's knocking on the wrong door.
Failure 3: NetworkPolicy silently drops the packet → timeout
kubectl get netpol -A
kubectl describe netpol <name> -n <ns>
- Symptom: endpoints are populated, targetPort is correct, and curl times out (hangs with no response) — not "refused".
- Investigation:
kubectl get netpol -Ato see if any policy selects the source or destination Pod;describeto read itsPodSelector/Ingress/Egress. - Cause: a NetworkPolicy is dropping this traffic. Dropped packets get no reply (no RST), which is exactly why you see a timeout instead of a refusal.
- Fix: allow the source→destination flow in the relevant policy, or adjust its PodSelector.
Three things about NetworkPolicy that trip people up:
- Whitelist flip. A Pod with no policy selecting it is "allow all." The moment any policy selects it — even one that only lists Ingress — that direction flips to "deny all except what's explicitly allowed." A common gotcha is a
podSelector: {}default-deny someone added. - Two directions. Either the destination Pod's Ingress doesn't allow the source, or the source Pod's Egress doesn't allow the destination. Either one blocks the flow, so check policies in both namespaces.
- The CNI has to enforce it. This is the big one for local labs:
kind's default CNI (kindnet) does not enforce NetworkPolicy. You can kubectl apply a policy, see it in kubectl get netpol, and traffic still flows — because nothing is enforcing it. To actually reproduce Failure 3 you need a CNI that enforces policy, like Calico or Cilium. I verified this by spinning up a separate kind cluster with disableDefaultCNI: true and installing Calico: the same default-deny policy that did nothing under kindnet turned curl into a clean timeout under Calico. Same YAML, different behavior — because the difference was never the policy, it was the CNI.
Service decision tree
Can't reach a Service → ① kubectl get endpoints <svc>
├─ endpoints empty ─────────────► selector doesn't match labels → fix selector
└─ endpoints populated → curl and watch HOW it fails
├─ connection refused (instant) ─► wrong targetPort / nothing listening
└─ timeout (hangs) ─────────────► suspect NetworkPolicy dropping packets
The cheapest signal in the whole tree is refused vs timeout: refused means the packet arrived and was rejected; timeout means it was swallowed on the way.
Part 2: PVC stuck in Pending — read the Events, not the logs
A PVC that never leaves Pending drags its Pod down with it (the Pod stays Pending too,
it can't even schedule). There are no logs to read here — the answer is in
kubectl describe pvc.
kubectl apply -f - <<'EOF'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: stuck-pvc
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: nonexistent-sc
resources:
requests:
storage: 1Gi
EOF
- Symptom:
kubectl get pvc stuck-pvcsits atPending, neverBound. - Investigation:
kubectl describe pvc stuck-pvc(read the Events), thenkubectl get storageclassto see what actually exists. - Cause (this example): the requested
storageClassName: nonexistent-scdoesn't exist, so no provisioner claims the PVC and it binds to nothing. - Fix: use a StorageClass that exists (
kubectl get sc— there's usually one marked(default)), or omitstorageClassNameto take the default.
The root causes worth memorizing as a checklist:
- StorageClass doesn't exist — Events say the class can't be found (this example). Fix the name.
- Static provisioning, no matching PV — with no dynamic provisioner you must pre-create a PV; if none has capacity ≥ the request with a matching accessMode and storageClassName, the PVC stays Pending. Create a matching PV, or switch to a class that supports dynamic provisioning.
volumeBindingMode: WaitForFirstConsumer— this Pending is normal. The class deliberately waits until a Pod actually uses the PVC before binding, so the volume lands in the right node/zone. Not an error; create a Pod that mounts it and binding happens.
PVC Pending and Pod Pending look similar but are diagnosed completely differently.
Pod Pending is a scheduling failure (insufficient resources, taints, nodeSelector) — you
look at kubectl describe pod Events and the scheduler. PVC Pending is a binding
failure — you look at kubectl describe pvc and kubectl get sc. Don't conflate them.
The one table to keep
| Symptom | First command | What to read | Likely cause |
|---|---|---|---|
| Service: endpoints empty | kubectl get endpoints <svc> | <none> | selector ≠ Pod labels |
| Service: refused (instant) | curl + kubectl get endpoints | populated, but RST | wrong targetPort / nothing listening |
| Service: timeout (hangs) | kubectl get netpol -A | matching PodSelector | NetworkPolicy dropping packets |
| PVC stuck Pending | kubectl describe pvc <pvc> | Events + get sc | no/wrong StorageClass, no matching PV, or WaitForFirstConsumer |
The thread running through both halves: when a Pod is healthy but unreachable or unschedulable, stop staring at the Pod. Walk the chain it depends on — Endpoints for networking, StorageClass and PV for storage — and let how it fails (empty vs refused vs timeout vs Pending) tell you which link to fix.
