Before the First Log Line

The cluster was "running." That was the word they used in the handoff. A bare RKE deployment — nodes up, control plane healthy, kube-system pods in some approximation of Running. What it did not have: storage classes, a functioning ingress controller, a trusted certificate authority, or any of the infrastructure prerequisites that make Kubernetes useful rather than merely operational. The vendor choices had already been made. I wasn't in those meetings.

A federal program office was waiting on observability before they could stand up their mission system. The Elasticsearch cluster was the blocker for the blocker. Every day it wasn't running was a day something else wasn't running either.

I spent the first three weeks never opening the Elasticsearch documentation. That's the pattern: the problem the ticket names arrives last.

The handoff

Enterprise Kubernetes engagements have a consistent pattern: a set of decisions that seemed reasonable from a procurement standpoint, handed to an engineer who now has to make them work. Storage vendor, CSI driver, certificate authority, ingress controller — all selected through a process that ended before implementation began. The diagram looks coherent. The reality is a cluster waiting to be told what it is.

In this case, the storage vendor was Hitachi Vantara. The CSI driver had been selected by someone who would not be the one implementing it. The org ran its own certificate authority — properly structured, with root and intermediates, but not globally trusted. The RKE deployment was clean in the narrow sense: the cluster itself was healthy. It just had no opinions about storage, certificates, or how traffic should reach it.

ECK requires all three before a single pod stays Running. Persistent storage for data nodes. A trusted cert chain for inter-component TLS. A reachable endpoint for Kibana. None of it is in the Elastic documentation, because Elastic's documentation reasonably assumes your cluster is already production-ready. Most clusters in enterprise environments are not.

Storage first, always

The standard orientation was fast: node status, pod health, available namespaces. Then kubectl get storageclass. It returned nothing. This is not an error — it's just a fact. No StorageClasses had been provisioned. ECK requires persistent storage for Elasticsearch data nodes: without it, pods come up, immediately fail their PersistentVolumeClaim bindings, and sit in Pending indefinitely. You learn this very quickly.

The Hitachi Vantara CSI driver was the answer the org had purchased. The vendor's delivery mechanism was a single YAML manifest — a file that had presumably worked somewhere, for someone, under conditions that were not documented. The conversation that accompanied it was brief: "Here's what we use. Figure it out." No Helm chart. No values reference. No migration path to the organization's GitOps patterns. No explanation of which parameters were environment-specific and which were defaults.

Converting a raw Kubernetes manifest to a Helm chart is not technically difficult. It is tedious and requires understanding everything the manifest assumes. Every hardcoded value becomes a template variable. Every environment-specific string — namespaces, storage backend addresses, driver names, node selectors — gets surfaced, named, and made configurable. Every piece the vendor had silently tuned for their own infrastructure now has to be understood and documented. We did this over about a week.

The saving grace was a storage administrator — the kind of engineer who has been paged at 2am when a SAN goes down and has the failure modes memorized. He reviewed our parameter assumptions one by one. He knew which values in the manifest were vestigial from an older driver version, which defaults would silently cause performance degradation at real data volumes, and which options were load-bearing in ways that weren't obvious from the manifest itself. One catch he flagged immediately: the vendor manifest defaulted reclaimPolicy to Delete. On a stateful cluster, that means a PVC deletion takes the underlying data with it — the kind of default that surfaces the first time someone restarts a data node for routine maintenance and finds their indices gone. That knowledge doesn't live in documentation. It accumulates through years of incidents.

Without him, we'd have gotten the driver running. With him, we got it running correctly — which is a different thing, and a gap that only becomes visible under production load.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: hitachi-block
provisioner: csi.hspc.hitachi.com
parameters:
  pool: "<storage-pool-id>"         # environment-specific — not in the manifest
  storageType: block
  fsType: ext4
reclaimPolicy: Retain              # Retain, not Delete — learned the hard way
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer

When the first PersistentVolumeClaim finally bound — green, Running, actually persisting data — it felt disproportionately significant for what was functionally a storage provisioner test. It was significant. Everything downstream depended on that moment, and none of it was the work the ticket had named.

The certificate problem nobody warns you about

The org ran its own certificate authority. Not self-signed in the casual sense — a proper PKI with a root CA and intermediate issuers, managed through an enterprise process. But also not present in any default trust store: not in the OS, not in Kubernetes, not in any of the container images ECK deploys. Nothing in the cluster had been told to trust it.

This matters because ECK's inter-component communication is TLS everywhere. Elasticsearch nodes verify each other. Fleet Server verifies Elasticsearch. Kibana verifies both. Elastic Agents verify Fleet Server. The system is a mesh of mutual TLS, and every participant needs to trust the CA that signed the certificates it encounters. Getting there means hitting every trust boundary in the cluster, in sequence, and confirming each one before moving to the next.

The failure mode that burns the most time: connection refused and certificate verify failed are often indistinguishable from the outside. A Fleet agent that can't reach its server because the server is misconfigured looks identical, in the logs, to a Fleet agent that can't reach its server because it doesn't trust the cert chain. Both surface as connectivity failures. You learn to treat them as the same problem and check both surfaces simultaneously until evidence eliminates one.

My first hypothesis when Fleet agents couldn't reach their server was a firewall rule. The errors read as network failures — timeouts, refused connections, nothing in the logs pointing explicitly at TLS. I spent the better part of a day requesting a network policy audit and combing through Calico rules. Then a targeted curl --verbose from inside the cluster against the Fleet Server endpoint showed the TLS handshake failing before the connection dropped. The network was fine. The trust chain wasn't. That was the signal that changed the approach: certificate failures present as connectivity failures often enough that cert validation now comes before network investigation.

The trust boundary problem isn't that it's hard. It's that every component has its own trust store, and they all fail silently in the same direction.

The fix requires touching each layer independently: OS trust stores on nodes, Kubernetes secrets and ConfigMaps, ECK operator configuration, Fleet Server certificate references, and Elastic Agent enrollment configuration. Each is a separate surface. Each is documented in a different part of a different project's documentation. None of them cross-reference each other because none of them were designed assuming an org CA. PKI engineering isn't in an ECK deployment guide. It's just what this deployment required.

After the trust chain was resolved, TLS was still failing. Not everywhere, not consistently — just failing in ways that didn't match any configuration state we'd changed. The standard advice is to look for horses: cert problems, misconfigured trust stores, wrong endpoints. We looked for horses for a while. It turned out to be a Zebra: asymmetric routing somewhere upstream of the cluster. Traffic was leaving on one path and returning on another, which was enough to destroy stateful TLS sessions before they completed. We didn't find it through insight. We found it by eliminating everything in our domain and handing the remaining suspect to someone who understood that layer. The fix was upstream. I couldn't tell you exactly what they changed.

Airgapped Fleet and the EPR sidecar

Fleet Server pulls integration packages from the Elastic Package Registry. By default, that means the internet. This environment had no internet access. The gap between those two facts is solved by a deployment pattern that exists in Elastic's documentation the way important things sometimes exist there: technically present, practically invisible unless you already know to look for it.

The solution is a co-deployed EPR instance — a sidecar container running alongside Fleet Server, serving integration packages from a local mirror. Fleet Server's configuration is updated to point at this local endpoint rather than the public registry. Agents pull from Fleet, Fleet pulls from the sidecar, the sidecar has no idea it's not the real thing.

spec:
  deployment:
    podTemplate:
      spec:
        containers:
          - name: fleet-server
            env:
              - name: FLEET_SERVER_ELASTICSEARCH_PACKAGE_REGISTRY_URL
                value: "http://localhost:8080"   # sidecar endpoint
          - name: elastic-package-registry   # the sidecar
            image: docker.elastic.co/package-registry/distribution:8.x
            ports:
              - containerPort: 8080
            volumeMounts:
              - name: epr-packages
                mountPath: /packages/production   # pre-seeded package mirror

The implementation is clean once you know the pattern. The discovery cost is high. And there is one specific failure mode worth calling out: if the URL between Fleet Server and the EPR sidecar is wrong by a single path segment, Fleet Server starts successfully, agents connect and enroll, and then integrations silently fail to install. The error surfaces as a package resolution failure that reads, from the outside, like a networking problem. You spend time on the network before realizing the problem is a URL.

The other constraint of airgapped EPR: the package mirror has to be pre-seeded before you're isolated from the outside. You need to know in advance which integrations you'll need. This requires actually understanding the observability requirements before you've built the observability platform — a planning discipline that airgapped environments enforce whether you're ready for it or not. Package distribution engineering: another domain that had nothing to do with Elasticsearch, and everything to do with deploying it.

Ingress: the last mile

Kibana needed to be reachable by humans. That meant nginx ingress. That meant the ingress controller needed to handle TLS with org certificates — the same certificates that nothing trusted by default, now also passing through an ingress layer with its own opinions about certificate handling.

The core question with any nginx ingress in an enterprise TLS environment: are you terminating TLS at the ingress and talking plain HTTP to the backend, passing TLS through without terminating it, or terminating at the ingress and re-encrypting to the backend? Each option has different annotation sets, different certificate configuration requirements, and different interactions with ECK's own certificate management. The wrong choice produces errors that look like certificate problems but are actually routing problems, or vice versa.

The pattern that worked in this environment: TLS termination at the ingress, presenting the org cert to the outside, with re-encrypted backend communication over ECK's internally managed certificate. Kibana gets a recognizable cert from the perspective of the user's browser. The cluster's internal traffic stays encrypted. The ingress controller handles the translation between the two cert chains.

annotations:
  nginx.ingress.kubernetes.io/backend-protocol: "HTTPS"
  nginx.ingress.kubernetes.io/ssl-passthrough: "false"
  nginx.ingress.kubernetes.io/proxy-ssl-verify: "off"     # ECK manages its own internal cert — no org CA injection on the backend leg
tls:
  - hosts:
      - kibana.internal.agency.gov
    secretName: kibana-org-tls   # org-signed cert, imported as K8s secret

Getting here required understanding the ingress controller, ECK's certificate management, and the org PKI simultaneously — because the failure modes span all three, and fixing one layer while the other two are misconfigured produces misleading signal about which fix actually worked. Four separate problem domains in four weeks, each requiring real expertise. Still no Elasticsearch.

What "deploy ECK" actually meant

The prerequisites, assembled:

ENTERPRISE STORAGE · CSI Convert vendor manifest to Helm chart. Understand every parameter. Validate PVC provisioning before touching ECK. Requires knowing what "correct" looks like under production load — which is not something you can determine from the manifest alone.

ORGANIZATIONAL PKI Inject org CA into OS trust stores, Kubernetes secrets, ECK operator config, Fleet Server config, and agent enrollment — each independently, each with its own failure mode. Treat connectivity failures and cert failures as the same problem until proven otherwise.

AIRGAPPED PACKAGE REGISTRY Deploy an EPR sidecar alongside Fleet Server. Pre-seed the package mirror before network isolation. Wire Fleet Server to the local endpoint. Understand that silent integration install failures are URL problems, not networking problems.

NGINX INGRESS · ENTERPRISE TLS Choose a TLS strategy (terminate, passthrough, or re-encrypt) before writing any annotation. The wrong choice produces errors that look like cert problems but are routing problems. Get the strategy right first, then configure it.

RKE · ECK OPERATOR Actually deploying Elasticsearch. The part the ticket was about. Last item on the list.

None of this is in the Elastic documentation. None of it is in the RKE documentation. It spans four separate vendor ecosystems, each of which documents its own surface in isolation. The integration points — where storage meets Kubernetes meets PKI meets ingress meets ECK — exist in the space between documentation pages.

What I'd do differently

Running this engagement again from day one, three changes would compress weeks into days:

Start with a storage validation test before any application deployment. Create a PVC, mount it in a Pod, confirm the bind. Ten minutes. Skipping it meant discovering the CSI driver was misconfigured on day one of the ECK deployment — week three of the engagement.

Run cert validation from inside the cluster immediately — before Fleet, before Agents, before anything that produces ambiguous connectivity errors. A single curl --cacert /path/to/org-ca.pem https://elasticsearch:9200 from a test pod would have surfaced the trust chain failure before it masqueraded as a network problem.

Get the storage administrator in the room before touching the CSI manifest. His knowledge of which defaults would fail under production load wasn't in any documentation. The week we spent before he reviewed our parameters was largely wasted. Senior domain expertise is a prerequisite, not a review gate.

Pre-stage the EPR package mirror with a real integration list before entering the airgap. Discovering you need a package you didn't mirror after the network is cut is not recoverable without going back out.

When TLS fails after certs are confirmed good, escalate to a network engineer before running another cert validation pass. Asymmetric routing can destroy TLS sessions in ways that look exactly like certificate problems. The probability is low; the cost of missing it is high.

The actual point

The T-shape that gets talked about in engineering hiring — broad knowledge with depth in a specialty — gets framed as a stable configuration. You pick your deep skill, you build horizontal awareness around it, and that's the shape of your career. What this engagement made clear is that the expertise requirement is not fixed. It moves to wherever the current problem is.

I went deep on enterprise storage provisioning. Then organizational PKI. Then airgapped package distribution. Then nginx ingress TLS. Not because I had anticipated needing those skills, but because the deployment path ran through them and there was no one else to go deep. The breadth is what got me to the right problem. That expertise — borrowed, learned, or extracted from someone who'd been there before — is what got me through it.

Enterprise deployments don't start where the documentation starts. They start where the environment is — with the storage vendor that was already chosen, the CA that was already running, the network that was already segmented.

Observability at enterprise scale is not an observability problem at the start. It's a storage problem, a networking problem, a certificate problem, a package distribution problem. The observability work begins after you've solved all of those — and the engineer who treats those as "someone else's problem" will be waiting a long time for someone else to show up.

The storage admin who saved us three weeks of trial and error on the CSI driver will never appear in a dashboard. He's not visible to the program office that needed observability before they could ship. He's not in the architecture diagram. But without him, the first PVC never binds, and none of the rest of it follows. That's the kind of dependency that doesn't show up until you're the one building the thing.

The first log line arrived on week four. It was the easiest part.

Before theFirst Log Line