Observability Stack
Deployed the kube-prometheus-stack Helm chart plus Loki for log aggregation. Every node, pod, and VM now has metrics and logs.
Key dashboards:
- Node exporter: CPU, memory, disk, network per host
- Kubernetes: pod restarts, PVC usage, API server latency
- Custom: Ceph OSD health, network interface errors
First thing I discovered with real observability: my backup CronJob had been silently failing for two weeks due to a misconfigured S3 endpoint. I'd been living dangerously without knowing it.