Alessio D'Ambrosio — Backend & Infrastructure

Observability Stack

Deployed the kube-prometheus-stack Helm chart plus Loki for log aggregation. Every node, pod, and VM now has metrics and logs.

Key dashboards:

Node exporter: CPU, memory, disk, network per host
Kubernetes: pod restarts, PVC usage, API server latency
Custom: Ceph OSD health, network interface errors

First thing I discovered with real observability: my backup CronJob had been silently failing for two weeks due to a misconfigured S3 endpoint. I'd been living dangerously without knowing it.