Production monitoring on Kubernetes

The monitoring stack that had been running locally on Docker Compose landed on the production Kubernetes cluster (OVH Managed Kubernetes). Prometheus, Grafana, Loki and Promtail — the same set, but installed via Helm and accessible externally under dedicated subdomains.

Installation via Helm

Locally the stack started through docker-compose.yml with manual configuration. On Kubernetes I used two Helm charts:

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace \
  -f k8s/helm-values/kube-prometheus-stack-values.yaml

helm install loki grafana/loki-stack \
  -n monitoring \
  -f k8s/helm-values/loki-stack-values.yaml

kube-prometheus-stack installs Prometheus, Grafana, Alertmanager, node-exporter and kube-state-metrics. loki-stack adds Loki and Promtail as a DaemonSet on every node.

Promtail and containerd

The first configuration mounted /var/lib/docker/containers — the standard path for Docker. OVH Managed Kubernetes uses containerd, and logs go to /var/log/pods/.

Promtail started but found no log files. The fix required three changes:

# loki-stack-values.yaml
promtail:
  extraVolumes:
    - name: pods
      hostPath:
        path: /var/log/pods
  extraVolumeMounts:
    - name: pods
      mountPath: /var/log/pods
      readOnly: true
  config:
    snippets:
      scrapeConfigs: |
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod
          pipeline_stages:
            - cri: {}
          relabel_configs:
            - source_labels:
                - __meta_kubernetes_pod_uid
                - __meta_kubernetes_pod_container_name
              separator: /
              replacement: /var/log/pods/*$1/$2/*.log
              target_label: __path__

Key elements: pipeline_stages: - cri: {} parses the containerd log format (not Docker JSON), and the relabel using __meta_kubernetes_pod_uid builds the path /var/log/pods/*/uid/container/*.log.

nginx-ingress metrics

On local dev I monitored Traefik as the reverse proxy. In production, traffic flows through the nginx-ingress controller. By default it did not expose detailed HTTP metrics.

Enabling them required two steps: adding the --enable-metrics=true flag and port 10254 to the deployment, then creating a Service and ServiceMonitor:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ingress-nginx
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  namespaceSelector:
    matchNames:
      - ingress-nginx
  selector:
    matchLabels:
      app.kubernetes.io/name: ingress-nginx
  endpoints:
    - port: metrics
      interval: 15s

The release: kube-prometheus label is required — Prometheus Operator filters ServiceMonitors by this label.

External access

Both panels are available under dedicated subdomains with TLS (Let's Encrypt via cert-manager):

grafana.borowski.services — Grafana with login
prometheus.borowski.services — Prometheus UI

Initially I protected access with an IP whitelist via an nginx annotation:

nginx.ingress.kubernetes.io/whitelist-source-range: "79.186.58.130/32"

It did not work. The OVH load balancer performs SNAT — nginx saw the load balancer's IP (146.59.117.234), not the real client IP. Attempting to enable proxy protocol broke connections (the LB does not send it). The solution: basic auth instead of a whitelist.

nginx.ingress.kubernetes.io/auth-type: basic
nginx.ingress.kubernetes.io/auth-secret: monitoring-basic-auth
nginx.ingress.kubernetes.io/auth-realm: "Monitoring"

The monitoring-basic-auth secret was generated with htpasswd. Grafana has its own login on top — a double layer.

Dashboards

Dashboards are loaded automatically by the Grafana sidecar. A ConfigMap with the grafana_dashboard=1 label is detected and imported without a restart:

Portfolio Services Dashboard — request rate, latency (p50/p95/p99), error rate and Loki logs
Loki Logs — Portfolio — log browser with filtering by namespace, pod and container

Queries were rewritten from Traefik to nginx-ingress (nginx_ingress_controller_requests instead of traefik_service_requests_total). Dashboard configuration details are covered in a separate article.

Versions

infra → v0.1.0 (new minor: production monitoring, cert-manager, nginx configs, basic auth)