Scale the built-in Prometheus

Service Mesh Manager relies on Prometheus (and Thanos in HA mode) to store historical values of metrics and calculate Topology, SLOs, or Health scores. Usually for large scale deployments (clusters with many Workloads and Services), Prometheus uses the most resources.

Furthermore, if Prometheus does not have enough resources, the dashboard might become slow or even unresponsive. In such cases, start scaling the Prometheus instance.

Scale Prometheus

Prometheus (as configured in Service Mesh Manager) can only be scaled vertically. To find out the right resource limits, follow the scaling a specific workload procedure.

Note: If CPU throttling is present for Prometheus, then usually this is the primary reason for the slow Service Mesh Manager dashboard.

Monitor resolution

Besides the conventional methods for scaling Prometheus, note that its resource usage depends on the amount of data ingested (among many other things). Service Mesh Manager by default configures Prometheus to have a 5s resolution (that is, to have a data point every five seconds), which is great for initial experimentation and spotting small spikes in the metrics.

For large scale deployments, we highly recommend to decrease the resolution to 15s or 30s, which essentially decreases CPU and Memory usage by 66% (15s) or 83% (30s).

To change the monitoring resolution of Prometheus, run the following commands.

cat > change-prometheus-scraping-frequency.yaml <<EOF
spec:
  smm:
    prometheus:
      scraping:
        frequency:
            interval: 15s
            timeout: 15s
EOF

kubectl patch controlplane --type=merge --patch "$(cat change-prometheus-scraping-frequency )" smm
  • If you are using Service Mesh Manager in Operator Mode, then the Istio deployment is updated automatically.
  • If you are using the imperative mode, run the smm operator reconcile command to apply the changes.

Note: When using persistent storage for Prometheus, the resource utilization will decrease slower than if it had started with an empty database, because Service Level Objectives still access data with higher resolution from the past.

Scale Thanos

For HA setups, Service Mesh Manager uses thanos-query for metric deduplication. The service scales horizontally and is usually CPU bound.

By default, Service Mesh Manager provisions a Horizontal Pod Autoscaler so that it can scale dynamically based on resource usage. To set up the resource requests/limits of Thanos, use the .spec.smm.prometheus.thanos.query.resources of the ControlPlane custom resource as detailed in SMM scaling.