In a Rancher cluster I have installed Prometheus/AlertManager helm charts.
Rancher: v2.5.5
rancher-monitoring: 9.4.202
k8s: v1.19.7
For some unknown reason the AlertManager shows the following alerts for some consecutive days now:
Unfortunately those runbook links don’t give very helpful insides what the problem is or how it could be solved.
I don’t even know what is actually broken, as the cluster behaves normal ( at least it looks like ok ) and e.g. PODs can be scheduled / restarted and external traffic is handled as usual, I can access cluster from outside using kubectl or k9s.
So my question is: What is broken ? How can I fix it ?
I notice that prometheus/grafana appear broken, it doesn’t show any data of the cluster, all dashboards only show “N/A” or “no data”.
And Prometheus shows the following logs ( which I have seen every time in the past when the persistent volume for prometheus went down or had problems because of longhorn issues ):
prometheus level=warn ts=2021-05-25T07:48:36.616Z caller=manager.go:595 component=“rule manager” group=kube-scheduler.rules msg=“Rule sample appending failed” err=“write to WAL: log samples: write /prometheus/wal/00007531: read-only file system”
prometheus level=warn ts=2021-05-25T07:48:36.617Z caller=manager.go:595 component=“rule manager” group=kube-scheduler.rules msg=“Rule sample appending failed” err=“write to WAL: log samples: write /prometheus/wal/00007531: read-only file system”
Somehow prometheus doesn’t restart or indicate unhealthy status in such a scenario.
In the past I used to “fix” such problems by manually restarting prometheus POD.
Again, after restarting prometheus POD those “N/A” and “no data” in grafana disappeared and real data was shown.
And now those strange alertmanager alerts ( KubeSchedulerDown, KubeControllerManagerDown, KubeAPIDown & KubeletDown ) resolved too 
Looks like the problem was this prometheus POD not being able to access its underlying database.