Hi team,
We upgraded our RKE2 cluster from v1.27.16+rke2r2 to v1.28.15+rke2r1
(Calico v3.27.3 → v3.28.2, tigera-operator v1.32.7 → v1.34.5) and are
hitting two unresolved upstream bugs.
Looking for guidance on how to solve this on
RKE2/Calico .
Cluster:
- 30 nodes (3 masters, 27 workers)
- RKE2: v1.28.15+rke2r1
- Calico: v3.28.2
- tigera-operator: v1.34.5
- OS: Ubuntu 24.04
- CNI: Calico VXLAN
Bug 1 — tigera-operator controller_installation reconcile loop
tigera-operator reconciles every ~1 second continuously and never
converges. Confirmed in both v1.32.7 (bundled with RKE2 v1.27.16)
and v1.34.5 (bundled with RKE2 v1.28.15).
Log pattern (repeating every 1s indefinitely):
{“logger”:“controller_installation”,“msg”:“Reconciling Installation.operator.tigera.io”,“Request.Name”:“calico-node”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera io”,“Request.Name”:“calico-cni-plugin”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera. o”,“Request.Name”:“calico-kube-controllers”}
No errors logged between reconcile messages. Installation CR spec
and status.computed are identical, no visible drift.
We tried --manage-crds=false but loop continued, it’s not CRD-related.
This causes ~3000-3200 etcd writes/min which caused Typha watchercache
to fall behind compaction, leading to a cluster-wide network outage
Current mitigation: tigera-operator scaled to 0 replicas.
Filed: github com/tigera/operator/issues/4615
Questions:
- Is this fixed in a specific tigera-operator version?
- Which RKE2 version bundles a fixed tigera-operator?
Bug 2 — Typha watchercache “too old resource version” on /calico/ipam/v2/host/
Still hitting this in Calico v3.28.2 (thought libcalico-go PR #9690
fixed it):
watchercache.go 125: Watch error received from Upstream ListRoot=“/calico/ipam/v2/host/” error=too old resource version: 869060569 (1075704479)
watchercache.go 181: Full resync is required ListRoot=“/calico/ipam/v2/host/”
Occurs every ~5-10 minutes. Current mitigation: daily Typha restart
crontab.
Shaun Crampton from Tigera mentioned v3.31 includes improved Typha
reconnection logic that restarts the Typha connection without
restarting the dataplane.
Questions:
- Is the /calico/ipam/v2/host/ watchercache path fixed in v3.29+?
- Which RKE2 version bundles Calico v3.31?
Bug 3 — Unknown actor deleted all 27 IPAM blocks simultaneously
On April 7, all 27 IPAM blocks were deleted from etcd at 18:55:09 UTC,
one every ~200ms:
18:55:09.686 ipam.go 511: Received block delete block=“10.42.1.0/24” 18:55:09.888 ipam.go 511: Received block delete block=“10.42.111.0/24” … all 27 blocks within 6 seconds
This triggered RouteRemove on all 30 Felix nodes simultaneously →
Felix restarted → all Typha connections dropped → grace period expired
→ vxlan.calico DOWN → 5h48m outage.
kube-controllers log only shows it received the delete events —
not who initiated them. tigera-operator was scaled to 0 at the time.
Question:
- What component in Calico v3.27.3 could delete all IPAM blocks
simultaneously? Is this a known bug fixed in v3.28+?
Summary of what we need:
- Which RKE2 version bundles a tigera-operator with the
controller_installation reconcile loop fixed? - Which RKE2 version bundles Calico v3.31 (Typha reconnect fix)?
- Any insight on the simultaneous IPAM block deletion?
Thanks