Tigera-operator v1.34.5 controller_installation reconcile loop + Typha watchercache bug still present after RKE2 v1.28.15 upgrade: seeking fix

Hi team,

We upgraded our RKE2 cluster from v1.27.16+rke2r2 to v1.28.15+rke2r1
(Calico v3.27.3 → v3.28.2, tigera-operator v1.32.7 → v1.34.5) and are
hitting two unresolved upstream bugs.

Looking for guidance on how to solve this on
RKE2/Calico .


Cluster:

  • 30 nodes (3 masters, 27 workers)
  • RKE2: v1.28.15+rke2r1
  • Calico: v3.28.2
  • tigera-operator: v1.34.5
  • OS: Ubuntu 24.04
  • CNI: Calico VXLAN

Bug 1 — tigera-operator controller_installation reconcile loop

tigera-operator reconciles every ~1 second continuously and never
converges. Confirmed in both v1.32.7 (bundled with RKE2 v1.27.16)
and v1.34.5 (bundled with RKE2 v1.28.15).

Log pattern (repeating every 1s indefinitely):
{“logger”:“controller_installation”,“msg”:“Reconciling Installation.operator.tigera.io”,“Request.Name”:“calico-node”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera io”,“Request.Name”:“calico-cni-plugin”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera. o”,“Request.Name”:“calico-kube-controllers”}

No errors logged between reconcile messages. Installation CR spec
and status.computed are identical, no visible drift.

We tried --manage-crds=false but loop continued, it’s not CRD-related.

This causes ~3000-3200 etcd writes/min which caused Typha watchercache
to fall behind compaction, leading to a cluster-wide network outage

Current mitigation: tigera-operator scaled to 0 replicas.

Filed: github com/tigera/operator/issues/4615

Questions:

  • Is this fixed in a specific tigera-operator version?
  • Which RKE2 version bundles a fixed tigera-operator?

Bug 2 — Typha watchercache “too old resource version” on /calico/ipam/v2/host/

Still hitting this in Calico v3.28.2 (thought libcalico-go PR #9690
fixed it):

watchercache.go 125: Watch error received from Upstream ListRoot=“/calico/ipam/v2/host/” error=too old resource version: 869060569 (1075704479)

watchercache.go 181: Full resync is required ListRoot=“/calico/ipam/v2/host/”

Occurs every ~5-10 minutes. Current mitigation: daily Typha restart
crontab.

Shaun Crampton from Tigera mentioned v3.31 includes improved Typha
reconnection logic that restarts the Typha connection without
restarting the dataplane.

Questions:

  • Is the /calico/ipam/v2/host/ watchercache path fixed in v3.29+?
  • Which RKE2 version bundles Calico v3.31?

Bug 3 — Unknown actor deleted all 27 IPAM blocks simultaneously

On April 7, all 27 IPAM blocks were deleted from etcd at 18:55:09 UTC,
one every ~200ms:

18:55:09.686 ipam.go 511: Received block delete block=“10.42.1.0/24” 18:55:09.888 ipam.go 511: Received block delete block=“10.42.111.0/24” … all 27 blocks within 6 seconds

This triggered RouteRemove on all 30 Felix nodes simultaneously →
Felix restarted → all Typha connections dropped → grace period expired
→ vxlan.calico DOWN → 5h48m outage.

kube-controllers log only shows it received the delete events —
not who initiated them. tigera-operator was scaled to 0 at the time.

Question:

  • What component in Calico v3.27.3 could delete all IPAM blocks
    simultaneously? Is this a known bug fixed in v3.28+?

Summary of what we need:

  1. Which RKE2 version bundles a tigera-operator with the
    controller_installation reconcile loop fixed?
  2. Which RKE2 version bundles Calico v3.31 (Typha reconnect fix)?
  3. Any insight on the simultaneous IPAM block deletion?

Thanks