Tigera-operator v1.34.5 controller_installation reconcile loop + Typha watchercache bug still present after RKE2 v1.28.15 upgrade: seeking fix

Hi team,

We upgraded our RKE2 cluster from v1.27.16+rke2r2 to v1.28.15+rke2r1
(Calico v3.27.3 → v3.28.2, tigera-operator v1.32.7 → v1.34.5) and are
hitting two unresolved upstream bugs.

Looking for guidance on how to solve this on
RKE2/Calico .


Cluster:

  • 30 nodes (3 masters, 27 workers)
  • RKE2: v1.28.15+rke2r1
  • Calico: v3.28.2
  • tigera-operator: v1.34.5
  • OS: Ubuntu 24.04
  • CNI: Calico VXLAN

Bug 1 — tigera-operator controller_installation reconcile loop

tigera-operator reconciles every ~1 second continuously and never
converges. Confirmed in both v1.32.7 (bundled with RKE2 v1.27.16)
and v1.34.5 (bundled with RKE2 v1.28.15).

Log pattern (repeating every 1s indefinitely):
{“logger”:“controller_installation”,“msg”:“Reconciling Installation.operator.tigera.io”,“Request.Name”:“calico-node”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera io”,“Request.Name”:“calico-cni-plugin”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera. o”,“Request.Name”:“calico-kube-controllers”}

No errors logged between reconcile messages. Installation CR spec
and status.computed are identical, no visible drift.

We tried --manage-crds=false but loop continued, it’s not CRD-related.

This causes ~3000-3200 etcd writes/min which caused Typha watchercache
to fall behind compaction, leading to a cluster-wide network outage

Current mitigation: tigera-operator scaled to 0 replicas.

Filed: github com/tigera/operator/issues/4615

Questions:

  • Is this fixed in a specific tigera-operator version?
  • Which RKE2 version bundles a fixed tigera-operator?

Bug 2 — Typha watchercache “too old resource version” on /calico/ipam/v2/host/

Still hitting this in Calico v3.28.2 (thought libcalico-go PR #9690
fixed it):

watchercache.go 125: Watch error received from Upstream ListRoot=“/calico/ipam/v2/host/” error=too old resource version: 869060569 (1075704479)

watchercache.go 181: Full resync is required ListRoot=“/calico/ipam/v2/host/”

Occurs every ~5-10 minutes. Current mitigation: daily Typha restart
crontab.

Shaun Crampton from Tigera mentioned v3.31 includes improved Typha
reconnection logic that restarts the Typha connection without
restarting the dataplane.

Questions:

  • Is the /calico/ipam/v2/host/ watchercache path fixed in v3.29+?
  • Which RKE2 version bundles Calico v3.31?

Bug 3 — Unknown actor deleted all 27 IPAM blocks simultaneously

On April 7, all 27 IPAM blocks were deleted from etcd at 18:55:09 UTC,
one every ~200ms:

18:55:09.686 ipam.go 511: Received block delete block=“10.42.1.0/24” 18:55:09.888 ipam.go 511: Received block delete block=“10.42.111.0/24” … all 27 blocks within 6 seconds

This triggered RouteRemove on all 30 Felix nodes simultaneously →
Felix restarted → all Typha connections dropped → grace period expired
→ vxlan.calico DOWN → 5h48m outage.

kube-controllers log only shows it received the delete events —
not who initiated them. tigera-operator was scaled to 0 at the time.

Question:

  • What component in Calico v3.27.3 could delete all IPAM blocks
    simultaneously? Is this a known bug fixed in v3.28+?

Summary of what we need:

  1. Which RKE2 version bundles a tigera-operator with the
    controller_installation reconcile loop fixed?
  2. Which RKE2 version bundles Calico v3.31 (Typha reconnect fix)?
  3. Any insight on the simultaneous IPAM block deletion?

Thanks


Hi,

We had a similar problem related to removing IPAM blocks configuration that you describe.
We upgraded our RKE2 cluster from:
rke2 v1.30.11+rke2r1
calico v3.29.2
tigera-operator v1.36.5
to:
rke2 v1.31.14+rke2r1
calico v3.30.4
tigera-operator v1.38.7

In our case, the ippool configuration was completely removed. The cluster upgrade stalled. The logs showed vxlan.calico entries: Link DOWN
The pod calico-kube-controllers-* was stuck in a crash/restart loop because the IPAM plugin was reporting: “cannot find a qualified ippool”.
However, we do not know the root cause of such behavior.

After making a diagnosis and extensive research, we only have assumptions about the probable causes of what happened:

  • IPAM GC behavior (the component that can delete IPAMBlocks): This runs in calico-kube-controllers and cleans up orphaned/leaked blocks during node events or controller restarts. Earlier versions (e.g., 3.27.x) had rate-limiter and GC heuristic issues that could lead to rapid or stalled bulk deletions. These were improved in 3.28+ and further refined in 3.29/3.30.
  • RKE2 Calico Helm chart upgrade problems: Older RKE2 upgrades (e.g., 1.28.x → 1.29.x) had cases where the Helm job incorrectly tried to uninstall Tigera Operator CRDs / Calico CRDs, causing install loops or partial resource loss.
  • Calico CRD Uninstall Loop
  • Race Condition with Calico (calico-node CNI Plugin vs ClusterInformation, calico-typha vs its own ClusterRole)

Regards,
R2

Hi R2,

Thank you, this is extremely helpful and closely matches what we experienced.

Our incident (April 7-8):

  • RKE2 v1.27.16 → v1.28.15 upgrade completed April 3-4
  • IPAM block deletion happened April 7 ,4 days AFTER the upgrade
  • All 27 IPAM blocks deleted simultaneously at 18:55:09 UTC, one every ~200ms
  • kube-controllers logs only showed it received delete events, not who initiated
  • Felix detected encapsulation change → self-restart → all 30 Typha connections dropped simultaneously → vxlan.calico DOWN

Because of the 4-day gap, the RKE2 Helm CRD uninstall hypothesis doesn’t fully fit our case, that would have caused immediate failure during upgrade.

The IPAM GC behavior hypothesis fits better:

  • We were on Calico v3.27.3 which you mentioned had rate-limiter and GC heuristic issues
  • tigera-operator reconcile storm was running continuously (~185,000 etcd writes/hour)
  • On April 7, etcd was visibly degraded (700ms+ requests, back-to-back compactions) in the 3 minutes before the IPAM deletion
  • Something likely triggered the GC around 18:55 UTC, possibly a watch error caused by etcd compaction
  • v3.27.3 GC bug then bulk-deleted all 27 blocks.
    Can this be true? If yes how come Your note that you hit a similar issue again on v1.30 → v1.31 is concerning since that’s our next upgrade path. Did you find a definitive root cause or any precautions to take before the next upgrade?

Thanks again for sharing.