Tigera-operator v1.34.5 controller_installation reconcile loop + Typha watchercache bug still present after RKE2 v1.28.15 upgrade: seeking fix

Sangam_Ghimire · April 8, 2026, 7:01pm

Hi team,

We upgraded our RKE2 cluster from v1.27.16+rke2r2 to v1.28.15+rke2r1
(Calico v3.27.3 → v3.28.2, tigera-operator v1.32.7 → v1.34.5) and are
hitting two unresolved upstream bugs.

Looking for guidance on how to solve this on
RKE2/Calico .

Cluster:

30 nodes (3 masters, 27 workers)
RKE2: v1.28.15+rke2r1
Calico: v3.28.2
tigera-operator: v1.34.5
OS: Ubuntu 24.04
CNI: Calico VXLAN

Bug 1 — tigera-operator controller_installation reconcile loop

tigera-operator reconciles every ~1 second continuously and never
converges. Confirmed in both v1.32.7 (bundled with RKE2 v1.27.16)
and v1.34.5 (bundled with RKE2 v1.28.15).

Log pattern (repeating every 1s indefinitely):
{“logger”:“controller_installation”,“msg”:“Reconciling Installation.operator.tigera.io”,“Request.Name”:“calico-node”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera io”,“Request.Name”:“calico-cni-plugin”} {“logger”:“controller_installation”,“msg”:“Reconciling Installation operator tigera. o”,“Request.Name”:“calico-kube-controllers”}

No errors logged between reconcile messages. Installation CR spec
and status.computed are identical, no visible drift.

We tried --manage-crds=false but loop continued, it’s not CRD-related.

This causes ~3000-3200 etcd writes/min which caused Typha watchercache
to fall behind compaction, leading to a cluster-wide network outage

Current mitigation: tigera-operator scaled to 0 replicas.

Filed: github com/tigera/operator/issues/4615

Questions:

Is this fixed in a specific tigera-operator version?
Which RKE2 version bundles a fixed tigera-operator?

Bug 2 — Typha watchercache “too old resource version” on /calico/ipam/v2/host/

Still hitting this in Calico v3.28.2 (thought libcalico-go PR #9690
fixed it):

watchercache.go 125: Watch error received from Upstream ListRoot=“/calico/ipam/v2/host/” error=too old resource version: 869060569 (1075704479)

watchercache.go 181: Full resync is required ListRoot=“/calico/ipam/v2/host/”

Occurs every ~5-10 minutes. Current mitigation: daily Typha restart
crontab.

Shaun Crampton from Tigera mentioned v3.31 includes improved Typha
reconnection logic that restarts the Typha connection without
restarting the dataplane.

Questions:

Is the /calico/ipam/v2/host/ watchercache path fixed in v3.29+?
Which RKE2 version bundles Calico v3.31?

Bug 3 — Unknown actor deleted all 27 IPAM blocks simultaneously

On April 7, all 27 IPAM blocks were deleted from etcd at 18:55:09 UTC,
one every ~200ms:

18:55:09.686 ipam.go 511: Received block delete block=“10.42.1.0/24” 18:55:09.888 ipam.go 511: Received block delete block=“10.42.111.0/24” … all 27 blocks within 6 seconds

This triggered RouteRemove on all 30 Felix nodes simultaneously →
Felix restarted → all Typha connections dropped → grace period expired
→ vxlan.calico DOWN → 5h48m outage.

kube-controllers log only shows it received the delete events —
not who initiated them. tigera-operator was scaled to 0 at the time.

Question:

What component in Calico v3.27.3 could delete all IPAM blocks
simultaneously? Is this a known bug fixed in v3.28+?

Summary of what we need:

Which RKE2 version bundles a tigera-operator with the
controller_installation reconcile loop fixed?
Which RKE2 version bundles Calico v3.31 (Typha reconnect fix)?
Any insight on the simultaneous IPAM block deletion?

Thanks

R2D2 · April 12, 2026, 7:09pm

Hi,

We had a similar problem related to removing IPAM blocks configuration that you describe.
We upgraded our RKE2 cluster from:
rke2 v1.30.11+rke2r1
calico v3.29.2
tigera-operator v1.36.5
to:
rke2 v1.31.14+rke2r1
calico v3.30.4
tigera-operator v1.38.7

In our case, the ippool configuration was completely removed. The cluster upgrade stalled. The logs showed vxlan.calico entries: Link DOWN
The pod calico-kube-controllers-* was stuck in a crash/restart loop because the IPAM plugin was reporting: “cannot find a qualified ippool”.
However, we do not know the root cause of such behavior.

After making a diagnosis and extensive research, we only have assumptions about the probable causes of what happened:

IPAM GC behavior (the component that can delete IPAMBlocks): This runs in calico-kube-controllers and cleans up orphaned/leaked blocks during node events or controller restarts. Earlier versions (e.g., 3.27.x) had rate-limiter and GC heuristic issues that could lead to rapid or stalled bulk deletions. These were improved in 3.28+ and further refined in 3.29/3.30.
RKE2 Calico Helm chart upgrade problems: Older RKE2 upgrades (e.g., 1.28.x → 1.29.x) had cases where the Helm job incorrectly tried to uninstall Tigera Operator CRDs / Calico CRDs, causing install loops or partial resource loss.
Calico CRD Uninstall Loop
Race Condition with Calico (calico-node CNI Plugin vs ClusterInformation, calico-typha vs its own ClusterRole)

Regards,
R2

Sangam_Ghimire · April 13, 2026, 3:38am

Hi R2,

Thank you, this is extremely helpful and closely matches what we experienced.

Our incident (April 7-8):

RKE2 v1.27.16 → v1.28.15 upgrade completed April 3-4
IPAM block deletion happened April 7 ,4 days AFTER the upgrade
All 27 IPAM blocks deleted simultaneously at 18:55:09 UTC, one every ~200ms
kube-controllers logs only showed it received delete events, not who initiated
Felix detected encapsulation change → self-restart → all 30 Typha connections dropped simultaneously → vxlan.calico DOWN

Because of the 4-day gap, the RKE2 Helm CRD uninstall hypothesis doesn’t fully fit our case, that would have caused immediate failure during upgrade.

The IPAM GC behavior hypothesis fits better:

We were on Calico v3.27.3 which you mentioned had rate-limiter and GC heuristic issues
tigera-operator reconcile storm was running continuously (~185,000 etcd writes/hour)
On April 7, etcd was visibly degraded (700ms+ requests, back-to-back compactions) in the 3 minutes before the IPAM deletion
Something likely triggered the GC around 18:55 UTC, possibly a watch error caused by etcd compaction
v3.27.3 GC bug then bulk-deleted all 27 blocks.
Can this be true? If yes how come Your note that you hit a similar issue again on v1.30 → v1.31 is concerning since that’s our next upgrade path. Did you find a definitive root cause or any precautions to take before the next upgrade?

Thanks again for sharing.

Topic	Replies	Views
Rancher release v2.8.3 Announcements	2232	March 28, 2024
Rancher Release v2.8.5 Announcements	2234	June 18, 2024
Rancher Release v2.5.14 Announcements	1343	May 24, 2022
Rancher Release v2.8.1 Announcements	2262	January 31, 2024
Rancher Release v2.8.4 Announcements	987	May 22, 2024

Tigera-operator v1.34.5 controller_installation reconcile loop + Typha watchercache bug still present after RKE2 v1.28.15 upgrade: seeking fix

Related topics