Tigera-operator v1.34.5 controller_installation reconcile loop + Typha watchercache bug still present after RKE2 v1.28.15 upgrade: seeking fix

Sangam_Ghimire · April 13, 2026, 3:38am

Hi R2,

Thank you, this is extremely helpful and closely matches what we experienced.

Our incident (April 7-8):

RKE2 v1.27.16 → v1.28.15 upgrade completed April 3-4
IPAM block deletion happened April 7 ,4 days AFTER the upgrade
All 27 IPAM blocks deleted simultaneously at 18:55:09 UTC, one every ~200ms
kube-controllers logs only showed it received delete events, not who initiated
Felix detected encapsulation change → self-restart → all 30 Typha connections dropped simultaneously → vxlan.calico DOWN

Because of the 4-day gap, the RKE2 Helm CRD uninstall hypothesis doesn’t fully fit our case, that would have caused immediate failure during upgrade.

The IPAM GC behavior hypothesis fits better:

We were on Calico v3.27.3 which you mentioned had rate-limiter and GC heuristic issues
tigera-operator reconcile storm was running continuously (~185,000 etcd writes/hour)
On April 7, etcd was visibly degraded (700ms+ requests, back-to-back compactions) in the 3 minutes before the IPAM deletion
Something likely triggered the GC around 18:55 UTC, possibly a watch error caused by etcd compaction
v3.27.3 GC bug then bulk-deleted all 27 blocks.
Can this be true? If yes how come Your note that you hit a similar issue again on v1.30 → v1.31 is concerning since that’s our next upgrade path. Did you find a definitive root cause or any precautions to take before the next upgrade?

Thanks again for sharing.

Topic		Replies	Views
Rancher release v2.8.3 Announcements	1	2230	March 28, 2024
Rancher Release v2.8.5 Announcements	1	2228	June 18, 2024
Rancher Release v2.5.14 Announcements	1	1342	May 24, 2022
Rancher Release v2.8.1 Announcements	1	2262	January 31, 2024
Rancher Release v2.8.4 Announcements	1	982	May 22, 2024