Tigera-operator v1.34.5 controller_installation reconcile loop + Typha watchercache bug still present after RKE2 v1.28.15 upgrade: seeking fix

Hi R2,

Thank you, this is extremely helpful and closely matches what we experienced.

Our incident (April 7-8):

  • RKE2 v1.27.16 → v1.28.15 upgrade completed April 3-4
  • IPAM block deletion happened April 7 ,4 days AFTER the upgrade
  • All 27 IPAM blocks deleted simultaneously at 18:55:09 UTC, one every ~200ms
  • kube-controllers logs only showed it received delete events, not who initiated
  • Felix detected encapsulation change → self-restart → all 30 Typha connections dropped simultaneously → vxlan.calico DOWN

Because of the 4-day gap, the RKE2 Helm CRD uninstall hypothesis doesn’t fully fit our case, that would have caused immediate failure during upgrade.

The IPAM GC behavior hypothesis fits better:

  • We were on Calico v3.27.3 which you mentioned had rate-limiter and GC heuristic issues
  • tigera-operator reconcile storm was running continuously (~185,000 etcd writes/hour)
  • On April 7, etcd was visibly degraded (700ms+ requests, back-to-back compactions) in the 3 minutes before the IPAM deletion
  • Something likely triggered the GC around 18:55 UTC, possibly a watch error caused by etcd compaction
  • v3.27.3 GC bug then bulk-deleted all 27 blocks.
    Can this be true? If yes how come Your note that you hit a similar issue again on v1.30 → v1.31 is concerning since that’s our next upgrade path. Did you find a definitive root cause or any precautions to take before the next upgrade?

Thanks again for sharing.