Hi R2,
Thank you, this is extremely helpful and closely matches what we experienced.
Our incident (April 7-8):
- RKE2 v1.27.16 → v1.28.15 upgrade completed April 3-4
- IPAM block deletion happened April 7 ,4 days AFTER the upgrade
- All 27 IPAM blocks deleted simultaneously at 18:55:09 UTC, one every ~200ms
- kube-controllers logs only showed it received delete events, not who initiated
- Felix detected encapsulation change → self-restart → all 30 Typha connections dropped simultaneously → vxlan.calico DOWN
Because of the 4-day gap, the RKE2 Helm CRD uninstall hypothesis doesn’t fully fit our case, that would have caused immediate failure during upgrade.
The IPAM GC behavior hypothesis fits better:
- We were on Calico v3.27.3 which you mentioned had rate-limiter and GC heuristic issues
- tigera-operator reconcile storm was running continuously (~185,000 etcd writes/hour)
- On April 7, etcd was visibly degraded (700ms+ requests, back-to-back compactions) in the 3 minutes before the IPAM deletion
- Something likely triggered the GC around 18:55 UTC, possibly a watch error caused by etcd compaction
- v3.27.3 GC bug then bulk-deleted all 27 blocks.
Can this be true? If yes how come Your note that you hit a similar issue again on v1.30 → v1.31 is concerning since that’s our next upgrade path. Did you find a definitive root cause or any precautions to take before the next upgrade?
Thanks again for sharing.