Hi,
I am working with a Harvester / SUSE Virtualization 1.7.1 cluster and I am trying to remove a control-plane / management node following the official Harvester host removal procedure.
The node I want to remove is still shown in the cluster after several days.
Procedure followed
I followed the documented process as closely as possible:
- I cordoned the node.
- In Longhorn, I disabled storage scheduling for the node.
- I set
Eviction Requested = True. - I waited until Longhorn showed no active replicas/engines on that node.
- I put the node into Maintenance mode.
- I ran the script indicated in the Harvester documentation to stop/remove the RKE2 components:
/opt/rke2/bin/rke2-uninstall.sh
- I shut down the node.
- Finally, I removed the host from the Harvester UI.
Current state
After approximately 3 days, the node still appears in Harvester and in Kubernetes.
kubectl get nodes shows it as:
NotReady,SchedulingDisabled
It still has the roles:
control-plane,etcd
From what I can see:
- The node has no VMs running on it.
- Longhorn is clean for this node.
- There are no active replicas or engines on it.
- Storage scheduling is disabled.
- The remaining control-plane nodes are healthy.
- The node has already been powered off.
- In
kubectl describe node, the state looks mostly expected for a powered-off/removed node, but Kubernetes/Harvester still considers it a control-plane/etcd member.
Question
What is the recommended recovery procedure in this situation?
Is it expected that the node remains visible for several days after being removed from the Harvester UI?
If the node is already powered off, has no VMs, no Longhorn replicas, no engines, and the remaining control-plane nodes are healthy, is it safe/recommended to remove the residual Kubernetes node object manually with:
kubectl delete node <node-name>
Or is there a Harvester-specific cleanup step that should be used instead?
I want to avoid breaking etcd quorum or leaving Harvester/RKE2 in an inconsistent state.
today i have seen a new thing:
Machine deletion is stuck in DrainingNode.
The blocking pod is cattle-fleet-local-system/fleet-agent.
fleet-agent is continuously recreated and scheduled to the deleting node.
The deleting node has taints unschedulable, unreachable, kubevirt.io/drain and out-of-service.
fleet-agent tolerates those taints.
Would adding a temporary custom NoSchedule taint not tolerated by fleet-agent, then deleting the fleet-agent pod, be an acceptable recovery?
Or is there a Harvester-supported cleanup for this Machine drain loop?
Additional finding:
The node removal seems to be stuck during the Machine drain phase.
The Machine object is:
Namespace: fleet-local
Machine: custom-5f61d9859ecf
Node: rvpc-g02
Phase: Deleting
Stage: DrainingNode
The Machine condition shows:
Drain not completed yet:
Pod cattle-fleet-local-system/fleet-agent-... deletionTimestamp set, but still not removed from the Node
The fleet-agent Deployment is not pinned to rvpc-g02:
replicas: 1
nodeSelector:
kubernetes.io/os: linux
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- matchExpressions:
- key: fleet.cattle.io/agent
operator: In
values:
- "true"
replicas: 1
nodeSelector:
kubernetes.io/os: linux
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- matchExpressions:
- key: fleet.cattle.io/agent
operator: In
values:
- "true"
node.kubernetes.io/unschedulable
node.kubernetes.io/unreachable
node.kubernetes.io/out-of-service
kubevirt.io/drain
The current hypothesis is that the Machine controller drains the node, the fleet-agent pod is deleted, but the Deployment recreates it and the scheduler assigns it again to rvpc-g02, so the Machine stays stuck in DrainingNode.
Is there a recommended Harvester/SUSE Virtualization recovery procedure for this specific drain loop?
Would temporarily preventing new pods from being scheduled on the deleting node, or temporarily scaling the fleet-agent Deployment, be considered safe/supported? Or should this be handled by a Harvester controller automatically?