Control-plane node not removed after following official node removal procedure

javitoox · May 11, 2026, 2:28am

Hi,

I am working with a Harvester / SUSE Virtualization 1.7.1 cluster and I am trying to remove a control-plane / management node following the official Harvester host removal procedure.

The node I want to remove is still shown in the cluster after several days.

Procedure followed

I followed the documented process as closely as possible:

I cordoned the node.
In Longhorn, I disabled storage scheduling for the node.
I set Eviction Requested = True.
I waited until Longhorn showed no active replicas/engines on that node.
I put the node into Maintenance mode.
I ran the script indicated in the Harvester documentation to stop/remove the RKE2 components:

/opt/rke2/bin/rke2-uninstall.sh

I shut down the node.
Finally, I removed the host from the Harvester UI.

Current state

After approximately 3 days, the node still appears in Harvester and in Kubernetes.

kubectl get nodes shows it as:

NotReady,SchedulingDisabled

It still has the roles:

control-plane,etcd

From what I can see:

The node has no VMs running on it.
Longhorn is clean for this node.
There are no active replicas or engines on it.
Storage scheduling is disabled.
The remaining control-plane nodes are healthy.
The node has already been powered off.
In kubectl describe node, the state looks mostly expected for a powered-off/removed node, but Kubernetes/Harvester still considers it a control-plane/etcd member.

Question

What is the recommended recovery procedure in this situation?

Is it expected that the node remains visible for several days after being removed from the Harvester UI?

If the node is already powered off, has no VMs, no Longhorn replicas, no engines, and the remaining control-plane nodes are healthy, is it safe/recommended to remove the residual Kubernetes node object manually with:

kubectl delete node <node-name>

Or is there a Harvester-specific cleanup step that should be used instead?

I want to avoid breaking etcd quorum or leaving Harvester/RKE2 in an inconsistent state.

today i have seen a new thing:

Machine deletion is stuck in DrainingNode.
The blocking pod is cattle-fleet-local-system/fleet-agent.
fleet-agent is continuously recreated and scheduled to the deleting node.
The deleting node has taints unschedulable, unreachable, kubevirt.io/drain and out-of-service.
fleet-agent tolerates those taints.
Would adding a temporary custom NoSchedule taint not tolerated by fleet-agent, then deleting the fleet-agent pod, be an acceptable recovery?
Or is there a Harvester-supported cleanup for this Machine drain loop?

Additional finding:

The node removal seems to be stuck during the Machine drain phase.

The Machine object is:

Namespace: fleet-local
Machine: custom-5f61d9859ecf
Node: rvpc-g02
Phase: Deleting
Stage: DrainingNode

The Machine condition shows:

Drain not completed yet:
Pod cattle-fleet-local-system/fleet-agent-... deletionTimestamp set, but still not removed from the Node

The fleet-agent Deployment is not pinned to rvpc-g02:

replicas: 1
nodeSelector:
  kubernetes.io/os: linux
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - matchExpressions:
          - key: fleet.cattle.io/agent
            operator: In
            values:
              - "true"

replicas: 1
nodeSelector:
  kubernetes.io/os: linux
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - matchExpressions:
          - key: fleet.cattle.io/agent
            operator: In
            values:
              - "true"

node.kubernetes.io/unschedulable
node.kubernetes.io/unreachable
node.kubernetes.io/out-of-service
kubevirt.io/drain

The current hypothesis is that the Machine controller drains the node, the fleet-agent pod is deleted, but the Deployment recreates it and the scheduler assigns it again to rvpc-g02, so the Machine stays stuck in DrainingNode.

Is there a recommended Harvester/SUSE Virtualization recovery procedure for this specific drain loop?

Would temporarily preventing new pods from being scheduled on the deleting node, or temporarily scaling the fleet-agent Deployment, be considered safe/supported? Or should this be handled by a Harvester controller automatically?

Topic		Replies	Views
Removing node from downstream server	2	555	July 25, 2024
Error message after removing shutdown node from Rancher2 Cluster SUSE Rancher Prime	0	539	May 17, 2019
Remove role from node SUSE Rancher Prime	5	4504	November 15, 2021
"waiting for 2 etcd machines to delete" SUSE Rancher Prime	13	1574	February 20, 2024
[SOLVED] Remove failed ETCD node SUSE Rancher Prime	0	2122	October 13, 2021

Control-plane node not removed after following official node removal procedure

Procedure followed

Current state

Question

Related topics