So we have four harvester nodes, each harvester node has qty 4 (four), 8 (eight) TB nvme disks assigned to the same Longhorn storage group in Harvester. The replica is set for 3. What would you say the maximum size for a local VM disk would be for this Harvester cluster? Assume automatic snapshots are disabled, and assume we are not copying the disks locally for backup. Also, please assume 1 (one) VM will run on this cluster.
Harvester uses Longhorn for its Storage. The Space Configuration Suggestions for Volumes section [1] has the following formula:
A general estimation for the maximum space consumption of a volume is
(N + 1) x head/snapshot average actual size
My understanding is that even if you had 0 snapshots, there are still temporary snapshots created for rebuilding the volumes. So the worse case situation for a full volume would be 3x the actual size for a single replica.
From the Longhorn docs [1]:
- The worst case that leads to so much space usage:
- At some point the 1st rebuilding/expansion is triggered, which leads to the 1st system snapshot creation.
- The purges before and after the 1st rebuilding does nothing.
- There is data written to the new volume head, and the 2nd rebuilding/expansion somehow is triggered.
- The snapshot purge before the 2nd rebuilding may lead to the shrink of the 1st system snapshot.
- Then the 2nd system snapshot is created and the rebuilding is started.
- After the rebuilding done, the subsequent snapshot purge would lead to the coalescing of the 2 system snapshots. This coalescing requires temporary space.
- During the afterward snapshot purging for the 2nd rebuilding, there is more data written to the new volume head.
- The explanation of the formula:
- The 1st 1 means the volume head.
- The 2nd 1 is the second system snapshot mentioned in the worst case.
- The 3rd 1 is for the temporary space that may be required by the 2 system snapshot purge/coalescing.
If you were planning to avoid a worse case scenario with a single full VM volume (bit of an edge caes I think), I would guess you can consider 2.5 TB as a safe maximum.
Thank you, Philip. This matches what we have found in our testing, and we feel like the SUSE Virtualization should document this situation better, or perhaps the upstream product should consider a formula to warn the user if a Harvester/Longhorn disk configuration combined with x number of VM(s) disk(s) at x size will not be sustainable. Many of our production VMs running under SLE 15 KVM on local RAID10 have local disks above 1TB, and they perform extremely well on modern hardware; migrating these very same VMS into a new 1.5.x Harvester/SUSE Virtualization cluster en masse to match CPU cores and RAM and available Longhorn total disk space works initially, but quickly begins to offer significant disk corruption and multiple issues due to lack of space (because the VM disks in our testing had very high percentage of modifications and changes). I would appreciate any thoughts/guidance on how we should be configuring these nodes and which trade-offs we should simply accept if we want to migrate to SUSE Virtualization/Harvester from straight KVM.
For future searchers with similar issues, in general, the maximum size of your VM disks on Harvester/Longhorn is really dependent on how much of your VM’s data changes, because that will drive the space required for rebuilding with temporary snapshots. If you have VMs where a very large percentage of the data is being modified or changed, it can outgrow the largest available NVMe disk on your Harvester Node (this is what I believe the 25% warning per disk is for in Harvester, but it’s a general warning. As the Harvester admin, you really need to understand how much of your VM disk data is changing over. You can manually move / delete other replicas or other VM disks (as we have done, but it’s ugly and requires a significant amount of time to do properly and even then for two test VMs with 5TB disks we had to fully restore them). Eventually, I feel this should be automated in Harvester disk control; to the point where the admin has the option to enable automatic VM pausing/live migrations to other nodes to prevent disk corruption.
For the disk corruption and other issues, I would recommend creating a support ticket [1] if you have SUSE support. I may be misunderstanding your setup, but if you are running Longhorn on RAID10 disks, then you are effectively running a distributed storage system on a distributed storage system. I wouldn’t be surprise if you experienced increased data corruption from that because all the disk writes would be compounded. It is recommended running Longhorn directly on physical disks.
I would also check out the best practices for Longhorn here [2]. You can reduce the replica count to 2 for better performance, and run recurring jobs for cleanup up snapshots.
[1] https://scc.suse.com/home
[2] https://longhorn.io/docs/1.9.1/best-practices/
Hi Philip, I am allowing Harvester/Longhorn to use the entire NVMe (all four of them) in each node, as it wants to. There is no RAID in between, did I imply this in my original post somehow? I apologize if I did. It seems when you do allow Harvester/Longhorn to manage the NVMe’s directly, you are limited to the smallest NVMe disk size and the formula you have nicely documented as the worst case for file size. The number of changes and modifications on the VM disk itself will determine the actual “worst case” requirement, which can be difficult to calculate if the Administrator isn’t tracking such things prior to migration of VMs into Harvester/Longhorn.
[edit] - Oh! I see your confusion. You are misunderstanding my reply. I am comparing our production SLE 15 KVM single server with RAID 10 and XFS for the local VM disk images to 1.5 Harvester/Longhorn four node cluster on a dedicated 25gbe network. Same exact virtual machines on a new generation, four node Harvester as configured. We haven’t experienced any disk corruption on SLE 15 KVM alone in a decade. It happened on day three in testing on Harvester.
Yeah, sorry, I just wasn’t sure if you were using RAID for the Longhorn setup or not
.
It might be good to know exactly what you mean by disk corruption. It is at the physical disk level or at the Longhorn level?
- Are you seeing degraded replicas? Are volumes failing to mount.
- Is it at the disk level and you find yourself running
xfs_repairoften? - Are there specific I/O messages in the logs?
I agree that it is tricky to plan out your storage because you do need to allow for extra space for the data management overhead and it is somewhat dependent on knowing the average actual size of a Longhorn volume.
I have seen more issues with Longhorn when the disks are almost full. In my opinion, you need to set aside quite a bit of extra space to allow for the data management operations, but it is hard to determine exactly how much. The formula from the Longhorn documentation is only for a single volume. Once you have multiple volumes and their replicas being scheduled, it can be tricky to estimate accurately. Perhaps you can increase the storage-minimal-available-percentage parameter to ensure you have enough “buffer” space.
By corruption, I mean longhorn disk corruption assigned to the VM. (Red alerts, you can’t do anything. To the point where you need to delete it out of longhorn and re-create.) This seems to only happen on disks above 5TB assigned to VMs – again depending entirely on how much data changes for those “hidden” snapshot items you discussed. For our configuration, we have 140 nodes configured exactly as described in my original post. We have found under 1.6.x Harvester, so far there is no harvester/longhorn corruption when you assign disks to VMs 1TB or under. So long as the VM’s able to span multiple disks for the specific use case, this has worked great. We never get to the 25% warning on each NVMe, but I suspect the 5TB+ longhorn disk becomes “full” if the vm disk has a LOT of changes that need to be tracked. Am I making sense here I hope? So far, decreasing this to 1TB groups has eliminated this.
[edit] I think the 25% warning on each NVMe (each Longhorn disk) should account for this hidden snapshot data. We don’t think it is accounting for this – and an earlier warning here may solve the problem for those with large vm disk files on longhorn, especially when multiple NVMe are assigned to Longhorn on each node. Because when you look at the entire available pool of many dozens of TB in a clean, shiny new Harvester cluster, the admin’s “instinct” is to just assign each VM whatever it needs in one disk. There is no note or warning describing how you need to consider replicas and hidden snapshots. The documentation does help a bit, but I think the interface should be a little stricter. Or perhaps we’re the only ones trying to use Harvester like we would vmware/ESX, ![]()