Optimizing for Virtualization, Part 2
by Liz van Dijk on June 29, 2009 12:00 AM EST- Posted in
- IT Computing
Last but not least in our discussion, the use of proper software, and the configuration thereof. ESX offers a veritable waterfall of settings for those who are willing to dig in and tweak them, but they should definitely be used with care. Furthermore, it has quite a few surprises, both good and bad (like the sequential read performance drop discussed earlier) for heavy consolidaters that warrant a closer look.
First of all, though we mentioned before that the Monitor remains the same across VMware’s virtualization products, not all of them can be put to use for the same purposes. VMware Server and Workstation, sturdy products though they may be, are not in any capacity meant to rival ESX in performance and scalability, and yet quite often perfectly viable testing setups are discarded due to their inferior performance. Hosted virtualization products are forced to comply with the existing OS’s scheduling mechanics, making them adequate enough to set up proof of concepts and development sandboxes, but not meant at all as a high-performance alternative to native situations.
Secondly, there are some very important choices to make when installing and running new VM’s. Though the “Guest OS” drop down list when setting up a new VM may seem like an unnecessary extra, it is actually responsible for the choice of monitor type and a plethora of optimizations and settings, including the choice of storage adapter, and the specific type of VMware Tools that will be installed. For that reason it’s important to choose the correct OS, or at least one that is as close as possible to the one that needs to be installed. A typical pitfall situation could be to select Windows 2000 as the guest operating system, but installing Windows 2003. This would force Windows 2003 to run with Binary Translation and Shadow Page Tables, even though hardware-assisted virtualization is available.
Preventing interrupts
Thirdly, as it turns out, not all OS’s are equally fit to be virtualized. As an OS operates through the use of timed interrupts to maintain control of a system, knowing how sensitive ESX is to interrupts, we need to make sure we are not using an OS that pushes this over the top. A standard Windows installation will send about 100 interrupts per second to every vCPU assigned to it, but some Linux 2.6 distributions (for example RHEL 5) have been known to send over 1000 per second, per vCPU. This generates quite some extra load for ESX, which has to keep up with the VM's demands through the use of a software-based timer rather than a hardware-based one. Luckily, this issue has been taken care of in later releases of Red Hat (5.1 onward), where a divider can be configured to reduce the amount of interrupts initiated.
For the same purpose, avoid adding any hardware devices to the VM that it doesn’t really need (be they USB or CDROM). All of these cause interrupts that could be done without, and even though their costs have reduced significantly, taking steps to prevent them go a long way.
Scheduling
A fun part of letting multiple systems make use of the same physical platform is thinking about the logistics required to make that happen, and to better grasp the way ESX makes its decisions scheduling wise, it might be interesting to dig a bit deeper into the underlying architecture. Though obviously it is built to be as robust as possible, there are still some ways to give it a hand.
Earlier in this article, we discussed NUMA, and how to make sure a VM is not making unnecessary node switches. The VMkernel’s scheduler is built to support NUMA as well as possible, but how does that work, and why are node switches impossible to prevent from time to time?
Up till ESX 3.5, it has been impossible to create a VM with over 4 vCPU’s in ESX. Why is that? Because VMware locks them into so-called cells, that force the vCPU’s to “live” together on a single socket. These cells are in reality no more than a construct, grouping physical CPU’s into a limited group which prevents scheduling outside the “cell”. In ESX 3.5, the standard cell size is 4 physical CPU’s, with one cell usually corresponding to one socket. This means that in dual core systems, a cell of size 4 would span 2 sockets.
The upside of a cell size of 4 on a quadcore NUMA system is that VM’s will never accidentally get scheduled on a “remote” socket. Because one cell is bound to one socket, and the VM can never leave its assigned cell, this prevents the potential overhead involved with socket migrations.
The downside of cell sizing is that they can really limit the scheduling options provided to ESX when the amount of physical cores available is no longer a power of 2, or the cell sizes get too cramped to allow for the scheduling of several VM’s in a single timeslot.
With standard settings, a dual socket 6-core Intel Dunnington or AMD Istanbul system would be divided up into 3 possible cell configurations. One cell bound to each socket, and one cell spanning the two sockets. This puts the VM’s stationed into the latter at a disadvantage, due to the required inter-vCPU communications slowing down, which would make scheduling “unfair”.
Luckily, it is possible to change the standard cell size to better suit these hexcores, by going into Advanced Settings on a VI client, selecting VMkernel and setting VMkernel.Boot.cpuCellSize to 6. The change should be implemented as soon as the ESX host is rebooted, and allowing 4-way VM’s to be scheduled a lot more freely on a single socket, without allowing it to migrate.
Changing the cell size to better reflect the amount of cores for Istanbul boosted performance in our vApus Mark I test by up to 25%. This performance improvement is easily explained by the large amount of scheduling possibilities added for the scheduler: When trying to fit a 4-way VM into a cell comprised of 4 physical cores, there is only ever one scheduling choice. When trying to fit that same VM into a cell comprised of 6 physical cores, there are suddenly 15 different ways to schedule the VM inside that cell, allowing the scheduler to choose the most optimal configuration in any given situation.
People who have made the switch to vSphere may have noticed there is no longer a possibility to change the cell size, as VMware has decided to tweak the way the scheduler operates. It will now configure itself automatically to best handle the socket size.
13 Comments
View All Comments
vorgusa - Monday, June 29, 2009 - link
Just out of curiosity will you guys be adding KVM to the list?JohanAnandtech - Wednesday, July 1, 2009 - link
In our upcoming hypervisor comparison, we look at Hyper-V, Xen (Citrix and Novell) and ESX. So far KVM has got lot of press (in the OS community), but I have yet to see anything KVM in a production environment. We are open to suggestions, but it seems that we should give priority to the 3 hypervisors mentioned and look at KVM later.It is only now, June 2009, that Redhat announces a "beta-virtualization" product based on KVM. When running many VMs on a hypervisor, robustness and reliability is by far the most important criteria, and it seems to us that KVM is not there yet. Opinions (based on some good observations, not purely opinions :-) ?
Grudin - Monday, June 29, 2009 - link
Something that is becoming more important as higher I/O systems are virtualized is disk alignment. Make sure your guest OS's are aligned with the SAN blocks.yknott - Monday, June 29, 2009 - link
I'd like to second this point. Mis-alignment of physical blocks with virtual blocks can result in two or more physical disk operations for a single VM operation. It's a quick way to kill I/O performance!thornburg - Monday, June 29, 2009 - link
Actually, I'd like to see an in-depth article on SANs. It seems like a technology space that has been evolving rapidly over the past several years, but doesn't get a lot of coverage.JohanAnandtech - Wednesday, July 1, 2009 - link
We are definitely working on that. Currently Dell and EMC have shown interest. Right now we are trying to finish off the low power server (and server CPUs) comparison and the quad socket comparison. After a the summer break (mid august) we'll focus on a SAN comparison.I personally have not seen any test on SANs. Most sites that cover it seem to repeat press releases...but I have may have missed some. It is of course a pretty hard thing to do as some of this stuff is costing 40k and more. We'll focus on the more affordable SANs :-).
thornburg - Monday, June 29, 2009 - link
Some linux systems using the 2.6 kernel make 10x as many interrupts as Windows?Can you be more specific? Does it matter which specific 2.6 kernel you're using? Does it matter what filesystem you're using? Why do they do that? Can they be configured to behave differently?
The way you've said it, it's like a blanket FUD statement that you shouldn't use Linux. I'm used to higher standards than that on Anandtech.
LizVD - Monday, June 29, 2009 - link
As yknott already clarified, this is not in any way meant to be a jab at Linux, but is in fact a real problem caused by the gradual evolution of the Linux kernel. Sure enough, fixes have been implemented by now, and I will make sure to have that clarified in the article.If white papers aren't your thing, you could have a look at http://communities.vmware.com/docs/DOC-3580">http://communities.vmware.com/docs/DOC-3580 for more info on this issue.
thornburg - Monday, June 29, 2009 - link
Thanks, both of you.thornburg - Monday, June 29, 2009 - link
Now that I've read the whitepaper, and looked at the kernel revisions in question, it seems that only people who don't update their kernel should worry about this.Based on a little search and a wikipedia entry, it appears that only Red Hat (or the major distros) is still on the older kernel version.