Want to do a storage stress test to see what your newly bought All-Flash Storage array is capable of? Perfect, then let’s have a look at IO queues within ESXi and how they influence your Storage performance! But before we dive in, as always, please read my Disclaimer first!
Alright, let’s get started… Before we start tuning our IO queues and push our storage array’s to the limit, we first need to understand what an IO is and how it is related to IO queues. I will try to briefly explain the process of an IO, based on following scenario: We have an application, which is installed on a windows VM running on a ESXi Host which is attached to a Fibre Channel SAN and retrieves his storage from a traditional block storage array. This particular application wants to fully read a file.
Now, whenever this stated application on the guest file system (NTFS in this case), wants to read this file, the IO device driver within windows will issue one or multiple SCSI Read commands through the virtual SCSI adapter after the the request has been translated in the file systems IO stack. Once received from the hypervisor, the SCSI requests sent by the guest OS needs to get translated and in rare cases even split by the ESXi IO stack until they finally reach the device driver and passes the HBA. Afterwards, they get send over the Fibre Channel Network towards the Storage Array where the whole game starts again in a similar fashion. Once the requested data arrives at the guest OS level, the IO’s get treated as completed.
As described in the previous section, depending on what the application requests from the file system, a single file operation can consist out of 1 up to N IO’s. Since most of today’s disks/storage array’s still expose 512 byte sectors in form of Logical block addresses (LBAs), file systems usually create so called block clusters to reduce meta data overhead and to improve performance (locking etc.). In Windows for example, the default cluster block size is 4kb. That means, the smallest addressable disk space in the guest OS is 4 KB. One SCSI operation can consist out of N continuous cluster blocks. For example, a SCSI Read request, includes the starting logical block address and the number of consecutive blocks. In Figure 1 “Example file”, we can see a file on a standard formatted NTFS which is spread across 4 cluster ranges. Therefore, to fully open this file, we need at least 4 IO operations, but since the IO stack and the device drivers have maximum IO sizes configured and apart from that, it also depends on application and memory buffer mechanisms, we most probably need even more.
To sum this up, to read this file, we need at least 4 IO operations, but most probably even more.
As previously explained, whenever a file gets read or written, one or multiple IO’s needs to be issued. Normally, IO’s get issued in an asynchronous fashion. That means, whenever an IO request was issued, the operating systems as well as the hypervisor continue to work on other threads until the IO operations are completed and they can continue with them. During the time, a read or write request was sent until the data or acknowledges arrive, each of the requests occupy a “slot” in the different IO queues.
As illustrated in Figure 2, IO queues exist on multiple levels. We have a queue at the guest OS level, at the VMkernel Level, on the Adapter and the LUN level and later also on Storage level of course. Now, to fully utilize our resources, we would need to tune all these queues in order to support as many parallel IO requests as possible. Of course, we also need enough memory for transfer buffers as well as CPU resources to process the IO requests. Furthermore the Storage Area Network should also be able to transport all of the requests fast enough…
So, let’s begin with the first IO Queue a request has to go trough after issued. On the operating system level, you’ll have different default queue depth’s based on the virtual scsi adapter you use. The LSI Logic SAS adapter for example has a default queue depth of 128 per adapter and 32 per virtual disk. On the other hand, the Paravirtual adapter has a default queue depth of 256 per adapter and 64 per virtual disk. In special cases you can even increase the queue depth to 1024 per adapter and 254 per disk as described in following KB Article. Now, if you have an IO hungry application or you just want to stress test your storage array, it is a good practice to distribute your workload across 4 pscsi adapters and 4 virtual disks each.
The next queue an IO has to traverse, lays on the VMFS level called “Number of outstanding IOs with competing worlds”. Per default, this queue is set to 32 and is only active when you have two or more VMs running on the same Datastore and when I/O Control is disabled. This setting is configurable on a per Datastore basis and can be checked as shown in Figure 3:
If you like to increase this queue, you can do it with the following command. But since ESXi 6.5, the value must be smaller or equal to the “Device Max Queue Depth” which we will cover in the next section.
esxcli storage core device list -d naa.60003ff44dc75adcb5452b2ab87f66e7 -o 256
In general, keep in mind this is also a per Host setting and should be applied equally to all Hosts in a cluster.
The “Device Max Queue Depth” or the IO Queue per LUN is a setting in the HBA driver config and as the name indicates, describes the maximum queue depth per LUN. In general, this queue depth is active when only one active VM is running on the Datastore or when I/O Control is activated (will be discussed in the I/O control section). Depending on the HBA vendor, you’ll see different default values usually reaching from 64 to 128. As mentioned, this setting can be adjusted in the HBA drivers config. The process to do this, differs from from vendor to vendor as they use different drivers. Just follow the instructions in this KB article.
As I do not have a fibre channel network as well as a dedicated storage array in my lab, I used iSCSI for testing where the cli command looks like the following:
esxcli system module parameters set -p iscsivmk_LunQDepth=256 -m iscsi_vmk
Afterwards you can check whether the configuration has been applied correctly:
Please note that you have to restart the host for this setting to become active as it is a driver/module modification.
At last but not least, you have the adapter queue depth of the HBA you have in your server. Contrary to the other queues, this IO queue is not configurable as it is a fixed value supported by the hardware. You can check the current value by opening “esxtop” on the ESXi command line and going to the storage adapter overview by hitting “d”. Afterwards you need to enable the queue stats by pressing “f” to show the available columns and then select “d” and go back:
As you can see in Figure 5, the two port HBA is supporting a max accumulated queue depth of 2048.
Okay, the “Device max queue depth” dictates the maximum queue depth and is active when only one VM is running on a Datastore, whereas the VMFS queue or the “Number of outstanding IOs with competing wordls” setting is getting active when two or more VMs are sharing the same Datastore. So far so good, but the whole thing changes when I/O control gets activated on these particular Datastores. As soon as I/O control is active, the queue depth on VMFS level gets ignored and Device queue depth is taken as a baseline. Whenever congestion occurs, I/O control reduces the queue depth in an adaptive fashion until the Datastore latency finally reaches a “normal” value again. I/O control detects contention either by a defined or intelligent observed latency threshold which needs to be passed. In production environment it is the best practice to activate I/O control anyway, but for stress testing purposes it doesn’t make a lot of sense.
In order to be able to fully utilize the potential of your storage array, you should tackle all potential performance bottlenecks on the way to it. This means, optimizing the queues beginning on top of the OS up to the storage array. Following a short table which sums up which queue will be active in which scenario:
|VMs on LUN||I/O Control||Active queue|
|2 or more||Disabled||VMFS queue|
|2 or more||Enabled||Device queue|
To check the current queue depth of a particular LUN after your configuration, go to ESXTOP and press “u” to get a list of all devices. In the row “DQLEN” you see the current queue depth. Apart from that, you have some other metrics where you can see how the queues are filled up during your tests.
But keep in mind, with filled queues, the latency of single requests will also go up. But for stress testing this usually doesn’t matter as you most probably do ramp up tests where you want to see how many IOps my system delivering at what latency.
One last point I’d like to mention here as well is, that the whole thing will change as soon as NVMe and/or NVMe over Fabrics will be fully adapted.
Storage Array stress testing
Additionally, following a few key points summed up on which you should pay attention when testing for performance:
- Configure your SATP & PSP rules properly (Use round robin to fully utilize HBAs and FC Ports)
- Set CPU and Memory reservations for workload VMs
- Set VM latency sensitivity to high
- Provision as many LUNs as you need to optimally utilize max adapter queue
- Distribute workload on multiple VMs with multiple virtual disks and scsi adapters
- Increase queue depth on VMFS level as well as on LUN level
- Configure your IO testing tool to issue as many parallel IOs as the setup allows