Hello and welcome to the second part of the shortly introduced performance monitoring series! In the first post we looked in to how the CPU influences the overall performance of your virtual workload and how you can identify potential bottlenecks. Today we’ll have a closer look at the whole disk and storage section. However, as multiple approaches and architectures exist like traditional block storage over SAN, file storage over NFS or Hyperconverged storage, in favor of reducing complexity, I’ll limit the today’s post to a traditional FC setup. But don’t worry, an additional post about vSAN performance is also planned 😉
Okay, as always I’d like to highlight my Disclaimer before we’re going to start 😉
As usual, I’m a fan of starting with some basics to bring the whole audience up to the same level of understanding. Therefore and first of all, I’d like to point out the first section of a previous post I made about the basics of an IO operation. For better understanding of the further content, I recommend you to quickly go trough it.
Alright, as you now know more about an IO operation, I would also like to tell you a little bit more about the different flavors of IOs and how they influence all the parameters and metrics around storage monitoring. As soon as you start talking to the storage responsible in your company about storage performance, the first terms you’ll most probably hear are “IOps” and “latency”. Therefore, lets take these ones as a starting point.
IOps is the short term for “Input/Output operations per second” and hence refers to how many IO operations a system (server or storage) is doing per second. That means, a server (physical or virtual) is doing or requesting a certain amount of IO operations per second, while the storage array is delivering them. During “normal” phases, you see as many IOps as the systems are requesting based on the current application demands. During peak phases, the IOps value you’re going to observe is only as high as the “weakest” component in the chain allows for. For example, if the storage array is not able to deliver the requested amount of Input/Output operations because it has to serve other servers as well and is therefore limited or because the hardware (storage processors, disks etc.) is simply not capable of delivering more, the IOps value will get stuck at a certain threshold.
To summarize, as a SDDC admin, you can use this metric to tell, how “storage-intensive” your virtual server or application is if there is no contention observed which brings us to the next section.
Now you know what IOps are and what they are used for. Unfortunately, the raw IOps value doesn’t tell us too much as a very important parameter is missing to fully understand how the storage stack is performing. As mentioned before, I said that you can use the IOps metric to tell how “storage-intensive” your virtual server is if you do not face contention. But how can you tell when you’re starting to face contention?
Exactly, the title of this section indicates it already. With the help of the latency value! The latency tells you, how long it takes the storage array and/or the network in between to serve an IO operation request. With that said, IOps and latency do always have a relation to each other. The more IOps you want to achieve during peak loads, the lower your latency has to be. But latency is of course not only important during peak loads, in general, the lower the latency, the faster your systems get the data it has requested and in logical consequence, the faster it can process the data in the compute stack. For example, you can have a setup where a server is showing 1000 IOps with an average latency of 1ms or another setup with the same amount of IOps but an average latency of 20ms. I don’t think I have to explain what’s better :-).
In general, as the requested IO operations increase, so will do the latency due to increased utilization of all the components in a storage array until it hits the upper limit. In the example below (Figure 1: IOps vs Latency), you see that this particular storage array is maximally serving around 60`000 IOps (30`000 reads & writes each) at an average latency of 4ms:
This was the maximum this storage array was able to deliver. But you also see how latency is increasing with rising IO requests. But even though we’ve now primarily talked about the storage array, of course, the whole stack (server, network etc.) as well as the configuration (path selection etc.) needs to be configured properly in order to move the bottleneck to the weakest part in the stack as mentioned earlier.
Alright, you should now be able to understand the role of latency and IOps when it comes to storage performance monitoring. In general, these two metrics are the basic parameters. However, the amount of IOps and their latency gets influenced by a bunch of further parameters which you need to consider in order to properly interpret the performance of your environment.
In the storage world, people usually talk about the “workload characteristics” of virtual machines or their applications. This term in general is the definition of following parameters:
- IO intensity
- Block sizes used
- Bandwidth intensity
- Read/Write ratio
- Access pattern
- Working set
IO intensity is basically what we’ve already discussed. It means how IO intensive an application is under normal conditions. Is it a rather storage intensive application like a database or a bit calmer one like a web-server for example.
Block size used
Block size indicates how large the requested data blocks within one or multiple IO operations are. The bigger they are, the longer it takes to read/write and transfer them. Usually your applications are working with different block sizes depending on what they want to achieve. For example, backup applications are mostly using huge block sizes as the primary goal is to transfer a huge amount of data in a sequential fashion (get covered in a bit) and therefore better utilize bandwidth without a lot of overhead. On the other hand, databases are mostly working with smaller block sizes as the accessed data is mostly of random nature.
As described in the previous section, when the primarily goal is to transfer a huge amount of data, it is more likely that you’ll see fewer but larger IO requests. This of course increases the latency of the individual IOs, but bandwidth utilization is more efficient.
Usually, Writes are more resource intensive for a storage array (parity calculations, mirroring etc.) compared to reads, hence write intensive application will usually face higher average latencies than a read intensive ones.
The access pattern refers to how random the data addresses are an application is requesting. Nowadays with SSD devices, this is not such a concern anymore compared to back then when spinning disks where in place.
And last but not least, the working set indicates how large the actual data set is with which the application is working. If it always requests the same data, most of the stuff can be delivered out of caches instead of reading everything from the disks.
If this confuses you, don’t worry, you don’t have to know them by heart. However, keep them in mind when troubleshooting performance issues. For example, if you look at a virtual machine which has higher latency values, it can be that this particular machine is aiming for maximum transfer rates and is therefore using bigger block sizes for optimizing for bandwidth.
Alright, if you’re still reading, congratulations, you made it to the actual part of the post :-D! Now we’ll have a look in to how you’re actually monitor the above stated metrics and how you can interpret them.
In general, there are a couple of tools available to monitor the storage performance. Based on how granular I’d like to see the data, I use one of the following:
- esxtop (2s)
- vSphere Real Time Charts (20s)
- vRealize Operations Manager Charts (1m)
- Log Insight (Logs)
As you probably already have recognized, the time specifications in the above list, describe the most precise granularity you can achieve with each tool. Hence, if you want to get a general overview about your environment, using vRealize Operations Manger (vROps) is a good idea as it can hold the configured granularity (default 5 min but configurable to 1 min.) for a long period of time. If some of your objects face contention, I would look at the vSphere Real Time Charts to further investigate the problem and if this is not sufficient or you want to do some tests, I recommend to use esxtop as it can show you average values in a 2 second granularity.
IOps, Latency & Bandwidth
Okay, as you now know the tools, lets start with the basics. When noticing bad performance, I usually first take a look at the latency values. Basically you have already two options, either directly look at the virtual machine latency values which include the whole stack up to the virtual scsi or nvme controller, or first take a look at the datastore metrics to check the underlying infrastructure first.
Usually I like to start with the second approach. If you’re using esxtop, it shows you not only the overall latency values (GAVG/cmd), but also the itemized ones like the raw device latency (DAVG/cmd) as well as the kernel latency (KAVG/cmd) as shown in the following screenshot:
This already narrows down the issue. If only the device latency is high, chances are high that either the storage array itself or the connection to it has a problem. Of course, it could also be a configuration issue, but we’ll discuss this part of the story later. If you see high kernel latencies for some reason, then there is something wrong within the ESXi Host. In general it is difficult to provide here hard thresholds as it strongly depends on the storage array, the network as well as on the workload characteristics mentioned earlier. However, experiences have shown, that average latency values beyond 10 ms deserve some attention.
If you’re going for maximal bandwidth, as stated in the previous section, some of the metrics become more important while others become less. That means, latency still plays an important role, but not anymore such an important one in terms of monitoring. Here I normally rely stronger on the raw bandwidth value and check where it gets stuck. If you don’t have a lot of overhead, you most probably gonna hit the connectivity limits like the maximal bandwidth of your HBA’s or the ones of your FC ISL’s.
Of course, there are a bunch of further useful metrics, but in favor of time I’m not going to outline each and every of them. However, I put some of them along with some critical threshold recommendations in to the summary table at the end of this post.
Configuration & Logs
Beside monitoring the actual workload, it is also mostly a good idea to check out the configuration of the whole storage setup. For example, check whether I/O control is activated and how the disk shares are configured. Or another famous example would be the multipathing configuration of the storage array within ESXi. Even tough the FC environment is properly set up (Zoning, Masking etc.), you have to assign the LUNs to the correct PSP rules within ESXi in order to fully utilize the bandwidth and resources. For instance, if you have 4 paths to a storage array but you’re using the MRU (Most recently used) which is the default for Active/Active Arrays, you end up with only using one path which is one quarter of the overall available bandwidth. If you now switch to round robin, of course you’d increase parallelism. For this reason, I also like to check the traffic distribution over the physical HBA’s to check whether the Host makes use of proper multipathing:
But always make sure to consult the best practices of your storage vendor on how to configure ESXi to optimally leverage the array without overloading it.
If you have successfully checked the performance charts as well as the configuration and still face performance issues, having a look at the “vmkernel” logs is another good idea. For this action it is beneficial if you have a log management system like vRealize Log Insight in place tin order to efficiently evaluate the log messages. In general, if you’re not an early adopter using the NVMe protocol, the ESXi Hosts are communicating using the SCSI protocol. Hence you’ll get all SCSI related messages and their corresponding event codes.
If you have Log Insight in place, it serves you already a nice set of Dashboards and queries to investigate these messages. However, lets have a quick look at a raw SCSI event message and how to translate it. Basically and as you can see in “Figure 2: SCSI Code example”, such a message consists out of two parts. The first part shows you status information whereas the second part give some additional information if necessary:
The first parts divides itself further in to the following sections:
- H: Host related codes
- D: Device related codes
- P: Plugin related codes
That means, you immediately see from where the log message is coming. Most of the time you’ll probably see a value in the “D” field which means it comes from the storage array or the storage endpoint. If it shows the value “2”, it means that there are more information about the event located in the Sense code part of the message which should provide you more details about the event.
As raw numbers don’t tell us much, there is of course a meaning behind every single one of them standardized by the T10 committee. To avoid browser trough lists and translate everything by your own, there are some nice decoders out there. One of these decoders can be found here.
If I translate the above stated code, it looks like the following:
Perfect, we’ve done it and with the above stated metrics and tools you should be able to identify 95% of the occurring issues. However, if you like to go a little bit deeper, following a summary of the already mentioned metrics supplemented with some additional counters you could use.
|Disk queue utilization||100%|
|HBA Utilization (Bandwidth)||80%|
|HBA Port Utilization||Uneven|