One of the 9 basic analysis tabs which every object in vRealize Operations Manager has, is “Stress”. But what does Stress indicate and when should you care? Lets dive in…
What is Stress?
Basically, stress is the demand of resources, an object like a VM or a Cluster is requesting and which exceeds a certain threshold. The default value for this threshold is 70% of available resources (e.g. CPU or Memory). This means, whenever for instance a VM requests more than 70% of its configured CPU resources, this is considered as stress. Now, even thought a single short peak of resource demand over 70% is already considered as stress, it is not really relevant from an overall perspective (at least for stress calculation) and for sure you don`t want to get alerted about this… The more interesting part is, how many times you had such peaks and this immediately leads to the question, how much time your object spent in this stress area. And here it gets interesting 😉
How does stress gets calculated?
As described before, the more interesting value regarding stress is, how long did your object spent in this stress area. Now, how does stress and the duration of it gets calculated? Looking at Diagram 1, you can see the demand of an example VM which is exceeding the stress threshold line at 70% at some point. After a while, the demand decreases and goes up again later. Now, as per default, vROps is looking at the last 30 days and takes the “worst” hour in to account. Worst hour means the hour with the highest accumulated stress. vROps therefore accumulates everything over 70% up to 100% (It could even go over 100% since a VM can request more than 100% resources). The accumulated demand over 70% is now converted to the percentage value of capacity within the stress zone. To come back to our example, this particular VM was requesting in average 50% of the available capacity of the stress zone (Between 70% and 100%) during this specific hour.
To get more practical, lets assume this VM has a total capacity of 10 GHz assigned. The resources between 7 GHz and 10 GHz are considered as Stress and therefore the 3 GHz difference is representing 100% of the Stress zone. The VM now had an average usage of about 8,5 GHz in this particular 1 hour analysis window. This means, the VM was using 50% of the capacity in the stress zone between 7 GHz and 10 GHz which is called stress score.
10 GHz (Total capacity) – 7 GHz (Stress free capacity) = 3 GHz (Stress zone capacity)
8,5 GHz (Average resource demand in analysis window) – 7 GHz (Stress free capacity) = 1,5 GHz (Average demand in Stress zone)
3 GHz (Stress zone capacity) / 1,5 (Average demand in Stress zone) = 50% (Stress score)
What does Stress tell you?
The question is now, why should I care about stress and why can`t I simply use Workload to notify me when a VM exceeds a certain threshold? First, having objects like clusters or VMs running over this stress line does not necessarily indicate that the resources are not sufficient at the moment. It rather can help you to identify potential performance bottlenecks in the future or to give you a better insight in to your objects workload behavior. For example, imaging a situation where you have several peaks like in Diagram 2, where workload contention was just a short issue, but it happened several times. As stress gets accumulated, vROps is capable of noticing this situation and alert you accordingly. Therefore, it shows you not just a snapshot of resource contention, it rather shows you resource contention over a period of time.
What is configurable?
As everything else as well, of course you can adjust several parameters for calculation the stress values and in order to optimize this nice little helper for your demands. To modify this values, you need to edit the according policy in the policy manager. There you`ll find following parameters to adjust:
- Stress line threshold (Default: 70%)
- Stress Score Threshold (Default: 25, 35 and 50%)
- Stress analysis window (Default: Any)
- Stress analysis window duration (Default: 60 Minute Peak)
As per default, at 25% stress score, vROps generates an alert with severity “Warning”, at 35% one with “Immediate” and at 50% one with “Critical”. Further you can adjust whether vROps should take any hour in the desired time range in to account for stress score calculation or whether he should look at the entire time range (for example 30 days). I strongly recommend to leave it at “Any” as otherwise vROps waits until the stress score is reached over a period of 30 days. If you leave it at “Any”, you are also able to modify or extend the stated 60 minutes. For instance, if you have some less important object which should alert only when the stress score is reaching the threshold within 4 hours.
Last but not least, you can specify the time range in which vROps should search for the “worst” hour (or whatever you configured) and whether he should look at 24 hours and 7 days or whether you like to limit the times because you may just have workload during the day.