1. Complaint-Based Operations
How do you know that the Infrastructure as a Service (IaaS) Platform (be it on-prem or in the cloud) is serving its workload well? If you depend on complaints, then you run a complaint-based operation.
Changing from reactive to proactive is unfortunately a complex undertaking, especially in large organizations where there are many roles and personas. It requires operations transformation and a paradigm shift. It is not easy to get customers to agree on a Service Level Agreement (SLA) when you’ve promised them “good” for years already. This book aims to provide practical guidance, something you can implement with the current version of vRealize products.
The litmus test below helps you assess the maturity of your IaaS.
Do your customers blame your infrastructure?
If the answer is yes, take a moment to ponder why. There is a high chance you are relying on complaints in your operations so you actually encourage them. No complaint, no problem. That’s why it’s aptly named Complaint-based Operations.
The reason why you rely on complaint is the operations have no other means by which to measure success. You have not defined the performance of your IaaS.
That’s the goal of this book.
A sign of matured operations is that you have complete, correct and accurate Service Level Agreements. Complete means you have Performance SLAs and Compliance SLAs, not just Availability SLAs. Correct means the SLA is measured on each paying VM, and not at the infrastructure level. It also means you use the right metric. Accurate means the measurement has to be measured every 5 minutes, as any longer intervals than this can miss the problem.
Is your IaaS cheaper than public cloud or hybrid cloud?
The commoditization of infrastructure means your IaaS is being compared with similar platforms such as VMware Cloud on AWS and Amazon Web Services.
If not, your CIO may question your business value. The reason for having an in-house architect is so you can bring lower cost, after taking into account your salary.
Does Help Desk provide a good first level defence?
If Help Desk simply passes issues through to the next level, you need to look at why.
Help Desk is your first line of defence. They are not as technical as you are. Equip them with a simple dashboard so that they can handle VM Owner complaints by discovering:
- Is the problem caused by IaaS not serving the VM well?
- If yes, which part of the Infra: CPU, RAM, Disk, Network?
- If not, how to prove it convincingly?
Can you justify new infrastructure when utilization is not high?
This is not referring to additional money that comes with new projects. This is referring to existing workload on existing clusters/storage.
Capacity is measured on utilization and performance. A cluster capacity is full if it can’t serve its VMs well. Since it takes time to buy hardware, you need to have an early warning system to detect this performance degradation.
Do you struggle with many over-provisioned VMs?
This is an indicator that you are operating as a System Builder as opposed to a Service Provider. As a System Builder, you are meddling with each System (read: Application). You size them and argue with the application team, who are actually your customers. You are busy as there are many applications and you are outnumbered.
If you are operating as an internal Cloud Service Provider, You are not “in the way” of the business. You use an effective pricing model to drive the right behaviour. Does a public cloud provider block application teams when they buy 40 CPU AWS EC2 VMs when they only need 2 CPU? They don’t, hence neither should you.
Does Troubleshooting mean all hands on deck?
Do you have a process that is followed by all teams (network, storage, server, OS, application)? Does that process end with Root Cause Analysis?
As part of RCA, do you set up alerts so the same issue can be detected faster if it happens again? Without an alert configured, the RCA is not closed. The alert is also critical as it will trigger the RCA process.
This page was last updated on June 29, 2021 by Stellios Williams with commit message: "Fixed non-ascii double quotes"