
I spy with my little eye ...
What a peaceful idea it is. Everything in your IT environment perfectly in order. Dashboards show green lights, monitoring tools report no critical incidents. Time for a coffee!
And yet - at the most unpredictable and inconvenient times - productivity stalls. Do you get or experience delays, hiccups, systems suddenly unresponsive for seconds or minutes. Huh?
This probably won't seem unfamiliar to you. It also happened recently at a company I was visiting. The IT manager sighed exactly what I wrote above. Including that coffee. They had just rolled out a new CRM system in the cloud, though. Everything seemed to be working flawlessly. But the users complained that the system completely shut down several times daily. A few minutes may not seem like a long time to outsiders, but when you have customers on the phone that you can't help for minutes on end, frustration rises and you rapidly lose sales.
Look, the standard monitoring tools did see that there seemed to be a problem somewhere, but didn't answer why. The supplier of the Internet connection saw no problems. Adding cloud resources made no difference. No one could find the cause.
That was the moment I got to work. The cause turned out to be simple. But I should actually say ... The cause was simple for me. What I do is measure, just like that software. But I measure differently. And I dare say ... I measure better. Because I measure with a different, better resolution.
Soon I discovered what was going on. Mobile devices appeared to perform massive updates at random times. In doing so, they were gobbling up the entire Internet connection. The available bandwidth was simply drowned out by the more powerful servers on the other end.
The Internet provider didn't see it because they were measuring with a 15-minute resolution. A peak of about 4 to 5 minutes then averages out. The APM tooling hadn't found it either because it had no visibility into the network connection.
Inefficiencies in IT arise not only from what you see, but mainly from what you don't see. By measuring more intelligently and in more detail, you can uncover hidden problems. And now you're thinking ... how many blind spots are there in my IT environment? I'm going to help you!
Tip: Why an average will lead you astray
“On average is nice here,” the man said with his head in the oven and his feet in the icebox. Kind of lame but immediately obvious right? An average tells only part of the story. It “masks” peaks and valleys, leaving important problems invisible. And in IT, that can be costly. And from what I see, it's more often costly than we wish.
Take the case with that company with that new CRM system in the cloud. The users complained , the Internet connection provider reassured them “on average”, “We measure every 15 minutes and see no problems.” The users did not feel they were being taken seriously because the problem was there. Even though the average load on the line was well within the margins, we were all happy that I had found the problem quickly and with that it was quickly fixed.
A usage averaged over fifteen minutes says virtually nothing. At least nothing useful. Just as an average temperature says nothing about how comfortable someone feels when they alternate between experiencing extreme heat and icy cold.
The standard monitoring tools usually do not raise alarms if they do not look closely at overload. If they measured maxima in addition to averages they would have found the problem. Tip for you, what about your monitoring tools? Averages or also the outliers?
In IT, measuring only averages is simply not enough. Because if you blindly rely on averages, you run the risk of keeping your biggest problems exactly out of the picture.
The right resolution: measuring what really matters
An APM vendor once told me proudly, “We measure every ten seconds. So what can you miss?” Not wanting to dampen his enthusiasm, I thought, “Well, almost everything.” In the world of IT, ten seconds is an eternity. Crucial events can happen in milliseconds, and if you don't capture them, you have a big “blind spot.
Compare it to measuring temperature. Suppose you take the temperature at noon every day, while someone else does so at 6 a.m. The temperature is the same. Both are trying to measure the same thing - the day's temperature - but the results will be totally different. One might measure 20 degrees, the other 8 degrees. Which one is correct? Neither or both. The problem is in the resolution as well as in the measurement time: you thus miss the variation in temperature and thus the essence of the weather.
It works the same way in IT. Suppose a microservice is only active for half a second, but you only measure every ten seconds. There is then a 95% chance that you never register this microservice. And if you don't see it, you can't include it in your analysis of the problem.
Half a second may sound negligible, but in modern applications, a user often calls dozens of microservices for a single action. If several of these services are momentarily delayed and you don't measure them, you miss the cause of slow application performance. What is not measured remains invisible - but does impact the user experience, among other things.
The reverse is also a pitfall. If you measure with too high a resolution, you get so much data that it becomes difficult to extract valuable information. You can no longer see the forest for the trees.
Measuring is professional work and cannot just be automated. I always adjust the measurement resolution to what I hear in terms of problems and thus actually want to understand. This is partly by feeling and partly by what I hear is going on. Not too coarse, not too fine, but just right to bring the real problems to the surface. Because measuring without understanding is as useless as not measuring.
Measuring the right things: not everything that runs is useful
Cloud vendors tout their FinOps tools as the solution to optimize costs. Useful, but they look primarily at usage, not business utility. Ask yourself, does that server that is active provide value to our organization? To avoid unnecessary costs.
A striking example: a FinOps tool indicated that server was “active” and “running fine” and had no reason to be turned off. After all, software was running on it. But when I looked closer, it turned out that server was just making its own backups. No function left, no value to the company, just cost. I had already recouped myself .
The lesson? The cloud vendor only looked at whether something was running, not what was running and whether that still made sense. By taking that step further, it quickly became clear that this server was only costing money without delivering anything.
True cost optimization requires more than just a checklist of active resources. It's about understanding what you're paying for and whether that money is actually contributing to your business goals. Otherwise, you may be optimizing your usage, but not your costs.
Don't keep searching-let's solve it
Are you facing a persistent IT problem that no one can put their finger on? Or do you see your cloud costs steadily rising without a clear cause? In many cases, we find the cause and the solution is often simpler than many people think.
Do you also want insight and savings? Let's look at your situation together. Schedule a no-obligation appointment and discover how you can gain clear insight and grip on your costs within a short period of time and eliminate productivity loss and unnecessary costs.