Why Most Cloud Cost Optimization Strategies Fail After 90 days

Written by Deepak Bhagat, In Technology , 2 Views

The first month is typically successful. Teams eliminate reserved-instance waste, terminate unused dev environments, and reduce the overprovisioned sizes of a few VMs. The dashboard changes color to green. A person creates a PowerPoint presentation.

Then comes the third month, and the bill is again up. 

This isn’t a coincidence. It’s almost a routine at this point: the initial cleanup efforts are effective, the structural issues are not addressed, and the costs creep back up. Not always to the exact location, but close enough that the savings discussion needs to be repeated.

This is not typically a technical issue. Or at least, not primarily. They’re organizational, behavioral, and sometimes simply a matter of what teams measure.

The first 90 days are the easiest

There’s a lot of waste that’s clearly visible in most cloud environments. Unattached volumes, instances with 3% CPU usage, test environments that are past their test, S3 buckets that have logs that are several years old, and no one is asking for them. It’s not difficult to clean that up once someone decides to do it.

The issue is that this type of cleanup is a one-time action. You do it, costs decrease, and then you’re done. The next bill, the month four bill, the month six bill, is dependent on whether the organization changed the way that it constructs and deploys things, and not only whether it cleans up what already exists.

  • Most don’t change the underlying behavior. The waste is thus returned.
  • One-off cleanups are like a strategy. They’re not. They’re maintenance.
  • A team that kills 40 idle instances has resolved a symptom. If a team can develop a process to prevent instances from building up in the first place, they have solved the actual problem. They are altogether different things, and they call for different types of work.
Also Read -  3D Printing: How It Works, Applications, Advantages, and Examples

Nobody Owns the Cost After the Project Ends

One thing that keeps coming up is that cloud cost optimization is treated as a project. It has a start date, a deliverable, perhaps an outside consultant, or a specific sprint, and then it’s done.

Cloud costs are not a project, however. They are a continuous reality of operation that evolves with each time a product is sent, scaled, or left on. If the project is completed and no one is designated as the owner of the number, it floats.

Typically, the teams that build and deploy never see the bill. The finance team is aware of the bill but has little control over it. The platform or DevOps team is in between, and frequently has no authority to enforce anything. Where costs lie is that gap.

The key is to bring the cost of ownership closer to the point of spending decisions, the team or service level. Not corporate-level budgets that are reviewed quarterly. Real-time visibility of a service or team, regularly monitored, and with someone whose name is on it.

Some organizations do this very well, using their own chargeback models. Some simply provide teams with a live dashboard of their service usage costs and log in monthly. It’s not about the mechanism; it’s about the accountability.

The Metrics Are Wrong

Many teams consider the monthly bill an indicator of cloud cost optimization. If it went down, then things are working. If it went up, there’s a problem.

That’s too direct for any good use. 

Total cloud spend is tied to business growth, so optimizing for absolute spend will eventually come into conflict with scaling the product. You want to measure cost efficiency, defined as expenditure divided by a useful measure of output. Revenue, active users, transactions processed, and data ingested. Anything that makes sense to the business.

When revenue doubles and cloud spending increases by 30%, that’s a win. That’s a problem if revenue remains the same and cloud spend increases by 30%. The monthly bill alone will not distinguish.

Without making this distinction, teams optimize for the wrong thing, or they simply do not optimize at all, because they believe that rising costs mean something is wrong when, in fact, it is just business growth.

The other missing element in most dashboards: forecasting. Most teams consider last month’s spending. Very few people have a clear idea of how much they will be expected to pay next month or what specific changes will be made to the bill. This can make it difficult to react in advance of a spike, not after.

Also Read -  Iofbodies.com: Technological Human Integration Future

Engineering Teams Don’t Think About Cost During Development

This may be the most structural issue. The practices that lead to cloud waste typically emerge during development, as teams race to ship features and don’t really care if they provision a little extra.

It’s the safe thing to do, and that’s over-provisioning. No one is paged for an instance that is too large. They do receive a call if it is too small.

So teams put a buffer in. They create environments with no clear plans for their destruction. They design things that function – but not necessarily cheaply, since no one told them to be.

This is not achieved using dashboards. It changes when: 

  • Cost visibility is not just in a finance report; it’s part of the development process.
  • Engineers can now view the cost of their services before they ship, not after.
  • Cost is a consideration in architectural decisions, along with performance and reliability.

No slowing down is required for any of this. It’s primarily a matter of tools and culture. However, it does need someone who thinks it’s important.

Reserved Capacity Is Bought Without a Usage Model

Reserved instances and savings plans are somewhat helpful. But the discounts are actually helpful; they offer 40-60% off, which you can compare with on-demand pricing as well. They may be required for future use, but that depends on your company’s needs. You. You can choose to lock in; it’s up to you.

The typical failure mode is to purchase reservations for current peak usage without accounting for future usage changes. Workloads migrate. Features get retired. Business priorities change. The three-year reservation period for instance types that are replaced by a newer architecture is not a saving.

The teams that do this well treat the decision to reserve capacity as a continuous process, rather than a single event. They check coverage quarterly, update it as the environment changes, and are wary of long-term commitments in areas where architecture is still evolving.

Teams don’t use spot instances or preemptible VMs that they don’t want to build fault tolerance into workloads that might run on cheaper, interruptible capacity. It is reasonable, but it means some work is left on the table for workloads that can be interrupted with some effort.

Also Read -  Betechit.com Tech: Your Complete Tech Resource

The 90-Day Pattern and How to Break It

What is effective then? No one tool, no one audit, no finance policy memo.

Cloud cost efficiency is achieved over the long haul by the organizations that do a few things consistently:

  • They make costs visible at the team level, in near real time.

Not a monthly finance report. An item that engineers and team leaders can review without a procurement ticket. A trend line, rather than a single point-in-time number.

  • They set the standard for what good unit economics should look like.

A target cost per transaction, per user, per API call, whatever is appropriate for the business. This provides teams with a target that won’t hinder growth.

  • They have lightweight governance that operates continuously.

Tag enforcement, automatic shutdown of untagged environments after X days, and alerts when spend anomalies reach a threshold. Not bureaucracy, but opposition to waste building up unnoticed.

  • They check the reserved capacity regularly.

At least quarterly. Coverage reports, utilization rates, what’s expiring, and what needs to be right-sized or let go.

  • They regard cloud cost as an engineering problem.

Not a budgeting exercise. Engineers who know the cost of their architectural decisions make better decisions, not because they’re cutting corners, but because they have a more complete picture.

  • The companies that get this right don’t optimize once. They create an environment that makes it more difficult to generate waste and to hide it.

This Takes Longer Than 90 Days to Fix

The truth is, it’s a 6-12 month journey to real cost efficiency in the cloud if the organization is serious about it. In the first 90 days, get rid of the obvious. The next period is focused on the organization’s transformation.

It’s more difficult, requires more internal alignment, and doesn’t create as sharp a “before and after” graph. However, it’s the part that actually contains.

If your cost optimization program is still being tracked on a single dashboard that compares your savings to those from three months ago, then it’s likely you’re on the path to another upward trend.

The article is given to us by costimizer.ai

Related articles