FinOps — The complete cost optimization guide
One of the primary motivations for using the cloud are flexibility and cost saving. The problem is that without ongoing management and oversight, the cloud budget can grow to dimensions that eliminate profitability.
A recent study on the cost of using the cloud has shown that companies pay an average of 35% more on cloud services than they really should. A total of over $60 billion is being wasted on cloud services that are not in use.
In this article, I will share my conclusions from a recent cloud cost optimization processes I managed, and review a number of ways and tools to significantly reduce cloud costs 30–50%, without sacrificing performance. The recommendations in this article are valid for all cloud vendors unless otherwise specified.
Tools for managing cloud costs
Before you start saving it is important to understand the cost composition. In each cloud, there are cost management tools that show different segmentation of costs by service type. These tools can automatically detect underutilized servers and recommend size reduction or turning them off.
- Multi-cloud: VMware CloudHealh
- Azure: Cost advisor, Cloudyn
- AWS: Trusted advisor, Cost management
- GCP: Cost management
Part 1 — Server costs
Usually, servers are the largest component of the overall cost, so we will start with server cost-saving strategies, then we will look at ways to lower your storage and network expenses and finally review some business options to obtain significant discounts on cloud spending.
Turn off servers
Estimated discount — 15% of total server cost
- Turn off servers that are not being used.
- Sizing — Adjusts the size of the server to the amount and usage load.
- Restrict user permissions to create new servers.
Estimated discount — 50% of the cost of long-term servers
Reserved Instances (RI) can significantly reduce server costs, provided you commit to a long term contract. Usually, a commitment for a year will give a 40% discount, while a commitment for three years will provide a 60% discount.
If you still want to stop using a server you have committed to, there are several ways to get out of the RI commitment:
- Replace the commitment with a different type of server
- Cancel the RI and pay an exit fine
From a calculation I made for Azure, it is more profitable to commit to 3 years with 60% discount and cancel after a year ( paying a 12% penalty) than to commit to a year with 40% discount.
Estimated discount — 70% of server costs that are not “mission critical”
Spot servers or low priority servers are 70–90% cheaper than regular servers of the same power! Furthermore, today the discount is more or less fixed (and not dependant on an auction process that was practiced in the past).
The catch is that these servers are low priority so you run the risk that they will suddenly turn off without notice.
There are however several ways to deal with this unfortunate state of affairs:
- Running stateless services that are not critical; if they close, you can always turn them on again (the disc is not deleted).
- Working with queues; if a server closes, the task will remain in a queue and wait for another server.
- Use auto scaling rules to automatically handle a particular instance count.
- An interesting company called Spotinst helps reduce costs by effectively using Spot. They are able to identify spot servers that are about to close and replace them with other spot servers with virtually no downtime. From an inquiry I made, it seems that their Azure support is still incomplete (Azure managed k8s service is not fully supported), AWS has better support.
Serverless \ Autoscaling Architectures
Estimated discount — 90% of the cost of servers that are constantly open to load times
Our systems have to withstand heavy loads, but there is no reason to keep all the servers working during periods of low demand.
Autoscaling ensures that the number of servers will automatically scale according to the measured load.
Serverless functions run only by demand without the need for permanent servers.
Proper architecture is one of the essential elements of an effective cloud but is beyond the scope of this article. For architecture and cloud design patterns, in general, see awesome-design-patterns.
In systems managed by Kubernetes, the allocation of resources is usually more efficient and cost-effective.
Dev \ Test
Estimated discount — 50% of the cost of low-altitude servers
A large portion of our servers belong to dev, test and pre-prod environments.
In Azure, there are discounts on dev/test environments (I do not know of a parallel program in other clouds).
In many cases, we work on dev/test environment during the day only, so these servers can be shut down at night and weekends, saving over 50% of the cost. There are tools to shut down servers automatically according to schedule:
Parlmycloud — Servers turn on only at the green slots.
Part 2 — Storage and Network
Estimated discount — 20% of storage costs
Here are the top 5 cloud storage categories ordered from cheap to expensive (and from slow to fast):
- Archive storage - AWS Glacier
- Object storage - S3 \ Blob
- File storage -Network libraries that can be mapped to multiple servers
- Block storage-discs SSD
- Database storage - SQL, MongoDB …
Within each storage category, there are several price levels depending on speed and redundancy.
Raw data can be stored in a cheap storage category, but metadata, used for frequent queries, is best kept in an expensive storage category.
Archive storage is the cheapest form of storage but is not practical for ongoing work due to slow retrieval times. Object storage is the next cheapest storage category, therefore this is the preferred storage choice for the majority of the data.
You should set up a “Storage Lifecycle” policy that allows you to set rules for automatically moving old files to cheaper storage categories. (Available in AWS and in preview in Azure)
Estimated discount — 20% of the cost of network traffic
The network traffic consists of internal traffic (within your network) and external traffic between the server and the customers.
For internal traffic I recommend:
- Allowing processes to operate within the same geographical area if possible.
- Using only internal addresses.
For understanding ways to reduce external traffic I talked with John Graham-Cumming , Cloudflare’s CTO, who pointed out Cloudflare’s networking costs benefits * :
- DDOS protection — DDOS can incur networking costs.
- CDN — files are downloaded from Cloudflare servers closest to the client location (with no additional network costs).
- Bandwidth alliance —promising cheaper cloud networking costs when transferring data between members of the alliance.
- Brotli compression- better than gzip.
*This does not imply Cloudflare is the only vendor in this domain.
Part 3 — Business discounts
Cloud providers and their partners have many programs that can offer significant discounts. Large customers can also bargain directly with cloud providers for discounts.
Estimated discount — 5–10% off the total cost + consultation
There is no substitute for consultation from an experienced cloud architect. All the major cloud providers have partners that provide consulting services. Working with them can provide a number of benefits.
- Discounts on the total cost of the cloud
- Support for malfunctions
- Architectural consulting
- Tools for cost management
- Flexible payment terms
Estimated discount — credit that can reach tens of thousands of dollars or more
All cloud companies have plans that can benefit start-ups.
The benefits can include:
- credit for cloud usage
- consulting services
- business promotion/accelerator
Free tier programs
Estimated discount — first year free for small servers
All cloud providers have free tier plans. These programs allow free use of the lowest cost services for a year or for a fixed number of uses. There is usually no limit to the number of free tier accounts that can be created.
- The topics mentioned above reflect my professional opinion only. I have no affiliation with any of the services mentioned in the article.