The Cloudy Tales

Risk analysis for resource planning

2023-01-13T00:00:00+00:00

Risk analysis is a crucial component of resource forecasting for datacenters. By identifying and evaluating potential risks, datacenter managers can ensure that their operations are prepared to handle any potential challenges that may arise, allowing them to deliver reliable, high-quality service to their customers. In this blog post, we will explore the different types of risks that can affect resource forecasting in a datacenter. We will also discuss how to use risk analysis to inform resource allocation and capacity planning decisions.

Risk analysis for capacity planning is challenging because there are different sources of variability. Let’s consider 3 major categories of risk for resourcing datacenters:

Demand variability
Supply lead time variability
Repairs and failure variability

Demand Variability represents the fact that customer behavior is unpredictable. Different applications have different levels of variability. E.g., a data warehouse that regularly executes the same set of workflows on a similar-sized dataset for BI would have highly predictable demand. But traffic to Wikipedia probably has a highly unpredictable demand curve that is influenced by real-world events. One way to think about demand variability is to use historical data as a model for demand growth and uncertainty. You could draw a forecast for future demand with different levels of confidence. As you predict further and further out, the error bars add up and you have a wider range of potential demand. This is called the cone of uncertainty. The below figure shows a cone of uncertainty for the minimum demand expected to be presented by customers to the system based on historical variability.

So now we have two questions to answer:

What confidence level of the cone do we pick for the forecast?
How far out in time should we forecast?

In a previous blog post, we walked through the process of performance modeling an application (but we cheated on the forecasting problem then). By applying a simplifying assumption that the application will burn its error budget if it does not have all the resources it needs, we could answer the first question by mapping the forecast confidence level to the application SLO. In this case, let’s say we have a P99 SLO, and thus we could choose the forecast line that represents the 99th percentile confidence.

The second question depends on how long it takes us to acquire the resources. In the case of a cloud service, this is simple and often instantaneous or has a small turnaround time for quota increase requests. But if you manage your datacenter, you have to work with your supply chain contract manufacturers to identify the machine lead time. That leads us to the second source of variability.

Supply lead time variability represents the amount of time that the supply chain requires to fulfill an order. This depends on several factors including the length and complexity of the supply chain. The topic of supply chain modeling and optimization is an area of itself that we will cover in a future post. But the key intuition to grasp here is that resource delivery lead times are not deterministic. If you had a supply chain with multiple contracts depending on each other, a delay with a single contractor deep in the chain (say because of procurement delays or unexpected logistics trouble, etc), the delay progresses through the chain in what is called the bullwhip effect.

Similar to the demand variability cone of uncertainty, we could draw a similar cone for supply variability as well - either based on historical delivery times or based on the SLAs provided by each of the nodes in the supply chain. A different way to think about this is as a CDF of time to fulfill orders.

Repairs and failure variability finally is the loss of capacity in the datacenter owing to failures and repairs. While supply is the ability to increase resources in the datacenter, this risk loses active resources.

Now that we have stochastic distributions of how these risks are distributed, how can we use this to determine an appropriate level of resource ordering?

Option #1: Choose a confidence interval in the cone of uncertainty based on the application SLO.

The risk offered to the application depends on how correlated the risks are. If they are perfectly correlated to each other, we would expect that the SLO is the MAX(the three confidence percentiles) == MAX(99%, 99%, 99%) == 99% – i.e., we would have a stock out 1% of the time.

But what about the worst-case scenario? If the risks are perfectly anti-correlated, then we would end up using an error budget that is additive i.e., SUM(1% + 1% + 1%) == 3% – i.e., we would stock out for 3% of the time.

In reality, these risks are generally independent of each other, so neither scenario is correct.

Option #2: Use a Monte Carlo simulation to optimize parameter choices.

So the next best option is to simulate the stock-out risk, by sampling from the three distributions. If we rerun the simulation millions of times, this is in effect a Monte Carlo simulation of the system’s resource use. This will allow us to draw a CDF graph of the resource utilization rates. By tweaking the simulation to take different values of confidence percentiles for each of the risk vectors, we could thus also find optimal planning parameters to address each of the risks.

In summary, by drawing deeper insight into the risks involved in resource provisioning, we can mitigate the impact of stock-outs, and ensure the reliability and stability of operations. A key insight here is that these risks can be mitigated by being conservative and holding a large “Safety Stock”. But this buffer of resources comes at a cost. Thus, we are well motivated to keep this inventory as low as possible. Inventory optimization is a large problem in its own right and will be discussed in a subsequent blog post.

Let’s do some capacity planning

2023-01-03T00:00:00+00:00

Cloud computing has revolutionized the way we think about computing resources and storage capacity. With the ability to access virtually unlimited resources on demand, it can feel like we have an almost infinite capacity at our fingertips. Without enough resources, system performance will degrade and grind to a stall. Capacity planning is a critical challenge in running systems and we don’t always learn how to be effective at it. capacity planning can be time-consuming and requires careful analysis and consideration of various factors such as cost, scalability, and reliability.

Let’s go through the process of capacity planning with a toy example. Let’s start with the following assumptions:

the system is processing 1,000 QPS (queries per second) on average.
this is processed with a pool of servers each of which can process 100 QPS.
each server needs 1 core and 4 GiB of RAM.

How much compute should we provision for this system?

Attempt #1

We need 10 servers to service the 1,000 QPS. That means 10 cores and 40 GiB of RAM. Easy peasy!

Hmm… the customers are complaining about performance. They are hard to please, aren’t they? Let’s look deeper.

Attempt #2

Crikey! The 1,000 QPS was the average load. Duh. Demand varies across the day and the week. We should be capacity planning for peak load. Turns out that the peak load is 2,000 QPS. That means we need twice as many servers! Expensive, but at least they stop whining now.

Well not quite. This ungrateful lot is still unhappy.

Attempt #3

Of course. The queries aren’t equal cost - some queries are more expensive and take more compute time to respond to. That means we are never quite going to be able to distribute them perfectly and run them at a 100% utilization rate, are we?

Say the load is Poisson distributed and we want to maintain 3 9s (i.e., ensure 99.9% of queries comply). Ideally, our throughput should provide a latency of 1/100=0.01s or 10ms. Our customers expect an SLO of < 20ms for 99.9% of requests. Let’s start by checking what it would be at the 99.9^th percentile with the current provisioning.

import ciw
import numpy as np
import math

def get_lognormal_vars(mean, sd):
    var = math.log((sd * sd)/(mean * mean)+1)
    mu = math.log(mean) - var/2
    return mu, math.sqrt(var)

def get_service_time(qps, utilization_ratio, throughput, sigma):
    server_count = math.ceil(qps / throughput * utilization_ratio)
    print(server_count)
    N = ciw.create_network(
        arrival_distributions=[ciw.dists.Exponential(rate=qps)],
        service_distributions=[ciw.dists.Lognormal(*get_lognormal_vars(1.0/throughput, sigma* 1.0/throughput))],
        number_of_servers=[server_count]
    )
    Q = ciw.Simulation(N)
    Q.simulate_until_max_customers(10**6, progress_bar=True)
    servicetimes = [r.waiting_time + r.service_time for r in Q.get_all_records()]
    return (server_count, np.percentile(servicetimes, 50), np.percentile(servicetimes, 99), np.percentile(servicetimes, 99.9))

>>> server_count, p50_latency, p99_latency, p99.9_latency = get_service_time(2 * 100000, 1.0, 100, 0.2)
>>> print(server_count, p50_latency, p99_latency, p99.9_latency)
(20, 0.24202101961395783, 0.47748099008185135, 0.49158595663888643)

Yikes! 490ms is quite a bit far away from the 20ms that we are targeting. We could add more servers to improve the performance. But how many do we add? Let’s do a sensitivity analysis of how the 99.9^th percentile service time changes with increased servers.

Server count	P99.9 Latency (ms)
20	658.552149
21	40.867156
22	25.780134
23	22.037392
24	19.995492
25	18.984054
26	18.523606
27	18.390090
28	18.239713

There is a clear knee in the graph at 24 servers where the P99.9 latency drops below our target of 20ms. Cool, that means we provision 20% additional machines, or in other words, run the servers at an utilization of 83.3%.

Attempt #4

Dagnabbit! The data center is in the path of a hurricane. Oh, how ye pain me. The customers are not going to be happy with downtime. Actually come to think of it, maybe I just got lucky until now. Either of the datacenters could have become unavailable for any number of reasons - network outages, acts of god, physical infrastructure failure, and what not.

I know what I will do, I will spread the server across two different datacenters - so 12 in DC1 and 12 in DC2. Hmmm, but what happens if the hurricane hits? We will need to shutdown DC1, so the service will need to run on just 12 servers. We know that makes people unhappy. I need to do “N+1” redundancy. Since I have two datacenters, I need 24 servers each. But that doubles the number of servers again.

Attempt #5

Just as things were getting a bit quiet, the customer complaints are back. What is it now? Oh humbug, the original problem requirements are no longer correct. Turns out the customers like the application, and more and more customers are signing up.

I suppose historical usage is representative of future growth. We appear to be growing at about 10% quarter on quarter. I will need to make sure to provision additional machines at regular intervals now. And I still need to make sure they are positioned properly to have the redundancy we need. This means that by next year, we will need (1 + 0.1)^4 * 24 * 2 ~= 71 servers.

[In a real setting, we would do this more rigorously. But you get the idea.]

Attempt #6

I am beginning to think that my primary job is to listen to complaints. Now that the customers are happy with the performance, the finance team is unhappy. The service costs way too much to run. Let’s see what I can do.

Hey, I have an idea. The N+1 redundancy buffer is equal to the size of the largest failure domain. If I could increase the number of failure domains, the buffer should reduce. But hmm, we only have a presence in 3 datacenters, and DC3 only has enough resources to run 10 servers. For N+1 redundancy I would need to have a redundancy buffer equivalent to the size of the largest failure domain. If we size the service as 14 servers in DC1, 14 servers in DC2, and 10 in DC3 we reduced the number of deployed servers from 48 to 38.

Attempt #7

Maybe we could do more to optimize this. We have been focusing on horizontal scaling. Maybe we could also try vertical scaling? If each server is larger, it can handle more requests, but will it be able to scale super-linearly? Presumably, it could upto a certain extent, since the static costs of initilization and such will be amortizes and it will have better in-process cache effects. Let’s do a stress test on different machine sizes and see how the service rate changes.

[Again, this is much more complex in reality but the principle stands - there are a lot more dimensions - do you scale the speed of the processor, number of cores on the processors, hyperthreading support, processor platform, etc]

Oh sweet. It looks like if I double the server size, I can triple the throughput. That means reducing the total processing core demand from 24 cores to 24 * 2 / 3 ~= 16 cores across 8 servers. Since I have 8 server slots available in each DC, I can spread the servers equally minimizing the redundancy buffer - 3 in each. We went from 38 servers down to 9 servers (or 38 CPU cores to 18 cores).

An astute reader would notice that this was primarily performance modeling rather than capacity planning. That is true, but this is indeed the first step toward capacity planning. In subsequent posts, we will dig into forecasting, risk pooling, and supply chains.

The Software Engineering Ladder

2022-12-18T00:00:00+00:00

Mentoring and coaching engineers is a part of my role that I enjoy a lot. A topic that comes up often in conversations with mentees is career growth and promotions. Promotions are important for a number of reasons including increased compensation, expanded scope, accessing more challenging problems and perhaps even the validation that one is growing and performing better and better over time.

Being excellent in your current role is not always enough: The important question is knowing if you are ready for the next level and if not, what skills you should be working on to get to the next level. This is often confusing because transitioning between these levels is often a transition between roles. This means that in spite of getting high ratings at your current level, you may not be performing the job functions of the next level. While this post focuses on the major functional role requirements, it does not talk about a number of other essential aspects such as clear communication, community contributions, working well with others and so on. These are critical aspects as well and I have come across cases of unsuccessful promotions because their profile was not well rounded.

Mapping levels across companies: Software engineering levels do not map one to one across companies. I use as a reference in this post the levels used at Google (which is the one I am most familiar with). But I expect that these expectations map across companies within +/-1 level. levels.fyi maps SWE levels across different companies. Many early stage companies start with no levels at all, and only adopt a ladder much later on. For example Netflix is only adopting a software engineer ladder in 2022 - a full 25 years after starting operations.

Entry level software engineer: L3 is the level at which a new graduate typically joins at Google. In Amazon this is L4, and at Microsoft this starts at 69. At this level, an engineer is still learning the trade. The engineer would need lots of mentoring time and support from senior engineers in the team. An engineer at this level should expect to be assigned well defined tasks (for e.g., “Implement a module that generates corresponding RPC calls to the storage layer for each incoming write”), and should be able to execute these tasks with support from others.

An L3 engineer should be evaluating the L4 role once the engineer is able to execute the assigned tasks with little to no external support. The engineer should also be starting to write design docs with support from senior members of the team.

Junior software engineer: L4 at Google is a natural progression for engineer. It is important to note that some companies have a growth expectation for entry level engineers to grow in their career. Google used to expect that all engineers be able to grow into L5, but this changed a few years back to be L4. At this level, Engineers work on reasonably well-defined projects (for e.g., “Implement a high throughput module that accepts user writes and reads”) and break it up independently executable tasks. Specifically the engineer would be responsible for designing the solution with limited help, and delegating the tasks to L3 engineers in the team.

As the L4 engineer matures in the role, their independence should continue to improve (especially on system design skills) and require very limited support. A good sign of maturity is when the engineer is able to identify new projects without external support.

Senior software engineer: While the next level is Senior Software Engineer, this level figures as a mid-level role in the context of the software engineering ladder and often is also the band with the majority of the engineers in the company. At this level, the engineer transitions from projects to problems (for e.g., “The high latency of the system is causing customer dissatisfaction”) and is responsible for developing a roadmap of projects that can solve the problem. While leadership skills are important at each level, it is especially critical from this level. Leadership is often construed as managing other people - this could not be further than the truth. Leadership is the ability of the engineer to influence others - this could be influencing the work of their peers, or influencing their manager about the priority of different work items, or possibly influencing dependent teams to produce a feature that the engineer needs. The other major difference at this level is the need to internalize business requirements. At this level, an engineer might also have the title of a “tech lead”.

Once the engineer starts exhibiting “influencing without authority” and is able to consistently scale their impact through others, it a strong indication that they are getting ready for the next level.

Engineers make a choice at this level of their career if they want to continue down the independent contributor (IC) ladder, or want to start managing a team. This post focuses about the IC ladder, but this point in an engineer’s career is an ideal time to make the switch over. If you are thinking about become a manager, a book I highly recommend is the Manager’s Path.

Staff software engineer: The transition from L3 to L5 is a progression of handling increasing ambiguity and independent execution. However a “Staff Software Engineer” represents a different job profile. This is a highly selective role and most organizations will have fewer than 5% of their engineers in Staff+ roles. In this role, the engineer often owns an area and is responsible for identifying the problems worth solving in the area (for e.g., “capacity planning or monitoring or scalability”). This role is an extension of the execs in the org and they will often lean on the engineer for decision making. Leadership skills is the center focus of this role, and a staff engineer will need to influence across teams and product areas. Each subsequent level represents a fundamentally different role, but the expectations become more and more murky as it depends on the needs of the organization. A couple of books on the topic that I would highly recommend reading - “Staff Engineer” by Will Larson and The Staff Engineer’s path by Tanya Reilly. Even if you don’t read the books, I would highly recommend reading this excerpt about the Staff Engineer archetypes.

Senior staff software engineer: This level is an extension of the Staff engineer where the engineer is able to either scale their impact through expansive breath and scope or is able to execute multiple staff engineer leveled projects in parallel. At this level, the expectation of influence is cross company and the engineer would be the de factor expert on a business critical area. The engineer is also expected to work with leadership to define the areas that need investment and provide a roadmap for executing the initiatives.

(Sr) Principal engineer, Distinguished engineer and Technical Fellow: These levels are equivalent to executive positions. The engineers represent the creme de la creme of the engineering community. At these levels, the engineers play roles that are critical to changing the direction of the company. They develop product strategy, and partner widely across different roles (UX, PM, etc) to ensure long term business success.