<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="4.3.1">Jekyll</generator><link href="https://thecloudytales.com/atom.xml" rel="self" type="application/atom+xml" /><link href="https://thecloudytales.com/" rel="alternate" type="text/html" /><updated>2023-01-13T10:11:43+00:00</updated><id>https://thecloudytales.com/atom.xml</id><title type="html">The Cloudy Tales</title><subtitle>A behind the scenes take on datacenter technologies</subtitle><author><name>Prashanth Mohan</name></author><entry><title type="html">Risk analysis for resource planning</title><link href="https://thecloudytales.com/risk-analysis/" rel="alternate" type="text/html" title="Risk analysis for resource planning" /><published>2023-01-13T00:00:00+00:00</published><updated>2023-01-13T00:00:00+00:00</updated><id>https://thecloudytales.com/risk-analysis</id><content type="html" xml:base="https://thecloudytales.com/risk-analysis/"><![CDATA[<p>Risk analysis is a crucial component of resource forecasting for datacenters. By
identifying and evaluating potential risks, datacenter managers can ensure that
their operations are prepared to handle any potential challenges that may arise,
allowing them to deliver reliable, high-quality service to their customers. In
this blog post, we will explore the different types of risks that can affect
resource forecasting in a datacenter. We will also discuss how to use risk
analysis to inform resource allocation and capacity planning decisions.</p>

<p>Risk analysis for capacity planning is challenging because there are different
sources of variability. Let’s consider 3 major categories of risk for resourcing
datacenters:</p>
<ol>
  <li>Demand variability</li>
  <li>Supply lead time variability</li>
  <li>Repairs and failure variability</li>
</ol>

<p><strong>Demand Variability</strong> represents the fact that customer behavior is
unpredictable. Different applications have different levels of variability.
E.g., a data warehouse that regularly executes the same set of workflows on a
similar-sized dataset for BI would have highly predictable demand. But traffic
to Wikipedia probably has a highly unpredictable demand curve that is influenced
by real-world events. One way to think about demand variability is to use
historical data as a model for demand growth and uncertainty. You could draw a
forecast for future demand with different levels of confidence. As you predict
further and further out, the error bars add up and you have a wider range of
potential demand. This is called the cone of uncertainty. The below figure shows
a cone of uncertainty for the minimum demand expected to be presented by
customers to the system based on historical variability.</p>

<p><img src="https://thecloudytales.com/images/uncertainity-cone.webp" alt="image-center" class="align-center" /></p>

<p>So now we have two questions to answer:</p>
<ol>
  <li>What confidence level of the cone do we pick for the forecast?</li>
  <li>How far out in time should we forecast?</li>
</ol>

<p>In a previous <a href="/capacity-planning/">blog post</a>, we
walked through the process of performance modeling an application (but we
cheated on the forecasting problem then). By applying a simplifying assumption
that the application will burn its error budget if it does not have all the
resources it needs, we could answer the first question by mapping the forecast
confidence level to the application SLO. In this case, let’s say we have a P99
SLO, and thus we could choose the forecast line that represents the 99th
percentile confidence.</p>

<p>The second question depends on how long it takes us to acquire the resources. In
the case of a cloud service, this is simple and often instantaneous or has a
small turnaround time for quota increase requests. But if you manage your
datacenter, you have to work with your supply chain contract manufacturers to
identify the machine lead time. That leads us to the second source of
variability.</p>

<p><strong>Supply lead time variability</strong> represents the amount of time that the supply
chain requires to fulfill an order. This depends on several factors including
the length and complexity of the supply chain. The topic of supply chain
modeling and optimization is an area of itself that we will cover in a future
post. But the key intuition to grasp here is that resource delivery lead times
are not deterministic. If you had a supply chain with multiple contracts
depending on each other, a delay with a single contractor deep in the chain (say
because of procurement delays or unexpected logistics trouble, etc), the delay
progresses through the chain in what is called the <a href="https://en.wikipedia.org/wiki/Bullwhip_effect">bullwhip
effect</a>.</p>

<p><img src="https://thecloudytales.com/images/supply-chain-nodes.webp" alt="image-center" class="align-center" /></p>

<p>Similar to the demand variability cone of uncertainty, we could draw a similar
cone for supply variability as well - either based on historical delivery times
or based on the SLAs provided by each of the nodes in the supply chain. A
different way to think about this is as a CDF of time to fulfill orders.</p>

<p><img src="https://thecloudytales.com/images/supply-lead-time-cdf.webp" alt="image-center" class="align-center" /></p>

<p><strong>Repairs and failure variability</strong> finally is the loss of capacity in the
datacenter owing to failures and repairs. While supply is the ability to
increase resources in the datacenter, this risk loses active resources.</p>

<p>Now that we have stochastic distributions of how these risks are distributed,
how can we use this to determine an appropriate level of resource ordering?</p>

<h3 id="option-1-choose-a-confidence-interval-in-the-cone-of-uncertainty-based-on-the-application-slo">Option #1: Choose a confidence interval in the cone of uncertainty based on the application SLO.</h3>

<p>The risk offered to the application depends on how correlated the risks are. If
they are perfectly correlated to each other, we would expect that the SLO is the
MAX(the three confidence percentiles) == MAX(99%, 99%, 99%) == 99% – i.e., we
would have a stock out 1% of the time.</p>

<p>But what about the worst-case scenario? If the risks are perfectly
anti-correlated, then we would end up using an error budget that is additive
i.e., SUM(1% + 1% + 1%) == 3% – i.e., we would stock out for 3% of the time.</p>

<p>In reality, these risks are generally independent of each other, so neither
scenario is correct.</p>

<h3 id="option-2-use-a-monte-carlo-simulation-to-optimize-parameter-choices">Option #2: Use a Monte Carlo simulation to optimize parameter choices.</h3>

<p>So the next best option is to simulate the stock-out risk, by sampling from the
three distributions. If we rerun the simulation millions of times, this is in
effect a <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo
simulation</a> of the system’s
resource use. This will allow us to draw a CDF graph of the resource utilization
rates. By tweaking the simulation to take different values of confidence
percentiles for each of the risk vectors, we could thus also find optimal
planning parameters to address each of the risks.</p>

<p>In summary, by drawing deeper insight into the risks involved in resource
provisioning, we can mitigate the impact of stock-outs, and ensure the
reliability and stability of operations. A key insight here is that these risks
can be mitigated by being conservative and holding a large “Safety Stock”. But
this buffer of resources comes at a cost. Thus, we are well motivated to keep
this inventory as low as possible. Inventory optimization is a large problem in
its own right and will be discussed in a subsequent blog post.</p>]]></content><author><name>Prashanth Mohan</name></author><category term="capacity planning" /><category term="risk analysis" /><summary type="html"><![CDATA[Risk analysis is a crucial component of resource forecasting for datacenters. By identifying and evaluating potential risks, datacenter managers can ensure that their operations are prepared to handle any potential challenges that may arise, allowing them to deliver reliable, high-quality service to their customers. In this blog post, we will explore the different types of risks that can affect resource forecasting in a datacenter. We will also discuss how to use risk analysis to inform resource allocation and capacity planning decisions.]]></summary></entry><entry><title type="html">Let’s do some capacity planning</title><link href="https://thecloudytales.com/capacity-planning/" rel="alternate" type="text/html" title="Let’s do some capacity planning" /><published>2023-01-03T00:00:00+00:00</published><updated>2023-01-03T00:00:00+00:00</updated><id>https://thecloudytales.com/capacity-planning</id><content type="html" xml:base="https://thecloudytales.com/capacity-planning/"><![CDATA[<p>Cloud computing has revolutionized the way we think about computing resources
and storage capacity. With the ability to access virtually unlimited resources
on demand, it can feel like we have an almost infinite capacity at our
fingertips. Without enough resources, system performance will degrade and grind
to a stall. Capacity planning is a critical challenge in running systems and we
don’t always learn how to be effective at it. capacity planning can be
time-consuming and requires careful analysis and consideration of various
factors such as cost, scalability, and reliability.</p>

<p>Let’s go through the process of capacity planning with a toy example. Let’s start
with the following assumptions:</p>
<ul>
  <li>the system is processing <code class="language-plaintext highlighter-rouge">1,000 QPS</code> (queries per second) on average.</li>
  <li>this is processed with a pool of servers each of which can process <code class="language-plaintext highlighter-rouge">100 QPS</code>.</li>
  <li>each server needs 1 core and 4 GiB of RAM.</li>
</ul>

<p>How much compute should we provision for this system?</p>

<h4 id="attempt-1">Attempt #1</h4>

<p>We need 10 servers to service the <code class="language-plaintext highlighter-rouge">1,000 QPS</code>. That means 10 cores and 40 GiB of
RAM. Easy peasy!</p>

<p>Hmm… the customers are complaining about performance. They are hard to
please, aren’t they? Let’s look deeper.</p>

<h4 id="attempt-2">Attempt #2</h4>
<p>Crikey! The <code class="language-plaintext highlighter-rouge">1,000 QPS</code> was the <strong>average</strong> load. Duh. Demand varies across the
day and the week. We should be capacity planning for peak load. Turns out
that the peak load is <code class="language-plaintext highlighter-rouge">2,000 QPS</code>. That means we need twice as many servers!
Expensive, but at least they stop whining now.</p>

<p>Well not quite. This ungrateful lot is still unhappy.</p>

<h4 id="attempt-3">Attempt #3</h4>
<p>Of course. The queries aren’t equal cost - some queries are more expensive and
take more compute time to respond to. That means we are never quite going to be
able to distribute them perfectly and run them at a 100% utilization rate, are
we?</p>

<p>Say the load is Poisson distributed and we want to maintain 3 9s (i.e., ensure
99.9% of queries comply). Ideally, our throughput should provide a latency of
<code class="language-plaintext highlighter-rouge">1/100=0.01s</code> or <code class="language-plaintext highlighter-rouge">10ms</code>. Our customers expect an SLO of <code class="language-plaintext highlighter-rouge">&lt; 20ms</code> for 99.9% of
requests. Let’s start by checking what it would be at the 99.9<sup>th</sup>
percentile with the current provisioning.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">ciw</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">math</span>

<span class="k">def</span> <span class="nf">get_lognormal_vars</span><span class="p">(</span><span class="n">mean</span><span class="p">,</span> <span class="n">sd</span><span class="p">):</span>
    <span class="n">var</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">((</span><span class="n">sd</span> <span class="o">*</span> <span class="n">sd</span><span class="p">)</span><span class="o">/</span><span class="p">(</span><span class="n">mean</span> <span class="o">*</span> <span class="n">mean</span><span class="p">)</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
    <span class="n">mu</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">mean</span><span class="p">)</span> <span class="o">-</span> <span class="n">var</span><span class="o">/</span><span class="mi">2</span>
    <span class="k">return</span> <span class="n">mu</span><span class="p">,</span> <span class="n">math</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">get_service_time</span><span class="p">(</span><span class="n">qps</span><span class="p">,</span> <span class="n">utilization_ratio</span><span class="p">,</span> <span class="n">throughput</span><span class="p">,</span> <span class="n">sigma</span><span class="p">):</span>
    <span class="n">server_count</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">ceil</span><span class="p">(</span><span class="n">qps</span> <span class="o">/</span> <span class="n">throughput</span> <span class="o">*</span> <span class="n">utilization_ratio</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">server_count</span><span class="p">)</span>
    <span class="n">N</span> <span class="o">=</span> <span class="n">ciw</span><span class="p">.</span><span class="n">create_network</span><span class="p">(</span>
        <span class="n">arrival_distributions</span><span class="o">=</span><span class="p">[</span><span class="n">ciw</span><span class="p">.</span><span class="n">dists</span><span class="p">.</span><span class="n">Exponential</span><span class="p">(</span><span class="n">rate</span><span class="o">=</span><span class="n">qps</span><span class="p">)],</span>
        <span class="n">service_distributions</span><span class="o">=</span><span class="p">[</span><span class="n">ciw</span><span class="p">.</span><span class="n">dists</span><span class="p">.</span><span class="n">Lognormal</span><span class="p">(</span><span class="o">*</span><span class="n">get_lognormal_vars</span><span class="p">(</span><span class="mf">1.0</span><span class="o">/</span><span class="n">throughput</span><span class="p">,</span> <span class="n">sigma</span><span class="o">*</span> <span class="mf">1.0</span><span class="o">/</span><span class="n">throughput</span><span class="p">))],</span>
        <span class="n">number_of_servers</span><span class="o">=</span><span class="p">[</span><span class="n">server_count</span><span class="p">]</span>
    <span class="p">)</span>
    <span class="n">Q</span> <span class="o">=</span> <span class="n">ciw</span><span class="p">.</span><span class="n">Simulation</span><span class="p">(</span><span class="n">N</span><span class="p">)</span>
    <span class="n">Q</span><span class="p">.</span><span class="n">simulate_until_max_customers</span><span class="p">(</span><span class="mi">10</span><span class="o">**</span><span class="mi">6</span><span class="p">,</span> <span class="n">progress_bar</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">servicetimes</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span><span class="p">.</span><span class="n">waiting_time</span> <span class="o">+</span> <span class="n">r</span><span class="p">.</span><span class="n">service_time</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">Q</span><span class="p">.</span><span class="n">get_all_records</span><span class="p">()]</span>
    <span class="k">return</span> <span class="p">(</span><span class="n">server_count</span><span class="p">,</span> <span class="n">np</span><span class="p">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">servicetimes</span><span class="p">,</span> <span class="mi">50</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">servicetimes</span><span class="p">,</span> <span class="mi">99</span><span class="p">),</span> <span class="n">np</span><span class="p">.</span><span class="n">percentile</span><span class="p">(</span><span class="n">servicetimes</span><span class="p">,</span> <span class="mf">99.9</span><span class="p">))</span>
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">&gt;&gt;&gt;</span> <span class="n">server_count</span><span class="p">,</span> <span class="n">p50_latency</span><span class="p">,</span> <span class="n">p99_latency</span><span class="p">,</span> <span class="n">p99</span><span class="p">.</span><span class="mi">9</span><span class="n">_latency</span> <span class="o">=</span> <span class="n">get_service_time</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="mi">100000</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mi">100</span><span class="p">,</span> <span class="mf">0.2</span><span class="p">)</span>
<span class="o">&gt;&gt;&gt;</span> <span class="k">print</span><span class="p">(</span><span class="n">server_count</span><span class="p">,</span> <span class="n">p50_latency</span><span class="p">,</span> <span class="n">p99_latency</span><span class="p">,</span> <span class="n">p99</span><span class="p">.</span><span class="mi">9</span><span class="n">_latency</span><span class="p">)</span>
<span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mf">0.24202101961395783</span><span class="p">,</span> <span class="mf">0.47748099008185135</span><span class="p">,</span> <span class="mf">0.49158595663888643</span><span class="p">)</span>
</code></pre></div></div>

<p>Yikes! <code class="language-plaintext highlighter-rouge">490ms</code> is quite a bit far away from the <code class="language-plaintext highlighter-rouge">20ms</code> that we are targeting. We
could add more servers to improve the performance. But how many do we add? Let’s
do a sensitivity analysis of how the 99.9<sup>th</sup> percentile service time
changes with increased servers.</p>

<table>
  <thead>
    <tr>
      <th>Server count</th>
      <th>P99.9 Latency (ms)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>20</td>
      <td>658.552149</td>
    </tr>
    <tr>
      <td>21</td>
      <td>40.867156</td>
    </tr>
    <tr>
      <td>22</td>
      <td>25.780134</td>
    </tr>
    <tr>
      <td>23</td>
      <td>22.037392</td>
    </tr>
    <tr>
      <td><strong>24</strong></td>
      <td><strong>19.995492</strong></td>
    </tr>
    <tr>
      <td>25</td>
      <td>18.984054</td>
    </tr>
    <tr>
      <td>26</td>
      <td>18.523606</td>
    </tr>
    <tr>
      <td>27</td>
      <td>18.390090</td>
    </tr>
    <tr>
      <td>28</td>
      <td>18.239713</td>
    </tr>
  </tbody>
</table>

<p>There is a clear knee in the graph at 24 servers where the P99.9 latency drops
below our target of <code class="language-plaintext highlighter-rouge">20ms</code>. Cool, that means we provision 20% additional
machines, or in other words, run the servers at an utilization of 83.3%.</p>

<h4 id="attempt-4">Attempt #4</h4>
<p>Dagnabbit! The data center is in the path of a hurricane. Oh, how ye pain me.
The customers are not going to be happy with downtime. Actually come to think
of it, maybe I just got lucky until now. Either of the datacenters could have
become unavailable for any number of reasons - network outages, acts of god,
physical infrastructure failure, and what not.</p>

<p>I know what I will do, I will spread the server across two different
datacenters - so 12 in DC1 and 12 in DC2. Hmmm, but what happens if the
hurricane hits? We will need to shutdown DC1, so the service will need to run on
just 12 servers. We know that makes people unhappy. I need to do “N+1”
redundancy. Since I have two datacenters, I need 24 servers each. But that
doubles the number of servers again.</p>

<h4 id="attempt-5">Attempt #5</h4>
<p>Just as things were getting a bit quiet, the customer complaints are back. What
is it now? Oh humbug, the original problem requirements are no longer correct.
Turns out the customers like the application, and more and more customers are
signing up.</p>

<p>I suppose historical usage is representative of future growth. We appear to be
growing at about 10% quarter on quarter. I will need to make sure to provision additional
machines at regular intervals now. And I still need to make sure they are
positioned properly to have the redundancy we need. This means that by next
year, we will need <code class="language-plaintext highlighter-rouge">(1 + 0.1)^4 * 24 * 2 ~= 71</code> servers.</p>

<p><em>[In a real setting, we would do this more rigorously. But you get the idea.]</em></p>

<h4 id="attempt-6">Attempt #6</h4>
<p>I am beginning to think that my primary job is to listen to complaints. Now that
the customers are happy with the performance, the finance team is unhappy. The
service costs way too much to run. Let’s see what I can do.</p>

<p>Hey, I have an idea. The N+1 redundancy buffer is equal to the size of the
largest failure domain. If I could increase the number of failure domains, the
buffer should reduce. But hmm, we only have a presence in 3 datacenters, and DC3
only has enough resources to run 10 servers. For N+1 redundancy I would need to
have a redundancy buffer equivalent to the size of the largest failure domain.
If we size the service as 14 servers in DC1, 14 servers in DC2, and 10 in DC3 we
reduced the number of deployed servers from 48 to 38.</p>

<h4 id="attempt-7">Attempt #7</h4>
<p>Maybe we could do more to optimize this. We have been focusing on horizontal
scaling. Maybe we could also try vertical scaling? If each server is larger, it
can handle more requests, but will it be able to scale super-linearly?
Presumably, it could upto a certain extent, since the static costs of
initilization and such will be amortizes and it will have better in-process
cache effects. Let’s do a stress test on different machine sizes and see how the
service rate changes.</p>

<p><em>[Again, this is much more complex in reality but the principle stands - there
are a lot more dimensions - do you scale the speed of the processor, number of
cores on the processors, hyperthreading support, processor platform, etc]</em></p>

<p><img src="https://thecloudytales.com/images/vertical-scaling-tradeoff.webp" alt="image-center" class="align-center" /></p>

<p>Oh sweet. It looks like if I double the server size, I can triple the
throughput. That means reducing the total processing core demand from 24 cores
to <code class="language-plaintext highlighter-rouge">24 * 2 / 3 ~= 16 cores</code> across 8 servers. Since I have 8 server slots
available in each DC, I can spread the servers equally minimizing the redundancy
buffer - 3 in each. We went from 38 servers down to 9 servers (or 38 CPU
cores to 18 cores).</p>

<p>An astute reader would notice that this was primarily performance modeling
rather than capacity planning. That is true, but this is indeed the first step
toward capacity planning. In subsequent posts, we will dig into forecasting,
risk pooling, and supply chains.</p>]]></content><author><name>Prashanth Mohan</name></author><category term="capacity planning" /><category term="resource management" /><category term="queueing theory" /><summary type="html"><![CDATA[Cloud computing has revolutionized the way we think about computing resources and storage capacity. With the ability to access virtually unlimited resources on demand, it can feel like we have an almost infinite capacity at our fingertips. Without enough resources, system performance will degrade and grind to a stall. Capacity planning is a critical challenge in running systems and we don’t always learn how to be effective at it. capacity planning can be time-consuming and requires careful analysis and consideration of various factors such as cost, scalability, and reliability.]]></summary></entry><entry><title type="html">The Software Engineering Ladder</title><link href="https://thecloudytales.com/swe-levels/" rel="alternate" type="text/html" title="The Software Engineering Ladder" /><published>2022-12-18T00:00:00+00:00</published><updated>2022-12-18T00:00:00+00:00</updated><id>https://thecloudytales.com/swe-levels</id><content type="html" xml:base="https://thecloudytales.com/swe-levels/"><![CDATA[<p>Mentoring and coaching engineers is a part of my role that I enjoy a lot. A
topic that comes up often in conversations with mentees is career growth and
promotions. Promotions are important for a number of reasons including increased
compensation, expanded scope, accessing more challenging problems and perhaps
even the validation that one is growing and performing better and better over
time.</p>

<p><strong>Being excellent in your current role is not always enough:</strong> The important
question is knowing if you are ready for the next level and if not, what skills
you should be working on to get to the next level. This is often confusing
because transitioning between these levels is often a transition between roles.
This means that in spite of getting high ratings at your current level, you may
not be performing the job functions of the next level. While this post focuses
on the major functional role requirements, it does not talk about a number of
other essential aspects such as clear communication, community contributions,
working well with others and so on. These are critical aspects as well and I
have come across cases of unsuccessful promotions because their profile was not
well rounded.</p>

<p><strong>Mapping levels across companies:</strong> Software engineering levels do not map one
to one across companies. I use as a reference in this post the levels used at
Google (which is the one I am most familiar with). But I expect that these
expectations map across companies within +/-1 level.
<a href="https://levels.fyi">levels.fyi</a> maps SWE levels across different companies.
Many early stage companies start with no levels at all, and only adopt a ladder
much later on. For example Netflix is only <a href="https://blog.pragmaticengineer.com/netflix-levels/">adopting a software engineer ladder
in 2022</a> - a full 25 years
after starting operations.</p>

<p><img src="https://thecloudytales.com/images/swe-ladder-progression.webp" alt="image-center" class="align-center" /></p>

<p><strong>Entry level software engineer:</strong> L3 is the level at which a new graduate
typically joins at Google. In Amazon this is L4, and at Microsoft this starts
at 69. At this level, an engineer is still learning the trade. The engineer
would need lots of mentoring time and support from senior engineers in the team.
An engineer at this level should expect to be assigned well defined tasks (for
e.g., “Implement a module that generates corresponding RPC calls to the storage
layer for each incoming write”), and should be able to execute these tasks with
support from others.</p>

<p class="notice">An L3 engineer should be evaluating the L4 role once the engineer is able to
execute the assigned tasks with little to no external support. The engineer
should also be starting to write design docs with support from senior members of
the team.</p>

<p><strong>Junior software engineer:</strong> L4 at Google is a natural progression for
engineer. It is important to note that some companies have a growth expectation
for entry level engineers to grow in their career. Google used to expect that
all engineers be able to grow into L5, but this changed a few years back to be
L4. At this level, Engineers work on reasonably well-defined projects (for e.g.,
“Implement a high throughput module that accepts user writes and reads”) and
break it up independently executable tasks. Specifically the engineer would be
responsible for designing the solution with limited help, and delegating the
tasks to L3 engineers in the team.</p>

<p class="notice">As the L4 engineer matures in the role, their independence should continue to
improve (especially on system design skills) and require very limited support. A
good sign of maturity is when the engineer is able to identify new projects
without external support.</p>

<p><strong>Senior software engineer:</strong> While the next level is Senior Software Engineer,
this level figures as a mid-level role in the context of the software
engineering ladder and often is also the band with the majority of the engineers
in the company. At this level, the engineer transitions from projects to
problems (for e.g., “The high latency of the system is causing customer
dissatisfaction”) and is responsible for developing a roadmap of projects that
can solve the problem. While leadership skills are important at each level, it
is especially critical from this level. Leadership is often construed as
managing other people - this could not be further than the truth. Leadership is
the ability of the engineer to influence others - this could be influencing the
work of their peers, or influencing their manager about the priority of
different work items, or possibly influencing dependent teams to produce a
feature that the engineer needs. The other major difference at this level is the
need to internalize business requirements. At this level, an engineer might also
have the title of a “tech lead”.</p>

<p class="notice">Once the engineer starts exhibiting “influencing without authority” and is able
to consistently scale their impact through others, it a strong indication that
they are getting ready for the next level.</p>

<p class="notice--warning">Engineers make a choice at this level of their career if they want to
continue down the independent contributor (IC) ladder, or want to start managing
a team. This post focuses about the IC ladder, but this point in an engineer’s
career is an ideal time to make the switch over. If you are thinking about
become a manager, a book I highly recommend is the <a href="https://www.goodreads.com/book/show/33369254-the-manager-s-path">Manager’s
Path</a>.</p>

<p><strong>Staff software engineer:</strong> The transition from L3 to L5 is a progression of
handling increasing ambiguity and independent execution. However a “Staff
Software Engineer” represents a different job profile. This is a highly
selective role and most organizations will have fewer than 5% of their engineers
in Staff+ roles. In this role, the engineer often owns an area and is
responsible for identifying the problems worth solving in the area (for e.g.,
“capacity planning or monitoring or scalability”). This role is an extension of
the execs in the org and they will often lean on the engineer for decision
making. Leadership skills is the center focus of this role, and a staff engineer
will need to influence across teams and product areas. Each subsequent level
represents a fundamentally different role, but the expectations become more and
more murky as it depends on the needs of the organization. A couple of books on
the topic that I would highly recommend reading - <a href="https://staffeng.com/book">“Staff Engineer” by Will
Larson</a> and <a href="https://www.goodreads.com/book/show/59694859-the-staff-engineer-s-path">The Staff Engineer’s path by Tanya
Reilly</a>.
Even if you don’t read the books, I would highly recommend reading this excerpt
about the <a href="https://staffeng.com/guides/staff-archetypes">Staff Engineer
archetypes</a>.</p>

<p><strong>Senior staff software engineer:</strong> This level is an extension of the Staff
engineer where the engineer is able to either scale their impact through
expansive breath and scope or is able to execute multiple staff engineer leveled
projects in parallel. At this level, the expectation of influence is cross
company and the engineer would be the de factor expert on a business critical
area. The engineer is also expected to work with leadership to define the areas
that need investment and provide a roadmap for executing the initiatives.</p>

<p><strong>(Sr) Principal engineer, Distinguished engineer and Technical Fellow:</strong> These
levels are equivalent to executive positions. The engineers represent the creme
de la creme of the engineering community. At these levels, the engineers play
roles that are critical to changing the direction of the company. They develop
product strategy, and partner widely across different roles (UX, PM, etc) to
ensure long term business success.</p>]]></content><author><name>Prashanth Mohan</name></author><category term="SWE Career" /><summary type="html"><![CDATA[Mentoring and coaching engineers is a part of my role that I enjoy a lot. A topic that comes up often in conversations with mentees is career growth and promotions. Promotions are important for a number of reasons including increased compensation, expanded scope, accessing more challenging problems and perhaps even the validation that one is growing and performing better and better over time.]]></summary></entry></feed>