Implementing Graceful Degradation

All software uses resources, whether concrete like RAM, or more abstract like an external service you rely on. If possible, when resources become constrained or unavailable, we want our software to be reactive to it’s environment by reducing performance or disabling secondary functionality instead of crashing. That is, we want our software to gracefully degrade.

In this post, I’ll describe a software architecture pattern for implementing graceful degradation.

Domain

To implement graceful degradation, we need:

A way to periodically retrieve a relatively current utilization metric for the resource
An action to perform when the resource’s utilization gets above a threshold

We can model our domain as:

Policy {
    action: function
    utilization_threshold: double // (0.0, 1.0)
}

ResourceManager {
    get_utilization()
    add_policy(Policy p)
    // Apply policies that have their threshold below the last measured utilization
    apply_policies()
}

Implementation

If our program is loop-driven, we can invoke ResourceManager#apply_policies() in our main control loop at our desired frequency. Note, that if our program is synchronous and it gets blocked on a long-running task, it might prevent our policy from being applied.

A better approach might be to register a timer that interrupts our program periodically and invokes our ResourceManager to guarantee it will run.

Our implementation of the apply_policies() method is as follows:

apply_policies() {
    utilization = get_utilization
    for (p : policies) {
        if (utilization > p.utilization_threshold)
            p.action()
    }
}

Use Case: Reducing RAM consumption

Let’s assume we own an /employees APU that supports a query for retrieving an employee’s manager hierarchy, /employees?f.ancestors_of=\<employee-id>. Furthermore, we have an in-memory store that caches retrieved hierarchy paths from our database.

In examples like this, it can be hard to accurately predict the memory consumption of our cache. Likewise, often its difficult to effectively bound the size of caches.

With this pattern, when RAM usage gets above 80%, our policy can be to delete a random 10% of leaves (employees who have no current direct reports in the cache). If an employee is re-requested in the future, we will re-materialize the full or partial hierarchy path from the database.

Our policy will get periodically repeated until memory consumption is below the threshold. This allows the size of our cache to grow and shrink dynamically and allows more efficient utilization of our resources.

Scaling policies

After observing the behavior of our graceful degradation policy for our /employees API, we’ve decided that deleting 10% of leaves might be too aggressive if we’re relatively close to our threshold, and that our policy should scale it’s impact based on the distance utilization is from the threshold.

We can extend our policy model’s action to accept an energy parameter:

Policy {
    action: function(energy: number) // [0.0, 1.0]
    utilization_threshold: number // (0.0, 1.0)
}

Our apply_policies() implementation now looks as follows:

apply_policies() {
    utilization = get_utilization()
    for (p : policies) {
        energy = 0.0
        if (utilization > p.utilization_threshold) {
            energy = 1.0 * (utilization - p.utilization_threshold) 
                / (100 - p.utilization_threshold)
        }
        p.action(energy)
    }
}

Our new policy is to linearly scale the percentage of leaves we delete from 0% to 10% with our energy parameter (leaf_percentage = 0.1 * energy).

Use Case: Reducing CPU usage

Let’s assume we have a repeating task that periodically flushes an in-memory buffer to disk and that when this task runs, it temporarily spikes CPU usage. Our desired policy is to defer processing our task until CPU utilization drops below the threshold.

Notice that in our energy loop, zero energy is supplied to the policy’s action when utilization is below the threshold. If the energy supplied is greater than zero, we can set a flag to prevent running our task. When energy equals zero, we clear the flag to allow running our task.

Supporting hysteresis

A common problem with having a single threshold is that it can cause thrashing:

our task is enabled and starts running
it might cause CPU to spike
which causes our policy to run
which disables our tasks running
which drops CPU usage below the threshold, re-enabling our task

We can reduce the frequency of oscillations by supporting hysteresis. We can add a lower threshold so that our task is not re-enabled until utilization drops below the lower bound.

We need to add a lower threshold to our Policy model:

Policy {
    action: function(energy: number) // [0.0, 1.0]
    utilization_threshold_high: number // (0.0, 1.0)
    utilization_threshold_low: number // (0.0, 1.0)
}

Our new apply_policies() method:

apply_policies() {
    utilization = get_utilization()
    for (p : policies) {
        energy = 0.0
        if (utilization > p.utilization_threshold_high)
            energy = 1.0
        else if (utilization > p.utilization_threshold_low) {
            energy = 1.0 * (utilization - p.utilization_threshold_low) 
                / (p.utilization_threshold_high - p.utilization_threshold_low)
        }
        p.action(energy)
    }
}

Our new apply_policies() implementation also has the added benefit of linearly scaling energy within a defined resource utilization range.

Use Case: Handling external resources

Let’s consider the case where we have an internal tool that allows manually importing spreadsheets and ingesting into our database using our ETL. Our ETL supports a DSL that allows complex, CPU-intensive transformations on the imported data.

We have decided that these import jobs don’t need to happen immediately, and we want to run our ETL on AWS spot instances, only when the market price has dropped below a threshold.

Our resource manager would retrieve the utilization by querying AWS’ spot price history API and our action would be to toggle a flag, causing the ETL functionality to be enabled/disabled and the import button to be enabled/disabled in our UI.

Graceful shutdowns

We can extend this pattern to support graceful shutdown functionality. If we track the rate of change of utilization over time, we can predict scenarios where we might hit hard resource limits and crash.

In these scenarios, we might want to take actions like retrieving diagnostics and flushing logs to disk.

Conclusion

Implementing graceful degradation provides many benefits:

Increases the efficiency of our resource utilization by removing the need for hard bounds on things like caches
Allows our software to dynamically scale with the size of the machine
Enables us to be more intentional about how our system should respond to resource constraints

Consider using this pattern if your software is running in a resource constrained environment or you need tight control over how your software behaves under load.

Published Jan 27, 2023

Tech Lead, Staff Software EngineerFollow me on Twitter