Feature Flags for Fun & Profit

By Gerald Singleton & Clint Parker

Intro

Continuous integration & deployment. Automated testing. Refactoring & taking control of the monolith. Reducing cycle time. Increased uptime. Optimizing the data layer. Putting stakeholders in control. Making customers happy.

We’ve done all that. And it was all easier with our aggressive adoption of feature flags. Using flags by default has unblocked all of the more visible initiatives we wanted to achieve. In this document, we’ll showcase what flags are (and aren’t), why we use them to the extent we do, and how you can quickly take advantage of these patterns to improve your codebases.

Guidelines

Not (just) for features

A common misconception is that feature flags are only for toggling application features. Development teams often tend to think that an application will have a finite number of features at any given time, and those features can be managed via configuration settings. This thinking limits the overall flexibility of your application. In our SDLC, we don’t use feature flags just to toggle application features. In fact, we primarily use feature flags as a deployment tool. They enable us to deploy changes to our product frequently, purposefully, and most importantly…safely.

Columns vs. Layers

A foundational concept of feature flags is context vs config. Many applications use a variety of pre-production environments with their own config settings and slightly different expectations. An example would be connection strings for the different versions of the same DB. Or maybe a sandbox vs production integration dependencies. These are usually diagramed as vertical, separated by horizontal lines as layers, hence the phrase “lower environments.” Layers are expensive. There are tricks to reduce costs, but nevertheless, they are expensive.

On the other hand, horizontal segmentations, or columns, tend to be much less expensive and dynamic. These are thought of as usage characteristics, like the number of users, geography, and account settings, which are cheap. For this purpose, we can group all the varieties of horizontal segmentation into the term “context.” Contexts don’t have to have names, nor do they need special dependencies. Contexts are unmanaged. They simply exist within the runtime space of the code.

Context

Context is not a reserved word in this case; it’s a concept. LaunchDarkly does have a specific meaning, but that’s not what this is. As mentioned, context already exists in your application. You already have settings for the application itself, organizations, users, geographies, and time/date. Contexts can be split or combined. The point is that your application can accommodate this perspective and already does. Accepting this is important because the appropriate usage of feature flags will maximize it.

Feature flags are just additional context. You should pick one to three known contexts to start with. In a B2B application, the first context you should accommodate in your flagging system is “Company ID.” The new context is flag state plus company. The myth of tech debt I’m sure you’re asking yourself, “But aren’t you just increasing the technical debt in the system by adding temporary code?” The short answer is…not really. In our experience, when managed correctly, this is not a problem. When feature flags are implemented in a way in which they can be easily removed (i.e. simple conditional statements), they can (and should) be the shortest-lived code sections in the repo. With this in mind, feature flags can actually be a tool that can be used by the development team to reduce overall technical debt. Armed with this new tool, contributors are empowered to improve the code base aggressively.

Risk mitigation

Feature flagging’s top value is as a risk mitigator. They can be used for all sorts of other things, but this should be the top priority. If you have such an amazing codebase that you prefer to fly without this safety net, congratulations, you’ve found the perfect engineering shop and should never leave!!! But in most of the engineering teams I’ve been on this hasn’t been the case. Engineers are usually working in code that has been through several development teams with varying levels of skill and backgrounds. This results in a codebase that is often brittle. This is where proper usage of feature flags can shine. Think of how inexpensive it is to add a flag to mitigate any potential risk to your application. Any side effects can be quickly reverted back to the original behavior with the simple flip of a feature flag.

With the safety net of feature flags in place, you can do the unthinkable…test in production. Feature flags enable you to pick a context (user, company, etc.) and test that hypothesis in a real production environment. The impact would be limited to that context only, and if not, the impact can be quickly disabled without impacting the rest of the team and their deliveries.

Refactors unleashed

With the safety of feature flags, you can take bigger swings especially as it relates to refactoring code. This proves especially beneficial in older code bases where the system may be more brittle. You can refactor a whole vertical slice, deploying frequently and confirming along the way that the change is working in production with little to no impact to your users. If you find that a piece of refactored code that was deployed causes unintended side effects, you don’t have to go through the process of rushing to make a code fix and redeploying your code to production. It’s as simple as turning off the feature flag. You can take the time to mindfully fix the issue and continue refactoring.

Quality mindset / User experience first

Since nothing is now stopping the team from improving the system, quality, and intolerance for degraded user experiences can become the norm. Imagine a world where rapid delivery of value to your users is possible without the historical fear of unintended downtime. Part of the SDLC now involves verifying assumptions in production, constantly maintaining the system, and never disappointing your users. Feature flags can literally let you “swap the engine while driving down the freeway.”

Aggressive/Liberal usage

With all of these benefits, why not aggressively apply feature flags across your code base? The reality is that if the implementation of feature flag usage described here is going to be successful, it must become an enforced standard. It must be a requirement on pull requests that the changes be behind a feature flag. Why does it require that level of enforcement to be successful? Because change is hard, and this change in particular, requires a mindset shift that may not be easy for some engineers. In the same way, writing tests, creating documentation, or following coding standards doesn’t come easy initially.

Implementation examples and guidelines

Identify your flags not just by the change but also by the team and implementer. Start with a common context for your application. In B2B, the context should be the organization identifier. In B2C, geography is a great place to start.

To make sure that this point gets across, we want to repeat: “Feature flags are meant to have a short lifespan.” Ensuring that feature flags are temporary is one of the foundations of implementing the strategies outlined here. Ignoring this foundational topic could lead to situations where you have nested feature flag implementations. This can severely reduce the maintainability of the code base.

public class SampleClass : ISampleClass
{
    private readonly IFeatureFlagProvider _featureFlagProvider;

    public SampleClass(IFeatureFlagProvider featureFlagProvider)
    {
        _featureFlagProvider = featureFlagProvider;
    }

    public async Task DoSomething(string inputValue)
    {
        if (_featureFlagProvider.IsEnabled(FeatureFlagEnums.FeatureFlag1)
        {
            if (_featureFlagProvider.IsEnabled(FeatureFlagEnums.FeatureFlag2)
            {
                /// Some Code Here 
            }
            else
            {
                /// Some Code Here 
            }
        }
        else
        {

        }
    }
}

Figure 1.1 Nested Feature Flags

One of the issues that we encountered pretty quickly when implementing feature flags was merge conflicts. In our initial implementation the FeatureFlags were defined in a single enum class. To solve this issue, the enum class was split into several partial classes with each developer having their own Enum file (FeatureFlagEnums.Dev1.cs,FeatureFlagEnums.Dev2.cs, etc). Within that class file is a partial declaration to the FeatureFlagEnums class where each developer can list the feature flags that they are working on. This gives the developer compile time notifications of potential conflicts.

/// Sample service class 
public class CompanyService : ICompanyService
{
    public async Task<string> DoSomethingCool(string inputValue1)
    {
        //Imagine some code here 
    }

    public async Task<string> DoSomethingCooler(string inputValue1)
    {
        //Imagine some cooler code here 
    }
}

/// This is a context object
/// It would be used to pass into the feature flag client
public class FeatureFlagContext : IFeatureFlagContext
{
    public string CompanyId { get; set; }
}

/// This class would represent your feature flag
/// management.  This could be a wrapper around an external feature flag
/// management client such as LaunchDarkly
public class FeatureFlagClient : IFeatureFlagClient
{
    public bool IsFeatureFlagEnabledForContext(string contextId, FeatureFlagEnum featureFlag)
    {
        // Call to your external flag manager here to 
        // retrieve the flag state for the given context
    }
}

/// Pseudo code implementation of the feature flag provider class 
/// This IFeatureFlagContext would contain 
public class FeatureFlagProvider : IFeatureFlagProvider
{
    private readonly IFeatureFlagContext _featureFlagContext;
    private readonly IFeatureFlagClient _featureFlagClient;

    public FeatureFlagProvider(IFeatureFlagContext featureFlagContext, IFeatureFlagClient featureFlagClient)
    {
        //The context could be the HttpContext of the session 
        // (HttpContext.Current) or some other context object.
        // The provider will need to account for if your  
        // context object is null and return the appropriate value
        // from the IsEnabled property
        _featureFlagContext = featureFlagContext;
        _featureFlagClient = featureFlagClient;
    }

    public bool IsEnabled(FeatureFlagEnum featureFlag)
    {
        /// Use the context to determine whether the feature is turned on for the specified context       
        if (_featureFlagContext.CompanyId == null)
        {
            return false;
        }
        else
        {
            return _featureFlagClient.IsFeatureFlagEnabledForContext(_featureFlagContext.CompanyId, featureFlag);
        }
    }
}

//Once compiled both of these feature flags will be part of the FeatureFlagsEnum object

/// <summary>
/// This specific file belongs to: FeatureFlagsEnum.Dev1.cs
/// </summary>
public static partial class FeatureFlagEnums
{
    public const string InternalBugFixIssueFlag = "internal-31119-sample-feature-flag";
}

/// <summary>
/// This specific file belongs to: FeatureFlagEnums.Dev2.cs
/// </summary>
public static partial class FeatureFlagEnums
{
    public const string Company321BugFixIssue = "internal-40101-sample-feature-flag";
}

//Use of the feature flag in code would look like this
public class FooService : IFooService
{
    private readonly IFeatureFlagProvider _featureFlagProvider;
    private readonly ICompanyService _companyService;


    public FooService(IFeatureFlagProvider featureFlagProvider, ICompanyService companyService)
    {
        _featureFlagProvider = featureFlagProvider;
        _companyService = companyService;
    }


    public async Task<string> DoFoo(string inputValue)
    {
        if (_featureFlagProvider.IsEnabled(FeatureFlagEnums.InternalBugFixIssueFlag))
        {
            return await _companyService.DoSomethingCool(inputValue);
        }
        else
        {
            return await _companyService.DoSomethingCooler(inputValue);
        }
    }
}

Figure 1.2 Initial Introduction of the feature flag into code.

//Foo service once the feature flag has been removed
public class FooService : IFooService
{
    private readonly IFeatureFlagProvider _featureFlagProvider;
    private readonly ICompanyService _companyService;

    public Foo(IFeatureFlagProvider featureFlagProvider, ICompanyService companyService)
    {
        _featureFlagProvider = featureFlagProvider;
        _companyService = companyService;
    }

    public async Task<string> DoFoo(string inputValue)
    {
        return await _companyService.DoSomethingCool(inputValue);
    }
}


/// Unused code removed from the company service
public class CompanyService : ICompanyService
{
    public async Task<string> DoSomethingCooler(string inputValue1)
    {
        //Imagine some cooler code here 
    }


    public async Task<string> DoTheCoolestThing(string inputValue1)
    {
        //Imagine some of the coolest code here 
    }
}

Figure 1.3 Same code block from Figure 1.1 after the feature flag has been removed. Real World Scenarios

The Big Refactor

If your development team is not made up of AI agents yet, then you’ve heard the phrase “I want to rewrite that whole feature from scratch”. In one of the development teams I worked with, we wanted to remove the use of stored procedures and replace it with an object relational mapping tool (ORM). On the surface it doesn’t sound crazy, until you factor in that the application had over 900+ stored procedures. Where in most development organizations this would be a non-starter, we were able to start work on this immediately. How? Look at Figure 2.1 to see where we started.

/// Parameter Values Class 
public class ParameterValue
{
    public string ParameterName { get; set; }
    public object ParamterValue { get; set; }

    public ParameterValue(string parameterName, object parameterValue)
    {
        ParameterName = parameterName;
        ParameterValue = parameterValue;
    }
}

/// Generic Data Service
public class DataService : IDataService
{
    public async Task GenericQuery1(string inputValue)
    {
        ExecuteStoredProcedure("GenericStoredProcedure",
            new List<ParameterValue>(){
                new ParameterValue("Value1", inputValue)
            });
    }

    public async Task GenericQuery2(string inputValue)
    {
        ExecuteStoredProcedure("AnotherStoredProcedure",
            new List<ParameterValue>(){
                new ParameterValue("Value1", inputValue)
            });
    }

    private async Task ExecuteStoredProcedure(string procedureName, List<ParameterValue> paramters)
    {
        /// Code to Execute Stored Procedure against a data store here 
    }
}

Figure 2.1 - Before any Changes

Nothing super interesting in that code snippet. Your normal boiler plate stored procedure execution. But look at how easy it was for us to start implementing changes to how we are accessing our data with a few lines of code. Look at figure 2.2

/// Generic Data Service
public class DataService : IDataService
{
    private readonly IFeatureFlagProvider _featureFlagProvider;

    public DataService(IFeatureFlagProvider featureFlagProvider)
    {
        _featureFlagProvider = featureFlagProvider;
    }

    public async Task GenericQuery1(string inputValue)
    {
        // Add a feature flag if statement here
        if (_featureFlagProvider.IsEnabled(FeatureFlagsEnum.UseNewQuery1))
        {
            await GetData(
                new List<ParameterValue>(){
                    new ParameterValue("Value1", inputValue)
                });
        }
        else
        {
            await ExecuteStoredProcedure("GenericStoredProcedure",
             new List<ParameterValue>(){
                new ParameterValue("Value1", inputValue)
            });
        }
    }

    public async Task GenericQuery2(string inputValue)
    {
        ExecuteStoredProcedure("AnotherStoredProcedure",
            new List<ParameterValue>(){
                new ParameterValue("Value1", inputValue)
            });
    }


    /// New method that doesn't use stored procedures 
    private async Task GetData(List<ParameterValue> parameters)
    {

    }

    private async Task ExecuteStoredProcedure(string procedureName, List<ParameterValue> paramters)
    {
        /// Code to Execute Stored Procedure against a data store here 
    }
}

Figure 2.2

Using feature flags we were able to start chipping away at a major refactor while still maintaining the legacy code. By using a context, we could control how many users were executing our new source code. We could deploy several changes with little to no user impact.


2023 Year in Review

My team has seen a lot of changes in the last year. These are things that we didn’t really have in 2022 but are became a part of our day-to-day in 2023.

Feature flags

We started to introduce the concept of flags in late 2022 but didn’t adopt them until 2023. We’ve rewritten the framework a few times. The team has created guidelines for flag creation, management, and removal. We’ve introduced over 200 flags in 2023. The adoption of our feature flag process has led to…

Deploying multiple times per day

In May of 2023, we moved to hourly deploys. We had previously been on a structured 2-week deployment cadence. There are some specific challenges with a 2-week cadence: maintaining the “release branch,” being beholden to the release schedule and work done or not done in time, the fact that we were deploying a bundle of 2 weeks of work, and hotfixes bypassing all the process. We’ve since moved to hourly deploys. We currently deploy on the hour and will be moving to full continuous deployment in January. Production incident remediation times are now tracked in minutes and not hours.

DDOS protection

In 2023, we moved our WAF to Cloudflare. This has given us DDOS protection and a CDN. The DDOS mitigation has proved extremely valuable, as our system has been able to withstand attacks over 10M requests per minute.

WASM

We’ve introduced Blazor to our stack to add frontend code quickly and reliably. We’re using Blazor WASM, which is C# and HTML compiled to WebAssemly. This allows us to use our C# knowledge and best practices (including automated testing) for browser code.

Running on Linux in prod

In the first half of 2023, we migrated our production servers to Linux. In the second half of the year, we migrated our remaining dev and staging servers to Linux. We’ve also migrated our build servers to Linux. These migrations saved costs on the computing side, allowing us to scale up our data side without any overall cost increase.

Latest .NET

Staying on the latest version of the framework is uncommon in most .NET shops. In 2022, we migrated to dotnet 6. In 2023, we’ve done it again and migrated to dotnet 7. In early 2024, we’ll move to the newly released dotnet 8.

Increased automated testing

In August, we increased our expectations around automated testing. We’re now near 40% for total line coverage for all codebases. We’ve adopted behavioral testing across all of the backend code. We’ve introduced Playwright, which allows us to test our frontend code in a more automated fashion.

Codified SDLC

In 2022, our SDLC was very loose and ad-hoc. In 2023, we’ve codified our SDLC. Our SDLC is meant to be flexible while maintaining consistency across the department. Our SDLC guidelines represent sensible defaults, and we hope they will continue to evolve to best serve the teams leveraging them.

Structured teams

At the end of 2023, we had one team of 12, one team of 5, and one team of 2 with QA floating across teams. We’ve since restructured into 3 teams of even size and even staffing.

Job descriptions

I know the engineering team had been working on some job descriptions/matrices, but they never quite made it to fruition. This year, Engineering leadership created measurable job expectations for software engineering levels 1-4. We’ve published these to our team and are using them in our 1:1s and reviews. This gives clarity to both our team members and managers. We’ll be creating similar documents for our managers and QA and DevOps teams in 2024.

Consistent meeting schedule

In addition to the meeting guidelines of our SDLC, we’ve also established a monthly department-wide meeting. This meeting is an opportunity to showcase the great work done each month, share department-level information, and keep each other accountable for our organizational goals.

Company-wide bug reporting

Open bug reporting is a sign of engineering team maturity, and in May of 2023, we opened up our bug reporting process to the whole company. We previously had two competing processes. Not only did this reduce transparency and create confusion, but issues reported in the support team’s system had to be verified and triaged before being added to the engineering backlog. This dual process limited visibility into the bug backlog and also skewed reporting.

This has been one of the most remarkable years of my career. Teams rarely see this much evolution in such a short time. I can’t wait to see what interesting enhancements 2024 delivers.


Improving Software Team Metrics

A healthy engineering organization (or any healthy team, for that matter) should be tracking itself across a variety of metrics. This is not covered by the standard CS curriculum but is readily encountered in the real world. Once someone is paying for software, there will invariably be questions about how that money is being spent. The most common metrics are bug count and velocity. Followed by automated code coverage. These are common because they’re the cheapest to produce. Bugs are, unfortunately, the most visible part of engineering output. Counting them is the start of reducing them. Code coverage is freely available in every modern build pipeline, although not always enabled. And velocity is the treasured metric of any young engineering leader, the end-all answer to the question “How much work are we getting done!?”

However, once you start looking, there is so much more insight you can gain and so many more things to track and compare. And, eventually, when you’re answering to very clever investors, you’ll need to provide the metrics that they care about. One of those, which I have come to appreciate, is the sprint completion percentage. This expounds on velocity and compares that actual value to the estimated or planned value. A high velocity is excellent, but accurate forecasting is even better for the overall business. This metric is easy enough to retrieve. Azure DevOps (ADO) has this baked into its velocity dashboards. The granularity is obviously at the sprint level.

With a little API magic, we can easily get:

Team Iteration Path StartDate EndDate Planned Completed Completed Late Incomplete Total
Avengers 21 2023-10-10 2023-10-23 87 58 0 0 58
Avengers 20 2023-09-26 2023-10-09 46 38 0 0 38
Avengers 19 2023-09-12 2023-09-25 51 50 0 0 50
X-Men 21 2023-10-10 2023-10-23 51 41 0 0 41
X-Men 20 2023-09-26 2023-10-09 66 79 0 3 79
X-Men 19 2023-09-12 2023-09-25 18 30 0 0 30
Justice League 21 2023-10-10 2023-10-23 90 75 0 0 75
Justice League 20 2023-09-26 2023-10-09 120 121 8 0 129
Justice League 19 2023-09-12 2023-09-25 108 77 0 0 77

The definitions for these states can be found here.

We need to do a little more math, though, for this to become a valuable reporting metric. Unfortunately, the rest of the business and the investors don’t care about your sprints; they care about monthly and quarterly aggregates.

So, let’s start there with the math that rolls up sprints to a monthly value. It’s pretty fun. We need to determine what month a sprint falls into. My calculation chooses the month that contains more days of the sprint, and if it is equal, the sprint starts.

Team Iteration Path StartDate EndDate Planned Completed Completed Late Incomplete Total Completion % Month Year
Avengers 21 2023-10-10 2023-10-23 87 58 0 0 58 67% 10 2023
Avengers 20 2023-09-26 2023-10-09 46 38 0 0 38 83% 10 2023
Avengers 19 2023-09-12 2023-09-25 51 50 0 0 50 98% 9 2023
X-Men 21 2023-10-10 2023-10-23 51 41 0 0 41 80% 10 2023
X-Men 20 2023-09-26 2023-10-09 66 79 0 3 79 120% 10 2023
X-Men 19 2023-09-12 2023-09-25 18 30 0 0 30 167% 9 2023
Justice League 21 2023-10-10 2023-10-23 90 75 0 0 75 83% 10 2023
Justice League 20 2023-09-26 2023-10-09 120 121 8 0 129 108% 10 2023
Justice League 19 2023-09-12 2023-09-25 108 77 0 0 77 71% 9 2023

Aggregating these values can be done in a few different ways. We’re combining teams and sprints to get a monthly representation for the group as a whole. I’ve found four reasonable ways to calculate this value across teams and sprints:

  • Basic Average
  • Unweighted Average
  • Weighted Average
  • “Inverted”

Basic Average

The most basic average. This would be the average of all the values for the Completion % column for a given month and year. While this is a straightforward value to calculate, I’ve found it gives too much weight to the individual sprints. For example, one lousy sprint, even with a minimal planned value, can drastically change this calculation.

Unweighted

This is the sum of the Total column divided by the sum of the Planned column for a given month and year. This assigns too little weight to individual sprints and doesn’t address the discrepancies in point values across teams.

Weighted

This has been my go-to calculation for years. This is a two-phased calculation. First, we roll up the value for the individual teams. We do this with the unweighted model but filter by Team in addition to month and year. Then, we average those values. This handles a team having a lousy sprint but recovering in the next, as well as the differences in point values.

But what about team B? They didn’t get all that work done. It doesn’t feel like the numbers represent the reality if the work not getting done was high value / high vis. The 1st phase of the weighted model allows for a disappointing sprint. And if the team is working ahead or catching up, we’re sweeping that bad sprint under the rug. While this hadn’t always directly worried me, my colleagues who had been expecting certain things and not seeing them delivered despite the 100%+ completion rates were getting a little frustrated.

So I’ve come up with a new number to properly represent just that: how much work we aren’t getting done every month.

“Inverted”

“Inverted” may be more representative of the commitment to the business. It shows if we did what we committed to but discounts the value of above and beyond work. This calculation has a maximum of 100%. The calculation is multi-phased. The first phase is the same as weighted. Then, we “invert” the monthly team values. If the number is less than 100%, we report the difference; otherwise, we report 0. Next, we average those shortfall percentages. And finally, we subtract that value from 100%.

The inverted value is more representative of our accountability to the business. It should be noted that this value doesn’t entirely neglect above and beyond work but severely discounts it. Namely, when the X-Men go above and beyond, it won’t outweigh the shortcomings of the Avengers that month.

Conclusion

Tracking software team metrics is an essential aspect of maintaining a healthy engineering organization. While common metrics such as bug count and velocity provide a basic understanding of team performance, they often fall short in providing a comprehensive view of the team’s efficiency and productivity. This article has explored the concept of sprint completion percentage as a more insightful metric, offering a comparison of actual work done against planned work.

In essence, the choice of metric and calculation method should align with the team’s objectives and the expectations of stakeholders. By adopting a more nuanced approach to tracking software team metrics, organizations can gain deeper insights into team performance, improve forecasting accuracy, and ultimately drive better business outcomes.


What Even Is Innovation?

I was once asked about the most inventive or innovative thing I’d done. Where to start? I’m a middling engineer at best. I fully subscribe to my own pitch as a leader that engineers should prioritize simplicity and obviousness over performance and cleverness.

That said, I have an obvious answer to “What is the most interesting problem you ever solved?” And just to be transparent and fair, I didn’t solve this in a vacuum. I worked with a great team and would not have succeeded without their help.

The innovation I’m proud of is a little embarrassing due to the underlying technology. While I was at Mindbody, we uncovered an impactful limitation of scaling Classic ASP web applications. That’s right, Mindbody was still very much reliant on Classic ASP, which had been deprecated with the arrival of .NET. The solution to this scaling problem wasn’t particularly complex, but the novelty and impact qualify as innovative. In the end, we were able to proactively identify, remediate, and prevent future consequences of the limitation.

In late 2017, our VP of Engineering asked me to investigate an issue plaguing another team in his org. I was a Senior Manager overseeing other teams in technically a different department, but I and some of my group had historical experience in the code in question. The nominal problem: a deployed bundle of changes resulted in a 10% increase in CPU usage in production. Rolling the deployment back brought the usage back down, and vice versa. Additionally, the CPU increase was not detectable outside of the production environment. ☹️

I started by enlisting one of the senior engineers on my team, and we began reviewing the changes in the associated deployment. Nothing initially jumped out at us, but on the 3rd pass, I began to suspect that the problem could be related to a change of an #include reference file. Please see my early post about conditional include references to understand why this is already a potential issue. (And begin to understand my absolute hat of the continued use of VBScript). – Side note: VBScript is awesome circa 1997. But, like everything else in the universe, we evolved, and the evolution of VBScript on the server was .NET. Now, if you want to complain about people choosing to use VBScript after 2001, I’d be happy to drink my sorrows away beside you. Rant over, for now– This reference file had itself added another reference, which is typical. But in this case, the outer file was almost ubiquitously referenced in every top-level file. Specifically, the heavy usage of the modified file meant that this small change was probably causing a wider-than-obvious impact.

To test the hypothesis that this one-line change was the culprit, we removed that commit from the bundle and redeployed it without issue. The CPU usage increase disappeared! While the immediate problem was solved, I still wanted to know the root cause and prevention methods.

I then endeavored to prove this issue was detectable via static code analysis. My second hypothesis was that this was related to the server doing more work interpreting more lines per request. The structure of Classic ASP requires that every single line be interpreted when served. Therefore, I suspected that more lines interpreted meant more work being done per request and, in turn, higher CPU usage.

We created a NodeJS command line tool to analyze the codebase to represent this. We used NodeJS because it truly is the best way to share multi-platform CLIs. And thank you, TJ, for commander.js! The references in the include files created an easily traversed tree. The tree was then flattened and converted to a total number of interpreted lines for any given top-level file.

We enhanced the tool to provide additional insights, such as the theoretical minimum total lines (fully optimized but impractical to maintain) and the specific references to any included file, as well as a bloat factor, which represented how far the structure of a file was from the optimal. The results were output as one CSV file and a collection of JSON files.

The results were astounding! The original (problematic) one-line change increased the total number of interpreted lines from 26 million to 52 million. On the other side of the spectrum, the theoretical optimal number of lines was just over 12 million.

From the insights gleaned from the analysis, we could then restructure the file references to a more optimal state. Finally, we submitted pull requests to the owning team and reduced the total interpreted lines to 19 million.

Lastly, I saw that this specific issue could be prevented with these new insights. So, we created a step in the build process to run the analysis and limit the total interpreted lines not to exceed a variable maximum value.

Over the years, other engineers extended the tool to support visualizations of the reference tree and various library upgrades and bug fixes. It was still a critical build step at the time of my departure.

While none of the technology is particularly glamorous, I am proud of this innovation. Over a few weeks, existing concepts and platforms were reorganized to create something novel and beneficial. We didn’t patent anything. We didn’t write a new language. Heck, we couldn’t even really talk about it for two main reasons: 1. The org didn’t want to admit to using outdated technology, 2. Who else was using that tech and would be interested in listening?

So, as I said at the beginning, I subscribe to my own pitch of simplicity. We used basic tools and concepts and put them together in a new way.

P.S. I’m not sure how much we saved the company, but it has to be substantial. At least 10 teams were blocked for 3 weeks from deploying to production. I think they would’ve continued to run into this issue, even if they found it in this instance, and probably would’ve resorted to massively overscaling production infrastructure. Yikes!

P.P.S. Let’s take a minute to discuss what was probably happening here. I say probably because I don’t know for sure the absolute underlying issue, and even if I did, there really isn’t any fixing it for this ecosystem.

VBScript works by retrieving the requested page/resource (something.asp) and then processing the contents based on the context/request and rendering the output. Again, top-notch for 1997.

VBScript is a v1 product. It isn’t optimized beyond what the engineers fathomed at the time of writing. So, VBScript pulls the initial ASP file from disk and processes it line-by-line. If there is an #include, it retrieves that and also processes it line-by-line. Why does it process every line? Because it’s a scripting language at heart, and those lines can modify global state outside of a method body( again, see my post on VBScript conditional includes). So, it is doing a lot of work for each page request. The engineers knew about this, so they created a cache of page contents to not have to go to disk every time.

In our case, though, these two concept collide and clobber each other. The need to process each request creates a ton of work, and the page sizes themselves become massive due to the (substantial but not infinite) recursive nature of the pages. Doing more work, and the cache can’t keep up, so it’s doing more work in vain. Brutal.

In the end, they did improve Classic ASP/VBScript … they created .NET.


When to Microservice

I’m enjoying Microsoft Build 2022. Developer experience (especially in the face of common and complicated IaaS and PaaS scenarios) was my favorite topic of the day 1 keynotes.

Later, watching the keynote after hours, I stumbled on a gem of a conversation between Scott Hanselman and Scott Guthrie.

Lot’s of classic “it depends” which is totally true. For me it depends on at least one of three macro factors being present:

  1. Teams/people need to develop and deploy at different paces.
  2. Parts of the system need to scale independently
  3. Parts of the system need to be segmented for security purposes. Ex: only engineers from the payments team can make changes to payments systems.

You can have any decomposition you like, but in that video Scott Guthrie alludes to the challenges you can face on either end of the spectrum (1 engineer with 100 micro-services or 100 engineers with one service).

One last note, I may start saying containers instead of micro-services going forward. I usually try say that I prefer macro-services, but then we have to have a whole discussion about the difference. Maybe the term container will become the defacto descriptor of services and their boundaries.