(2023-06-20) Hebert Embrace Complexity Tighten Your Feedback Loops

Fred Hebert: Embrace Complexity; Tighten Your Feedback Loops. Instead I decided to follow my gut feeling and go with what I think really explains my perspective and the approach I bring with me to work and even my life in general

so “This is all going to hell anyway” is pervasive to my approach.

Any improvement will be used to bring it right to that edge

In what is probably my favorite paper ever, titled Moving Off The Map, Ruthanne Huising ran ethnological studies by embedding herself into projects within many large corporations doing planned organizational changes. In supporting these efforts, they were doing “tracing” of their functions. cf (2019-12-31) Can You Know Too Much About Your Organization

they generally reached out to experts within the organization who were supposed to know how things were working. Even then, they were really surprised.

Others would state that “the problem is that it was not designed in the first place.”

One of the most surprising results reported in there was about tracking the people who participated in organizing and running the change projects, and seeing who got promoted, who left

The CEO sat down, put his head on the table, and said, “This is even more fucked up than I imagined.” He realized that the operation of his organization was out of his control, and that his grasp on it was imaginary. (illusion of control)

She found out there were two main types of outcome. The first group turned out to be filled with people who got promotions. They were mostly folks who worked in communications, training, who managed the costs and savings of the projects, or those who helped do process design

Another group however mostly contained people who moved to the periphery: away from core roles at the organization, sometimes becoming consultants, or leaving altogether. Those who fit this category happened to be the people who collected the data and created the map. They attributed their moves to either feeling like they finally understood the organization better, felt more empowered to change things, or became so alienated by the results they wanted to get out.

As a continuation of this, the way people work every day is often different from the way people around them imagine their work is being done

When you see this mismatch causing people to ignore or bend rules, you can choose to apply authority and ask for a stricter rule-following. This pattern of enforcing the rules harder will likely drive these adaptations underground rather than stamping them out, because real constraints drive that behavior.

the vast majority of answers, nearly 60%, came from people saying "my time tracking was always fake and lies," with some people stating they even wrote applications to generate realistic-looking time sheets.

Part of the reason for this is that every day decisions are made by trying to deal with all sorts of pressures coming from the workplace, which includes the values communicated both as spoken and as acted out. People generally want to do a good job and they’ll try to balance these conflicting values and pressures as well as they can.

these small decisions accumulate based on the feedback we get from each of these and can end up compounding and accumulating

that’s one way your culture can define itself.

You can easily foster your own local counterculture within a team if you want to. This can both be good (say in Skunkworks where you bypass a structure to do important work) or bad (normalizing behaviors that are counterproductive and can create conflict).

you nevertheless get the best results by also aligning with or re-aligning some of the organizational pressures and values usually set from above.

So let's start with negotiating trade-offs, with a bit more of an ops-y perspective, because that's where I'm coming from.

This is a painful one sometimes, especially when you have highly professional people who take their jobs seriously.

I just asked off-hand: "are you trying to deliver more reliability than people are asking for? What if you just stopped and let it burn more and rested your people?" He thought about it seriously, and said "yeah, maybe."

In some cases, the answer will be "yes, we want to be this reliable". But you just won't be given the right tools to do it.

At Honeycomb, we want on-call rotations to have 5-8 people on them because that’s what we think gives a good pace that maintains a balance

But many services are owned by smaller teams of 3-4 people

makes us prepare to deal with more unknown: fewer runbooks, more high-level switches and manual circuit breakers to gracefully degrade parts of the system to keep it running off-hours, and with different patterns of escalation.

We're going to accept a bit of well-scoped, partial unavailability—something that happens a lot in large distributed systems—in order to keep the system stable.

That’s one of the complex trade-offs we can make between staffing, training/onboarding, capacity planning, iterative development, testing approaches, operations, roadmap, and feature delivery.

To make these tricky decisions, you have to be able to bring up these constraints, these challenges, and have them be discussed openly without a repression that forces them underground.

We went over 30 or so incident reports that had been written over the previous year, and a pattern that quickly came up was how many reports mentioned "lack of tests"

they knew that the code was buggy. But they felt in general that it was safer to be on-time with a broken project than late with a working one. They were afraid that being late would put them in trouble and have someone yell at them for not doing a good job.

When I went up to upper management, they absolutely believed that engineers were empowered and should feel safe pressing a big red button that stopped feature work if they thought their code wasn't ready. The engineers on that team felt that while this is what they were being told, in practice they'd still get in trouble.

Sometimes you can eat downtime or degraded service because it’s going to keep your workload manageable and people from burning out.

Conversely however, you have to be able to call out when your teams are strained, when targets aren’t being met and customers are complaining about it.

Metrics are good to direct your attention and confirm hypotheses, but not as a target, and they’re unlikely to be good for insights. (Goodhart's Law)

This loss of context is a critical part of dealing with systems that are too complex to adequately be represented by a single aggregate

As a related concept, if you act on a leading indicator, it stops leading, particularly when it’s influenced by trade-offs.

Our storage engine's disk storage used to be our main bottleneck

This was a useful signal, but it also drove costs up, and eventually became the target of optimization.

An engineer successfully made our data offloading almost an order of magnitude faster

Removing this limit however messed with our ability to know when to scale, which then revealed issues with file descriptors, memory, and snapshotting times.

writing a procedure means little unless people actually see its value and believe it’s worth following

A related concept here is one here is that if you are tracking things like action items after an incident analysis and they go in the backlog to die, it may not be that your people are failing to follow through; it might also be that it’s impractical to do so, or it’s could also be that these action items were never feeling useful, and the process itself needs to be revisited rather than reinforced.

The shortest feedback loop may be attained by giving people the tools to make the right decisions right there and then, and let them do it. Cut the middlemen, including yourself.

How do you make that work? We come back to goal alignments and top priorities being harmonized and well understood. If the pressures and goals are understood better, the decisions made also work better. (strategic context)

they need to trust you back with critical and unpleasant information as well. The feedback flows both ways, and this hinges on psychological safety.

Trust also means that if you want people to be innovative, you have to allow them to make mistakes

Finally, let's look at shifting perspective away from a bare analysis and onto a more systemic point of view.

The most basic point here is that you can’t expect to change the outcome of these small little decisions that accumulate all the time if you never address the pressures within the system that foster them.

If your root cause is at the weed’s level, you’ll keep pulling on them forever and will rarely make decent progress. The weeds will keep growing no matter how many roots you remove.

But the tip here is probably: look into what are the behaviors you want to see happen, and give them room to grow.

I find it useful to keep focusing on what an indicator triggers as a behavior (the interaction) rather than only what it reports directly

SLOs aren’t hard and fast rules. When the error budget is empty, the main thing that matters to me is that we have a conversation about it, and decide what it is we want to happen from there on. Are we going to hold off on deploys and experiments? Are we able to meet the objectives while on-call, with some schedule corrective work, some major re-architecting? Can we just talk to the customers? Were our targets too ambitious or are we going to eat dirt for a while?

For any of these choices, we also have to know how this is going to be communicated to users and customers, and having these discussions is the true value of SLOs to me. SLOs that flow outside of engineering teams provide a greater feedback loop about our practices, further upstream, than those that are used exclusively by the teams defining them, regardless of their use for alerting.

Finally, this is where SREs can be placed in a great way to shine. You can be away from the central roles, away from the decision-making, on the periphery. By being outside of silos and floating around the organization’s structure, you are allowed to take information from many levels, carry it around, and really tie the loop at the end of so many decisions made in the organization by noting and carrying their impact back once they’ve hit a production system.

It is an iterative exercise, our sociotechnical systems are alive, and carrying pertinent signals and amplifying them, you can influence how long it’s gonna take before it all goes to hell anyway.


Edited:    |       |    Search Twitter for discussion