I Hope We Don't Learn the Wrong Lesson From the Crowdstrike Incident

Published: 2024-08-01
Tagged: essay coordination news progress

I often find myself at a loss when I try to explain how crappy most software is. People outside the industry just don't believe me. The recent Crowdstrike incident may change that though. In fact, I think it'll nudge us, however indirectly, towards regulation and stagnation.

Friday's events--grounded planes, closed supermarkets, and bewildered bankers--are hard evidence of how shoddy software engineering is at Crowdstrike. First, they went against popular wisdom and deployed a change to all their agent programs all at once. Normal operational practices demand that such changes get rolled out incrementally, beginning with a tiny number of nodes, usually in some controlled test environment.

More damning, however, is the nature of the failure itself. From what facts have emerged, it appears that Crowdstrike pushed malformed data to their agent programs. These programs accepted the bad data, which is surprising because it's standard practice to validate inputs before processing them. Then, instead of handling the failure gracefully, ie. minimizing the damage or interference to a user (also standard practice), they instead temporarily transformed the machine into a useless paper weight. Finally, there's also the question of how the malformed data was produced. Normally that kind of thing would be handled by a program written specifically to ensure the data it produces is valid.

One failure could be labeled an incident. It's guaranteed that a complex system will eventually fail. But multiple failures, especially basic ones like deploying a change to all agents at once, suggests rather an accumulation of shortcuts, bad practices, and half-baked design decisions. In other words, a subpar engineering process.

I'm comfortable with that judgement--at least until more facts emerge. We're talking, after all, about a critical component of a core product. A core product usually gets the largest slice of leadership's attention and, thus, resources for development. So for such a spectacular failure to occur turns my suspicion squarely towards Crowdstrike leadership.

By the way, I don't feel some sick satisfaction writing this. I'm just sad to see such hard evidence about the sorry state of our industry. Because for a glitch like this to snowball into what we witnessed that Friday also casts a shadow on IT specialists the world over. They decided to give Crowdstrike's software administrator-level access and to accept whatever updates Crowdstrike decided to push down the wire.

The first decision is defensible to an extent. Cybersecurity software needs the highest possible level of access to do its job effectively. Lacking that, malware could easily avoid detection and extermination. But any business that puts this kind of software into its critical path should act with care, even paranoia, because if something goes wrong there, the whole system can go down (which it did).

Moreover, unconditionally accepting updates puts all of Crowdstrike into the critical path. Changes to hiring practices, the career ladder, or even the quality of snacks at the office all can effect a customer's computer infrastructure. It's essentially like grafting a whole company!

I can imagine the sales pitch describing all of the above as a great idea: get real-time automated threat protection--just check a few boxes and let the good people at Crowdstrike handle all the work. Go ahead, enjoy a good night's sleep or a Starbucks coffee. You've earned it.

But when dealing with software dependencies like what Crowdstrike sells, it's universally accepted to rely on pull-based updates. Doing so allows organizations to control not only what to install and when, but also how changes get deployed, giving the people closest to the business a chance to decide the right balance between risk and speed. Crowdstrike can't do that. They're not in their customers' business.

I think the scale and breadth of this incident will motivate the government to hoist more regulations on the software industry. (Let's just ignore for the moment the irony that Crowdstrike is in business because it makes regulatory compliance easier). It seems like the obvious path forward because that's what society has done to all sorts of other engineering disciplines. The safety of our cars and bridges, and the wiring and plumbing in our homes is ready evidence for the effectiveness of this approach.

But producing software is only partially like engineering. However critical it may be to the functioning of modern civilization, a major part of making it involves the creative and the personal. That's best reflected in all the websites people make for fun, alongside the countless game servers, chatrooms, and other online spaces that folks build and operate just for the heck of it.

I must concede that that kind of software probably doesn't make up more than 10% of all the software out there and it's definitely not as civilizationally load bearing as the other 90%. But I believe that the fun software is where a lot of great ideas originate from, and also what attracts--maybe even produces--engineers that push the limits of what's possible. This is evident if you ever listen to someone like Rich Hickey or John Carmack or other well known programmers. They all sound like they love what they're doing.

I'm not saying that designing a bridge isn't fun. I'm sure there are many certified engineers who love what they're doing. But software is special in that someone like Aaron Swartz or Linus Torvalds can nudge or even push the whole industry in a whole new direction.

That would, however, become much more difficult if we made software just like other engineering disciplines. The extra requirements and constraints would filter out a lot of people, and even the people who got in would find their freedom of movement, of creation, severely limited by having to follow rules composed by clueless bureaucrats. Look at those ugly cookie banners: do they actually serve any useful function?

We could continue this discussion and hope that we'll arrive at some optimal balance between freedom and safety, but I think there's a more promising path forward, one that I haven't seen talked about much.

What if, instead of tightening the screw on engineers, we shift the burden of accountability to leadership?

Leadership explicitly sets the quality bar by structuring the hiring and promotion processes. They also do so implicitly by enforcing certain norms, like when is it OK to take shortcuts, or what happens when a deadline will be missed. Perhaps being on the hook for the consequences of these decisions would incentivize them to more keenly interested in how engineering is done in their company. What's more, because it's unlikely they would have the necessary expertise, they would have to give engineers a bigger seat at the table instead of treating them like warm bodies sitting in a box labeled "cost center."

The obvious problem with this approach is that leadership can wiggle their way out of such accountability. Their position in the company hierarchy provides them the means to try and not quite bad odds of succeeding. It would be foolish to trust ordinary people to accept just punishment. A few probably would, but many wouldn't. We've already seen that in the VW emissions case where executives blamed "rogue engineers".

I suspect a few reasons why this approach isn't more talked about. For one, it's simply less familiar. Licensing and similar restrictions are so commonplace that it's no wonder our minds instinctively bring them up so quickly. Another reason is internal to the industry. Putting a government stamp on anything makes it look more serious and professional, which is something I think quite a few software engineers would welcome on account of it making them seem more accomplished to people outside of the industry.

Whatever path we take, we'll get safer, more reliable software--eventually. If we regulate software engineers, however, I expect it'll take us longer because we'll encounter the same sort of stagnation we see in the nuclear energy sector. There, we haven't really tallied the damage and suffering caused by regulation that forced us to use fossil fuels for longer than necessary. Let's hope we've learned a lesson and avoid a similar future with software.

Comments

There aren't any comments here.

Add new comment