LLMs Turned Out Pretty Useful, But

Published: 2024-09-08

Tagged: essay learning productivity progress sre software work

I first began using LLMs at work about a year ago. I was curious how good they were and also secretly hoped they'd make me more productive. They did--but not in the way I expected.

First, what is "work"?

My title is "site reliability engineer". That's someone who makes sure some critical piece of software isn't crashing too often and too badly. Day to day, my focus is split across development and operations. Development means figuring out solutions to problems and translating them into code. A big chunk of that involves reading code, since the new pieces has to work with what's already there. Usually, there's a lot already there.

Operations means actually running that code. There are many layers that need configuring to ensure the right version of the right program is running in the right place. What makes this tricky is that both program and the system it's part of are constantly changing, and each change carries some risk of something going wrong. It's important to continuously update one's mental image of the whole contraption so that when something breaks, it can be found and fixed quickly.

In this context, LLMs--specifically ChatGPT 4o and Copilot,--have proven useful in four distinct ways: power ducking, speed search, smart code completion, and communication cleanup.

Power ducking is a close cousin of rubber ducking. Rubber ducking means taking a little toy duck and explaining your problem to it. Often, a few minutes in, a perfectly viable solution materializes in your head. I don't know the mechanism behind this, but talking with colleagues and taking long showers seems to work in a similar fashion.

ChatGPT takes this technique to another level because it combines reflection with injections of new information. For example, let's say I need to add a feature that limits how much incoming traffic my service should accept so as to avoid being overwhelmed. It's not a domain I'm familiar with.

I begin by writing two-three paragraphs about my problem: what is the current shape of my program, why too much traffic is bad for it, what I think should happen with the rejected traffic, etc. I then ask about prior art in this domain. ChatGPT replies back about "token buckets", "weighted fair queues", and "proportional-integral-derivative controllers." In response to that, I ask it about applications of these in software, their trade-offs, and boundary conditions. The first two are well known algorithms, often used in networking, while the last one is what makes cruise control work. I decide to ignore the last one and request example implementations of the first two in Python.

Comparing both, token buckets appear simpler, so I begin sketching out solutions to my problem using that. Then, I ask questions about what I think will be problems, for example, what happens if I need to restart the service, what happens if I need to revalue the tokens, etc., and I iterate on my sketches. At some point, it looks good enough so I go ahead and begin implementing it in code.

Along the way, I verify the information ChatGPT is giving me. I look for existing code or papers and compare it with what I learned to see if I'm being taken for a ride. It doesn't happen often, but I've caught enough small mistakes to be wary.

Another way ChatGPT is useful is for finding specific details. In my work, I often have to look for documentation for some setting or function. I know it exists, but it's buried where I normally need to click through a bunch of pages and skim a whole lot of text to find it. This process is slowly becoming more tiring as Google Search gets worse. But with ChatGPT, I punch in a sentence and get back exactly what I want in a succinct paragraph.

Consider the problem of a server not doing what it's supposed to be doing. I encounter some variation of this problem a few times a year. Using what I have in my finger-memory, I can check some basics, like whether it's running the program it should (ps) or if it has enough resources (top, vmstat). However, if those look good, I need to probe deeper. I know the commands to do that (lsof, ss/netstat, etc.) but I don't remember the exact invocations to use because I only run them handful of times a year. So I fire off a simple question to ChatGPT and get precisely what I need.

Now, each of these commands comes with extensive, well written documentation. But again, I would need to search and skim, and because I usually need to run a dozen of such commands, the overall time balloons quickly, so ChatGPT is a welcome help.

Next up, smart code completion. Regular code completion is a code editor feature that works by suggesting ways to complete a name. It makes coding go faster and helps avoid typos because many names in a program repeat again and again.

With a tool like Copilot, the editor instead suggests ways to complete a whole line of code--or even a whole chunk of a program. It was wildly astonishing to witness this the first couple of times. My thoughts oscillated between "Wow, I can get 10x the work done now" and "My job will be gone a year or two." Alas, neither looks true today.

Larger pieces of code that Copilot outputs are about 60-80% there. That's still amazing considering that it takes just a second or two to get that. And even with the work needed to get it to a 100%, it's still much, much faster than punching in code the old fashioned way. However, the problem that keeps me from relying on this method is that the code is very average in quality. I assume that's because most of the code out there is simply not great--it's stuff that "compiles" or "works on my machine" or "is good enough for the deadline."

But code is like shared thinking, which makes good code an investment in the future and bad code a drag on development and a constant source of bugs. I'm particularly aware of this dynamic now because I'm onboarding into a new team. The code is well written so I'm learning the system at what feels like breakneck speed.

So right now I use Copilot as just a more powerful version of code completion. I still wouldn't want to work without it, given it probably saves me thousands of key strokes per day, making the work flow more smoothly.

The last area where LLMs have proven helpful is shaping up reports, announcements, and even peer reviews.

At a remote-first company, a lot of communication happens through writing. Because of that, I feel it needs extra emphasis on clarity and simplicity. So whenever I put together anything longer than a paragraph, I copy it into ChatGPT and ask for suggestions for improving it in those terms.

It's like having an always available editor. I imagine a real one would be more helpful, what with experience and creativity, but ChatGPT is always online and infinitely patient--and, perhaps more importantly, the stuff I write needs a quick, pragmatic looking over. I would feel bad for forcing that kind of dull work on a real live person. Though I do need to spend more time developing my prompts here because otherwise I get very bland output.

Generally, I think large language models have been a net positive for me. They weren't the silver bullet I was secretly hoping for, but the unanticipated ways I'm putting them to use every day have been pretty helpful. I think I can get even more out of it if I improve my prompts. But, given how useful power ducking has been, I feel there's more value in exploring the types of problems I apply LLMs to.

Matt's Codecave