LLMs as basic compute units

*26 April, 2024* There is a thought-experiment known as the Chinese Room Argument (1980), where one imagines himself alone in a room armed only with a computer program for responding to Chinese characters slipped under the door. You understand nothing of Chinese, yet by faithfully following the program's instructions to manipulate symbols and numerals, you can send back seemingly appropriate strings of Chinese characters. Those outside the room are led to mistakenly suppose there is a Chinese speaker in the room. The experiment argues that programming a computer may make it appear to understand language, but it could not produce real understanding—hence the "Turing test" is inadequate. The first time you interact with systems like ChatGPT or Sora, it’s hard not to feel some empathy with “those outside the room” - marveling at their apparent intelligence. But as we peek behind the curtain to understand the “causes” of how these models operate, we may slide to opposite conclusion - there's no real intelligence at work, it’s “just a statistical machine”. There mere existence of causes for a belief is commonly treated as raising a presumption that it is groundless. By design, given an input, the machine is trained to offer the “most likely explanation” based on patterns in web data. It seems counter intuitive that you can profit with the “average stuff” found online. After all, progress rarely comes from stating the obvious. Generative AI can be though of as a new form of electricity - useful to lot of different things. While much of the hype focuses on consumer-facing tools. there’s one trend deserving more appreciation - using AI as developer tool. I use "developer" broadly here to mean any professional harnessing AI as a tool to iteratively elevate their work output. Compensation usually follows productivity which means we should see roles and professions transformed as we discover news ways of being more productive and shift the creative kernels of our activities to higher levels of abstraction. This emerging force will, for instance, turn translators into editors (the creative component of the role is enhanced). In medicine, we’ll probably have fewer radiologists because they can process 100 x-rays a day instead of 10. It brings down the cost of medicine, freeing them to focus on higher-order tasks - perhaps even tackling historically neglected conditions like bad humour. Agentic workflows are one of the best examples of emergent patterns that can support the proliferation of developer tools. But before delving into that, it’s useful to reflect on how we’ve been using language models and frame a different way of understanding the role of these models. Picture a typical prompt to ChatGPT, whether the ask is to write code or an essay. One thing in common about these two examples is that we create high expectations about the output of the model. In particular we expect the answer to be the right one at the first attempt. ![[Pasted image 20250426161453.png]] These expectations are at odds with our experience. Humans don’t produce complex outputs with a single stroke. They need time to think, erase, reframe, revise, correct and iterate. Expecting perfection from a language model on the first go is like demanding a human programmer write bug-free code without ever running it. A more realistic scenario is to run the first version of the program only to discover that a buggy program that either doesn’t compile or doesn’t do what you want. How do you solve it? You begin with an initial mental mode of how your code operates. However, the existence of the bug is sufficient proof that your mental model is flawed. To rectify this, you devise an experiment to track down where the imperfection is. This might involve writing tests or inserting “print()” statements to verify if the program's state aligns with your expectations. Analysis of the output allow you to tinker and refine your mental model. Few bugs can withstand this process unblemished. ![[Pasted image 20250426161548.png]] In light of this example, it seems unreasonable to expect language models to exhibit a degree of computation reducibility that would allow them to effectively “cut corners” and arrive at solutions using a single inference step. So how do we allow LLMs “time to think”? An obvious proxy for it is the volume of output tokens - we need to get them to think before they speak. Therefore a more powerful model to arrive at solutions would be to allow a (potentially large) number of inference steps for a given task: ![[Pasted image 20250426161604.png]] So how can we effectively utilize these "N inference cycles" to tackle our tasks? It brings to mind considerations about the fractal nature of writing quality software - you want certain properties to repeat at different scales. Just as a well-designed function should have certain desirable properties - doing one thing, being testable, composable, and having a clean interface - these same characteristics should scale up to modules, libraries, services, and even entire systems. For our present example we can look at the “N inference cycles” as a computation budget - in the same way that we traditionally write a program to run under a certain number of CPU clock cycles. The difference is the level of abstraction (the scale). With a single CPU cycle you can afford to sum two numbers or read a register. Conversely, with a LLM cycle you can “create a though”. ![[Pasted image 20250426161618.png]] Obviously we can physically implement one using the other, but from a computational point of view they can be thought of as a form of computational unit. Agentic workflows are ways of exploring this computational budget in the pursuit of better solutions. Let’s take an example. Imagine needing to compare workers' compensation laws across two states, like New York and California. A reasonable approach would be the following: 1. Retrieve the full legal text for both states' laws 2. Summarize the key provisions of each 3. Compare the summaries to highlight differences 4. Refine the analysis and present the findings clearly We could use our LLM computational budget (the number of inference steps) to explicitly ask the model to solve each one of the above steps. A more general approach would be to ask the model itself to come up with a plan and then execute it. ![[Pasted image 20250426161627.png]] As programers, our task is less focused about the mechanics about how the identification of a provision is done, or how the intricacies of how to two similar laws should be compared. Instead, the abstraction level is increased and our role is to program “general solvers” that follow logically sound procedures to arrive at solutions. Central to this process is the conductor, which ensures seamless communication and effective integration of the outputs from the program execution. The orchestrator might start be asking the LLM generate a plan (one inference cycle); it then parses the output, stitches together the appropriate context and executes the plan sequentially. During the execution, at every time step, the conductor can defer the decision of what the next step should be to the LLM itself: “should the current best solution be revised?” (reflection); “Is it worth performing a web search to validate a point?” (tool usage). Note that the content of the execution is itself dynamic. The conductor only enforces a certain approach for problem solving. Having established this as a computational paradigm, one can get creative and expand the execution to a multi-agent environment. Say you want to build a Polymath Lawyer - a lawyer that is an expert in every domain. You can model this program as a collaborative between multiple agents, each one expert on one domain, coordinated by a conductor whose role is to break down complex tasks into smaller, more manageable subtasks. ![[Pasted image 20250426161644.png]] Note that in these architectures each subtask or agent can be supported by an independent LM (potentially fine tune for the task), or they could just be “virtual” instances of the same LM, each operating under specific, tailored instructions. One can imagine similar approaches to accelerating scientific research, creative endeavors, and complex decision-making. This agentic paradigm does imply a significant increase in serving costs per user query, perhaps running for minutes or even hours at current inference speeds. But market demands are likely to drive rapid efficiency gains. As the operational costs of these models become more affordable, it becomes natural to ask whether one might use scaffolding systems and leverage multiple LM queries to not only refine but also to enhance the accuracy and robustness of these model outputs. Modern processors typically have clock speeds in the range of 1 GHz to 3.8 GHz. Consider the type of programs we could write if we could run 1 billion LLM inferences per second. The move towards compound systems in AI introduces a variety of intriguing design considerations - it means leading AI results can be achieved through clever engineering, not just scaling up training. While agentic systems can offer clear benefits, the art of designing, optimizing, and operating them is still emerging. For businesses aiming to stay competitive, integrating AI agents won't be optional - it will be an operational imperative. The question isn't whether to adopt AI, but how creatively it can be harnessed to transform industries.