Okay, let's consider the more accurate (I would rather say my description of the "room" was more realistic) question first. Large language models are based on neural networks with, I'm reading from multiple sources, many millions of nodes and up to tens of billions of connections. Let's say there's a mere one million nodes and ten billion connections.
If you're going to execute the LLM algorithm with a person and paper, which is theoretically quite possible, at the very least you need to have the weight of each interconnection written down on a list. At a hundred entries per page (hope you brought your reading glasses) and a thousand pages per book, that requires a hundred thousand books. Sounds like a library to me. By nature the calculations for a neural network involve multiple parallel computations. You'll want to keep track of the cumulative sums of the inputs of each neuron as you go; at a thousand per white board, you need a thousand white boards. (You can reduce the number of white boards you'd need, at the cost of more work rewriting them more often.)
Now to do the calculations, at ten seconds per interconnection (hopefully the books are well-organized to minimize how often you have to go back to the stacks to fetch new ones, and you're a prodigy at addition and multiplication of high precision real numbers in your head, and of course you never need to eat or sleep or take any breaks at all), will take you 3,169 years. The final steps of sorting the weights of the final layer of nodes and then looking up the matching Chinese character(s) are trivial by comparison. But, I'm pretty sure that's per word of the output, so you'd better clear your schedule for a few hundred millennia if you're planning a conversation or to answer some involved question. And I sure hope it's an important question being answered!
Okay, so you don't absolutely need those hundred assistants to finish before the sun burns out. But I'd bet you'd be glad to have them.
Again though - you're doubling down on emphasizing the necessary scale and complexity of the process. I conceptualized all of this already from your first response - what I'm asking you to do is convince me that it's relevant. Explain why the thought experiment
needs to specify that there are hundreds of people who don't know Chinese consulting thousands of instruction manuals at once rather than just one person who doesn't know Chinese consulting one book, in order for the analogy to do what it's intended to do.
The working premise of the analogy is the idea an individual who is perhaps theoretically, technically
capable of knowing how to read and write Chinese but as a point of fact
does not, is able to use some kind of process (simplified as "an instruction book") to mechanically receive and respond to Chinese writing without comprehending the input or the output, in a manner that an external user of the "system" who does know Chinese interprets as coherent. The main ideas are the person doing the work not knowing Chinese and that following some set of instructions allows them to appear that they do to an outside observer. A thousand people following a thousand sets of instructions doesn't change the premise or those ideas.
Now, you might be thinking that such a computation-intensive neural network is the wrong way to design the Chinese room. It's certainly not what Searle had in mind. At the time, it was already known that just looking up a list of possible inputs in a book and following some simple if-then and/or random-choice instructions to choose a response is nowhere near adequate to create the semblance of intelligent discourse. (Computer game NPCs work that way.) And that making the book bigger wouldn't work either, even if it filled the room.
What was considered the most likely method for passing the Turing test was to compile a database of facts about the world and then use algorithms of logical deduction to figure out facts or hypotheses about the intent of the input questions and what output would correctly resolve it. That sounds more like something that might fit in a room.
But, it isn't. To be useful for general AI (or the illusion thereof), the database of facts has to include "common-sense knowledge" that requires millions of entries. At a hundred cryptic facts per page, it's still thousands of books. (Why cryptic? Because of course, if the knowledge base was in English you'd learn Chinese in the process.) Which is fine, but it turns out there's no way to tell in advance which facts will be relevant. That kind of logical deduction, the kind used for proving mathematical theorems for instance, works by trial and error, exploring an ever-increasing space of possibilities, and practically every step along the way requires searching through the database and trying each entry. The amount of computation becomes intractable even for machines that can do the LLM neural network calculations routinely. It might seem there should be some clever way around this, but the researchers who worked on it for decades consider the problem so intractable that there have been arguments that it proves AI that can do, well, what it turns out LLMs can do, to be completely impossible. The human brain just "somehow knows" which information is relevant to a problem at hand in a way that a computer can't imitate, according to that argument.
The approach does work where only a smaller database of specialized knowledge is needed, and it can also work when specialized hardware like Watson can muster up enough cycles to manage it. But chat with Watson still quickly reveals more severe limitations than we see with LLMs. And if you were to encode Watson in a room full of books, the necessary steps to run the algorithm by hand would far exceed the neural net version. Then you really would need sun-burns-out amounts of time in the Chinese room.
Why do I harp on that when you've already acknowledged the possibility that maybe you really do need the huge library of books and the hundred assistants, and claimed that doesn't change anything important? Because it's not the scale of the mechanism that matters (that just determines how long it takes, whether it's one person or a hundred assistants, or even ten billion people each doing one calculation nearly simultaneously so the answer would come out lickety-split), but the amount of information the mechanism takes into account. Whether it's hundreds of thousands of books of neural net weights, or tens of thousands of books of facts, or heck, in principle you could do it with one long fixed sequence of printed if-then choose-your-own-adventure instructions in which case the number of books would more than fill the known universe but you'd only have to consult a tiny fraction of them to produce any single response. In all those cases the room doesn't fit in the parameters of Searle's description of it.
And of course, all those conceptual Chinese Rooms (and LLMs) are toys, in terms of data content and parallel computing power, compared to a mature human brain. 80+ billion neurons, 100 trillion connections...
There's nothing strange about a whole system being able to do things that its component parts cannot do individually. What individual part of an airplane can fly? What individual organism can evolve? Which soldier was the one who surrounded the enemy formation?
Again, within the scope of the thought experiment, you're asserting as a just-so proposition that (essentially) the size of the instruction book (as a function of how much information it contains) matters. Why? The exercise never goes into detail about what the instructions are or how many of them there are.
Obviously such a contraption if it were built in the real world would require more than a single volume to be able to answer more than a small number of prompts effectively. The process is distilled into a single book for the purposes of the thought experiment because the only important concept is that
it is a set of instructions and nothing else. "But the cat would need air to breathe and food to sustain itself" is objectively true but really doesn't belong in a discussion about Schrödinger's quantum superposition analogy.
Since this is the science subforum rather than the philosophy one, what test for "understanding" other than the ability to respond coherently to probing questions about the concept(s) in question are you proposing?
Consider the question I suggested for ChatGPT (which Pulvinar kindly tested out for real) about opposites. Or simpler questions like "which two of these words are opposites: cold, fast, dark, hot, night." Or "Write a sentence that uses opposites." Questions/tasks like these are how we test students' understanding of concepts, and we interpret correct answers as demonstrating understanding. How else?
We test students' understanding of concepts in that way to prove that they have learned a thing that we have positively taught them. We teach students facts like that hot is the opposite of cold and encourage them to both remember those facts and extrapolate additional facts from them using deductive reasoning.
ChatGPT
does not learn or retain facts, and it does not employ reasoning when composing its answers. It builds responses through what are effectively complex weighted dice rolls and the weights are based on statistical correlation of words. We know this for an absolute fact because that's what it was programmed to do, and it does not have the ability to change its own programming so that it can do something else instead. So any kind of test that presumes the subject has or should have retained facts or expects them to employ deductive reasoning, when used on ChatGPT, just doesn't demonstrate what the test is intended to demonstrate even if the returned answers are objectively correct.
Imagine giving students a test on word definitions that is presumably intended to assess their memory or comprehension, but also allowing them to look up the definitions of the ones they don't know in a dictionary or textbook or somesuch. In the end you get a page full of "correct" answers, but you have no way of discerning on a given student's paper which definitions are repeated from memory and which were looked up and just copied from the textbook. What can you conclude about the student's retention or comprehension from that student's test result? Nothing. All of the answers are right and yet the test has not achieved its purpose. It wasn't designed to see merely if students could produce the correct answer
somehow; it was designed to see specifically if they remembered or could deduce the answer.