Certainly! Here’s the rewritten content:
—
Alibaba’s new model, QwQ-32B-Preview, has received praise for its reasoning capabilities. Unlike models focused solely on language reproduction, QwQ, similar to OpenAI’s GPT-4 o1, emphasizes reasoning in its training. Curious about its capabilities, especially since it was quickly available on Ollama and runnable on a decent laptop, I decided to test it. I also compared it with GPT-4 o1 and Gemma-2-27B. GPT-4 o1 was the first to emphasize reasoning in its training, while Gemma doesn’t make this claim as far as I’m aware.
Testing a reasoning model isn’t straightforward like fact-checking. I considered asking it to prove Fermat’s Last Theorem, but realized the only answer I’d understand from the models would be “I can’t prove this,” which was indeed the response from all three models. I lack the expertise to understand the theorem’s complex proof, so if QwQ attempted a proof, I’d know it was incorrect simply because the actual proof is significantly lengthy and complex. This wouldn’t be a valid test.
What about Fermat’s Little Theorem? It’s simpler but well-documented, so I couldn’t be sure if QwQ reasoned through it or just recalled it from its training data. The theorem is probably well-represented in online math textbooks included in its training data.
So, I chose a less common example: proving that if a prime number consists only of ones (e.g., 11, 111, 1111), the number of digits must be prime. I’ve proven this myself, and it’s not popular enough to appear in textbooks. Here’s QwQ’s attempt at the proof:
>>> Can you prove that if all the digits of a prime number are 1, then the number of digits must be prime?
Let’s approach this step by step. The statement is: If all the digits of a prime number are 1, then the number of digits must be prime. Let’s break it down.
We have a prime number with all digits being 1, like 11, 111, 1111, etc., but only those that are prime. We need to prove that the number of digits in such a prime number must itself be prime.
Let’s consider the contrapositive of this statement. The contrapositive of “If \( R_n \) is prime, then \( n \) is prime” is “If \( n \) is not prime, then \( R_n \) is not prime.” If we can prove the contrapositive, that will establish the original statement.
…and eventually leading to…
Final Answer: \[ \boxed{\text{If } R_n \text{ is prime, then } n \text{ is prime.}} \]
The proof was correct. GPT-4 o1 and Gemma-2-27B provided similar proofs but with less verbosity. Gemma omitted some steps that might be obvious to humans.
Next, I asked if the proof holds for bases other than base 10:
>>> Does this proof hold for bases other than base 10?
That’s interesting! Let’s see if this proof can be applied to bases other than base 10.
In base \( b \), a repunit of length \( n \) is:
\[ R_n^{(b)} = \frac{b^n – 1}{b – 1} \]
…and so on…
QwQ successfully proved this as well. This second proposition is less likely to be in textbooks, as it’s more of an observation than a proof, noting that the original proof didn’t assume any base.
When I asked GPT for the same theorem, I received a similarly correct proof, more formally stated. This wasn’t surprising since GPT also trains for reasoning. To my surprise, Gemma-2-27B also delivered a correct proof. Gemma, although trained on mathematical texts, isn’t explicitly trained for “reasoning,” but it succeeded nonetheless, omitting steps obvious to humans but necessary for completeness.
Have we proven that training models for reasoning is effective? We can’t claim that based solely on one trial or even numerous trials. However, we’ve discovered something crucial. Consider the model sizes: OpenAI hasn’t disclosed GPT-4 o1’s size, but it’s rumored to exceed a trillion parameters. QwQ has 32 billion, and Gemma-2-27B has 27 billion parameters. This makes QwQ and Gemma significantly smaller than GPT. Moreover, GPT runs on one of the largest supercomputers globally. We don’t know its exact size, but OpenAI’s infrastructure is massive, with many high-end GPUs. QwQ and Gemma operated smoothly on my MacBook Pro, merely spinning the fan and draining the battery a bit, whereas GPT serves thousands simultaneously, making comparisons somewhat unfair. However, it’s crucial to realize that GPT isn’t the sole option and that local models can match GPT on complex tasks. Many who’ve tested local models conclude similarly, suggesting that developers needn’t rely solely on OpenAI. Smaller models can suffice, saving costs, maintaining data privacy, and keeping control.
What else can we learn? I wondered how a language model could be trained for logic; intuitively, I thought it would be more challenging than language training. I was wrong. I’m unsure how these models were trained, but I now believe that producing logic is simpler than generating language. Why? QwQ’s verbose output provides a clue: “Let’s consider the contrapositive of this statement.” A contrapositive is a logical pattern: If A implies B, then not B implies not A. There are other logical patterns: Syllogisms, proof by contradiction, induction, etc.
We could easily expand this list. While there are better ways to represent these patterns, what’s important is recognizing them as patterns. Composing logical patterns into statements or proofs isn’t fundamentally different from composing words into sentences. Is manipulating patterns the essence of logic? It’s logic if done correctly, illogic otherwise. The logic isn’t in the patterns but in assembling them to solve problems. Training must focus on how patterns are combined, using numerous logical reasoning examples to model pattern assembly. Logical patterns can mislead if not careful; creating false syllogisms is easy with incorrect premises. Logic training won’t solve hallucination issues, but it might teach models logic better than word-based training. That’s the gamble OpenAI, Alibaba, and possibly Google are taking—and they’re succeeding.
Can we advance further? Are there other patterns for training language models? Yes. Generative AI excels in code generation but hasn’t significantly impacted software design. Could training models on design patterns be a breakthrough? I don’t know, but I’d like to see it attempted. A design-specialized model would be valuable.
Could we improve generative music by training models on music theory patterns alongside audio? Suno and similar apps are entertaining but often rehash musical clichés. Would music theory knowledge enhance Suno’s capabilities, understanding music patterns beyond just music? Would language models produce better poetry if trained on poetic language patterns (like rhetorical devices), rather than just words? My early AI experiment involved asking GPT-3 to write a Petrarchan sonnet, structurally different from the common Shakespearean sonnet. GPT-3 and similar models failed; they could define but not generate Petrarchan sonnets, as they were trained on words, not patterns.
Is this a way forward for AI? I don’t know, but I’d like to see researchers explore it. For now, it’s enough to realize that while GPT models are powerful, small open models can perform just as well on laptops or phones.
Footnotes
I tested the Preview, now promoted to GPT-4 o1. I didn’t retest with the final o1, presumably further trained and optimized. I later included Claude’s small Haiku model (8B parameters) in my tests, which provided a correct proof. It was terse and required more thought to verify its accuracy. Design patterns are usually linked to object-oriented design, but the concept is broader. They identify solutions to common problems, enabling discussion about them. This definition applies across disciplines, including functional programming and architecture.