ChatGPT o1-preview Model Tested: Improved Calculations, but Notable Precision Issues ####################### .. figure:: /images/chatgpt-o1-preview-model-tested_intro.png :alt: Image generated by DALL·E from a prompt by the author Image generated by DALL·E from a prompt by the author I’ve always been involved in the field of Natural Language Processing and Conversational AI, and, since I began testing ChatGPT back in 2022, I’ve thoroughly enjoyed testing each new model. Later on, I was able to combine this passion with my profession at Amarula Solutions, and I’ve come to appreciate the steady progress, whether big or small, with every model release. Up until now, the improvements in ChatGPT’s abilities have been truly remarkable. However, with the *o1-preview* Model, things took a different turn. How I Approach Testing ********************** Since I began testing, I’ve followed the same approach: submitting the same queries to each new model to gauge ChatGPT’s progress firsthand. I usually present a variety of questions covering Math, Physics, and Programming. This time, I was particularly thrilled to start with Math, having heard about the significant improvements in the model’s calculation abilities. ✅ Test #1: A Diophantine Equation I Invented ********************************************* I’ve come across (for example, in this video) that I’m not the only one using Diophantine equations to challenge ChatGPT. The reason is simple: Diophantine equations require non-standard calculations, making them perfect for pushing the boundaries of what an AI can handle. These equations, which seek integer solutions, involve complex relationships and test a model’s ability to reason beyond standard numerical calculations. However, as mentioned earlier, I prefer not to use well-known equations like the one in the video (x³ + y³ + z³ = k), since their possible inclusion in the training data could skew the test results. With this in mind, I crafted my own equation to see how the latest version of ChatGPT would perform. |image0| This equation has a couple of trivial solutions for (*m, n*), that are (1, 1) and (2, 1), as well as a non-trivial solution, (5, 2), which none of the previous ChatGPT models have been able to discover, even after being guided in the right direction. I was pleased to see that the *o1-preview* Model correctly **identified all the solutions on the first attempt**. |image1| ❌ Test #2: Continued Fractions ******************************* My initial plan for this test was to evaluate how the *o1-preview* Model shows the adherence to Khinchin’s constant in decimal numbers. In short, Khinchin demonstrated that for almost all real numbers the coefficients of their continued fraction expansion have a finite geometric mean that converges to approximately 2.685452 . What I meant to do was ask ChatGPT to demonstrate the adherence of a randomly generated decimal number with many decimal places (to better approximate real numbers) to Khinchin’s constant. Following my guideline, I avoided using well-known numbers like *π* . Older models struggled with accurately calculating the geometric mean of the coefficients, and I was hopeful that the *o1-preview* Model would show improvement. For instance, when the number was less than 1, the *gpt-4* Model correctly calculated the continued fraction coefficients but mistakenly included the first coefficient, a₀ = 0, leading to an incorrect geometric mean of 0 and concluding that the theorem didn’t apply. I decided to shift my focus just to continued fractions and set Khinchin aside for a moment. I asked the following question to both the *gpt-4o* and *o1-preview* Models: :: Can you calculate the first 10 coefficients of the continued fraction of 0.1585536657064938800442275742180987996243882347015 ? The *gpt-4o* Model gave a straightforward, concise answer with the correct solution: [0, 6, 3, 3, 1, 7, 1, 15, 2, 1] while *o1-preview* starts making mistakes from the sixth element onward: [0, 6, 3, 3, 1, **2, 9, 20, 3, 1**] The reason for this error lies in the approach taken to calculate the fractions. The *gpt-4o* Model uses a straightforward Python code: :: from mpmath import mp # Set precision high enough for the calculation mp.dps = 50 # Decimal places # Value of the number number = mp.mpf('0.1585536657064938800442275742180987996243882347015') # Function to calculate the continued fraction def continued_fraction_coefficients(x, n): coeffs = [] for _ in range(n): a = mp.floor(x) coeffs.append(int(a)) x = x - a if x == 0: break x = 1 / x return coeffs # Calculate the first 10 coefficients coefficients = continued_fraction_coefficients(number, 10) coefficients On the other hand, the *o1-preview* Model took a completely different approach, delivering the wrong result while taking significantly more time to calculate (14 secs.): |image2| From the full ChatGPT response, it’s evident that a precision error gradually increases with each step, eventually leading to an incorrect result. The error appears as early as the second step, where 1/0.15855366570649388… is incorrectly approximated as 6.305762195498155 instead of 6.30701280569032605. This results in an\ **error of 0.0198%, which is remarkably high.** ❌ Test #3: Exponential Calculation *********************************** A doubt started to creep into my mind: could even the calculation of a math expression be affected by this issue? Staying true to my guideline, I came up with a simple exponential expression to challenge the *o1-preview* Model against\ *gpt-4o:* |image3| When submitted to *gpt-4o* , it provided the answer correctly approximated almost immediately with a precision of four decimal places: 7.2625 However, when I posed the same question to the *o1-preview* Model, it returned 7.273 with a significant **error of 0.143957%** and a **response time of** **14** **seconds** ! |image4| It should be noted that the three precision errors made by the *o1-preview* Model are **independent of each other,** meaning they occurred separately in their respective individual calculations: * Error in calculating ln *π* : 0.0000737872% * Error in multiplying √3 by ln*π* : **0.10675783%** * Error in calculating exp(6.15593929226734): 0.06874742% UPDATE ****** I ran into this thread on the ChatGPT official forum (which I hadn’t read before, as it doesn’t explicitly refer to the *o1-preview* Model). The moderator suggested to “ask it to write a small program in Python that does these calculations, and then execute the program.” I decided to follow the suggestion (even if I felt this extra step was a bit of a regression compared to the previous models, where everything was more straightforward), and here’s what happened |image5| The Python code **wasn’t actually executed** , and the model returned the same imprecise result as before. I suspected ChatGPT might have been influenced by its own previous answers in that thread, so I decided to start fresh. I opened a completely new thread and asked ChatGPT to write and run a Python code to evaluate the formula. The result? **A completely hallucinated output: 22.459**. |image6| As a side note, this is the result from the Python code when actually executed, matching the output from the *gpt-4o* Model: |image7| Conclusion ********** There are both highlights and drawbacks to the mathematical capabilities of the new *o1-preview* Model. - On one hand, it has significantly improved its reasoning abilities, allowing it to solve more complex problems in a versatile manner. - On the other hand, applying the same approach in certain cases, rather than relying on a straightforward and reliable Python code, results in imprecise or sometimes hallucinated solutions. Written by Patrizio Gelosi -------------------------- .. |image0| image:: https://latex.codecogs.com/svg.image?(m+n)^{m-n}-(m-n)^m=(mn)^n .. |image1| image:: /images/diophantine-equation.png .. |image2| image:: /images/continued-fraction_corrected.png .. |image3| image:: https://latex.codecogs.com/svg.image?%5Cpi%5E%7B%5Csqrt%7B3%7D%7D .. |image4| image:: /images/exponential_corrected.png .. |image5| image:: /images/calc_with_python_code.png .. |image6| image:: /images/hallucination.png .. |image7| image:: /images/python_code.png