OpenAI's o1-preview model aced my coding tests, and showed its work (in surprising detail)

Monday, 16 September 2024

52 Hits

sankai/Getty Images

Usually, when a software company pushes out a major new release in May, they don't try to top it with another major new release four months later. But there's nothing usual about the pace of innovation in the AI business.

Although OpenAI dropped its new omni-powerful GPT-4o model in mid-May, the company has been busy. As far back as last November, Reuters published a rumor that OpenAI was working on a next-generation language model, then known as Q*. They doubled down on that report in May, stating that Q* was being worked on under the code name of Strawberry.

Also: 6 ways to write better ChatGPT prompts - and get the results you want faster

Strawberry, as it turns out, is actually a model called o1-preview, which is available now as an option to ChatGPT Plus subscribers. You can choose the model from the selection dropdown:

Screenshot by David Gewirtz/ZDNET

As you might imagine, if there's a new ChatGPT model available, I'm going to put it through its paces. And that's what I'm doing here.

Also: What are o1 and o1-mini? OpenAI's mystery AI models are finally here

The new Strawberry model focuses on reasoning, breaking down prompts and problems into steps. OpenAI showcases this approach through a reasoning summary that can be displayed before each answer.

When o1-preview is asked a question, it does some thinking and then displays how long it took to do that thinking. If you toggle the dropdown, you'll see some reasoning. Here's an example from one of my coding tests:

Screenshot by David Gewirtz/ZDNET

It's good that the AI knew enough to add error handling, but I find it interesting that o1-preview categorizes that step under "Regulatory compliance".

I also discovered the o1-preview model provides more exposition after the code. In my first test, which created a WordPress plugin, the model provided explanations of the header, class structure, admin menu, admin page, logic, security measures, compatibility, installation instructions, operating instructions, and even test data. That's a lot more information than was provided by previous models.

But really, the proof is in the pudding. Let's put this new model through our standard tests and see how well it works.

1. Writing a WordPress plugin

This straightforward coding test requires knowledge of the PHP programming language and the WordPress framework. The challenge asks the AI to write both interface code and functional logic, with the twist being that instead of removing duplicate entries, it has to separate the duplicate entries, so they're not next to each other.

Also: OpenAI trained its new o1 AI models to think before they speak - how to access them

The o1-preview model excelled. It presented the UI first as just the entry field:

Screenshot by David Gewirtz/ZDNET

Once the data was entered, and Randomize Lines was clicked, the AI generated an output field with properly randomized output data. You can see how Abigail Williams is duplicated, and in compliance with the test instructions, both entries are not listed side-by-side:

Screenshot by David Gewirtz/ZDNET

In my tests of other LLMs, only four of the 10 models passed this test. The o1-preview model completed this test perfectly.

2. Rewriting a string function

Our second test fixes a string regular expression that was a bug reported by a user. The original code was designed to test if an entered number was valid for dollars and cents. Unfortunately, the code only allowed integers (so 5 was allowed, but not 5.25).

Also: Want Apple's new AI features without buying a new iPhone? Try this app

The o1-preview LLM rewrote the code successfully. The model joined four of my previous LLM tests in the winners' circle.

3. Finding an annoying bug

This test was created from a real-world bug I had difficulty resolving. Identifying the root cause requires knowledge of the programming language (in this case PHP) and the nuances of the WordPress API.

The error messages provided were not technically accurate. The error messages referenced the beginning and the end of the calling sequence I was running, but the bug was related to the middle part of the code.

Also: 10 features Apple Intelligence needs to actually compete with OpenAI and Google

I wasn't alone in struggling to solve the problem. Three of the other LLMs I tested couldn't identify the root cause of the problem and recommended the more obvious (but wrong) solution of changing the beginning and ending of the calling sequence.

The o1-preview model provided the correct solution. In its explanation, the model also pointed to the WordPress API documentation for the functions I used incorrectly, providing an added resource to learn why it had made its recommendation. Very helpful.

4. Writing a script

This challenge requires the AI to integrate knowledge of three separate coding spheres, the AppleScript language, the Chrome DOM (how a web page is structured internally), and Keyboard Maestro (a specialty programming tool from a single programmer).

Answering this question requires an understanding of all three technologies, as well as how they have to work together.

Once again, o1-preview succeeded, joining only three of the other 10 LLMs that have solved this problem.

A very chatty chatbot

The new reasoning approach for o1-preview certainly doesn't diminish ChatGPT's ability to ace our programming tests. The output from my initial WordPress plugin test, in particular, seemed to function as a more sophisticated piece of software than previous versions.

Also: I've tested dozens of AI chatbots since ChatGPT's debut. Here's my new top pick

It's great that ChatGPT provides reasoning steps at the beginning of its work and some explanatory data at the end. However, the explanations can be chatty. I asked o1-preview to write "Hello world" in C#, the canonical test line in programming. This is how GPT-4o responded:

Screenshot by David Gewirtz/ZDNET

And this is how o1-preview responded to the same test:

Screenshot by David Gewirtz/ZDNET

I mean, wow, right? That's a lot of chat from ChatGPT. You can also flip the reasoning dropdown and get even more information:

Screenshot by David Gewirtz/ZDNET

All of this information is great, but it's a lot of text to filter through. I prefer a concise explanation, with additional information options in dropdowns removed from the main answer.

Yet ChatGPT's o1-preview model performed excellently. I look forward to how well it will work when integrated more fully with the GPT-4o features, such as file analysis and web access.

Have you tried coding with o1-preview? What were your experiences? Let us know in the comments below.

You can follow my day-to-day project updates on social media. Be sure to subscribe to my weekly update newsletter, and follow me on Twitter/X at @DavidGewirtz, on Facebook at Facebook.com/DavidGewirtz, on Instagram at Instagram.com/DavidGewirtz, and on YouTube at YouTube.com/DavidGewirtzTV.

Original link