One of my favorite projects for AI large language models is to identify prime powers. I consistently don’t get the right answer on the first try.

Here is my latest request for OpenAI’s ChatGPT 4

what are the first 10 prime powers. a prime power is a prime number raised to an exponent of 2 or more. use an “is prime power” function

I added the suggestion to use an “is prime power” function because, in the past, queries without that suggestion generated obviously incorrect results. This time the error was more subtle. I didn’t notice it at first.

After that, I asked ChatGPT 4 to show me the factorization of the first 25 prime powers past 78120. (A few days ago, I had requested a list of the first 100 prime powers. 78125 is the 100th. I wanted to extend the list.)

I was surprised with this comment at the end of its answer.

It appears there was a mistake in interpreting some numbers as prime powers; specifically, those represented as \(x^2\) where \(

x\)is not prime. This discrepancy arises from the function initially designed to identify prime powers without checking if the base is prime for squared values.

It had the insight to see a mistake without my pointing it out. When I asked to recreate the list, it updated the offending function and gave correct results.

This remark is an example of ChatGPT 4’s ability to understand what it is creating. It’s a higher order thinking process to create the code and then evaluate its results autonomously.

I’ve never trusted that current AI models would generate correct code without some coaching. I am more impressed with an ability to understand the results, see where those hadn’t matched my intent and correct the error.

I tried a similar experiment with ChatGPT 3.5. It gave correct code, but the output it presented from the execution didn’t match a run of that code with a real Python interpreter. By the time I realized that the problem was correct code/incorrect output, the chat had devolved into confused statements trying to explain what happened.

I have been trusting that, if a query develops Python code, there would be an actual Python engine to execute it. The situation is more complicated than this. I’m not sure how to probe version 4 to see whether its Python is running correctly.

The hallucinations of an LLM can be more subtle than I anticipated. Their successes can be subtle as well.