ChatGPT Turing Test
Results

I collected 111 responses from 28 people over 2 weeks. On average, this is 3.96 responses per person. In total, 54% of the statements received the correct ‘label’ (robot/human). Not every question was made equally well. The phrase “While I appreciate…” (from ChatGPT) scored the most correct answers (71%) and the phrase “The concept…” (also from ChatGPT) scored the fewest correct answers (36%). A complete overview can be found below.

The ten sentences and how well people were able to correctly identify the origin. The greener the bar, the higher the score. The 'C' or 'H' before each sentence represents the source, ChatGPT (C) or a human (H).

Now you might think: an average score of 54%, well, that’s still more than half. So people aren’t that bad at this.

But unfortunately that’s not how it works. In this type of testing, ‘the lower the worse’ does not apply. Instead, we follow ‘the closer to 50%, the worse.’ I can explain that with an example. Suppose Bert scores 10%. Sounds bad, but if we do exactly the opposite of what Bert says, we end up with 90%. And that’s very good. We cannot reverse a score of 50%: in that case we might as well toss a coin, because that gives us the same amount of information. In short, 54% is a poor score.

ChatGPT detectors

ChatGPT detectors are entering the market. Detectors are programs, websites, or other pieces of software that claim to recognize ChatGPT’s work. I ran the database of sentences through 14 different detectors. By pure coincidence, I collected exactly as many computer responses as human responses, 111. The average score of the ChatGPT detectors is 70.3%.

How well ChatGPT detectors were able to correctly identify the origin. Green are correct answers, orange are wrong answers.

Two things stand out:

  1. Some detectors perform very well, such as Copyleaks and SEO.ai. Others, such as Scribbr and ChatGPTdetector, suck at it.
  2. Not every detector is willing to deal with every sentence. This is most often due to sentences that are too short. For example, the OpenAI text classifier from OpenAI itself (!) requires a minimum of 1000 characters. In other cases, a detector indicates that it cannot state with certainty what the source is.

Discussion

One can object against plenty of parts of my scientific method. But it was never my intention to provide conclusive scientific evidence. Food for thought is enough. Pseudo science is okay. Since the start of my research, real scientific research has been done into the human likeness of ChatGPT. Researchers at New York University found an average Turing test score of about 65%.

Despite my pseudoscientific approach, it is still good to look at where my research falls short. The biggest weakness of my research is the group of polled humans. I was actually researching whether a follower of Stach can recognize ChatGPT’s messages, not whether ‘an average person’ can.

If we look at the facts, my followers are:

And this creates a biased group of subjects, which is actually very nice. I find it more interesting to know how well people with an interest in technology are able to recognize ChatGPT, rather than how well an average person can do this.

Conclusion

Stach’s followers are hardly able to distinguish ChatGPT from a human. I tested this with a self-developed, open source ChatGPT Turing test. I collected 111 responses from 28 people over 2 weeks. The number of correct answers is 54%. It might be time to send everyone to a ‘spot the chat bot’ training. Or just to gift everyone a decent ChatGPT detector. After all, that also works.

Disclosure

Stach Redeker is not affiliated with OpenAI, ChatGPT, or any other service providers mentioned in this article.

Return to the page about the experiment.

The code for this experiment is open source. You can find it on my GitHub. If you have any questions, shoot me a message.