ChatGPT Turing Test
Results
I collected 111 responses from 28 people over 2 weeks. On average, this is 3.96 responses per person. In total, 54% of the statements received the correct ‘label’ (robot/human). Not every question was made equally well. The phrase “While I appreciate…” (from ChatGPT) scored the most correct answers (71%) and the phrase “The concept…” (also from ChatGPT) scored the fewest correct answers (36%). A complete overview can be found below.
Now you might think: an average score of 54%, well, that’s still more than half. So people aren’t that bad at this.
But unfortunately that’s not how it works. In this type of testing, ‘the lower the worse’ does not apply. Instead, we follow ‘the closer to 50%, the worse.’ I can explain that with an example. Suppose Bert scores 10%. Sounds bad, but if we do exactly the opposite of what Bert says, we end up with 90%. And that’s very good. We cannot reverse a score of 50%: in that case we might as well toss a coin, because that gives us the same amount of information. In short, 54% is a poor score.
ChatGPT detectors
ChatGPT detectors are entering the market. Detectors are programs, websites, or other pieces of software that claim to recognize ChatGPT’s work. I ran the database of sentences through 14 different detectors. By pure coincidence, I collected exactly as many computer responses as human responses, 111. The average score of the ChatGPT detectors is 70.3%.
Two things stand out:
- Some detectors perform very well, such as Copyleaks and SEO.ai. Others, such as Scribbr and ChatGPTdetector, suck at it.
- Not every detector is willing to deal with every sentence. This is most often due to sentences that are too short. For example, the OpenAI text classifier from OpenAI itself (!) requires a minimum of 1000 characters. In other cases, a detector indicates that it cannot state with certainty what the source is.
Discussion
One can object against plenty of parts of my scientific method. But it was never my intention to provide conclusive scientific evidence. Food for thought is enough. Pseudo science is okay. Since the start of my research, real scientific research has been done into the human likeness of ChatGPT. Researchers at New York University found an average Turing test score of about 65%.
Despite my pseudoscientific approach, it is still good to look at where my research falls short. The biggest weakness of my research is the group of polled humans. I was actually researching whether a follower of Stach can recognize ChatGPT’s messages, not whether ‘an average person’ can.
If we look at the facts, my followers are:
- WEIRD (Western, Educated, Industrialized, Rich, and Democratic);
- In possession of an internet connection;
- Sufficiently proficient in the English language to work through my instruction and tool;
- Somewhat interested in digital technologies.
And this creates a biased group of subjects, which is actually very nice. I find it more interesting to know how well people with an interest in technology are able to recognize ChatGPT, rather than how well an average person can do this.
Conclusion
Stach’s followers are hardly able to distinguish ChatGPT from a human. I tested this with a self-developed, open source ChatGPT Turing test. I collected 111 responses from 28 people over 2 weeks. The number of correct answers is 54%. It might be time to send everyone to a ‘spot the chat bot’ training. Or just to gift everyone a decent ChatGPT detector. After all, that also works.
Disclosure
Stach Redeker is not affiliated with OpenAI, ChatGPT, or any other service providers mentioned in this article.
Return to the page about the experiment.
The code for this experiment is open source. You can find it on my GitHub. If you have any questions, shoot me a message.