GPT Detector Accuracy: #TidyTuesday 2023, Week 29
- toldham2
- Jul 19, 2023
- 2 min read
Using R/tidyverse ridgeline plots to compare accuracy distributions.

This week's #TidyTuesday challenge takes us into the fascinating world of GPT detectors. For those who may be unaware, #TidyTuesday is a weekly project that encourages data enthusiasts to explore and analyze a new dataset using R. It's a fantastic opportunity to hone data visualization skills, learn from others, and share your insights with the R community.
In this era of advanced artificial intelligence, distinguishing between human-generated and AI-generated text is becoming increasingly important. That's where GPT detectors come in. These detectors are designed to predict whether a given piece of text was written by a human or an AI, such as ChatGPT. But how accurately can they make this distinction?
To answer this question, I've delved into this week's #TidyTuesday dataset and used R's powerful ggplot2 and ggridges libraries to create a series of ridgeline plots. These visualizations represent the prediction probabilities of various detectors, providing a visual perspective on the performance of each one.
The Findings
My analysis reveals that all GPT detectors in the dataset demonstrated a significant number of false negatives and false positives. This indicates that their accuracy in distinguishing between human and AI-generated text was not as high as hoped.
Among the detectors analyzed, ZeroGPT stood out as the most reliable. This was largely due to its relative lack of false positives. Given that GPT detectors should ideally err on the side of caution, meaning they should be more likely to misclassify AI-generated text as human than vice versa, this made ZeroGPT the standout performer.
The Visualization
The use of ridgeline plots for this analysis offered an intuitive way to compare the performance of different detectors. Each 'ridge' in the plot represents a different detector, with the distribution of prediction probabilities along the x-axis indicating the frequency with which each detector classified a piece of text as AI-generated.
By using facet_wrap() to create separate facets for human and AI text, I could compare the performance of the detectors directly. This visual comparison made it clear that while some detectors were overly cautious (leading to more false positives), others were not cautious enough (leading to more false negatives).
Wrapping Up
This week's #TidyTuesday challenge has provided an insightful exploration into the world of GPT detectors. While these detectors offer invaluable tools in the era of AI, it's clear that there is still plenty of room for improvement.
Through the power of data visualization, we've been able to delve into the performance of these complex models. As always, #TidyTuesday has shown us the importance of continuous learning and improvement in our data visualization skills and the technologies we rely on daily.
Click to See Code
Comentários