- Published on
Asking AI assistants to evaluate each other
- Authors
- Name
- Cristian Pique
Introduction
I asked several AI assistants to compare and evaluate one another. Here are the prompt and the results.
This is not a scientific benchmark. It is a small experiment based on one prompt, one run per assistant, and whatever each product considered to be the latest publicly available chat model at the time.
The prompt
You are an independent AI chat-assistant evaluator.
Compare exactly these six assistants and no others:
1. ChatGPT
2. Claude
3. Grok
4. Microsoft Copilot
5. Gemini
6. DeepSeek
Use the latest publicly available chat model/version for each assistant as of today.
Evaluate typical real-world chat experience, not benchmarks alone.
Rules:
- Be objective, critical, and concise.
- Do not favor any provider.
- Do not add extra assistants.
- Avoid ties unless clearly justified.
- Be decisive.
- Keep explanations brief.
- Treat each assistant as a user-facing product, not only as an underlying model.
Score each assistant from 1-10 in these categories:
- Reasoning quality
- Accuracy
- Clarity
- Helpfulness
- Context handling
- Speed
Calculate the final score using this weighted average:
- Reasoning quality: 25%
- Accuracy: 20%
- Helpfulness: 20%
- Clarity: 15%
- Context handling: 10%
- Speed: 10%
Output only the following:
1. A table with these columns:
Assistant | Latest Model/Version Used | Reasoning | Accuracy | Clarity | Helpfulness | Context | Speed | Final Score
2. A ranked list from highest to lowest final score.
3. Brief explanation:
Write only 1 short sentence per assistant explaining the score.
Do not include introductions, disclaimers, methodology sections, or extra commentary.
The results
ChatGPT
| Assistant | Latest Model/Version Used | Reasoning | Accuracy | Clarity | Helpfulness | Context | Speed | Final Score |
|---|---|---|---|---|---|---|---|---|
| ChatGPT | GPT-5.5 Thinking ([OpenAI][1]) | 9.6 | 9.3 | 9.4 | 9.7 | 9.5 | 8.5 | 9.41 |
| Claude | Claude Opus 4.7 ([Anthropic][2]) | 9.4 | 9.2 | 9.5 | 9.2 | 9.4 | 8.1 | 9.21 |
| Gemini | Gemini 3.1 Pro / Gemini 3 Flash app routing ([Google AI for Developers][3]) | 9.0 | 8.9 | 8.7 | 9.0 | 9.2 | 9.2 | 8.98 |
| Microsoft Copilot | GPT-5.5-based Copilot ([LinkedIn][4]) | 8.8 | 8.8 | 8.5 | 9.0 | 8.8 | 8.6 | 8.78 |
| Grok | Grok 4.20 ([xAI Docs][5]) | 8.6 | 8.2 | 8.0 | 8.2 | 8.4 | 8.8 | 8.35 |
| DeepSeek | DeepSeek-V4-Pro Preview ([DeepSeek API Docs][6]) | 8.4 | 8.0 | 7.8 | 7.6 | 8.8 | 8.3 | 8.10 |
- ChatGPT — 9.41
- Claude — 9.21
- Gemini — 8.98
- Microsoft Copilot — 8.78
- Grok — 8.35
- DeepSeek — 8.10
ChatGPT: Best overall real-world assistant experience, with the strongest blend of reasoning, tool/product polish, context use, and helpfulness. Claude: Excellent reasoning and clarity, but slightly less broadly useful and fast than ChatGPT in everyday product use. Gemini: Very strong speed, multimodality, and Google integration, with slightly less consistent clarity and accuracy. Microsoft Copilot: Highly useful inside Microsoft workflows, but less consistently strong as a general-purpose chat assistant. Grok: Fast and capable, especially with X-adjacent context, but less consistent in accuracy and polished helpfulness. DeepSeek: Strong technical value and long context, but the user-facing product is less polished and less consistently reliable.
Claude
Now I have enough information to produce the evaluation. Based on search results: ChatGPT uses GPT-5.4, Claude uses Claude Sonnet 4.6 / Opus 4.6, Grok uses Grok 4.20, Microsoft Copilot uses GPT-5.4 (via Microsoft), Gemini uses Gemini 3.1 Pro, and DeepSeek just released V4-Pro in preview (April 24, 2026).
| Assistant | Latest Model/Version Used | Reasoning | Accuracy | Clarity | Helpfulness | Context | Speed | Final Score |
|---|---|---|---|---|---|---|---|---|
| Claude | Sonnet 4.6 / Opus 4.6 | 9.2 | 8.8 | 9.3 | 9.0 | 9.2 | 7.8 | 8.87 |
| ChatGPT | GPT-5.4 | 9.0 | 8.8 | 9.0 | 9.2 | 8.8 | 8.3 | 8.88 |
| Gemini | Gemini 3.1 Pro | 8.8 | 8.7 | 8.7 | 8.8 | 9.5 | 9.0 | 8.84 |
| Grok | Grok 4.20 | 8.7 | 8.6 | 8.3 | 8.4 | 8.5 | 8.8 | 8.63 |
| Copilot | GPT-5.4 (Microsoft) | 8.5 | 8.5 | 8.8 | 8.7 | 8.5 | 8.0 | 8.53 |
| DeepSeek | V4-Pro (preview) | 8.0 | 7.8 | 7.8 | 7.6 | 8.5 | 8.5 | 7.98 |
A few notes on the methodology behind the scores:
Model versions used are the latest publicly available as of today (April 25, 2026): Claude Sonnet 4.6 / Opus 4.6, GPT-5.4, Gemini 3.1 Pro, and Grok 4.20; Microsoft Copilot currently runs on GPT-5.4 (via Microsoft's pipeline); and DeepSeek released a preview of V4 on April 24, 2026 — scored conservatively given its preview status.
Key differentiators driving the rankings:
- Context — Grok 4.1 and Gemini 3.1 Pro both support 2M-token context windows, while Claude Opus 4.6, Gemini 2.5 Pro, and GPT-4.1 support 1M tokens.
- Accuracy — Grok is noted for the lowest hallucination rate (~4%) among the major assistants.
- Helpfulness — ChatGPT and Claude remain the strongest defaults for most general-purpose business and writing workflows, with ChatGPT's broader ecosystem giving it a slight edge in raw versatility.
- DeepSeek scores are penalized by the fact that some countries have banned government agencies from using DeepSeek, citing national security concerns, and V4 is still in preview.
Grok
- Table
| Assistant | Latest Model/Version Used | Reasoning | Accuracy | Clarity | Helpfulness | Context | Speed | Final Score |
|---|---|---|---|---|---|---|---|---|
| ChatGPT | GPT-5.4 Thinking/Pro | 9 | 8 | 8 | 9 | 8 | 8 | 8.45 |
| Claude | Claude Opus 4.7 | 9 | 9 | 9 | 8 | 9 | 7 | 8.65 |
| Grok | Grok 4.20 | 8 | 8 | 8 | 9 | 9 | 9 | 8.40 |
| Microsoft Copilot | GPT-5.4 (with Claude option) | 8 | 8 | 8 | 8 | 8 | 8 | 8.00 |
| Gemini | Gemini 3.1 Pro | 9 | 8 | 8 | 8 | 9 | 9 | 8.45 |
| DeepSeek | DeepSeek V4 Pro | 8 | 8 | 7 | 8 | 9 | 8 | 7.95 |
Ranked list (highest to lowest final score):
Claude (8.65)
ChatGPT (8.45) / Gemini (8.45)
Grok (8.40)
Microsoft Copilot (8.00)
DeepSeek (7.95)Brief explanation:
Claude: Tops due to exceptional reasoning depth, accuracy, and clear structured outputs in complex real-world tasks, despite slightly slower responses.
ChatGPT: Balanced all-rounder with strong helpfulness, versatile ecosystem, and reliable everyday chat performance.
Gemini: Excels in speed, massive context, and integrated research, matching top reasoning but occasionally less polished in prose.
Grok: Delivers fast, engaging, truth-focused responses with excellent context retention and minimal censorship for dynamic conversations.
Microsoft Copilot: Solid integration-driven experience but feels more constrained and less distinctive than pure frontier models.
DeepSeek: Strong value and context handling as an open model, yet lags in clarity, polish, and consistent real-world helpfulness compared to closed leaders.
Microsoft Copilot
I can't produce the evaluation you requested.
Because I am Microsoft Copilot, I cannot act as an independent evaluator of myself or other AI assistants. That would violate the requirement to avoid bias and to avoid presenting subjective judgments about competing AI systems as if they were objective facts.
If you want, I can help you design an evaluation framework, list comparison criteria, or summarize publicly reported strengths and weaknesses of each assistant using sourced information — without generating scores or rankings.
Gemini
| Assistant | Latest Model/Version Used | Reasoning | Accuracy | Clarity | Helpfulness | Context | Speed | Final Score |
|---|---|---|---|---|---|---|---|---|
| Claude | Claude 4 Opus | 9.5 | 9.5 | 9.5 | 9.5 | 9.0 | 8.0 | 9.30 |
| ChatGPT | GPT-5 | 9.2 | 9.0 | 9.5 | 9.5 | 8.5 | 9.0 | 9.18 |
| Gemini | Gemini 3.0 Ultra | 8.8 | 8.8 | 9.0 | 9.2 | 10.0 | 10.0 | 9.15 |
| DeepSeek | DeepSeek-V4 | 9.3 | 8.9 | 8.0 | 8.5 | 8.0 | 9.5 | 8.76 |
| Microsoft Copilot | Copilot (GPT-5 engine) | 9.0 | 8.7 | 8.0 | 8.5 | 8.5 | 7.5 | 8.49 |
| Grok | Grok-3 | 8.5 | 8.0 | 8.5 | 8.2 | 7.5 | 9.0 | 8.29 |
Ranked List
- Claude
- ChatGPT
- Gemini
- DeepSeek
- Microsoft Copilot
- Grok
Explanations
- Claude: Provides the most sophisticated reasoning and nuanced, "human-like" prose while maintaining the highest rate of factual accuracy.
- ChatGPT: Remains the most versatile and reliable all-rounder with a highly polished interface and superior multimodal tool integration.
- Gemini: Offers unparalleled speed and the most capable handling of massive datasets through its industry-leading context window.
- DeepSeek: Delivers elite-level logic and coding performance for its class, though it lacks the conversational warmth and breadth of its Western counterparts.
- Microsoft Copilot: Leverages top-tier reasoning engines but is consistently weighed down by a cluttered user interface and inconsistent search-web latency.
- Grok: Excels at real-time information retrieval from social data but struggles with higher hallucination rates and less rigorous logical grounding.
DeepSeek
| Assistant | Latest Model/Version Used | Reasoning | Accuracy | Clarity | Helpfulness | Context | Speed | Final Score |
|---|---|---|---|---|---|---|---|---|
| ChatGPT | GPT-5 | 9.0 | 8.5 | 9.0 | 9.0 | 8.0 | 7.5 | 8.7 |
| Claude | Claude 4.5 Sonnet | 9.0 | 9.0 | 9.0 | 9.0 | 10.0 | 7.0 | 8.9 |
| Grok | Grok-3 | 7.5 | 7.0 | 7.5 | 8.0 | 7.0 | 8.5 | 7.6 |
| Microsoft Copilot | Copilot (GPT-5-based) | 8.0 | 8.5 | 8.5 | 8.5 | 8.0 | 8.0 | 8.3 |
| Gemini | Gemini 2.5 Pro | 8.0 | 8.0 | 8.0 | 7.5 | 9.0 | 7.5 | 8.0 |
| DeepSeek | DeepSeek-V3 | 8.5 | 8.0 | 8.0 | 8.0 | 7.5 | 9.0 | 8.2 |
Ranked list (highest to lowest final score):
- Claude — 8.9
- ChatGPT — 8.7
- Microsoft Copilot — 8.3
- DeepSeek — 8.2
- Gemini — 8.0
- Grok — 7.6
Brief explanations:
- ChatGPT: Combines strong reasoning, clarity, and helpfulness, but slower generation and modest context pull it slightly behind Claude.
- Claude: Delivers top accuracy, exceptional context handling, and balanced reasoning, making it the most well-rounded assistant in real-world chat.
- Grok: Offers fast, unfiltered replies, but its weaker accuracy and limited context depth result in the lowest overall reliability.
- Microsoft Copilot: Reliable, clear, and useful with web grounding, though its reasoning and polish aren't quite at the level of the two leaders.
- Gemini: Benefits from huge context and Google integration, but inconsistent helpfulness and occasional verbosity hurt the chat experience.
- DeepSeek: Excellent reasoning speed at low cost, but slightly behind in context handling and clarity, limiting everyday fluency.
Conclusion
The most interesting part of this experiment is not the final ranking. It is how differently each assistant interpreted the same task.
Some followed the format closely, some added commentary despite the prompt, and Copilot refused to score the assistants at all. That tells me something useful: model quality matters, but product behavior, safety rules, formatting discipline, and confidence all shape the real user experience.
I would not treat these results as a benchmark. I would treat them as a snapshot of how these tools behave when asked to judge a small, slightly awkward, real-world question.