The Multi-Modal LLM Evaluation Benchmark provides a standardized framework for assessing the deepfake detection capabilities of large language models across various modalities and complexity levels.
Benchmark Overview
This benchmark addresses the need for systematic evaluation of multi-modal LLMs in the context of deepfake detection, providing researchers with standardized metrics and evaluation protocols.
Dataset Components
Image Collections
High-Quality Deepfakes: State-of-the-art synthetic images
Traditional Manipulations: Classic photo editing techniques
Edge Cases: Challenging examples that test model limits
Control Sets: Authentic images for baseline comparison
Evaluation Protocols
Zero-shot Evaluation: No task-specific training
Few-shot Learning: Limited example-based adaptation
Prompt Engineering: Optimized instruction formats
Reasoning Analysis: Explanation quality assessment
Model Coverage
The benchmark supports evaluation of major LLM families:
• OpenAI models (GPT-4o, O1)
• Google models (Gemini Flash 2)
• Anthropic models (Claude 3.5/3.7 Sonnet)
• Open-source alternatives (Llama, Qwen, Mistral)
Evaluation Metrics
Performance Measures
Accuracy: Correct classification rate
Precision/Recall: Detailed performance breakdown
F1-Score: Balanced performance metric
AUC-ROC: Area under the receiver operating curve
Reasoning Quality
Explanation Coherence: Logic and consistency of reasoning
Evidence Identification: Ability to point out specific artifacts
Confidence Calibration: Alignment between confidence and accuracy
Research Applications
This benchmark enables:
• Comparative analysis across LLM architectures
• Investigation of reasoning capabilities
• Development of improved prompting strategies
• Understanding of model limitations and strengths
Usage Guidelines
The benchmark includes detailed documentation for:
• Setup and installation procedures
• Evaluation script usage
• Result interpretation guidelines
• Best practices for fair comparison
Future Extensions
We plan to expand the benchmark with:
• Additional modalities (audio, video)
• More diverse synthetic content types
• Cross-lingual evaluation capabilities
• Adversarial robustness testing