LOGINCONTACT SALES
Scam AI logo

March 25, 2025

DeepfakeMulti-modal LLMsComputer Vision

Can Multi-modal (reasoning) LLMs work as deepfake detectors?

Exploring the potential of state-of-the-art multi-modal reasoning large language models for deepfake image detection, benchmarking 12 latest models against traditional detection methods.

Deepfake detection remains a critical challenge in the era of advanced generative models, particularly as synthetic media becomes more sophisticated. In this study, we explore the potential of state of the art multi-modal (reasoning) large language models (LLMs) for deepfake image detection.


Abstract


We benchmark 12 latest multi-modal LLMs against traditional deepfake detection methods across multiple datasets, including recently published real-world deepfake imagery. The models evaluated include OpenAI O1/4o, Gemini thinking Flash 2, Deepseek Janus, Grok 3, llama 3.2, Qwen 2/2.5 VL, Mistral Pixtral, and Claude 3.5/3.7 sonnet.


Methodology


To enhance performance, we employ prompt tuning and conduct an in-depth analysis of the models' reasoning pathways to identify key contributing factors in their decision-making process.


Key Research Questions


Can multi-modal LLMs effectively detect deepfakes?

How do they compare to traditional detection methods?

What factors contribute to their decision-making process?

How do model size and reasoning capabilities affect performance?


Key Findings


Our findings indicate several important insights:


Performance Variability: Best multi-modal LLMs achieve competitive performance with promising generalization ability with zero shot

Outperformance: Even surpass traditional deepfake detection pipelines in out-of-distribution datasets

Model Family Differences: The rest of the LLM families performs extremely disappointing with some worse than random guess

Version Impact: Newer model version and reasoning capabilities does not contribute to performance in such niche tasks of deepfake detection

Size Matters: Model size do help in some cases


Practical Implications


This study highlights the potential of integrating multi-modal reasoning in future deepfake detection frameworks and provides insights into model interpretability for robustness in real-world scenarios.


Subjects


Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)


Citation


arXiv:2503.20084 [cs.CV] (or arXiv:2503.20084v2 [cs.CV] for this version)

2025

Authors: Simiao Ren, Yao Yao, Kidus Zewde, Zisheng Liang, Tsang (Dennis)Ng, Ning-Yau Cheng, Xiaoou Zhan, Qinzhe Liu, Yifei Chen, Hengwei Xu

Keep reading