LOGINCONTACT SALES
Scam AI logo

January 10, 2025

Fraud DetectionDigital CommunicationsMulti-language

ScamNet Fraud Communications Dataset

Large-scale dataset of labeled fraudulent and legitimate digital communications across multiple platforms and languages for training and evaluating fraud detection systems.

800 GB
Dataset Size
1,000,000
Samples
Custom Research License
License
Dataset
Type

The ScamNet Fraud Communications Dataset represents the largest publicly available collection of labeled fraudulent communications, enabling comprehensive research into digital fraud detection and prevention.


Dataset Overview


This dataset addresses the critical need for large-scale, diverse training data in fraud detection research by providing real-world examples of fraudulent communications across multiple platforms and languages.


Content Composition


Communication Types

Email: Phishing and legitimate emails

SMS/Text: Smishing and normal messages

Social Media: Fraudulent and authentic posts

Instant Messages: Chat-based scam attempts


Language Coverage

English (primary)

Spanish, French, German

Mandarin Chinese, Japanese

Arabic, Portuguese, Russian

Regional dialects and variations


Annotation Framework


Fraud Categories

Phishing: Credential theft attempts

Financial Fraud: Investment and payment scams

Identity Theft: Personal information harvesting

Romance Scams: Relationship-based fraud

Tech Support: Fake technical assistance


Quality Assurance

Multi-annotator consensus

Expert validation process

Inter-annotator agreement metrics

Continuous quality monitoring


Technical Specifications


Data Format

JSON structured format

Metadata preservation

Privacy-compliant processing

Standardized schema


Privacy Protection

Personal information anonymization

Differential privacy techniques

Compliance with GDPR/CCPA

Ethical review board approval


Research Applications


Machine Learning

Supervised learning model training

Transfer learning across languages

Few-shot learning evaluation

Adversarial robustness testing


Natural Language Processing

Text classification benchmarks

Multi-lingual model evaluation

Feature engineering research

Semantic analysis studies


Access and Licensing


The dataset is available for academic and research purposes under our custom research license. Commercial applications require separate licensing agreements. All usage must comply with ethical guidelines and privacy regulations.


Contributing


We welcome contributions from the research community:

Additional language samples

New fraud category examples

Improved annotation guidelines

Quality enhancement suggestions


Maintenance and Updates


The dataset receives regular updates:

Quarterly content additions

Annual quality reviews

Schema version updates

Community feedback integration

Related Datasets