The ScamNet Fraud Communications Dataset represents the largest publicly available collection of labeled fraudulent communications, enabling comprehensive research into digital fraud detection and prevention.
Dataset Overview
This dataset addresses the critical need for large-scale, diverse training data in fraud detection research by providing real-world examples of fraudulent communications across multiple platforms and languages.
Content Composition
Communication Types
Email: Phishing and legitimate emails
SMS/Text: Smishing and normal messages
Social Media: Fraudulent and authentic posts
Instant Messages: Chat-based scam attempts
Language Coverage
• English (primary)
• Spanish, French, German
• Mandarin Chinese, Japanese
• Arabic, Portuguese, Russian
• Regional dialects and variations
Annotation Framework
Fraud Categories
Phishing: Credential theft attempts
Financial Fraud: Investment and payment scams
Identity Theft: Personal information harvesting
Romance Scams: Relationship-based fraud
Tech Support: Fake technical assistance
Quality Assurance
• Multi-annotator consensus
• Expert validation process
• Inter-annotator agreement metrics
• Continuous quality monitoring
Technical Specifications
Data Format
• JSON structured format
• Metadata preservation
• Privacy-compliant processing
• Standardized schema
Privacy Protection
• Personal information anonymization
• Differential privacy techniques
• Compliance with GDPR/CCPA
• Ethical review board approval
Research Applications
Machine Learning
• Supervised learning model training
• Transfer learning across languages
• Few-shot learning evaluation
• Adversarial robustness testing
Natural Language Processing
• Text classification benchmarks
• Multi-lingual model evaluation
• Feature engineering research
• Semantic analysis studies
Access and Licensing
The dataset is available for academic and research purposes under our custom research license. Commercial applications require separate licensing agreements. All usage must comply with ethical guidelines and privacy regulations.
Contributing
We welcome contributions from the research community:
• Additional language samples
• New fraud category examples
• Improved annotation guidelines
• Quality enhancement suggestions
Maintenance and Updates
The dataset receives regular updates:
• Quarterly content additions
• Annual quality reviews
• Schema version updates
• Community feedback integration