A deep dive into how Hugging Face created the FineWeb dataset: starting from Common Crawl snapshots, extracting high-quality text from raw web data, filtering noisy content, deduplicating at web scale, and building FineWeb-Edu with model-assisted educational quality filtering. --- 🔗 Links - FineWeb dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb - FineWeb-Edu dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu - FineWeb paper: https://arxiv.org/abs/2406.17557 - FineWeb blog post: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1 - Common Crawl: https://commoncrawl.org/ - Trafilatura: https://trafilatura.readthedocs.io/ --- 👋 Connect with me - My website: https://alejandro-ao.com/ - X (Twitter): https://x.com/_alejandroao - LinkedIn: https://www.linkedin.com/in/alejandro-ao/ --- 🤓 Topics Covered - FineWeb dataset creation pipeline - Common Crawl filtering and deduplication - FineWeb-Edu educational data filtering --- ⏱️ Timestamps 00:00 Introduction 00:52 Why FineWeb matters 02:58 Common Crawl as data source 06:10 Base filtering techniques 07:17 Deduplication within snapshots 13:05 C4-style quality filters 17:20 FineWeb-Edu extraction 21:41 Key lessons learned 24:21 Synthetic data on the web 28:39 Conclusion
A deep dive into how Hugging Face created the FineWeb dataset: starting from Common Crawl snapshots, extracting high-quality text from raw web data, filtering noisy content, deduplicating at web scale, and building FineWeb-Edu with model-assisted educational quality filtering.
---
🔗 Links
- FineWeb dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb
- FineWeb-Edu dataset: https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu
- FineWeb paper: https://arxiv.org/abs/2406.17557
- FineWeb blog post: https://huggingface.co/spaces/HuggingFaceFW/blogpost-fineweb-v1
- Common Crawl: https://commoncrawl.org/
- Trafilatura: https://trafilatura.readthedocs.io/
---
👋 Connect with me
- My website: https://alejandro-ao.com/
- X (Twitter): https://x.com/_alejandroao
- LinkedIn: https://www.linkedin.com/in/alejandro-ao/
---
🤓 Topics Covered
- FineWeb dataset creation pipeline
- Common Crawl filtering and deduplication
- FineWeb-Edu educational data filtering
---
⏱️ Timestamps
00:00 Introduction
00:52 Why FineWeb matters
02:58 Common Crawl as data source
06:10 Base filtering techniques
07:17 Deduplication within snapshots
13:05 C4-style quality filters
17:20 FineWeb-Edu extraction
21:41 Key lessons learned
24:21 Synthetic data on the web
28:39 Conclusion