Reverse Dependencies of warcio
The following projects have a declared dependency on warcio:
- aiu — Tools for for interacting with Archive-It.
- archive-query-log — Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.
- auto-archiver — Automatically archive links to videos, images, and social media content from Google Sheets (and more).
- cdx-toolkit — A toolkit for working with CDX indices
- cdxj-indexer — CDXJ Indexer for WARC and ARC files
- CmonCrawl — no summary
- cocrawler — A modern web crawler framework for Python
- crau — Easy-to-use Web archiver
- crocoite — Save website to WARC using Google Chrome.
- datatrove — HuggingFace library to process and filter large amounts of webdata
- fch — A Python library to find historical Twitter follower count using the web archives
- forum-dl — Scrape posts and threads from forums, news aggregators, mail archives
- har2warc — Convert HTTP Archive (HAR) -> Web Archive (WARC) format
- html2tei — Map the HTML schema of portals to valid TEI XML with the tags and structures used in them using small manual portal-specific configurations.
- invisible-rabbit — Scalable Data Preprocessing Tool for Training Large Language Models
- invisible-unicorn — Scalable Data Preprocessing Tool for Training Large Language Models
- ipwb — InterPlanetary Wayback (ipwb): Web Archive integration with IPFS
- mailbagit — A tool for preserving email in multiple preservation formats.
- marginados-warc-scraper — Scrape Marginados data from a WARC file
- nemo-curator — Scalable Data Preprocessing Tool for Training Large Language Models
- news-please — news-please is an open source easy-to-use news extractor that just works.
- otmt — Tools for determining if web archive collecions are Off-Topic
- pyplexity — Perplexity filter for documents and bulk HTML and WARC boilerplate removal.
- pywb — Pywb Webrecorder web archive replay and capture tools
- scrapy-webarchive — A webarchive extension for Scrapy
- smallpond — A lightweight data processing framework built on DuckDB and shared file system.
- sp-ccrawl — The base for commoncrawl analysis based on sparkcc
- warc-cache — Easy WARC records disk cache.
- warc-s3 — Scalable and easy WARC records storage on S3.
- warc2graph — Warc2graph extracts a graph data structure from WARC files.
- warc2summary — warc2summary
- warc2zim — Convert WARC to ZIM
- warcdb — WarcDB: Web crawl data as SQLite databases
- warcit — Convert Directories, Files and Zip Files to Web Archives (WARC)
- web-archive-api — Unified, type-safe access to web archive APIs.
- web-archive-get — a tool to find archived web pages from different websites using multiple different services
- webarticlecurator — A crawler program to download content from portals (news, forums, blogs) and convert it to the desired output format according to the configuration.
- webrefine — Workflow for refining datasets from World Wide Web data
1