Wheelodex — warcio — Reverse Dependencies

Wheelodex » Projects » warcio » Reverse Dependencies

Reverse Dependencies of warcio

The following projects have a declared dependency on warcio:

aiu — Tools for for interacting with Archive-It.
archive-query-log — Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives.
auto-archiver — Automatically archive links to videos, images, and social media content from Google Sheets (and more).
cdx-toolkit — A toolkit for working with CDX indices
cdxj-indexer — CDXJ Indexer for WARC and ARC files
CmonCrawl — no summary
cocrawler — A modern web crawler framework for Python
crau — Easy-to-use Web archiver
crocoite — Save website to WARC using Google Chrome.
datatrove — HuggingFace library to process and filter large amounts of webdata
fch — A Python library to find historical Twitter follower count using the web archives
forum-dl — Scrape posts and threads from forums, news aggregators, mail archives
har2warc — Convert HTTP Archive (HAR) -> Web Archive (WARC) format
html2tei — Map the HTML schema of portals to valid TEI XML with the tags and structures used in them using small manual portal-specific configurations.
invisible-rabbit — Scalable Data Preprocessing Tool for Training Large Language Models
invisible-unicorn — Scalable Data Preprocessing Tool for Training Large Language Models
ipwb — InterPlanetary Wayback (ipwb): Web Archive integration with IPFS
mailbagit — A tool for preserving email in multiple preservation formats.
marginados-warc-scraper — Scrape Marginados data from a WARC file
nemo-curator — Scalable Data Preprocessing Tool for Training Large Language Models
news-please — news-please is an open source easy-to-use news extractor that just works.
otmt — Tools for determining if web archive collecions are Off-Topic
pyplexity — Perplexity filter for documents and bulk HTML and WARC boilerplate removal.
pywb — Pywb Webrecorder web archive replay and capture tools
scrapy-webarchive — A webarchive extension for Scrapy
smallpond — A lightweight data processing framework built on DuckDB and shared file system.
sp-ccrawl — The base for commoncrawl analysis based on sparkcc
warc-cache — Easy WARC records disk cache.
warc-s3 — Scalable and easy WARC records storage on S3.
warc2graph — Warc2graph extracts a graph data structure from WARC files.
warc2summary — warc2summary
warc2zim — Convert WARC to ZIM
warcdb — WarcDB: Web crawl data as SQLite databases
warcit — Convert Directories, Files and Zip Files to Web Archives (WARC)
web-archive-api — Unified, type-safe access to web archive APIs.
web-archive-get — a tool to find archived web pages from different websites using multiple different services
webarticlecurator — A crawler program to download content from portals (news, forums, blogs) and convert it to the desired output format according to the configuration.
webrefine — Workflow for refining datasets from World Wide Web data