trafilatura

View on PyPIReverse Dependencies (83)

2.0.0 trafilatura-2.0.0-py3-none-any.whl

Wheel Details

Project: trafilatura
Version: 2.0.0
Filename: trafilatura-2.0.0-py3-none-any.whl
Download: [link]
Size: 132557
MD5: b9fd6ba484493b8c7c0ed8f73fce1cb6
SHA256: 77eb5d1e993747f6f20938e1de2d840020719735690c840b9a1024803a4cd51d
Uploaded: 2024-12-03 15:23:21 +0000

dist-info

METADATA

Metadata-Version: 2.1
Name: trafilatura
Version: 2.0.0
Summary: Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML.
Author-Email: Adrien Barbaresi <barbaresi[at]bbaw.de>
Project-Url: Homepage, https://trafilatura.readthedocs.io
Project-Url: Source, https://github.com/adbar/trafilatura
Project-Url: Blog, https://adrien.barbaresi.eu/blog/tag/trafilatura.html
Project-Url: Tracker, https://github.com/adbar/trafilatura/issues
License: Apache 2.0
Keywords: corpus,html2text,news-crawler,natural-language-processing,scraper,tei-xml,text-extraction,webscraping,web-scraping
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Security
Classifier: Topic :: Text Editors :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Topic :: Utilities
Requires-Python: >=3.8
Requires-Dist: certifi
Requires-Dist: charset_normalizer (>=3.4.0)
Requires-Dist: courlan (>=1.3.2)
Requires-Dist: htmldate (>=1.9.2)
Requires-Dist: justext (>=3.0.1)
Requires-Dist: lxml (==4.9.2); platform_system == "Darwin" and python_version <= "3.8"
Requires-Dist: lxml (>=5.3.0); platform_system != "Darwin" or python_version > "3.8"
Requires-Dist: urllib3 (<3,>=1.26)
Requires-Dist: flake8; extra == "dev"
Requires-Dist: mypy; extra == "dev"
Requires-Dist: pytest; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: types-lxml; extra == "dev"
Requires-Dist: types-urllib3; extra == "dev"
Requires-Dist: brotli; extra == "all"
Requires-Dist: cchardet (>=2.1.7); python_version < "3.11" and extra == "all"
Requires-Dist: faust-cchardet (>=2.1.19); python_version >= "3.11" and extra == "all"
Requires-Dist: htmldate[speed] (>=1.9.2); extra == "all"
Requires-Dist: py3langid (>=0.3.0); extra == "all"
Requires-Dist: pycurl (>=7.45.3); extra == "all"
Requires-Dist: urllib3[socks]; extra == "all"
Requires-Dist: zstandard (>=0.23.0); extra == "all"
Provides-Extra: dev
Provides-Extra: all
Description-Content-Type: text/markdown
License-File: LICENSE
[Description omitted; length: 9647 characters]

WHEEL

Wheel-Version: 1.0
Generator: setuptools (75.6.0)
Root-Is-Purelib: true
Tag: py3-none-any

RECORD

Path Digest Size
trafilatura/__init__.py sha256=X932hB00C6KrDeT3TbZwRug4EW8J5tAU8xwk5h9trkw 756
trafilatura/baseline.py sha256=6vEFWkaiZ8YCq8cICfOZuJMC7s-ndA1BFYAoKgZrP8Q 4029
trafilatura/cli.py sha256=pX7EYYt0TZCDBosbqdo3ufq786hSTfAMoOR0GS4OA_k 10341
trafilatura/cli_utils.py sha256=U5SVH7m7CFNW2mXu3gEbX5pPBMNqXHbgtbyHnADu3gk 17590
trafilatura/core.py sha256=jLG3-vgnvk8-G1sU4aijLhckmNkcdsPKLTdtmBjn-II 18611
trafilatura/deduplication.py sha256=LXNoSCYnTi5Bm9bdf6kT-v2LwsBcUYhTsN02BkWh6W4 9644
trafilatura/downloads.py sha256=vu05IM8d0VABnNAG9TM5YgOYSSu13ZNWDKmSio-ZwF0 17281
trafilatura/external.py sha256=lOYPqNdycFA5u8_5qAqOY9lYRcoll5osbmlmQ-5YE5k 7631
trafilatura/feeds.py sha256=Vr6miPHazXr30B4HL4P47E2laLJ1SbYv9aOTYhM4908 9697
trafilatura/htmlprocessing.py sha256=GI0CQ9R4-erwUaSIgMOFaq3-LdQmKWe8jEPMT_R_DDQ 14672
trafilatura/json_metadata.py sha256=m3pQVaXwMgJsIJDNAnvlSLydJgGgivDRAMrspuC4d9s 13020
trafilatura/main_extractor.py sha256=vvqabD3TVxcLwEV2M2tmiVcim1vg2CnRy2owYqoW_K4 29582
trafilatura/meta.py sha256=vDQ3xUBbwdJ4rWiURme4WzVlSFJRTezXzoBngRi4qCg 931
trafilatura/metadata.py sha256=XPYK49Y7Rk2Klo2Tu5NDWmu-4mtr08K_sPdkcon-vKs 19110
trafilatura/py.typed sha256=47DEQpj8HBSa-_TImW-5JCeuQeRkm5NMpJWZG3hSuFU 0
trafilatura/readability_lxml.py sha256=crNfDFUNOMM3u820B5hqtPVcjIqw8wdiY3SC3z1q6Cw 19294
trafilatura/settings.cfg sha256=kTF3BW5g9gxB1t0rHhnDgKHJ_z-GmcVqkjK7OVXnRZ8 830
trafilatura/settings.py sha256=Q1oa6SY7F3Omlyx0vns_sh6GI7PpjdjSOd0ZuzObOkM 12484
trafilatura/sitemaps.py sha256=XWKF-bw3BsAtw6HziQXrdmfG8ZzgizgrpSgSjA7lGGQ 9841
trafilatura/spider.py sha256=RGDUUGZuLwX4QVojBftB9-3_mQzCNKM4fvJD3Da6cKw 12132
trafilatura/utils.py sha256=VU1KEOrZoKKEAtnY1JbRvoanfNvVeXP9vjA9INqBEgY 17436
trafilatura/xml.py sha256=ztG1ZKq-OLgOD2DtMp3hYx3_UjA1QL_9z107qWmfMts 23470
trafilatura/xpaths.py sha256=x2FxeB3wyn-G6HijbK-j0mDNjbupbXkBuW_woSLsIFg 15194
trafilatura/data/tei_corpus.dtd sha256=UB5xeOqR0n2uXY6itoSZlSsOltIjra1UM8pJL_jXHZc 196033
trafilatura-2.0.0.dist-info/LICENSE sha256=psuoW8kuDP96RQsdhzwOqi6fyWv0ct8CR6Jr7He_P_k 10173
trafilatura-2.0.0.dist-info/METADATA sha256=8sMCA2jLJ8Vj-li8nmqjULdJ1_7LYvHx_s1WoO8vwOw 12765
trafilatura-2.0.0.dist-info/WHEEL sha256=PZUExdf71Ui_so67QXpySuHtCi3-J3wvF4ORK6k_S8U 91
trafilatura-2.0.0.dist-info/entry_points.txt sha256=Y8rgPtCp7nrr_zEnxJr1nbEBbEhsd8rKma2UCZulTp8 53
trafilatura-2.0.0.dist-info/top_level.txt sha256=FNlkTX9sAktQsHwwXze9RAexePfOXsqPY9cF86PNlnE 12
trafilatura-2.0.0.dist-info/RECORD

top_level.txt

trafilatura

entry_points.txt

trafilatura = trafilatura.cli:main