trafilatura

View on PyPIReverse Dependencies (78)

1.12.2 trafilatura-1.12.2-py3-none-any.whl

Wheel Details

Project: trafilatura
Version: 1.12.2
Filename: trafilatura-1.12.2-py3-none-any.whl
Download: [link]
Size: 132246
MD5: 64995f0e6bd6511a15ac7388a5487ca2
SHA256: 6df5b666f625c9579a50d7cc715005f450fa75606696aceab73eeda0a76dbe96
Uploaded: 2024-09-10 12:42:30 +0000

dist-info

METADATA

Metadata-Version: 2.1
Name: trafilatura
Version: 1.12.2
Summary: Python package and command-line tool designed to gather text on the Web, includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments.
Author: Adrien Barbaresi
Author-Email: barbaresi[at]bbaw.de
Home-Page: https://trafilatura.readthedocs.io
Project-Url: Documentation, https://trafilatura.readthedocs.io
Project-Url: Source, https://github.com/adbar/trafilatura
Project-Url: Blog, https://adrien.barbaresi.eu/blog/tag/trafilatura.html
License: Apache-2.0
Keywords: corpus,html2text,news-crawler,natural-language-processing,scraper,tei-xml,text-extraction,webscraping,web-scraping
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: Information Technology
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: MacOS
Classifier: Operating System :: Microsoft
Classifier: Operating System :: POSIX
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Internet :: WWW/HTTP
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Security
Classifier: Topic :: Text Editors :: Text Processing
Classifier: Topic :: Text Processing :: Linguistic
Classifier: Topic :: Text Processing :: Markup :: HTML
Classifier: Topic :: Text Processing :: Markup :: Markdown
Classifier: Topic :: Text Processing :: Markup :: XML
Classifier: Topic :: Utilities
Requires-Python: >=3.6
Requires-Dist: certifi
Requires-Dist: courlan (>=1.2.0)
Requires-Dist: htmldate (>=1.8.1)
Requires-Dist: justext (>=3.0.1)
Requires-Dist: lxml (>=5.2.2); platform_system != "Darwin" or python_version > "3.8"
Requires-Dist: lxml (==4.9.2); platform_system == "Darwin" and python_version <= "3.8"
Requires-Dist: charset-normalizer (>=3.0.1); python_version < "3.7"
Requires-Dist: urllib3 (<2,>=1.26); python_version < "3.7"
Requires-Dist: importlib-metadata; python_version < "3.8"
Requires-Dist: charset-normalizer (>=3.2.0); python_version >= "3.7"
Requires-Dist: urllib3 (<3,>=1.26); python_version >= "3.7"
Requires-Dist: brotli; extra == "all"
Requires-Dist: htmldate[speed] (>=1.8.1); extra == "all"
Requires-Dist: py3langid (>=0.2.2); extra == "all"
Requires-Dist: pycurl (>=7.45.3); extra == "all"
Requires-Dist: urllib3[socks]; extra == "all"
Requires-Dist: zstandard (>=0.20.0); extra == "all"
Requires-Dist: cchardet (>=2.1.7); python_version < "3.11" and extra == "all"
Requires-Dist: faust-cchardet (>=2.1.19); python_version >= "3.11" and extra == "all"
Requires-Dist: Gooey (>=1.0.1); extra == "gui"
Provides-Extra: all
Provides-Extra: gui
Description-Content-Type: text/markdown
License-File: LICENSE
[Description omitted; length: 10818 characters]

WHEEL

Wheel-Version: 1.0
Generator: setuptools (74.1.2)
Root-Is-Purelib: true
Tag: py3-none-any

RECORD

Path Digest Size
trafilatura/__init__.py sha256=VSg0wcT0_bwf-chIxmZZeAKGOCeRpxrxVNyobhbRWUM 630
trafilatura/baseline.py sha256=3A1znAjj4ZMPRTXwxHIiNvWPSu5UCqSf-lP02KmMkNM 3921
trafilatura/cli.py sha256=Ck7VxDBglGXPJ1X9Mk9z9p473-6xz4uwJYnXi0xCzh4 12805
trafilatura/cli_utils.py sha256=cXaZ8MQ_e5RtzWW9pvMvekp6xzMQy9hnqe49xl1Q9EQ 16990
trafilatura/core.py sha256=upNodd9wqXAvameSPCSP6jO7jjHrZapMAw1F9rZymUc 16910
trafilatura/deduplication.py sha256=OMntXvr-uxGu_2lV_YTLpIrS6aBso-PSJ2gou0uIYU4 9651
trafilatura/downloads.py sha256=jFx1rbrLfNHCXENMSUfYZ_ztG8eogFQcGbUjOsvggkE 16024
trafilatura/external.py sha256=aaAQ_-Fg8tZx7bP_KZbIqSg717PRytTNAd2N4ICTja4 7294
trafilatura/feeds.py sha256=1y1qKTG-tlfFdwXAc44RHI6SuhVwZi_B4THTXByG_70 9666
trafilatura/gui.py sha256=CClJtffvEmV7lFKZaNM8BBdUZuP6pHOa4MTNdzYPMSo 1755
trafilatura/htmlprocessing.py sha256=JoJ7NOaocS0HsJoIZRZ_gKE9PXQ1IbU_VnrATfLDRpY 13548
trafilatura/json_metadata.py sha256=6YfZ8_2RccBTG-5Cld0W5rjqneiUedXgXSxUaw4BRYg 13051
trafilatura/main_extractor.py sha256=4OFfyvOX7r5Lm09GdnoyRRktViA6KOqP1-fYx6oxoHA 27897
trafilatura/meta.py sha256=bu2SIdAqMceJi-bIWhBC4YG2tSaa2B518kaSxm3QnYE 915
trafilatura/metadata.py sha256=R6mXhoNNi5g7VR8CV-G-snVK5XCGugeUXkOq4oLbemI 19388
trafilatura/readability_lxml.py sha256=ORjkIXyfbeJNaKuMvcswg-8x_-wgdq-jpcST0bJaI8c 19072
trafilatura/settings.cfg sha256=IuClQnwBQffOXk2j31pVo6V-o5cdnjjZVKTEd9iyQaU 739
trafilatura/settings.py sha256=rIDOigPm5IROD_TiGpPxqhaQAURUnJFkHcJBPRRjLG8 9701
trafilatura/sitemaps.py sha256=YkuuBxfY1aLeqGByDuRzqABnR7jBTr9vXlBVKsbH_rk 9822
trafilatura/spider.py sha256=foxYm97J0T7FyVSEUJSSZONDPItrq7ZDK8nFi03fZ-k 12051
trafilatura/utils.py sha256=ieVGU5K000V62jhKZOXMG6YKypgJ5r51YbVmdcJjxZI 16905
trafilatura/xml.py sha256=klz4wIlvbyQXodO_9L4injozZBWAdZShqo5jEqJq2yo 23604
trafilatura/xpaths.py sha256=x2FxeB3wyn-G6HijbK-j0mDNjbupbXkBuW_woSLsIFg 15194
trafilatura/data/tei_corpus.dtd sha256=UB5xeOqR0n2uXY6itoSZlSsOltIjra1UM8pJL_jXHZc 196033
trafilatura-1.12.2.dist-info/LICENSE sha256=psuoW8kuDP96RQsdhzwOqi6fyWv0ct8CR6Jr7He_P_k 10173
trafilatura-1.12.2.dist-info/METADATA sha256=-ylqhqm8i6-8opoBsfmePRiB4IpjTEUBmlYaULIIMP4 14141
trafilatura-1.12.2.dist-info/WHEEL sha256=cVxcB9AmuTcXqmwrtPhNK88dr7IR_b6qagTj0UvIEbY 91
trafilatura-1.12.2.dist-info/entry_points.txt sha256=G-TALznoHb9Ad0G2dVyBlvbbRSoMRjY3kNT3bzJeGiw 92
trafilatura-1.12.2.dist-info/top_level.txt sha256=FNlkTX9sAktQsHwwXze9RAexePfOXsqPY9cF86PNlnE 12
trafilatura-1.12.2.dist-info/RECORD

top_level.txt

trafilatura

entry_points.txt

trafilatura = trafilatura.cli:main
trafilatura_gui = trafilatura.gui:main