Metadata-Version: |
2.1 |
Name: |
trafilatura |
Version: |
1.12.2 |
Summary: |
Python package and command-line tool designed to gather text on the Web, includes all necessary discovery and text processing components to perform web crawling, downloads, scraping, and extraction of main texts, metadata and comments. |
Author: |
Adrien Barbaresi |
Author-Email: |
barbaresi[at]bbaw.de |
Home-Page: |
https://trafilatura.readthedocs.io |
Project-Url: |
Documentation, https://trafilatura.readthedocs.io |
Project-Url: |
Source, https://github.com/adbar/trafilatura |
Project-Url: |
Blog, https://adrien.barbaresi.eu/blog/tag/trafilatura.html |
License: |
Apache-2.0 |
Keywords: |
corpus,html2text,news-crawler,natural-language-processing,scraper,tei-xml,text-extraction,webscraping,web-scraping |
Classifier: |
Development Status :: 5 - Production/Stable |
Classifier: |
Environment :: Console |
Classifier: |
Intended Audience :: Developers |
Classifier: |
Intended Audience :: Education |
Classifier: |
Intended Audience :: Information Technology |
Classifier: |
Intended Audience :: Science/Research |
Classifier: |
License :: OSI Approved :: Apache Software License |
Classifier: |
Operating System :: MacOS |
Classifier: |
Operating System :: Microsoft |
Classifier: |
Operating System :: POSIX |
Classifier: |
Programming Language :: Python |
Classifier: |
Programming Language :: Python :: 3 |
Classifier: |
Programming Language :: Python :: 3.6 |
Classifier: |
Programming Language :: Python :: 3.7 |
Classifier: |
Programming Language :: Python :: 3.8 |
Classifier: |
Programming Language :: Python :: 3.9 |
Classifier: |
Programming Language :: Python :: 3.10 |
Classifier: |
Programming Language :: Python :: 3.11 |
Classifier: |
Programming Language :: Python :: 3.12 |
Classifier: |
Topic :: Internet :: WWW/HTTP |
Classifier: |
Topic :: Scientific/Engineering :: Information Analysis |
Classifier: |
Topic :: Security |
Classifier: |
Topic :: Text Editors :: Text Processing |
Classifier: |
Topic :: Text Processing :: Linguistic |
Classifier: |
Topic :: Text Processing :: Markup :: HTML |
Classifier: |
Topic :: Text Processing :: Markup :: Markdown |
Classifier: |
Topic :: Text Processing :: Markup :: XML |
Classifier: |
Topic :: Utilities |
Requires-Python: |
>=3.6 |
Requires-Dist: |
certifi |
Requires-Dist: |
courlan (>=1.2.0) |
Requires-Dist: |
htmldate (>=1.8.1) |
Requires-Dist: |
justext (>=3.0.1) |
Requires-Dist: |
lxml (>=5.2.2); platform_system != "Darwin" or python_version > "3.8" |
Requires-Dist: |
lxml (==4.9.2); platform_system == "Darwin" and python_version <= "3.8" |
Requires-Dist: |
charset-normalizer (>=3.0.1); python_version < "3.7" |
Requires-Dist: |
urllib3 (<2,>=1.26); python_version < "3.7" |
Requires-Dist: |
importlib-metadata; python_version < "3.8" |
Requires-Dist: |
charset-normalizer (>=3.2.0); python_version >= "3.7" |
Requires-Dist: |
urllib3 (<3,>=1.26); python_version >= "3.7" |
Requires-Dist: |
brotli; extra == "all" |
Requires-Dist: |
htmldate[speed] (>=1.8.1); extra == "all" |
Requires-Dist: |
py3langid (>=0.2.2); extra == "all" |
Requires-Dist: |
pycurl (>=7.45.3); extra == "all" |
Requires-Dist: |
urllib3[socks]; extra == "all" |
Requires-Dist: |
zstandard (>=0.20.0); extra == "all" |
Requires-Dist: |
cchardet (>=2.1.7); python_version < "3.11" and extra == "all" |
Requires-Dist: |
faust-cchardet (>=2.1.19); python_version >= "3.11" and extra == "all" |
Requires-Dist: |
Gooey (>=1.0.1); extra == "gui" |
Provides-Extra: |
all |
Provides-Extra: |
gui |
Description-Content-Type: |
text/markdown |
License-File: |
LICENSE |