Functions for extracting text and tables from
PDF-based order documents. It provides an n-gram-based approach for identifying
the language of an order document. It furthermore uses R-package 'pdftools' to
extract the text from an order document. In the case that the PDF document is
only including an image (because it is scanned document), R package 'tesseract'
is used for OCR. Furthermore, the package provides functionality for identifying
and extracting order position tables in order documents based on a clustering approach.
Version: |
1.0.0 |
Depends: |
R (≥ 4.3.0), tidyselect |
Imports: |
data.table, dplyr, matrixcalc, quanteda, rlist, stringr, tibble, tidyr, utils, purrr, digest, lubridate |
Suggests: |
pdftools, tesseract, xml2 |
Published: |
2024-12-12 |
DOI: |
10.32614/CRAN.package.orderanalyzer |
Author: |
Michael Scholz [cre, aut],
Joerg Bauer [aut] |
Maintainer: |
Michael Scholz <michael.scholz at th-deg.de> |
License: |
GPL-3 |
NeedsCompilation: |
no |
SystemRequirements: |
Tesseract >= 5.0.0, libtesseract-dev (deb),
tesseract-devel (rpm), libleptonica-dev (deb), leptonica-devel
(rpm), tesseract-ocr-eng (deb), libpoppler-cpp-dev (deb),
poppler-cpp-devel (rpm), poppler-data (rpm/deb), libxml2-dev
(deb), libxml2-devel (rpm) |
CRAN checks: |
orderanalyzer results |