Extracting Order Position Tables from PDF-Based Order Documents [R package orderanalyzer version 1.0.0]

orderanalyzer: Extracting Order Position Tables from PDF-Based Order Documents

Functions for extracting text and tables from PDF-based order documents. It provides an n-gram-based approach for identifying the language of an order document. It furthermore uses R-package 'pdftools' to extract the text from an order document. In the case that the PDF document is only including an image (because it is scanned document), R package 'tesseract' is used for OCR. Furthermore, the package provides functionality for identifying and extracting order position tables in order documents based on a clustering approach.

Version:	1.0.0
Depends:	R (≥ 4.3.0), tidyselect
Imports:	data.table, dplyr, matrixcalc, quanteda, rlist, stringr, tibble, tidyr, utils, purrr, digest, lubridate
Suggests:	pdftools, tesseract, xml2
Published:	2024-12-12
DOI:	10.32614/CRAN.package.orderanalyzer
Author:	Michael Scholz [cre, aut], Joerg Bauer [aut]
Maintainer:	Michael Scholz <michael.scholz at th-deg.de>
License:	GPL-3
NeedsCompilation:	no
SystemRequirements:	Tesseract >= 5.0.0, libtesseract-dev (deb), tesseract-devel (rpm), libleptonica-dev (deb), leptonica-devel (rpm), tesseract-ocr-eng (deb), libpoppler-cpp-dev (deb), poppler-cpp-devel (rpm), poppler-data (rpm/deb), libxml2-dev (deb), libxml2-devel (rpm)
CRAN checks:	orderanalyzer results

Documentation:

Reference manual:

orderanalyzer.pdf

Downloads:

Package source:	orderanalyzer_1.0.0.tar.gz
Windows binaries:	r-devel: orderanalyzer_1.0.0.zip, r-release: orderanalyzer_1.0.0.zip, r-oldrel: orderanalyzer_1.0.0.zip
macOS binaries:	r-devel (arm64): orderanalyzer_1.0.0.tgz, r-release (arm64): orderanalyzer_1.0.0.tgz, r-oldrel (arm64): orderanalyzer_1.0.0.tgz, r-devel (x86_64): orderanalyzer_1.0.0.tgz, r-release (x86_64): orderanalyzer_1.0.0.tgz, r-oldrel (x86_64): orderanalyzer_1.0.0.tgz

Linking:

Please use the canonical form https://CRAN.R-project.org/package=orderanalyzer to link to this page.