sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

Unsupervised text tokenizer allowing to perform byte pair encoding and unigram modelling. Wraps the 'sentencepiece' library <https://github.com/google/sentencepiece> which provides a language independent tokenizer to split text in words and smaller subword units. The techniques are explained in the paper "SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing" by Taku Kudo and John Richardson (2018) <doi:10.18653/v1/D18-2012>. Provides as well straightforward access to pretrained byte pair encoding models and subword embeddings trained on Wikipedia using 'word2vec', as described in "BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages" by Benjamin Heinzerling and Michael Strube (2018) <http://www.lrec-conf.org/proceedings/lrec2018/pdf/1049.pdf>.

Version: 0.2.3
Depends: R (≥ 2.10)
Imports: Rcpp (≥ 0.11.5), stats
LinkingTo: Rcpp
Suggests: tokenizers.bpe, word2vec (≥ 0.2.0)
Published: 2022-11-13
DOI: 10.32614/CRAN.package.sentencepiece
Author: Jan Wijffels [aut, cre, cph] (R wrapper), BNOSAC [cph] (R wrapper), Google Inc. [ctb, cph] (Files at src/sentencepiece/src (Apache License, Version 2.0), The Abseil Authors [ctb, cph] (Files at src/third_party/absl (Apache License, Version 2.0), Google Inc. [ctb, cph] (Files at src/third_party/protobuf-lite (BSD-3 License)), Kenton Varda (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h, google/protobuf/stubs/common.h, google/protobuf/stubs/hash.h, google/protobuf/stubs/once.h, google/protobuf/stubs/once.h.org (BSD-3 License)), Sanjay Ghemawat (Google Inc.) [ctb, cph] (Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)), Jeff Dean (Google Inc.) [ctb, cph] (Design of files at src/third_party/protobuf-lite: coded_stream.cc, extension_set.cc, generated_message_util.cc, generated_message_util.cc, message_lite.cc, repeated_field.cc, wire_format_lite.cc, zero_copy_stream.cc, zero_copy_stream_impl_lite.cc, google/protobuf/extension_set.h, google/protobuf/generated_message_util.h, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h, google/protobuf/repeated_field.h, google/protobuf/io/coded_stream.h, google/protobuf/io/zero_copy_stream_impl_lite.h, google/protobuf/io/zero_copy_stream.h (BSD-3 License)), Laszlo Csomor (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: io_win32.cc, google/protobuf/stubs/io_win32.h (BSD-3 License)), Wink Saville (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: message_lite.cc, google/protobuf/wire_format_lite.h, google/protobuf/wire_format_lite_inl.h, google/protobuf/message_lite.h (BSD-3 License)), Jim Meehan (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: structurally_valid.cc (BSD-3 License)), Chris Atenasio (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/wire_format_lite.h (BSD-3 License)), Jason Hsueh (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/io/coded_stream_inl.h (BSD-3 License)), Anton Carver (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/stubs/map_util.h (BSD-3 License)), Maxim Lifantsev (Google Inc.) [ctb, cph] (Files at src/third_party/protobuf-lite: google/protobuf/stubs/mathlimits.h (BSD-3 License)), Susumu Yata [ctb, cph] (Files at src/third_party/darts_clone (BSD-3 License), Daisuke Okanohara [ctb, cph] (File src/third_party/esaxx/esa.hxx (MIT License)), Yuta Mori [ctb, cph] (File src/third_party/esaxx/sais.hxx (MIT License)), Benjamin Heinzerling [ctb, cph] (Files data/models/nl.wiki.bpe.vs1000.d25.w2v.txt, data/models/nl.wiki.bpe.vs1000.d25.w2v.bin and data/models/nl.wiki.bpe.vs1000.model (MIT License))
Maintainer: Jan Wijffels <jwijffels at bnosac.be>
License: MPL-2.0
URL: https://github.com/bnosac/sentencepiece
NeedsCompilation: yes
Materials: README NEWS
In views: NaturalLanguageProcessing
CRAN checks: sentencepiece results

Documentation:

Reference manual: sentencepiece.pdf

Downloads:

Package source: sentencepiece_0.2.3.tar.gz
Windows binaries: r-devel: sentencepiece_0.2.3.zip, r-release: sentencepiece_0.2.3.zip, r-oldrel: sentencepiece_0.2.3.zip
macOS binaries: r-release (arm64): sentencepiece_0.2.3.tgz, r-oldrel (arm64): sentencepiece_0.2.3.tgz, r-release (x86_64): sentencepiece_0.2.3.tgz, r-oldrel (x86_64): sentencepiece_0.2.3.tgz
Old sources: sentencepiece archive

Reverse dependencies:

Reverse suggests: textrecipes

Linking:

Please use the canonical form https://CRAN.R-project.org/package=sentencepiece to link to this page.