tokenizers

Fast, Consistent Tokenization of Natural Language Text

Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.

Total

270,000

Last month

26,831

Last week

5,607

Average per day

894

Daily downloads

Total downloads

Description file content

Package
tokenizers
Type
Package
Title
Fast, Consistent Tokenization of Natural Language Text
Version
0.2.1
Date
2018-03-29
Description
Convert natural language text into tokens. Includes tokenizers for shingled n-grams, skip n-grams, words, word stems, sentences, paragraphs, characters, shingled characters, lines, tweets, Penn Treebank, regular expressions, as well as functions for counting characters, words, and sentences, and a function for splitting longer texts into separate documents, each with the same number of words. The tokenizers have a consistent interface, and the package is built on the 'stringi' and 'Rcpp' packages for fast yet correct tokenization in 'UTF-8'.
License
MIT + file LICENSE
LazyData
yes
URL
BugReports
https://github.com/ropensci/tokenizers/issues
RoxygenNote
6.0.1
Depends
R (>= 3.1.3)
Imports
stringi (>= 1.0.1), Rcpp (>= 0.12.3), SnowballC (>= 0.5.1)
LinkingTo
Rcpp
Suggests
covr, knitr, rmarkdown, stopwords (>= 0.9.0), testthat
VignetteBuilder
knitr
NeedsCompilation
yes
Packaged
2018-03-29 17:26:00 UTC; lmullen
Author
Lincoln Mullen [aut, cre] (), Os Keyes [ctb] (), Dmitriy Selivanov [ctb], Jeffrey Arnold [ctb] (), Kenneth Benoit [ctb] ()
Maintainer
Lincoln Mullen
Repository
CRAN
Date/Publication
2018-03-29 20:07:40 UTC

install.packages('tokenizers')

0.2.1

8 months ago

https://lincolnmullen.com/software/tokenizers/

Lincoln Mullen

MIT + file LICENSE

Depends on

R (>= 3.1.3)

Imports

stringi (>= 1.0.1), Rcpp (>= 0.12.3), SnowballC (>= 0.5.1)

Suggests

covr, knitr, rmarkdown, stopwords (>= 0.9.0), testthat

Discussions