Provides a high-performance interface for calculating string similarities and distances, leveraging the efficient C++ library RapidFuzz developed by Max Bachmann and Adam Cohen. This package integrates the C++ implementation, allowing R users to access cutting-edge algorithms for fuzzy matching and text analysis.
You can install directly from CRAN or the development version of pikchr from GitHub with:
# install.packages("pak")
::pak("StrategicProjects/RapidFuzz")
pak
library(RapidFuzz)
The RapidFuzz
package is an R wrapper around the highly
efficient RapidFuzz C++ library. It provides implementations of multiple
string comparison and similarity metrics, such as Levenshtein,
Jaro-Winkler, and Damerau-Levenshtein distances. This package is
particularly useful for applications like record linkage, approximate
string matching, and fuzzy text processing.
String comparison algorithms calculate distances and similarities between two sequences of characters. These distances help to quantify how similar two strings are. For example, the Levenshtein distance measures the minimum number of single-character edits required to transform one string into another.
RapidFuzz leverages advanced algorithms to ensure high performance while maintaining accuracy. The original library is open-source and can be accessed on RapidFuzz GitHub Repository.
processString()
: Process a string with options to trim,
convert to lowercase, and transliterate to ASCII.opcodes_apply_str()
: Apply Opcodes to transform a
string.opcodes_apply_vec()
: Apply Opcodes to transform a
string into a character vector.get_editops()
: Retrieve Edit Operations between two
strings.editops_apply_str()
: Apply Edit Operations to transform
a string.editops_apply_vec()
: Apply Edit Operations to transform
a string into a character vector.damerau_levenshtein_distance()
: Calculate the
Damerau-Levenshtein Distance.damerau_levenshtein_normalized_distance()
: Calculate
the Normalized Damerau-Levenshtein Distance.damerau_levenshtein_normalized_similarity()
: Calculate
the Normalized Damerau-Levenshtein Similarity.damerau_levenshtein_similarity()
: Calculate the
Damerau-Levenshtein Similarity.fuzz_QRatio()
: Perform a Quick Ratio Calculation.fuzz_WRatio()
: Perform a Weighted Ratio
Calculation.fuzz_partial_ratio()
: Calculate Partial Ratio.fuzz_ratio()
: Calculate a Simple Ratio.fuzz_token_ratio()
: Calculate Combined Token
Ratio.fuzz_token_set_ratio()
: Perform Token Set Ratio
Calculation.fuzz_token_sort_ratio()
: Perform Token Sort Ratio
Calculation.hamming_distance()
: Calculate Hamming Distance.hamming_normalized_distance()
: Calculate Normalized
Hamming Distance.hamming_normalized_similarity()
: Calculate Normalized
Hamming Similarity.hamming_similarity()
: Calculate Hamming
Similarity.indel_distance()
: Calculate Indel Distance.indel_normalized_distance()
: Calculate Normalized Indel
Distance.indel_normalized_similarity()
: Calculate Normalized
Indel Similarity.indel_similarity()
: Calculate Indel Similarity.jaro_distance()
: Calculate Jaro Distance.jaro_normalized_distance()
: Calculate Normalized Jaro
Distance.jaro_normalized_similarity()
: Calculate Normalized Jaro
Similarity.jaro_similarity()
: Calculate Jaro Similarity.jaro_winkler_distance()
: Calculate Jaro-Winkler
Distance.jaro_winkler_normalized_distance()
: Calculate
Normalized Jaro-Winkler Distance.jaro_winkler_normalized_similarity()
: Calculate
Normalized Jaro-Winkler Similarity.jaro_winkler_similarity()
: Calculate Jaro-Winkler
Similarity.lcs_seq_distance()
: Calculate LCSseq Distance.lcs_seq_editops()
: Retrieve LCSseq Edit
Operations.lcs_seq_normalized_distance()
: Calculate Normalized
LCSseq Distance.lcs_seq_normalized_similarity()
: Calculate Normalized
LCSseq Similarity.lcs_seq_similarity()
: Calculate LCSseq Similarity.levenshtein_distance()
: Calculate Levenshtein
Distance.levenshtein_normalized_distance()
: Calculate Normalized
Levenshtein Distance.levenshtein_normalized_similarity()
: Calculate
Normalized Levenshtein Similarity.levenshtein_similarity()
: Calculate Levenshtein
Similarity.osa_distance()
: Calculate Distance Using OSA.osa_editops()
: Retrieve Edit Operations Using OSA.osa_normalized_distance()
: Calculate Normalized
Distance Using OSA.osa_normalized_similarity()
: Calculate Normalized
Similarity Using OSA.osa_similarity()
: Calculate Similarity Using OSA.prefix_distance()
: Calculate the Prefix Distance
between two strings.prefix_normalized_distance()
: Calculate the Normalized
Prefix Distance between two strings.prefix_normalized_similarity()
: Calculate the
Normalized Prefix Similarity between two strings.prefix_similarity()
: Calculate the Prefix Similarity
between two strings.prefix_distance("abcdef", "abcxyz")
# Output: 3
prefix_normalized_similarity("abcdef", "abcxyz", score_cutoff = 0.0)
# Output: 0.5
postfix_distance("abcdef", "xyzdef")
# Output: 3
damerau_levenshtein_distance("abcdef", "abcfed")
# Output: 2
# Example data
<- "new york jets"
query <- c("Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys")
choices <- 0.0
score_cutoff # Find the best match
extract_matches(query, choices, score_cutoff, scorer = "PartialRatio")
# Output:
# choice score
# 1 New York Jets 100.00000
# 2 New York Giants 81.81818
# 3 Atlanta Falcons 33.33333
The RapidFuzz
package is a wrapper of the RapidFuzz C++ library,
developed by Max Bachmann and Adam Cohen. The library implements
efficient algorithms for approximate string matching and comparison.