RapidFuzz RapidFuzz website

Provides a high-performance interface for calculating string similarities and distances, leveraging the efficient C++ library RapidFuzz developed by Max Bachmann and Adam Cohen. This package integrates the C++ implementation, allowing R users to access cutting-edge algorithms for fuzzy matching and text analysis.

Installation

You can install directly from CRAN or the development version of pikchr from GitHub with:

# install.packages("pak")
pak::pak("StrategicProjects/RapidFuzz")

library(RapidFuzz)

Overview

The RapidFuzz package is an R wrapper around the highly efficient RapidFuzz C++ library. It provides implementations of multiple string comparison and similarity metrics, such as Levenshtein, Jaro-Winkler, and Damerau-Levenshtein distances. This package is particularly useful for applications like record linkage, approximate string matching, and fuzzy text processing.

String comparison algorithms calculate distances and similarities between two sequences of characters. These distances help to quantify how similar two strings are. For example, the Levenshtein distance measures the minimum number of single-character edits required to transform one string into another.

RapidFuzz leverages advanced algorithms to ensure high performance while maintaining accuracy. The original library is open-source and can be accessed on RapidFuzz GitHub Repository.


Functions

Process String Function

Opcode Functions

Edit Operation Utilities

Edit Operations Functions

Damerau-Levenshtein Functions

Fuzz Ratio Functions

Hamming Functions

Indel Functions

Jaro Functions

Jaro-Winkler Functions

Longest Common Subsequence (LCSseq) Functions

Levenshtein Functions

Optimal String Alignment (OSA) Functions

Prefix Functions


Example Usage

Prefix Functions

prefix_distance("abcdef", "abcxyz")
# Output: 3

prefix_normalized_similarity("abcdef", "abcxyz", score_cutoff = 0.0)
# Output: 0.5

Postfix Functions

postfix_distance("abcdef", "xyzdef")
# Output: 3

Damerau-Levenshtein Functions

damerau_levenshtein_distance("abcdef", "abcfed")
# Output: 2

Extract Matches

# Example data
query <- "new york jets"
choices <- c("Atlanta Falcons", "New York Jets", "New York Giants", "Dallas Cowboys")
score_cutoff <- 0.0
# Find the best match
extract_matches(query, choices, score_cutoff, scorer = "PartialRatio")
# Output:
#            choice     score
# 1   New York Jets 100.00000
# 2 New York Giants  81.81818
# 3 Atlanta Falcons  33.33333

Original Library

The RapidFuzz package is a wrapper of the RapidFuzz C++ library, developed by Max Bachmann and Adam Cohen. The library implements efficient algorithms for approximate string matching and comparison.

]