BP International: Almost Optimal Algorithms for Detecting Near-Duplicates in Domain-Independent Big Data | Chapter 04 | Recent Recent Advances in Mathematical Research and Computer Science Vol. 9

Friday, 11 March 2022

Almost Optimal Algorithms for Detecting Near-Duplicates in Domain-Independent Big Data | Chapter 04 | Recent Recent Advances in Mathematical Research and Computer Science Vol. 9

In this chapter, we propose Merge-Filter Representative-based Clustering (Merge-Filter-RC), a general domain-independent method for finding near-duplicate records within and across different data sources. Following that, we develop three nearly optimal classes of algorithms known as All-Three algorithms: constant threshold (CT), variable threshold (VT), and function threshold (FT). Merge-Filter-RC and All-Three form the backbone of this effort. Merge-Filter-RC recursively divides and merges near-duplicates into hierarchical clusters with prototype representatives to dis- till locally and globally near-duplicates. Each cluster is distinguished by one or more dynamically refined representatives. To limit the number of pairwise comparisons and hence the search space, representatives are utilised for further similarity comparisons. Furthermore, we describe the findings of the comparisons as "very similar," "similar," and "not comparable." We augment All-Three methods with a more complete reexamination of the original well-tuned characteristics of Monge-(ME) Elkan's foundational work, which we avoided by employing an affine variation of Smith-(SW) Waterman's similarity measure. We conducted multiple trials and comprehensive research on real-world benchmarks as well as synthetically created data sets to demonstrate that All-Three algorithms based on the Merge-Filter-RC technique surpass Monge-algorithmic Elkan's in terms of accuracy in finding near-duplicates. Furthermore, All-Three methods are as computationally efficient as Monge-technique. Elkan's.

Author(S) Details

Aziz Fellah
School of Computer Science and Information Systems, Northwest Missouri State University, Maryville, MO 64468, USA.

View Book:- https://stm.bookpi.org/RAMRCS-V9/article/view/6022

BP International

Friday, 11 March 2022

Almost Optimal Algorithms for Detecting Near-Duplicates in Domain-Independent Big Data | Chapter 04 | Recent Recent Advances in Mathematical Research and Computer Science Vol. 9

No comments:

Post a Comment