Performance Analysis of Sketching Methods

Authors

  • Dinesh Maharjan Ph.D. Scholar at the University of Seoul, Seoul, South Korea

DOI:

https://doi.org/10.3126/nccsrj.v2i1.60082

Keywords:

MinHash, Sketch, Jaccard Similarity, Densification

Abstract

One of the problems in data mining or machine learning is the high-dimensional dataset. MinHash can generate sketches of sparse datasets efficiently reducing the dimension to a few thousand. These sketches, then, can be used for different machine-learning applications. It takes O(kd) computations to generate k hash values for a data point with d non-zeros. The Sketch of a data point is the vector of k hash values. Weighted Minwise Hashing is another method to generate a sketch of size k in O(kp/d) computations. Here, p is the size of the universal set. Optimal Densification is the most efficient and accurate method as it can generate k hash values in mere O(d + k) computations. In this paper, we investigate the performance of Optimal Densification, Weighted Minwise Hashing, and Vanilla MinHash by performing two experiments. Firstly, we investigate Jaccard similarity estimation accuracy on six different synthetic datasets. Then, we perform one nearest neighbor classification (1NN) of four real datasets. Optimal Densification outperforms both Weighted Minwise Hashing and Vanilla MinHash in terms of accuracy and time taken to generate the sketches.

Downloads

Download data is not yet available.
Abstract
40
PDF
38

Downloads

Published

2023-11-27

Issue

Section

Articles