Traditionally, algorithms for counting distinct items in a stream of data would store all the items. A new algorithm, called CVM, uses randomization to estimate the number of distinct items with minimal memory usage. The trick is to keep track of items by recording them and then randomly deleting some. The probability of an item staying on the list is related to the number of rounds it survives. With this method, the researchers were able to accurately estimate the number of distinct words in Hamlet.

  • davel [he/him]@lemmy.ml
    link
    fedilink
    English
    arrow-up
    5
    ·
    8 months ago

    Optimize Memory and Performance with the CVM Algorithm in JavaScript: A Comprehensive Guide for Text Analysis, Big Data, and More

    Final CVM Algorithm Code in a JavaScript “oneliner”

    You can use this JS function by providing any text string as input, and it will simulate the described algorithm, returning the final set of unique words. This is as condensed as it can get, but if you can condese it more , please do post and Ill update it! If you are interested in seeing how this ‘oneliner’ started, before it was condensed, I have included the long form version of this same code at the bottom of the article.

    const cvmAlgorithm = t => { const f = n => !(Math.random() * (1 << n) | 0); const w = t.match(/\b\w+\b/g) || []; let r = 1, u = new Set(); for (let i of w.map(x => x.toLowerCase())) u.size < 100 ? u.add(i) : f(r) && u.delete(i), u.size >= 100 && r++, u.size > 50 && (u = new Set([...u].filter(() => f(r)))); return [...u]; };
    
    // Example usage:
    const text = "To be, or not to be, that is the question..."; 
    const finalWords = cvmAlgorithm(text);
    console.log(finalWords);  // [ 'to', 'be', 'or', 'not', 'that', 'is', 'the', 'question' ]
    
  • conditional_soup@lemm.ee
    link
    fedilink
    arrow-up
    5
    ·
    8 months ago

    This is really cool and clever. I think for a lot of applications, the output will be truthy enough to be useful.

    • a1studmuffin@aussie.zone
      link
      fedilink
      English
      arrow-up
      2
      ·
      8 months ago

      Especially considering it scales by adjusting accuracy! That makes it very adaptable so it could be used everywhere from microcontrollers through to Google data servers.

  • davel [he/him]@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    8 months ago

    High praise indeed! waow-based Variyam co-develops innovative new data analysis algorithm

    The authors presented their new algorithm at the 2022 European Symposium on Algorithms, a premier international conference on algorithm design. Since then, their discovery has been gaining recognition and praise from computer scientists across the field, including world-renowned computer scientist Donald Knuth, author of “The Art of Computer Programming,” who is often referred to as "the father of the analysis of algorithms.”

    In May 2023, Knuth published his own paper on the algorithm, “The CVM Algorithm for Estimating Distinct Elements in Streams [PDF],” offering extensive praise and even naming it the CVM algorithm in honor of its inventors. Knuth said in addition to “explaining the ideas to just about everybody [he] meets,” he expects the algorithm to become a foundational aspect of computer science in the near future.

    "Their algorithm is not only interesting; it is extremely simple,” Knuth said in the paper. “It’s wonderfully suited to teaching students who are learning the basics of computer science. I’m pretty sure that something like this will eventually become a standard textbook topic.”

    As Knuth predicted and Variyam hoped, the algorithm has already found a place in computer science courses.

    From Knuth’s paper:

    Algorithm D (Distinct element estimation). Given an arbitrary data stream A = (a1, a2, . . . , am) and a buffer size s ≥ 1 as described above, this algorithm returns an unbiased estimate of |A| = |{a1, a2, . . . , am}|. It uses a buffer B that’s capable of holding up to s ordered pairs (a, u), where a is an element of the stream and u is a real number, 0 ≤ u < 1.

    D1. [Initialize.] Set t ← 0, p ← 1, and B ← ∅.
    D2. [Done?] Terminate if t = m, returning the estimate |B|/p.
    D3. [Input a.] Set tt + 1 and aat, the next element of the stream.
    D4. [Remove a from B.] If B contains the pair (a, u), for any u, delete it.
    D5. [Maybe put a in B.] Let u be a uniform deviate, independent of all others (namely a random real number in the range 0 ≤ u < 1). If up, go back to step D2. Otherwise, if |B| < s, insert (a, u) into B and go back to step D2.
    D6. [Maybe swap a into B.] (At this point u < p and |B| = s.) Let (a’, u’) be the element of B for which u’ is maximum. If u > u’, set pu. Otherwise replace (a’, u’) in B by (a, u) and set pu’. Then go back to step D2.