Compressing an infinite stream of data
Exploring a novel pattern recognition algorithm as a support for lossless compression in data centres.
Let’s cut through the complexity. Like Alexander did.
In the annals of history, Alexander the Great stands as an iconic figure whose strategic brilliance is often paralleled with his audacious feats. Legend has it that during his conquest of Asia Minor, Alexander encountered a seemingly insurmountable challenge: the Gordian Knot, a complex and intricately woven knot binding an ox-cart to a pole in the city of Gordium.
Prophecy held that the person who could unravel this intricate knot would become the ruler of all of Asia. Rather than succumbing to the puzzle’s complexity, Alexander, in a moment of decisive action, drew his sword and sliced through the Gordian Knot, effortlessly dismantling the supposedly unsolvable tangle. This bold and unconventional approach challenges the conventional wisdom that complex problems demand intricate solutions.
Cutting through the intro with a pen rather than a sword — to explain the purpose of this text — we explore the idea of the simplest solutions, resolving the most complex problems.
Welcome to the realm where simplicity, audacity, and a willingness to challenge the status quo become the keys to unravelling the metaphorical Gordian Knots of our modern challenges.
General Purpose Algorithm
In our article, Gordian Knot are all the complex nuances of data science, machine learning and compression algorithms. Quite often, relying on great knowledge of mathematics and statistics. Although it’s not an article about machine learning, it’s impossible to avoid the reference. And it’s not because of bizarre SEO (Search Engine Optimisation) techniques but rather a general purpose algorithm I want to propose as a solution to compress infinite amount of data with a finite — or even modest — amount of RAM.
How to compress an infinite?
This article touches on the mater of losseless compression for a finite or infinite amount of data, assuming that infinity is characterised by a never ending stream — just like in a data centre or instance. Assuming we never want to worry about the process itself, algorithms can rely on dictionaries which are nothing new and well explored.
A novelty in a proposed method is the use of a technique that allows for dynamic dictionaries, endlessly adjusting processes to the character of phrases and — in theory — working with an infinite amount of data.
Digital Hippocampus
Base for this operation is a conceptualised and developed by the author novel algorithm, called Digital Hippocampus. It specialises in designating patterns and anomalies in a data stream on the fly. It can run infinitely (as well as being saved, transferred or paused), fluently designating dictionaries’ candidates for a losseless compression of your choice.
Below, you can see abstract principles of the general mechanics for Digital Hippocampus.
Monkey Wrench Algorithm
Digital Hippocampus was designed to work as a monkey-wrench algorithm, meaning it’s sort of a general purpose algo. That’s why we’ve mentioned data science and machine learning in a previous paragraph. Having a natural ability to designate patterns, allows for a streamlined operation where terabytes of data are processed and marked as repetitive while a set of patterns adjusts to a new type of inputs over time.
Originally, author was trying to find as simple as possible alternative for neural networks in machine learning with an algorithm that could create small classifiers, interconnected in a hive of knowledge. Being aware it’s not the scope of this article, allow me to direct your curious mind to the website where you can see the concept unravelled as an AI framework, called MarieAI — named after Marie Curie: https://marieai.com/
Let The Data Speak
Assuming we are already familiar with fundamental principles of how Digital Hippocampus works, let’s dive into some practical information. As you might have noticed, Hippocamp relies on a set of arrays, each shorter than the previous one. Base array length as well as the overall number of tiers and length of data we can feed the algorithm affects how much RAM we need. The good thing is — Hippocamp has a predictable RAM ceilings which allows for massively optimised operations during the process of compression.
Here are some tables — with few caveats.
A. Main assumption is that Entry Array is simply a memory buffer where data flow as an uninterrupted stream. All other arrays are used only if the direct ascendant has a duplicate.
B. Each child array (2nd tier, 3rd etc) is equal to the length of the previous one, multiplied by 0.618. To put it simply, arrays follow Fibonacci sequence in a reversed order (starting with the highest number). Worth mentioning, it’s just the author’s preference, not the official guideline.
Ram vs models’ size
Strong vs weak patterns
Digital Hippocampus pyramidal structure creates a range of patterns types. Starting with some we could call weak (single duplicate within ‘buffer array’ up to the most commonly occured in a large sample dataset (sometimes across petabytes of data).
Below is a table how increase of Tiers (patterns strenghts) affects appetite for RAM memory. As we can see, it’s not significant, if you consider a possibility of finding a pattern, hidden in an exabyte of data, using less than 64MB of ram.
Conclusions : Stay lean
A. As you can see (especially model №1 and №2) Digital Hippocampus for compression purposes scales linearly, which creates more opportunities than the many others where doubling entries quadruples RAM requirements or number of operations.
B. Tiers are the smaller arrays where patterns dwell. The higher the more pronounced. We are not sure if going beyond 5th tier is significantly practical, it doesn’t cost much (illustrated as a percentage increase in a total amount of RAM).
C. Array search technique heavily affects the speed of processing and feasibility of Digital Hippocampus itself. Being directly correlated with entry array length creates limits for programming languages or even types of RAM.
Envisioning possibilities
As we conclude our exploration into a less travelled realm of compression technology, it becomes evident that the journey has been one of envisioning possibilities rather than certainties.
The proposed novel compression framework — while could be utilised with a standard LZW-like techniques — stands as a theoretical beacon, illuminating a path yet to be traversed. While its efficacy as for today remains untested in the field, the very essence of scientific inquiry into the unlimited data pool lies in the daring pursuit of uncharted territories.
The potential benefits, if realised, could change how we deal with the entropy of data centres, transforming the digital landscape. A sense of anticipation and a curiosity spark ignites our imaginations and we invite you, dear reader, on a journey of future research and pioneering models to bring this theoretical concept to life.
The journey from theory to reality is a testament to the relentless pursuit of knowledge, where today’s hypotheses pave the way for tomorrow’s technological triumphs.
About Me & Final note
Marcin Rybicki, former game developer, algorithm enthusiast. If you have any questions regarding this work, don’t hasitate to reach me out through Linked In or Digital Hippocampus website: https://marieai.com/hippocampus/