Pushing the envelope - Probing the possible in information retrieval
For some time now I’ve been enthralled with the idea that data physical locality should be descriptive rather than prescriptive.
That is to say: Data locality should only ever be based on its origin and its utility. Who’s emitting it, and who wants it - Data’s physical locality within a given volume of space should be based on this, and nothing else. Of course, this is a very abstract way of looking at things, but it’s very important for the future of computing that we get this right.
Today, we tend to take a very prescriptive approach to assigning data locality. In the old days of on premises servers, we would prescribe that location to be down the hall in the server room. We hardly cached anything* because round-tripping 50 feet is no big deal. These days, more often than not there’s some distant and elaborate cache hierarchy which is rooted in some cloud provider’s storage system or database persistent storage. In some more avant-garde systems such as the blockchain flavor-of-the-month, or other peer to peer networks, this should be mitigated, or so one would think. Unfortunately no, pretty much all of these have some sort of prescriptive data locality assignment. In the case of most public blockchains, that prescribed location is: everywhere (modulo some half-baked trusted-full-node scheme). In other Distributed-Hash-Table based storage systems like IPFS, that prescribed location (at least of the index) is some redundant set of logical addresses, which map arbitrarily to physical localities.
( *Actually, we’ve always used memory cache hierarchies on the CPU and memory bus of the client machine. It’s easy to contemplate that cache hierarchy as being fundamentally separate from the server’s cache hierarchy, but in actuality it isn’t – Rather it’s single overarching hierarchy built of a composition the two. From pixel/keyboard/mouse, through the network, and all the way to the persistent storage medium – This is a single hierarchical composition of cache consistency models. More on this later)
The big issue with prescriptive data-locality-assignment is primarily that location may or may not be well aligned in physical space with those parties who want it (or those who are authorized to have it!) In the case of DHT schemes, if you start with a very small physical volume; say for instance your office; which is fully connected with a reliable network, the logical address which contains the data you want is somewhere within the building. Not bad in terms of latency, and not a worry with regard to connectivity.
Now, imagine the physical volume is a lot larger… for instance a planet – Earth lets say. If we have singly-assigned logical addresses based on the DHT scheme, then that location is some random host somewhere on the globe. That host might be offline, or might be on the other side of the world, or worse still might have had a backhoe cut their last mile network cable five minutes ago. Not to worry though, Most sensible DHT networks assign many redundant copies of this data at differing logical addresses, usually with the number of replicas being vaguely commensurate with the overall network size. This way, you can ask several of these logical addresses for the information. The one with the backhoe cut won’t respond to your query, but others will, and you can just discard the responses after the first. Case closed!
The problem is: even with a massive number of redundant copies of the data at predictably differing logical/physical addresses, the probability that any of their commensurate physical addresses is within your office is rather low. Of course you could make some server in your office a “full node” meaning that it’s guaranteed to have the data for all possible queries, but now you need to receive a copy of all worldwide traffic. That’s going to get expensive.
So, why such a big deal? so what if most of my queries need to reach out to some public full node or hosting provider? Well, in the short term, you want system availability. If your internet connection gets cut or degraded somehow, Bob still wants to keep collaborating on that document with Alice down the hall. Does that mean they have to use an on-premises system? For now I guess it does. They’ll probably just put up with it until it becomes unbearable however. One can construct all manner of pathological situations where Bob and Alice are in different cities or continents, with network partitions occurring in some other place which is not between them, but still affecting their ability to collaborate or communicate. This will only stand to worsen as the IOT revolution gradually morphs into the shitty-IOT-everyday-reality.
In the medium and long terms, I believe we are going to run into serious limitations in terms of data size vs bandwidth within various spatial volumes, (dis)trust in our service providers, new demands placed on our cache hierarchies by modern federated/decentralized systems, and also increasing expectations of decreasing latency for various applications.
For this reason, I have imagined unba.se – A highly speculative endeavor, and one that will be nigh impossible to monetize (at least directly as a database company) even in the unlikely event that it ever turns into a fully working product. Yes, there’s a whole other conversation to be had about cryptocurrency-based computational frameworks. It’s not wholly impossible to take that approach, but I’ll leave my ethical grievances over modern cryptocurrencies for another post. The key feature of unbase, as it is imagined, is to provide a strong causal consistency model with zero apriori planning of data locality – only behavioral localization based on utility and availability.
The key feature of such a system is to approach the upper-bound of performance which is physically possible for a strong causal consistency model within an arbitrary spatial volume. Of course there’s a vastly more technical definition which is warranted here, but for now lets just say that we intend for knowledge to be infectious – Every time I hear about something, even vicariously, I will always include that in my projection of state – querying distant nodes and waiting as necessary to get a better-than-eventual-consistency representation of the “current” state of a given datum. (If you want to talk about linearizability or CAP theorem, your offramp is here)
And so, we posit that for any given set of originators and consumers of (immutable+copyable) data within a physical volume, there exists some optimal distribution of data; and continuous transformation thereto; such that latency, durability, energy, and storage volume are collectively minimized.
If we had infinite bandwidth and infinite storage density, we could simply broadcast all the data to all the places. Of course we do not. Density, locality, energy, latency, and durability are in tension. I believe that stasis or a priori decision-making in the selection of locality (usually chosen in the interest of complexity reduction) is necessarily a sub-optimal solution. A better answer is to continually transform the locality of our data based on a series of simple behaviors which create emergent optima in terms of data locality. The challenge is in designing the bookkeeping system such that you can actually find the data you want, without choking out useful work with bookkeeping traffic by itself.
Candidly I feel like some profound insight lay just beyond my grasp in terms of this relationship – something to do with the “shape” in a higher dimensional sense, of this bookkeeping traffic, compared with the shape of the physical space in which the data resides. (Anybody know a mathematician looking for a fun project? Contact me!)
Through experimentation, I am searching for a function which describes the the lower bound of latency (and energy in a physical sense) required for logical recursion and retrieval of information distributed within a physical volume. This applies to understanding the properties of non-optimal/arbitrary distributions within the physical volume, as
well as the limits of optimal distributions, and most importantly the rate and energy required to transformed the one into the other.
The purpose of probing the properties of the former arbitrary/non-optimal distributions is to understand tractability under the average starting case, and the latter being to understand tractability under the best case – To measure various behavioral systems to see which one is, ahem, optimal, at finding optima. The objective being to create the best-case distribution of fine grained data as often and reliably as possible; And for that matter, understanding on a mathematical level what is physically possible under what conditions.
In practical terms, we in CS tend to think of our own memory bus as having a logical distance of zero to our locality of computation. In fact, we undergo Herculean efforts to abstract this away and maintain the illusion so we can manage the complexity of our systems; or arguably reduce the complexity of our individual efforts. This comes at the great cost of overall complexity and perverse control dynamics in virtually every sense: The functioning of information systems in a changing world, and the ethics of privacy and control of one’s information.
As a quite precocious digression, as IANA physicist: It is probably fair to say, more or less, that physically no information is ever “here” where “here” is a single point in space. I suppose we could argue that maybe this is true for a single bit / qbit, but probably not two. There is arguably always some system of indirection at hand. In the degenerate case it’s only (zero or) one hop, however in most real-world cases it’s a dozen or more hops, as with register, L1-L3 cache, memory controller, bus, network controller, b-tree index nodes, disk allocation table, etc etc. We have various ways to spread the work around, but the principle holds, I think. I think it’s reciprocally fair to say (and admittedly with a nontrivial degree of physics fudge-ey-ness) that every bit of information must occupy nonzero space.
Today, the applicability of this principle on the microscopic scale may be of debatable utility beyond academic curiosity (not me, I think it’s fascinating, but, you know… businessey types) – whereas I think there is a much stronger argument for utility in the application of this principle in the macroscopic sense. That is to say: among today’s networks of today’s computers. The manner in which they locate, organize, and retrieve data; locate, organize and perform computation is of tremendous value, particularly if such a scheme is not merely able to converge on such optima once, but in fact do so continuously.
And here lay the ragged ends of my intuitions – with my educational background in mathematics and physics ruefully lacking; And yet, I’m fairly certain on some instinctual level there’s a strong connection to information theory and physics. No doubt substantial research has been done on some or all of the facets of this. Through experimentation and recruitment of collaborators more skilled and educated than I, I seek to at least ask the right questions – To dream of a revolution in computer science and information technology, and so in turn the quite timely death of the database.