(2017-06-26) Cull Dataspace Those Memex Dreams Again

Nate Cull: Dataspace 0: Those Memex Dreams Again ....a very rough sketch of an idea about what a future computing system might look like.

Let’s start with what the ‘here’ is that is less satisfactory

Back around 1980, as interactive systems were starting to become commonplace and home computers were showing that they were the future, the idea of Objects started to catch hold, driven by Alan Kay’s Smalltalk.

anyone remember OpenDoc?

The object vision was pretty neat for 1980. Now that we’ve had a few years to play with it, I think we’ve also found that it has some serious limitations:

This set of requirements, taken together, roughly equates to what I call ‘dataspace

This ‘space’ concept seems to be fairly powerful both for computers to use to do processing and for humans to use to store data. We use it all over: in filesystems, for example, and in directories like DNS or LDAP.

What a filesystem doesn’t give you, though, is a standard method of computing a value that doesn’t already exist as literal data.

Most seriously: after 27 or so years with the concept, we still don’t have anything approaching a standard, formal definition of what an object is.

In the Internet age we often find that we have data that doesn’t quite fit into the category of ‘document’ or ‘typed object’.

Interestingly, it turns out that although we have a huge pile of software and data standards, we don’t really have anything like this! And so this is an exploration of what kinds of properties a system like this might need, and how we might go about getting them.

But there’s one thing that objects do give you: and that’s the idea that the virtual ‘giant computer’ of the network or Internet is something like a space.

when we do a computation, we’d like for it not to have side effects

The 1990s is littered with distributed object systems that either failed spectacularly, or got trapped in various tiny niches.

Dataspace 1: In Search of a Data Model

Because the core HTTP data model is not good enough is exactly why we have so many web application server frameworks on the servers and Javascript frameworks on the client – to try to reimplement the pieces that weren’t there in the original HTTP/HTML Web vision.

We need a data model that’s coherent and built for the job of ‘pointing at small pieces of data from anywhere’. So let’s look at some of our options

Option 1: Filesystems

Option 2: Object systems

but there is no ‘one true object system’ that exists everywhere, or even enough places that we can use. Much worse: the very idea of an object has no formal definition, there’s no clear way to transport them or point at them between systems, and they just generally feel too complicated and too heavy.

Option 3: Dictionaries and functions

Javascript’s subset JSON (Javascript Object Notation), for instance, has ‘objects’ which don’t have methods

It’s sometimes called a hash (hash-table), a map, an associative array, or a dictionary. I’ll call it a dictionary.

But again – and this is more than a little surprising – it’s not quite enough! Let’s look at why, because this is a controversial claim to many working programmers

The kind of information I want to store in a Memex type system turns out to be very similar to logical statements

Semantic Web

JSON will only let you use a text string as a key, not a dictionary itself.

If we’re going to be storing logical statements, though, it’s going to be really really important that we can have dictionaries with keys that are arbitrary dictionaries.

But that case – one key with multiple values – is in short is why we can’t use dictionaries – and why dictionaries are, in fact, not a sufficient data type on which to build programming or data representation.

Option 4: Relational databases

and the graph is built out of a vast number of triples

The small problem is right at the top: ‘cats are funny’

But there’s a small problem and a big problem.

It looks like, finally, we’ve found our ultimate data model.

becomes a medium-sized problem if we want to represent a very old, very simple data structure: lists.

What part of a spreadsheet is the ‘program’, for example? Something seems awry if we have to split our brain into two separate hemispheres like this.

A third problem which might be so huge it’s not even registering on our radar: triples (or Codd relations generally) don’t have a programming model associated with them.

But the big problem is that you might want to put your triples – or your sets of triples – somewhere… and that means you might want to have graphs of graphs, or relations containing relations… and although Codd relations can do that, most triple stores, or graph query languages, haven’t quite got there.

Functions are just a special case of a much more fundamental mathematical structure – relations.

The Semantic Web concept is based on the graph

Functions are a special case of a special kind of relation: binary relations

Option 5: Graphs and triples (TripleSpace)

But what computing generally gives us isn’t binary relations. Instead we have finitary relations (relations of many places)

Dataspace 2: Revenge of the Data Model

we’re forced into putting the relation first, we’ve actually lost the ability to organise data by objects.

Option 6. Logic programming

The main language known for logic programming, Prolog, goes back to 1972 in France.

Suddenly three categories which seemed entirely separate: code, data and types – all merge into one. A huge simplification in thought.

The standard free Prolog today is SWI Prolog, which has a wonderful web-based scratchpad called SWISH. Give it a try.

Prolog has some very rough edges, which we’d like to sand off.

The whole thing is called a term

It’s called a functor, a horrible and ambiguous word, but the definition I think comes not from mathematics but from linguistics – basically it’s a ‘function-like word’ that expresses a relation.

The underlying data model that Prolog uses, called term structure, seems useful, though

The important thing is that functors don’t actually have any inherent meaning except what we give them. And a term doesn’t have any meaning other

There is a modern revival of the logic programming concept, Kanren – or now, miniKanren – which is sort of less of a language and more of an approach to embedding Prolog-like programming inside other languages. It’s worth looking at, but one problem with Kanren is that it depends on all the machinery of its host language, originally Scheme.

The general approach though, is: a program is built out of function-like things called (in Prolog) predicates.

Dataspace 3: It Came from the S-Expressions

‘A tuple might actually be a linked list, with a hidden nil at the end’ just wasn’t the paradigm they were working in – but in a Lisp frame of mind, converting between tuples and lists comes naturally.

Micro-Prolog

Pilog

names can be structured entities

So maybe S-expression syntax is a very useful thing to hold to, and can give us an edge here. But there are still problems with it.

HiLog

Option 7: S-expressions

Lisp, which almost entirely minimises the distance between its in-memory semantics and its syntax. Its semantics is linked lists; its syntax is S-expressions.

*If we look at Prologs based on S-expressions, we might see some more flexible ways of handling Prolog term structure.

There are three interesting answers here.*

Dataspace 4: The Term-inator

So perhaps we had a list looking like this: (a b c / d e f)

That ‘/’ is the magic character that replaces the dot. Anything after a / is a logical term. It can have arbitrary structure.

There is a very simple way forward from here, and it’s one that I haven’t seen described before. I think it has potential as a fundamental data semantics for building very large or very small distributed systems.

What if, instead of just a nil, or a dot and then a symbol, we allow a S-expression list to end with – or simply be – a third option: an arbitrary logical term?

This is an approach to representing term structure in S-expressions, so I’m calling it term expressions.

Dataspace 5: Introducing /all

*To get ‘pointable data’, I want a data model that allows me to embed pointers anywhere, but keeps the simplicity and clarity that, eg, RDF triples and XML/HTML don’t have.

The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression*

*Because term-expressions allow us to end a list with a term (which, remember, you can’t do in standard Prolog term structure) – we suddenly have an extra degree of freedom, which we can use in multiple ways.

/all means ‘what follows is a set, expressed as a list. Interpret all of the above as completions of this term’*

But /all , just by using the syntactic trick of ending a list with a term, suddenly gives us a spacelike operator which is still just an ordinary logical term. You can see how entire ‘tables’ could be nested inside ‘databases’ inside ‘servers’ inside ‘DNS domains’ – or any other spacelike, setlike containment abstraction that we can imagine for our data – simply by using this one operator /all to represent ‘sets of lists’.

Dataspace 6: Terms as Types

I’m not a huge fan of type theory as it currently exists in functional programming languages such as Haskell. Type theory seems, to me, to be merely an application of logic – and I think what we’d find much more useful than a type-inference engine that only runs at compile time, is a general logical inference engine like Prolog that operates on general logical terms, and can operate at runtime.

But given a generalised logical term structure like term-expressions, we can markup any piece of data we wish as a term in a very lightweight manner. And we can interpret such terms as something like a type-annotated piece of data. This means that we could implement any number of different type systems at runtime, using terms to represent typed data

This could be a hive of security issues, so would need to be very carefully thought through

Dataspace 7: A Low-Level Encoding

CDR coding systems like this typically don’t work well with mutable data. But having the option of ‘jumps’ or ‘redirects’ which take up just one cell means that we have some options for moving data around in memory and leaving a ‘forwarding address’.

Dataspace 8: Example: Movie data

Sidebar: Here’s a quick comparison of what I’m hoping to achieve in terms of syntax and readability, and an example of why I think it’s important to spend a fair bit of time thinking about syntax. Particulary, about what’s not in the syntax, so it’s not there to get in the way.

Dataspace 9: A Tower of Nulls, And Awkward Sets

Looking at term-expressions, one of the first things we notice is that there are a large number of null-like terms. I’m wondering what the meaning of these varieties of null might be.

*In other words, /all is just a ‘multiway AND’. It’s a simpler way of writing a whole bunch of /ands, and simplicity and clarity mostly seem to be what we want when we’re representing data.

But a multiway AND is not quite a set, because a set has a boundary which a logical AND doesn’t have.*

This seems to be quite a deep issue, and I don’t feel like I have a lot of guidance from either logic or set theory at this point. I assume that there are deep connections between set theory and logic – but here, I seem to have struck an equally deep discontinuity between them.

Dataspace 10: An Array Representation

the most popular languages today are not Lisp or Scheme, and don’t usually have a native cons-cell implementation. Further, the model of all storage as a big undifferentiated soup of cons-cells has a couple of big limitations: 1) an O(n) to O(log n) access time, depending on the data structure, if we don’t already have a pointer, and 2) pointers are relative to a big memory pool – they don’t give us an easy way to break our data into chunks and make sure that related data is stored close by.

Given this representation, we can now think about implementing term-expressions directly on top of JSON, in a Javascript environment (or on top of any other array facility in a mainstream programming language). We could use JSON objects as fast, system-provided representations where they make sense (for simple string-indexed key-value maps) and use array-expressions for all other expressions that JSON objects struggle with.


Edited:    |       |    Search Twitter for discussion

No twinpages!