Truth, Lakes and Time

Eduardo Bellani
4 min readJul 15, 2021
The persistence of memory

What, then, is time? I know well enough what it is,
provided that nobody asks me; but if I am asked what it is and try to explain,
I am baffled.
– Saint Augustine of Hippo:
Confessions (397)

My core theses here are the following:

  1. In order to leverage its data commercially, a company needs a single source of truth (SST)
  2. A Data Lake can be a solution for the problem of not having a SST.
  3. A Data Lake needs to contain some amount of structure to be useful. The most important structure to add is time. Or more precisely, transaction time.

It is usual for companies big and small to have valuable data. Unfortunately, its also usual for them to be unable to leverage that data to generate business value. This state of affairs, I think it is safe to say, is in large part due to not having a SST for their data. Why? Because the absence of a SST impacts the bottom line in at least these 2 ways:

  1. Grouping the data to generate reports and knowledge is expensive. This means only big projects are worth to be undertaken, and a myriad of opportunities are wasted due to unanswered questions.
  2. The erosion of the trustworthiness of the data, since such data can be stored in several places and updated by different parties with different views and interests.

In the past the popular concept for engaging in the SST problem was the Data Warehouse. These days it seems that the Data Lake concept is gaining more traction, probably because it is much cheaper to set up. The following sections apply to both concepts, with an emphasis on the Data lake.

Let’s begin by defining a Data Lake:

A Data Lake is single repository of data that seeks to be a basilar single source of truth for a given context by storing copies of source-data.

Notice that for a Data Lake instance to be coherent with the above, data is never erased. Why? Lets take the source-data x. A copy of it at time t could be represented as such:

At day t I have noticed x

One could instantiate the abstract language above like this:

"At time [2021-07-14 Wed 13:00] I have observed that the name of the Customer is John"

Even if the value of x changes in the future, nothing is altered about the above fact. It is still the case that, at time t, I believed that x was the case. I.e, that at time 2021–07–14 Wed 13:00 I believed that the Customer’s name was John. It does not matter at all if later I discovered that the name of the customer was Betty.

To keep such temporal facts in the Data Lake is crucial in order to leverage statistics for creating business value.

Notice also that we can alter data that contains period information. For instance:

At day t I have noticed x that spans period p

At day t+1, we could have noticed that p was actually p+1. What we would have now stored in our lake would be the following 2 facts:

1- At day t I have noticed x that spans period p

2- At day t+1 I have updated my belief about x, such that the spanned period is now p+1

What we are driving for here are the following 2 concepts: Valid and Transaction times. What are the definitions of such concepts? Quoting Time and Relational Theory Temporal Databases in the Relational Model and SQL:

The transaction time for a proposition p is the set of times t such that, according to what the database stated at time t, p was true. Note that proposition p

  1. must have been represented in the database at some time, either explicitly or implicitly, and
  2. can itself refer to the past and/or present and/or future.

The valid time for a proposition q is the set of times t such that, according to what the database currently states (which is to say, according to our current beliefs), q is, was, or will be true at time t.

In a nutshell:

  • Transaction times are the times in the past when the database said we believed something is, was, or will be true.
  • Valid times are the times (past and/or present and/or future) when, according to what we believe right now, something is, was, or will be true.

More informally: Valid times are kept in the database, transaction times are kept in the log.

In order to properly leverage the SST to create business value, a Data Lake must keep the log as searchable as the database. The best way to keep that searchability is to store the log information in the same format as the rest of the data, i.e., in the database.

--

--