(Previous post)

4. Data replication

When talking about Blockchain in its original Bitcoin sense many authors using term “Distributed Database”. This is a term misuse because of the transactions data is replicated (copied) to every participated node rather than distributed between some of them. So the term “Replicated Database” is more relevant.

“Replicated Database” means the Data Storage requirements grow at least linearly (and network traffic grow super-linearly) together with increasing amount of Validation Nodes regardless of actual feasibility of such a heavy replication. Effectively this doesn’t allow effective data partitionability (sharding). (Google for “CAP theorem” for more details.)

Another consideration is the fact that modern Cloud and other infrastructure providers supplying customers with data storage solutions which are already powered by hardware data replication.

5. Latency and Throughput

Let’s look into the scalability a bit more closely.

Bitcoin has a known limitation of 3 transactions per second. Which is a laughable amount these days and more I talk to (ex-)banking people more I hear the same concern.

It similarly applies to block confirmation time. Whether it is average 1/10 mins (Bitcoin), 1/12s (Etherium) or 1/5s-1/1s (some other providers), it doesn't matter as it is still toooo slow. Most Ecommerce solutions would kill this on their own. I need 1 million per second or 1 per micro second?

Generally, "n-transactions per second" is slightly a cheeky term as it actually represents a two-dimensional thing: throughput vs latency. To some extend it can be rephrased as horizontal vs vertical scaling.

A frank example - the same reasonably big amount of something can be moved from A to B using Formula1 cars and freight trains. F1 is about low latency/small throughput, freight train - high latency/huge throughput.

So when we are addressing throughput (and calculations we run are reasonably parallel-able) we potentially can split calculation into independent realms (partionions, shards) and execute them on separate chains. For whatever target number you just add more chains, machines, and set cascading aggregations and here you go.

If we are addressing latency the situation is slightly more interesting as every single hashing or network operation (or persistence) is pumping this latency up and the question is what we are sacrificing: consistency, business risk, availability, immutability and so on.

Since milliseconds are no longer a "true HFT" but even for a garbage collected language you still can get a magnitude of tens of millions per sec per thread on relatively simple operations.

Additionally, the initial stream of transactions could be partitioned into a number of parallel chains to simultaneously increase the throughput.

6.Write vs Read

Vast majority of existing Blockchain implementations supply users with a convenient way of writing data into the system (putting it into a replicated transaction log or distributed ledger). But with reading and especially querying data situation is way less straight forward.

Users presented with the current state of the system (in a case of Bitcoin via balances). But to find out how some specific account used to look like when say the difference between two other accounts was minimal and the fourth one had at least 100 BTC requires writing very non-trivial data processing logic. Additionally it may require to traverse through enormous amount of data as there is no guarantee which blocks contain transactions related to accounts in question.

If we put payments aside and try to create a more general purpose Blockchain based solution then we quickly realize that standard SQL databases simply not designed to work efficiently with data represented as a stream of facts in perfect tense (transactions, events, etc.) rather than in continues tense (balances, final states).

(to be continued ...)


Post comment