Transaction Semantics in Apache Kudu (incubating)

This is a brief introduction to Kudu’s transaction and consistency semantics. For an in-depth technical exposition of most of what is mentioned here, and why it is correct, see the technical report [1].

Kudu’s transactional semantics and architecture are inspired by state-of-the-art systems such as Spanner [2] and Calvin [3]. Kudu builds upon decades of database research. The core philosophy is to make the lives of developers easier by providing transactions with simple, strong semantics, without sacrificing performance or the ability to tune to different requirements.

Kudu is designed to eventually be fully ACID, however, multi-tablet transactions are not yet implemented. As such, this discussion focuses on single-tablet write operations, and only briefly touches multi-tablet reads. Eventually Kudu will support fully strict-serializable semantics. In fact it already does in a limited context, but not all corner cases are covered as this is still a work in progress.

Kudu currently allows the following operations:

  • Write operations are sets of rows to be inserted, updated, or deleted in the storage engine, in a single tablet with multiple replicas. Write operations do not have separate "read sets" i.e. they do not scan existing data before performing the write. Each write is only concerned with previous state of the rows that are about to change. Writes are not "committed" explicitly by the user. Instead, they are committed automatically by the system, after completion.

  • Scans are read operations that can traverse multiple tablets and read information with some consistency or correctness guarantees. Scans can perform time-travel reads, i.e. the user is able to set a scan timestamp in the past and get back results that reflect the state of the storage engine at that point in time.

Before We Begin
  • The term timestamp is mentioned several times to illustrate the functionality, but timestamp is an internal concept mostly invisible to users, except when setting timestamp on a KuduScanner.

  • We generally refer to methods and classes of the async java client. While the C++ client mostly has analogous methods and classes, parity between the APIs is still a work in progress. At times, we may refer specifically to the C++ client.

Single tablet write operations

Kudu employs Multiversion Concurrency Control (MVCC) and the Raft consensus algorithm [4]. Each write operation in Kudu must go through the tablet’s leader.

  1. The leader acquires all locks for the rows that it will change.

  2. The leader assigns the write a timestamp before the write is submitted for replication. This timestamp will be the write’s "tag" in MVCC.

  3. After a majority of replicas acknowledges the change, the actual rows are changed.

  4. After the changes are complete, they are made visible to concurrent writes and reads, atomically.

All replicas of a tablet observe the same order of operations and if a write operation is assigned timestamp n and changes row x, a second write operation at timestamp m > n is guaranteed to see the new value of x.

This strict ordering of lock acquisition and timestamp assignment is enforced to be consistent across all replicas of a tablet through consensus. Therefore, write operations are totally ordered with regard to clock-assigned timestamps, relative to other writes in the same tablet. In other words, writes have strict-serializable semantics, though in an admittedly limited context. See this blog post for a little more context regarding what these semantics mean.

While Isolated and Durable in an ACID sense, write operations are not yet fully Atomic. The failure of a single write in a batch operation does not roll back the operation, but produces per-row errors.

Writing to multiple tablets

Kudu does not yet support transactions that span multiple tablets. However, consistent snapshot reads are possible (with caveats in the current implementation) as explained below.

Writes to a Kudu client are optionally buffered in memory until they are flushed and sent to the server. During the client’s session flush, the rows for each tablet are batched together, and sent to the tablet server which hosts the leader replica of the tablet. Since there are no inter-tablet transactions, each of these batches represents a single, independent write operation with its own timestamp. However you have the option to impose some constraints on the assigned timestamps and on how writes to different tablets can be observed by clients.

Kudu, like Spanner, was designed to be externally consistent [5], preserving consistency even when operations span multiple tablets and even multiple data centers. In practice this means that, if a write operation changes item x at tablet A, and a following write operation changes item y at tablet B, you might want to enforce that if the change to y is observed, the change to x must also be observed. There are many examples where this can be important. For example, if Kudu is storing clickstreams for further analysis, and two clicks follow each other but are stored in different tablets, subsequent clicks should be assigned subsequent timestamps so that the causal relationship between them is captured.

CLIENT_PROPAGATED Consistency

Kudu’s default external consistency mode is called CLIENT_PROPAGATED. See [1] for an extensive explanation on how it works. In brief, this mode causes writes from a single client to be automatically externally consistent. In this mode, writes are only externally consistent from the perspective of a single client. In the clickstream scenario above, if the two clicks are submitted by different client instances, the application must manually propagate timestamps from one client to the other for the causal relationship to be captured.

CLIENT_PROPAGATED consistency is currently only available on the java client and is exposed through the AsyncKuduClient#getLastPropagatedTimestamp() and AsyncKuduClient#setLastPropagatedTimestamp() methods.

Commit Wait Consistency

Kudu also implements an experimental implementation of an external consistency model used in Google’s Spanner , called Commit Wait. Commit Wait works by tightly synchronizing the clocks on all machines in the cluster. Then, when a write occurs, timestamps are assigned and the results of the write are not made visible until enough time has passed so that no other machine in the cluster could possibly assign a lower timestamp to a following write.

For the moment, Kudu’s experimental implementation of Commit Wait is only available in the java client, by setting KuduSession#setExternalConsistencyMode() to COMMIT_WAIT. When using this mode, the latency of writes is tightly tied to the accuracy of clocks on all the cluster hosts, and using this mode with loose clock synchronization causes writes to take a long time to complete or even time out. See Known Issues and Limitations.

Read Operations (Scans)

Scans are read operations performed by clients that may span one or more rows across one or more tablets. When a server receives a scan, it takes a snapshot of the MVCC state and then proceeds in one of two ways depending on the read mode selected by the user by means of the KuduScanner::SetReadMode() method.

READ_LATEST

This is the default read mode. The server takes a snapshot of the MVCC state and proceeds with the read immediately. Reads in this mode only yield 'Read Committed' isolation.

READ_AT_SNAPSHOT

In this read mode, scans are consistent and repeatable. A timestamp for the snapshot is selected either by the server, or set explicitly by the user through KuduScanner::SetSnapshotMicros(). Explicitly setting the timestamp is recommended; see Recommendations. The server waits until this timestamp is 'safe' (until all write operations that have a lower timestamp have completed and are visible). This delay, coupled with an external consistency method, will eventually allow Kudu to have full strict-serializable semantics for reads and writes. This is still a work in progress and some anomalies are still possible (see Known Issues and Limitations). Only scans in this mode can be fault-tolerant.

Selecting between read modes requires balancing the trade-offs and making a choice that fits your workload. For instance, a reporting application that needs to scan the entire database might need to perform careful accounting operations, so that scan may need to be fault-tolerant, but probably doesn’t require a to-the-microsecond up-to-date view of the database. In that case, you might choose READ_AT_SNAPSHOT and select a timestamp that is a few seconds in the past when the scan starts. On the other hand, a machine learning workload that is not ingesting the whole data set and is already statistical in nature might not require the scan to be repeatable, so you might choose READ_LATEST instead.

Known Issues and Limitations

We plan to fix the following issues. Monitor the linked JIRAs for progress.

Serialization

There are several gaps and corner cases that prevent Kudu from being fully strictly-serializable in some situations, at the moment. Below are the details and next, some recommendations.

Scans

  • Support for COMMIT_WAIT is experimental and requires careful tuning of the time-synchronization protocol, such as NTP (Network Time Protocol).

  • Support for externally-consistent write modes is only fully available in the Java API at this time. (see KUDU-1187)

  • In some rare circumstances, the READ_AT_SNAPSHOT scan mode may yield anomalous, non-repeatable reads.

    • When scanning a replica at a snapshot, the replica may not have received all the writes from the leader and might reply immediately, yielding a non-repeatable read (see KUDU-798).

    • On a leader change, scans at a snapshot whose timestamp is beyond the last write may also yield non-repeatable reads (see KUDU-1188). See Recommendations for a workaround.

    • When performing multi-tablet scans without selecting a snapshot timestamp (see KUDU-1189).

  • Impala scans are currently performed as READ_LATEST and have no consistency guarantees.

Writes

  • When a write fails with a timeout or is aborted, it is possible that it may actually be committed. Kudu is currently missing a way to determine if a particular timed-out write ever actually succeeded. On a retry, the write may succeed but may also generate errors if some rows have already been inserted, or deleted (see KUDU-568).

  • When a delete is performed to a row that has already been flushed, and the row is reinserted all history is reset (see KUDU-237). This is not the case for rows that haven’t been flushed yet and still reside in memory.

Recommendations

  • If repeatable snapshot reads are a requirement, use READ_AT_SNAPSHOT with a timestamp that is slightly in the past (between 2-5 seconds, ideally). This will circumvent the anomalies described in Scans. Even when the anomalies have been addressed, back-dating the timestamp will always make scans faster, since they are unlikely to block.

  • If external consistency is a requirement and you decide to use Commit Wait, the time-synchronization protocol needs to be tuned carefully. Each transaction will wait 2x the maximum clock error at the time of execution, which is usually in the 100 msec. to 1 sec. range with the default settings, maybe more. Thus, transactions would take at least 200 msec. to 2 sec. to complete when using the default settings and may even time out.

    • A local server should be used as a time server. We’ve performed experiments using the default NTP time source available in a Google Compute Engine data center and were able to obtain a reasonable tight max error bound, usually varying between 12-17 milliseconds.

    • The following parameters should be adjusted in /etc/ntp.conf to tighten the maximum error:

      • server my_server.org iburst minpoll 1 maxpoll 8

      • tinker dispersion 500

      • tinker allan 0

The above parameters minimize maximum error at the expense of estimated error, the latter might be orders of magnitude above it’s "normal" value. These parameters also may place a greater load on the time server, since they make the servers poll much more frequently.
References
  • [1] David Alves, Todd Lipcon and Vijay Garg. Technical Report: HybridTime - Accessible Global Consistency with High Clock Uncertainty. April, 2014. http://users.ece.utexas.edu/~garg/pdslab/david/hybrid-time-tech-report-01.pdf

  • [2] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s globally-distributed database. In Proceedings of the 10th USENIX conference on Operating Systems Design and Implementation (OSDI'12). USENIX Association, Berkeley, CA, USA, 251-264.

  • [3] Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. Calvin: fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data (SIGMOD '12). ACM, New York, NY, USA, 1-12. DOI=10.1145/2213836.2213838 http://doi.acm.org/10.1145/2213836.2213838

  • [4] Diego Ongaro and John Ousterhout. 2014. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC'14), Garth Gibson and Nickolai Zeldovich (Eds.). USENIX Association, Berkeley, CA, USA, 305-320.

  • [5] Kwei-Jay Lin, "Consistency issues in real-time database systems," in System Sciences, 1989. Vol.II: Software Track, Proceedings of the Twenty-Second Annual Hawaii International Conference on , vol.2, no., pp.654-661 vol.2, 3-6 Jan 1989 doi: 10.1109/HICSS.1989.48069