Apache Kudu 1.2.0 Release Notes

New features

  • Kudu clients and servers now redact user data such as cell values from log messages, Java exception messages, and Status strings. User metadata such as table names, column names, and partition bounds are not redacted.

    Redaction is enabled by default, but may be disabled by setting the new log_redact_user_data flag to false.

  • Kudu’s ability to provide consistency guarantees has been substantially improved:

    • Replicas now correctly track their "safe timestamp". This timestamp is the maximum timestamp at which reads are guaranteed to be repeatable.

    • A scan created using the SCAN_AT_SNAPSHOT mode will now either wait for the requested snapshot to be "safe" at the replica being scanned, or be re-routed to a replica where the requested snapshot is "safe". This ensures that all such scans are repeatable.

    • Kudu Tablet Servers now properly retain historical data when a row with a given primary key is inserted and deleted, followed by the insertion of a new row with the same key. Previous versions of Kudu would not retain history in such situations. This allows the server to return correct results for snapshot scans with a timestamp in the past, even in the presence of such "reinsertion" scenarios.

    • The Kudu clients now automatically retain the timestamp of their latest successful read or write operation. Scans using the READ_AT_SNAPSHOT mode without a client-provided timestamp automatically assign a timestamp higher than the timestamp of their most recent write. Writes also propagate the timestamp, ensuring that sequences of operations with causal dependencies between them are assigned increasing timestamps. Together, these changes allow clients to achieve read-your-writes consistency, and also ensure that snapshot scans performed by other clients return causally-consistent results.

  • Kudu servers now automatically limit the number of log files. The number of log files retained can be configured using the max_log_files flag. By default, 10 log files will be retained at each severity level.

Optimizations and improvements

  • The logging in the Java and C++ clients has been substantially quieted. Clients no longer log messages in normal operation unless there is some kind of error.

  • The C++ client now includes a KuduSession::SetErrorBufferSpace API which can limit the amount of memory used to buffer errors from asynchronous operations.

  • The Java client now fetches tablet locations from the Kudu Master in batches of 1000, increased from batches of 10 in prior versions. This can substantially improve the performance of Spark and Impala queries running against Kudu tables with large numbers of tablets.

  • Table metadata lock contention in the Kudu Master was substantially reduced. This improves the performance of tablet location lookups on large clusters with a high degree of concurrency.

  • Lock contention in the Kudu Tablet Server during high-concurrency write workloads was also reduced. This can reduce CPU consumption and improve performance when a large number of concurrent clients are writing to a smaller number of a servers.

  • Lock contention when writing log messages has been substantially reduced. This source of contention could cause high tail latencies on requests, and when under high load could contribute to cluster instability such as election storms and request timeouts.

  • The BITSHUFFLE column encoding has been optimized to use the AVX2 instruction set present on processors including Intel® Sandy Bridge and later. Scans on BITSHUFFLE-encoded columns are now up to 30% faster.

  • The kudu tool now accepts hyphens as an alternative to underscores when specifying actions. For example, kudu local-replica copy-from-remote may be used as an alternative to kudu local_replica copy_from_remote.

Fixed Issues

  • KUDU-1508 Fixed a long-standing issue in which running Kudu on ext4 file systems could cause file system corruption.

  • KUDU-1399 Implemented an LRU cache for open files, which prevents running out of file descriptors on long-lived Kudu clusters. By default, Kudu will limit its file descriptor usage to half of its configured ulimit.

  • Gerrit #5192 Fixed an issue which caused data corruption and crashes in the case that a table had a non-composite (single-column) primary key, and that column was specified to use DICT_ENCODING or BITSHUFFLE encodings. If a table with an affected schema was written in previous versions of Kudu, the corruption will not be automatically repaired; users are encouraged to re-insert such tables after upgrading to Kudu 1.2 or later.

  • Gerrit #5541 Fixed a bug in the Spark KuduRDD implementation which could cause rows in the result set to be silently skipped in some cases.

  • KUDU-1551 Fixed an issue in which the tablet server would crash on restart in the case that it had previously crashed during the process of allocating a new WAL segment.

  • KUDU-1764 Fixed an issue where Kudu servers would leak approximately 16-32MB of disk space for every 10GB of data written to disk. After upgrading to Kudu 1.2 or later, any disk space leaked in previous versions will be automatically recovered on startup.

  • KUDU-1750 Fixed an issue where the API to drop a range partition would drop any partition with a matching lower or upper bound, rather than any partition with matching lower and upper bound.

  • KUDU-1766 Fixed an issue in the Java client where equality predicates which compared an integer column to its maximum possible value (e.g. Integer.MAX_VALUE) would return incorrect results.

  • KUDU-1780 Fixed the kudu-client Java artifact to properly shade classes in the com.google.thirdparty namespace. The lack of proper shading in prior releases could cause conflicts with certain versions of Google Guava.

  • Gerrit #5327 Fixed shading issues in the kudu-flume-sink Java artifact. The sink now expects that Hadoop dependencies are provided by Flume, and properly shades the Kudu client’s dependencies.

  • Fixed a few issues using the Python client library from Python 3.

Wire Protocol compatibility

Kudu 1.2.0 is wire-compatible with previous versions of Kudu:

  • Kudu 1.2 clients may connect to servers running Kudu 1.0. If the client uses features that are not available on the target server, an error will be returned.

  • Kudu 1.0 clients may connect to servers running Kudu 1.2 without limitations.

  • Rolling upgrade between Kudu 1.1 and Kudu 1.2 servers is believed to be possible though has not been sufficiently tested. Users are encouraged to shut down all nodes in the cluster, upgrade the software, and then restart the daemons on the new version.

Incompatible Changes in Kudu 1.2.0

  • The replication factor of tables is now limited to a maximum of 7. In addition, it is no longer allowed to create a table with an even replication factor.

  • The GROUP_VARINT encoding is now deprecated. Kudu servers have never supported this encoding, and now the client-side constant has been deprecated to match the server’s capabilities.

New Restrictions on Data, Schemas, and Identifiers

Kudu 1.2.0 introduces several new restrictions on schemas, cell size, and identifiers:

Number of Columns

By default, Kudu will not permit the creation of tables with more than 300 columns. We recommend schema designs that use fewer columns for best performance.

Size of Cells

No individual cell may be larger than 64KB. The cells making up a a composite key are limited to a total of 16KB after the internal composite-key encoding done by Kudu. Inserting rows not conforming to these limitations will result in errors being returned to the client.

Valid Identifiers

Identifiers such as column and table names are now restricted to be valid UTF-8 strings. Additionally, a maximum length of 256 characters is enforced.

Client Library Compatibility

  • The Kudu 1.2 Java client is API- and ABI-compatible with Kudu 1.1. Applications written against Kudu 1.1 will compile and run against the Kudu 1.2 client and vice-versa.

  • The Kudu 1.2 C++ client is API- and ABI-forward-compatible with Kudu 1.1. Applications written and compiled against the Kudu 1.1 client will run without modification against the Kudu 1.2 client. Applications written and compiled against the Kudu 1.2 client will run without modification against the Kudu 1.1 client unless they use one of the following new APIs:

    • kudu::DisableSaslInitialization()

    • KuduSession::SetErrorBufferSpace(…​)

  • The Kudu 1.2 Python client is API-compatible with Kudu 1.1. Applications written against Kudu 1.1 will continue to run against the Kudu 1.2 client and vice-versa.

Known Issues and Limitations

Please refer to the Known Issues and Limitations section of the documentation.

Installation Options

For full installation details, see Kudu Installation.