Apache Kudu (incubating) Schema Design

Kudu tables have a structured data model similar to tables in a traditional RDBMS. Schema design is critical for achieving the best performance and operational stability from Kudu. Every workload is unique, and there is no single schema design that is best for every table. This document outlines effective schema design philosophies for Kudu, paying particular attention to where they differ from approaches used for traditional RDBMS schemas.

At a high level, there are three concerns in Kudu schema design: column design, primary keys, and data distribution. Of these, only data distribution will be a new concept for those familiar with traditional relational databases. The next sections discuss altering the schema of an existing table, and known limitations with regard to schema design.

Column Design

A Kudu Table consists of one or more columns, each with a predefined type. Columns that are not part of the primary key may optionally be nullable. Supported column types include:

  • boolean

  • 8 bit signed integer

  • 16 bit signed integer

  • 32 bit signed integer

  • 64 bit signed integer

  • timestamp

  • single-precision (32 bit) IEEE-754 floating-point number

  • double-precision (64 bit) IEEE-754 floating-point number

  • UTF-8 encoded string

  • binary

Kudu takes advantage of strongly-typed columns and a columnar on-disk storage format to provide efficient encoding and serialization. To make the most of these features, columns must be specified as the appropriate type, rather than simulating a 'schemaless' table using string or binary columns for data which may otherwise be structured. In addition to encoding, Kudu optionally allows compression to be specified on a per-column basis.

Column Encoding

Each column in a Kudu table can be created with an encoding, based on the type of the column. Columns use plain encoding by default.

Table 1. Encoding Types

Column Type

Encoding

integer, timestamp

plain, bitshuffle, run length

float

plain, bitshuffle

bool

plain, dictionary, run length

string, binary

plain, prefix, dictionary

Plain Encoding

Data is stored in its natural format. For example, int32 values are stored as fixed-size 32-bit little-endian integers.

Bitshuffle Encoding

Data is rearranged to store the most significant bit of every value, followed by the second most significant bit of every value, and so on. Finally, the result is LZ4 compressed. Bitshuffle encoding is a good choice for columns that have many repeated values, or values that change by small amounts when sorted by primary key. The bitshuffle project has a good overview of performance and use cases.

Run Length Encoding

Runs (consecutive repeated values), are compressed in a column by storing only the value and the count. Run length encoding is effective for columns with many consecutive repeated values when sorted by primary key.

Dictionary Encoding

A dictionary of unique values is built, and each column value is encoded as its corresponding index in the dictionary. Dictionary encoding is effective for columns with low cardinality. If the column values of a given row set are unable to be compressed because the number of unique values is too high, Kudu will transparently fall back to plain encoding for that row set. This is evaluated during flush.

Prefix Encoding

Common prefixes are compressed in consecutive column values. Prefix encoding can be effective for values that share common prefixes, or the first column of the primary key, since rows are sorted by primary key within tablets.

Column Compression

Kudu allows per-column compression using LZ4, snappy, or zlib compression codecs. By default, columns are stored uncompressed. Consider using compression if reducing storage space is more important than raw scan performance.

+ Every data set will compress differently, but in general LZ4 has the least effect on performance, while zlib will compress to the smallest data sizes. Bitshuffle-encoded columns are inherently compressed using LZ4, so it is not typically beneficial to apply additional compression on top of this encoding.

Primary Keys

Each Kudu table must declare a primary key comprised of one or more columns. Primary key columns must be non-nullable, and may not be a boolean or floating-point type. Every row in a table must have a unique set of values for its primary key columns. As with a traditional RDBMS, primary key selection is critical to ensuring performant database operations.

Unlike an RDBMS, Kudu does not provide an auto-incrementing column feature, so the application must always provide the full primary key during insert or ingestion. In addition, Kudu does not allow the primary key values of a row to be updated.

Within a tablet, rows are stored sorted lexicographically by primary key. Advanced schema designs can take advantage of this ordering to achieve good distribution of data among tablets, while retaining consistent ordering in intra-tablet scans. See Data Distribution for more information.

Data Distribution

Kudu tables, unlike traditional relational tables, are partitioned into tablets and distributed across many tablet servers. A row always belongs to a single tablet (and its replicas). The method of assigning rows to tablets is specified in a configurable partition schema for each table, during table creation.

Choosing a data distribution strategy requires you to understand the data model and expected workload of a table. For write-heavy workloads, it is important to design the distribution such that writes are spread across tablets in order to avoid overloading a single tablet. For workloads involving many short scans, performance can be improved if all of the data for the scan is located in the same tablet. Understanding these fundamental trade-offs is central to designing an effective partition schema.

Kudu provides two types of partition schema: range partitioning and hash bucketing. These schema types can be used together or independently. Kudu does not yet allow tablets to be split after creation, so you must design your partition schema ahead of time to ensure that a sufficient number of tablets are created.

Range Partitioning

With range partitioning, rows are distributed into tablets using a totally-ordered distribution key. Each tablet is assigned a contiguous segment of the table’s distribution keyspace. By default, the distribution key uses all of the columns of the primary key, but it may be configured to use any subset of the primary key columns.

During table creation, tablet boundaries are specified as a sequence of split rows. Consider the following table schema (using SQL syntax for clarity):

CREATE TABLE customers (
  first_name STRING NOT NULL,
  last_name STRING NOT NULL,
  order_count INT32,
  PRIMARY KEY (last_name, first_name),
)

Specifying the split rows as (("b", ""), ("c", ""), ("d", ""), .., ("z", "")) (25 split rows total) will result in the creation of 26 tablets, with each tablet containing a range of customer surnames all beginning with a given letter. This is an effective partition schema for a workload where customers are inserted and updated uniformly by last name, and scans are typically performed over a range of surnames.

It may make sense to partition a table by range using only a subset of the primary key columns, or with a different ordering than the primary key. For instance, you can change the above example to specify that the range partition should only include the last_name column. In that case, Kudu would guarantee that all customers with the same last name would fall into the same tablet, regardless of the provided split rows.

Hash Bucketing

Hash bucketing distributes rows by hash value into one of many buckets. Each tablet is responsible for the rows falling into a single bucket. The number of buckets (and therefore tablets), is specified during table creation. Typically, all of the primary key columns are used as the columns to hash, but as with range partitioning, any subset of the primary key columns can be used.

Hash partitioning is an effective strategy to increase the amount of parallelism for workloads that would otherwise skew writes into a small number of tablets. Consider the following table schema.

CREATE TABLE metrics (
  host STRING NOT NULL,
  metric STRING,
  time TIMESTAMP NOT NULL,
  measurement DOUBLE,
  PRIMARY KEY (time, metric, host),
)

If you use the default range partitioning over the primary key columns, inserts will tend to only go to the tablet covering the current time, which limits the maximum write throughput to the throughput of a single tablet. If you use hash partitioning, you can guarantee a number of parallel writes equal to the number of buckets specified when defining the partition schema. The trade-off is that a scan over a single time range now must touch each of these tablets, instead of (possibly) a single tablet. Hash bucketing can be an effective tool for mitigating other types of write skew as well, such as monotonically increasing values.

As an advanced optimization, you can create a table with more than one hash bucket component, as long as the column sets included in each are disjoint, and all hashed columns are part of the primary key. The total number of tablets created will be the product of the hash bucket counts. For example, the above metrics table could be created with two hash bucket components, one over the time column with 4 buckets, and one over the metric and host columns with 8 buckets. The total number of tablets will be 32. The advantage of using two separate hash bucket components is that scans which specify equality constraints on the metric and host columns will be able to skip 7/8 of the total tablets, leaving a total of just 4 tablets to scan.

This optimization is not yet implemented. See known limitations for details.

Hash Bucketing and Range Partitioning

Hash bucketing can be combined with range partitioning. Adding hash bucketing to a range partitioned table has the effect of parallelizing operations that would otherwise operate sequentially over the range. The total number of tablets is the product of the number of hash buckets and the number of split rows plus one.

Schema Alterations

You can alter a table’s schema in the following ways:

  • Rename the table

  • Rename, add, or drop columns

  • Rename (but not drop) primary key columns

You cannot modify the partition schema after table creation.

Known Limitations

Kudu currently has some known limitations that may factor into schema design:

Immutable Primary Keys

Kudu does not allow you to update the primary key of a row after insertion.

Non-alterable Primary Key

Kudu does not allow you to alter the primary key columns after table creation.

Non-alterable Partition Schema

Kudu does not allow you to alter the partition schema after table creation.

Partition Pruning

When tables use hash buckets, the Java and C++ clients do not yet use scan predicates to prune tablets for scans over these tables. In the future, specifying an equality predicate on all columns in the hash bucket component will limit the scan to only the tablets corresponding to the hash bucket.

Tablet Splitting

You currently cannot split or merge tablets after table creation. You must create the appropriate number of tablets in the partition schema at table creation. As a workaround, you can copy the contents of one table to another by using a CREATE TABLE AS SELECT statement or creating an empty table and using an INSERT query with SELECT in the predicate to populate the new table.