Apache Kudu - Contributing to Apache Kudu

Contributing to Apache Kudu

Contributing Patches Using Gerrit

The Kudu team uses Gerrit for code review, rather than Github pull requests. Typically, you pull from Github but push to Gerrit, and Gerrit is used to review code and merge it into Github.

See the Gerrit Tutorial for an overview of using Gerrit for code review.

Initial Setup for Gerrit

Sign in to Gerrit using your Github username.
Go to Settings. Update your name and email address on the Contact Information page, and upload a SSH public key under SSH Public Keys if you would like to use SSH to connect to Gerrit. Generate an HTTP password under HTTP Password if you would like to use HTTP or HTTPS to connect to Gerrit. (Most Kudu developers use the SSH option.)

If you do not update your name, it will appear as "Anonymous Coward" in Gerrit reviews.
If you have not done so, clone the main Kudu repository. By default, the main remote is called origin. When you fetch or pull, you will do so from origin.
```
git clone https://github.com/apache/kudu
```
Change to the new kudu directory.

Add a gerrit remote.

If using SSH to connect to Gerrit, use the following command to add the Gerrit remote (substitute <username> with your Github username):

git remote add gerrit ssh://<username>@gerrit.cloudera.org:29418/kudu

If using HTTP or HTTPS to connect to Gerrit, use the following command to add the Gerrit remote (http:// also works):

git remote add gerrit https://gerrit.cloudera.org/a/kudu

If you are using Gerrit’s HTTP or HTTPS endpoints and prefer not to type a username and password each time you submit a patch, you can put your login and password into a .netrc file located at $HOME/.netrc and Git will use it. The password is stored as plaintext and the file format is as follows:

machine <hostname>
  login <username>
  password <password>

Run the following command to install the Gerrit commit-msg hook:

cd kudu
gitdir=$(git rev-parse --git-dir)
curl -LSsf https://gerrit.cloudera.org/tools/hooks/commit-msg -o ${gitdir}/hooks/commit-msg
chmod +x ${gitdir}/hooks/commit-msg

Be sure you have set the Kudu repository to use pull --rebase by default. You can use the following two commands, assuming you have only ever checked out master so far:
```
git config branch.autosetuprebase always
git config branch.master.rebase true
```
If for some reason you had already checked out branches other than master, substitute master for the other branch names in the second command above.

Submitting Patches

To submit a patch, first commit your change (using a descriptive multi-line commit message if possible), then push the request to the gerrit remote. For instance, to push a change to the master branch:

git push gerrit HEAD:refs/for/master --no-thin

or to push a change to the gh-pages branch (to update the website):

git push gerrit HEAD:refs/for/gh-pages --no-thin

While preparing a patch for review, it’s a good idea to follow generic git commit guidelines and good practices.

The --no-thin argument is a workaround to prevent an error in Gerrit. See https://code.google.com/p/gerrit/issues/detail?id=1582.

Consider creating Git aliases for the above commands. Gerrit also includes a command-line tool called git-review, which you may find helpful.

You can add reviewers automatically for a patch by adding their GitHub username or associated email address to the remote branch name following with the "r" flag:

git push gerrit HEAD:refs/for/master%r=githubuser,r=example@apache.org

To find possible reviewer candidates for your commit, use git blame or git log to find out who are involved with the area you’re touching. It’s also a good idea to add as reviewer whoever is involved with the JIRA you’re working on.

Gerrit will add a change ID to your commit message and will create a Gerrit review, whose URL will be emitted as part of the push reply. If desired, you can send a message to the kudu-dev mailing list, explaining your patch and requesting review.

After getting feedback, you can update or amend your commit, (for instance, using a command like git commit --amend) while leaving the Change ID intact. Push your change to Gerrit again, and this will create a new patch set in Gerrit and notify all reviewers about the change.

When your code has been reviewed and is ready to be merged into the Kudu code base, a Kudu committer will merge it using Gerrit. You can discard your local branch.

Abandoning a Review

If your patch is not accepted or you decide to pull it from consideration, you can use the Gerrit UI to Abandon the patch. It will still show in Gerrit’s history, but will not be listed as a pending review.

Reviewing Patches In Gerrit

You can view a unified or side-by-side diff of changes in Gerrit using the web UI. To leave a comment, click the relevant line number or highlight the relevant part of the line, and type 'c' to bring up a comment box. To submit your comments and/or your review status, go up to the top level of the review and click Reply. You can add additional top-level comments here, and submit them.

To check out code from a Gerrit review, click Download and paste the relevant Git commands into your Git client. You can then update the commit and push to Gerrit to submit a patch to the review, even if you were not the original reviewer.

Gerrit allows you to vote on a review. A vote of +2 from at least one committer (besides the submitter) is required before the patch can be merged.

Code Style

C++ Code Style

Get familiar with these guidelines so that your contributions can be reviewed and integrated quickly and easily.

In general, Kudu follows the Google C++ Style Guide, with the following exceptions:

Notes on C++ 11

Kudu uses C++ 11. Check out this handy guide to C++ 11 move semantics and rvalue references: https://www.chromium.org/rvalue-references

We aim to follow most of the same guidelines, such as, where possible, migrating away from foo.Pass() in favor of std::move(foo).

Limitations on `boost` Use

boost classes from header-only libraries can be used in cases where a suitable replacement does not exist in the Kudu code base. However:

Do not introduce dependencies on boost classes where equivalent functionality exists in the standard C++ library or in src/kudu/gutil/. For example, prefer strings::Split() from gutil rather than boost::split.
Prefer using functionality from boost rather than re-implementing the same functionality, unless using the boost functionality requires excessive use of C++ features which are disallowed by our style guidelines. For example, boost::spirit is heavily based on template metaprogramming and should not be used.
Do not use boost in any public headers for the Kudu C++ client, because boost commonly breaks backward compatibility, and passing data between two boost versions (one by the user, one by Kudu) causes serious issues.

When in doubt about introducing a new dependency on any boost functionality, it is best to email dev@kudu.apache.org to start a discussion.

Line length

The Kudu team allows line lengths of 100 characters per line, rather than Google’s standard of 80. Try to keep under 80 where possible, but you can spill over to 100 or so if necessary.

Pointers

Smart Pointers and Singly-Owned Pointers

Generally, most objects should have clear "single-owner" semantics. Most of the time, singly-owned objects can be wrapped in a unique_ptr<> which ensures deletion on scope exit and prevents accidental copying.

If an object is singly owned, but referenced from multiple places, such as when the pointed-to object is known to be valid at least as long as the pointer itself, associate a comment with the constructor which takes and stores the raw pointer, as in the following example.

  // 'blah' must remain valid for the lifetime of this class
  MyClass(const Blah* blah) :
    blah_(blah) {
  }

Older parts of the Kudu code base use gscoped_ptr instead of unique_ptr. These are hold-overs from before Kudu adopted C++11. New code should not use gscoped_ptr except when necessary to interface with existing code. Alternatively, consider updating usages as you come across them.

Using std::auto_ptr is strictly disallowed because of its difficult and bug-prone semantics. Besides, std::auto_ptr is declared deprecated since C++11.

Smart Pointers for Multiply-Owned Pointers:

Although single ownership is ideal, sometimes it is not possible, particularly when multiple threads are in play and the lifetimes of the pointers are not clearly defined. In these cases, you can use either std::shared_ptr or Kudu’s own scoped_refptr from gutil/ref_counted.hpp. Each of these mechanisms relies on reference counting to automatically delete the referent once no more pointers remain. The key difference between these two types of pointers is that scoped_refptr requires that the object extend a RefCounted base class, and stores its reference count inside the object storage itself, while shared_ptr maintains a separate reference count on the heap.

The pros and cons are:

shared_ptr

can be used with any type of object, without the object deriving from a special base class
part of the standard library and familiar to most C++ developers
supports the weak_ptr use cases:
- a temporary ownership when an object needs to be accessed only if it exists
- break circular references of shared_ptr, if any exists due to aggregation
you can convert from the shared_ptr into the weak_ptr and back
if creating an instance with std::make_shared<>() only one allocation is made (since C++11; a non-binding requirement in the Standard, though)
if creating a new object with shared_ptr<T> p(new T) requires two allocations (one to create the ref count, and one to create the object)
the ref count may not be near the object on the heap, so extra cache misses may be incurred on access
the shared_ptr instance itself requires 16 bytes (pointer to the ref count and pointer to the object)

scoped_refptr

only requires a single allocation, and ref count is on the same cache line as the object
the pointer only requires 8 bytes (since the ref count is within the object)
you can manually increase or decrease reference counts when more control is required
you can convert from a raw pointer back to a scoped_refptr safely without worrying about double freeing
since we control the implementation, we can implement features, such as debug builds that capture the stack trace of every referent to help debug leaks.
the referred-to object must inherit from RefCounted
does not support the weak_ptr use cases

Since scoped_refptr is generally faster and smaller, try to use it rather than shared_ptr in new code. Existing code uses shared_ptr in many places. When interfacing with that code, you can continue to use shared_ptr.

Function Binding and Callbacks

Existing code uses boost::bind and boost::function to capture and manage functors. For new code, use std::bind and std::function, which are functionally equivalent.

Alternatively, the Bind and Callback classes in gutil may also be used to capture functors. See gutil/callback.h for more details and util/callback_bind-test.cc for examples. While less full-featured, they provide different options from their counterparts by the way of argument lifecycle management. The benefits of each are described below:

std::bind and std::function

natively supports binding shared_ptr and weak_ptr objects, so a bound smartpointer argument will increment its count during bind, and decrement it when the function leaves scope
supports argument placeholders, wrapped function pointers, and function objects

Bind and Callback

natively supports binding RefCounted objects, so a bound argument whose class extends RefCounted will increment its count during Bind and decrement it when the Callback goes out of scope

GFlags

Kudu uses gflags for both command-line and file-based configuration. Use these guidelines to add a new gflag. All new gflags must conform to these guidelines. Existing non-conformant ones will be made conformant in time.

Name

The gflag’s name conveys a lot of information, so choose a good name. The name will propagate into other systems, such as the Configuration Reference.

The different parts of a multi-word name should be separated by underscores. For example, fs_data_dirs.
The name should be prefixed with the context that it affects. For example, webserver_num_worker_threads and cfile_default_block_size. Context can be difficult to define, so bear in mind that this prefix will be used to group similar gflags together. If the gflag affects the entire process, it should not be prefixed.
If the gflag is for a quantity, the name should be suffixed with the units. For example, tablet_copy_idle_timeout_ms.
Where possible, use short names. This will save time for those entering command line options by hand.
The name is part of Kudu’s compatibility contract, and should not change without very good reason.

Default value

Choosing a default value is generally simple, but like the name, it propagates into other systems.

The default value is part of Kudu’s compatibility contract, and should not change without very good reason.

Description

The gflag’s description should supplement the name and provide additional context and information. Like the name, the description propagates into other systems.

The description may include multiple sentences. Each should begin with a capital letter, end with a period, and begin one space after the previous.
The description should NOT include the gflag’s type or default value; they are provided out-of-band.
The description should be in the third person. Do not use words like you.
A gflag description can be changed freely; it is not expected to remain the same across Kudu releases.

Java Code Style

Preconditions vs assert in the Kudu Java client

Use assert for verification of the static (i.e. non-runtime) internal invariants. Internal means the pre- and post-conditions which are completely under control of the code of a class or a function itself and cannot be influenced by input parameters and other runtime/dynamic conditions.

Use Preconditions for verification of the input parameters and the other conditions which are outside of the control of the local code, or conditions which are dependent on the state of other objects/components in runtime.

Object pop() {
  // Use Preconditions here because the external user of the class should not
  // call pop() on an empty stack, but the stack itself is internally consistent
  Preconditions.checkState(curSize > 0, "queue must not be empty");
  Object toReturn = data[--curSize];
  // Use an assert here because if we ended up with a negative size counter,
  // that's an indication of a broken implementation of the stack; i.e. it's
  // an invariant, not a state check.
  assert curSize >= 0;
  return toReturn;
}

However, keep in mind that assert checks are enabled only when the JVM is run with -ea option. So, if some dynamic condition is crucial for the overall consistency (e.g. a data loss can occur if some dynamic condition is not satisfied and the code continues its execution), consider throwing an AssertionError:

if (!isCriticalConditionSatisfied) {
  throw new AssertionError("cannot continue: data loss is possible otherwise");
}

References

`CMake` Style Guide

CMake allows commands in lower, upper, or mixed case. To keep the CMake files consistent, please use the following guidelines:

built-in commands in lowercase

add_subdirectory(some/path)

built-in arguments in uppercase

message(STATUS "message goes here")

custom commands or macros in uppercase

ADD_KUDU_TEST(some-test)

Third-party dependencies

Like many complex applications, Kudu depends on a number of third-party dependencies. Some (such as OpenSSL) are expected to be found on the build system itself. However, the vast majority are "vendored" in the thirdparty/ tree. These dependencies are all versioned and pinned. They are also source-based; the dependencies are built before the rest of Kudu is built using the build-if-necessary.sh script.

Third-party dependencies and their versions are defined in vars.sh. The source code for each dependency is located in a tarball, typically named <dependency>-<version>.tar.gz. The tarballs are stored in an Amazon S3 bucket operated by Cloudera. The bucket is cached in Amazon CloudFront to maximize download performance and reliability.

If as part of your contribution you need to add a new third-party dependency, here’s what you need to do:

Begin by preparing a source tarball for the new dependency. Ideally it should be a vanilla tarball obtained directly from an upstream project, but sometimes either its name or the contents need to be massaged to meet Kudu’s expectations.
Add the new dependency to the third-party build. You’ll need to modify vars.sh, download-thirdparty.sh, build-definitions.sh, and build-thirdparty.sh.
On your local machine, extract the source tarball into thirdparty/src.
Test the dependency’s build by running build-thirdparty.sh <dependency>. This should build and install the dependency into thirdparty/installed, making it available for the Kudu build.
Test the Kudu build using the new dependency. You will need to pass NO_REBUILD_THIRDPARTY=1 in the environment to prevent the Kudu build from rebuilding the thirdparty/ tree (whereupon it’ll fail to download the new dependency).
When everything checks out, contact a Kudu committer who is also a Cloudera employee and ask them to upload your source tarball to S3.
After the tarball has been uploaded, test the entire third-party build end-to-end by running build-if-necessary.sh.
Publish your patch to gerrit. With the tarball uploaded, the precommit builds should download and build the new dependency successfully.

Testing

All new code should have tests.: Add new tests either in existing files, or create new test files as necessary.
All bug fixes should have tests.: It’s OK to fix a bug without adding a new test if it’s triggered by an existing test case. For example, if a race shows up when running a multi-threaded system test after 20 minutes or so, it’s worth trying to make a more targeted test case to trigger the bug. But if that’s hard to do, the existing system test should be enough.
Tests should run quickly (< 1s).: If you want to write a time-intensive test, make the runtime dependent on KuduTest#AllowSlowTests, which is enabled via the KUDU_ALLOW_SLOW_TESTS environment variable and is used by Jenkins test execution.
Tests which run a number of iterations of some task should use a gflags command-line argument for the number of iterations.: This is handy for writing quick stress tests or performance tests.
Commits which may affect performance should include before/after perf-stat(1) output.: This will show performance improvement or non-regression. Performance-sensitive code should include some test case which can be used as a targeted benchmark.

Documentation

See the Documentation Style Guide for guidelines about contributing to the official Kudu documentation.

Blog posts

Writing a post on the Kudu blog

If you are using or integrating with Kudu, consider doing a write-up about your use case and your integration with Kudu and submitting it to be posted as an article on the Kudu blog. People in the community love to read about how Kudu is being used around the world.

Consider checking with the project developers on the Kudu Slack instance or on dev@kudu.apache.org if you have any questions about the content or the topic of a potential Kudu blog post.

Submitting a blog post in Google Doc format

If you don’t have the time to learn Markdown or to submit a Gerrit change request, but you would still like to submit a post for the Kudu blog, feel free to write your post in Google Docs format and share the draft with us publicly on dev@kudu.apache.org — we’ll be happy to review it and post it to the blog for you once it’s ready to go.

If you would like to submit the post directly to Gerrit for review in Markdown format (the developers will appreciate it if you do), please read below.

How to format a Kudu blog post

Blog posts live in the gh-pages branch under the _posts directory in Markdown format. They’re automatically rendered by Jekyll so for those familiar with Markdown or Jekyll, submitting a blog post should be fairly straightforward.

Each post is a separate file named in the following format:

YYYY-MM-DD-title-of-the-post.md

The YYYY-MM-DD part is the date which will be included in the link as /YYYY-MM-DD, then title-of-the-post is used verbatim. The words should be separated by dashes and should contain only lowercase letters of the English alphabet and numbers. Finally, the .md extension will be replaced with .html.

The header contains the layout information (which is always "post"), the title and the author’s name.

---
layout: post
title: Example Post
author: John Doe
---

The actual text of the blog post goes below this header, beginning with the "lead" which is a short excerpt that shows up in the index. This is separated by the  string from the rest of the post.

How to check the rendering of a blog post

Once you’ve finished the post, there is a command you can run to make sure it looks good called site_tool in the root of the gh-pages branch that can start up Jekyll and serve the rendered site locally. To run this, you need Ruby and Python to be installed on your machine, and you can start it with the below command.

$ ./site_tool jekyll serve

When starting, it will print the URL where you can reach the site, but it should be http://localhost:4000, or to reach the blog directly, http://localhost:4000/blog

You should be able to see the title and lead of your post along with your name at the top of this page, and after clicking on the title or the "Read full post…", the whole post.

How to submit a blog post

To submit the post, you’ll need to commit your change and push it to Gerrit for review. If the post is deemed useful for the community and all comments are addressed, a committer can merge and publish your post.

If you have a GitHub account, you can fork Kudu from https://github.com/apache/kudu and push the change to your fork too. GitHub will automatically render it on https://<yourname>.github.io/blog and you can link it directly on Gerrit.

This way the reviewers can see that the post renders well without having to download it, which can speed up the review process.