Apache Kudu Quickstart

Follow these instructions to set up and run the Kudu VM, and start with Kudu, Kudu_Impala, and CDH in minutes.

Get The Kudu Quickstart VM

Prerequisites

  1. Install Oracle Virtualbox. The VM has been tested to work with VirtualBox version 4.3 on Ubuntu 14.04 and VirtualBox version 5 on OSX 10.9. VirtualBox is also included in most package managers: apt-get, brew, etc.

  2. After the installation, make sure that VBoxManage is in your PATH by using the which VBoxManage command.

Installation

To download and start the VM, execute the following command in a terminal window.

$ curl -s https://raw.githubusercontent.com/cloudera/kudu-examples/master/demo-vm-setup/bootstrap.sh | bash

This command downloads a shell script which clones the kudu-examples Git repository and then downloads a VM image of about 1.2GB size into the current working directory.[1] You can examine the script after downloading it by removing the | bash component of the command above. Once the setup is complete, you can verify that everything works by connecting to the guest via SSH:

$ ssh demo@quickstart.cloudera

The username and password for the demo account are both demo. In addition, the demo user has password-less sudo privileges so that you can install additional software or manage the guest OS. You can also access the kudu-examples as a shared folder in /home/demo/kudu-examples/ on the guest or from your VirtualBox shared folder location on the host. This is a quick way to make scripts or data visible to the guest.

You can quickly verify if Kudu and Impala are running by executing the following commands:

$ ps aux | grep kudu
$ ps aux | grep impalad

If you have issues connecting to the VM or one of the processes is not running, make sure to consult the Troubleshooting section.

Load Data

To perform some typical operations with Kudu and Impala, you can load the SFO Passenger Data into Impala and then load it into Kudu.

  1. Upload the sample data from the home directory to HDFS.

    $ hdfs dfs -mkdir /data
    $ hdfs dfs -put examples/SFO_Passenger_Data/MonthlyPassengerData_200507_to_201506.csv /data
  2. Create a new external Impala table to access the plain text data. To connect to Impala in the virtual machine issue the following command:

    ssh demo@quickstart.cloudera -t impala-shell

    Now, you can execute the following commands:

    CREATE EXTERNAL TABLE passenger_data_raw (
      id int,
      activity_period int,
      operating_airline string,
      airline_iata_code string,
      published_airline string,
      published_airline_iata_code string,
      geo_summary string,
      geo_region string,
      activity_type_code string,
      price_category_code string,
      terminal string,
      boarding_area string,
      passenger_count bigint
    )
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    LOCATION '/data/';
  3. Validate if the data was actually loaded run the following command:

    SELECT count(*) FROM passenger_data_raw;
    
    +----------+
    | count(*) |
    +----------+
    | 13901    |
    +----------+
  4. It’s easy to convert data from any Hadoop file format and store it Kudu using the CREATE TABLE AS SELECT statement.

    CREATE TABLE passenger_data
    DISTRIBUTE BY HASH (id) INTO 16 BUCKETS
    TBLPROPERTIES(
    'storage_handler' = 'com.cloudera.kudu.hive.KuduStorageHandler',
    'kudu.table_name' = 'passenger_data',
    'kudu.master_addresses' = '127.0.0.1',
    'kudu.key_columns' = 'id'
     ) AS SELECT * FROM passenger_data_raw;
    
    +-----------------------+
    | summary               |
    +-----------------------+
    | Inserted 13901 row(s) |
    +-----------------------+
    Fetched 1 row(s) in 1.26s

For CREATE TABLE …​ AS SELECT we currently require that the first columns that are projected in the SELECT statement correspond to the Kudu table keys and are in the same order (id in the example above). If the default projection generated by does not meet this requirement, the user should avoid using and explicitly mention the columns to project, in the correct order.

+ The created table uses a simple single column primary key. See Kudu Impala Integration for a more detailed introduction to the extended SQL syntax for Impala.

+ The columns of the created table are copied from the passenger_data_raw base table. See Impala’s documentation for more details about the extended SQL syntax for Impala.

Read and Modify Data

Now that the data is stored in Kudu, you can run queries against it. The following query lists the airline with the highest passenger volume over the entire reporting timeframe.

SELECT sum(passenger_count) AS total, operating_airline FROM passenger_data
  GROUP BY operating_airline
  HAVING total IS NOT null
  ORDER BY total DESC LIMIT 10;

+-----------+----------------------------------+
| total     | operating_airline                |
+-----------+----------------------------------+
| 105363917 | United Airlines - Pre 07/01/2013 |
| 51319845  | United Airlines                  |
| 32657456  | SkyWest Airlines                 |
| 31727343  | American Airlines                |
| 23801507  | Delta Air Lines                  |
| 23685267  | Virgin America                   |
| 22507320  | Southwest Airlines               |
| 16235520  | US Airways                       |
| 11860630  | Alaska Airlines                  |
| 6706438   | JetBlue Airways                  |
+-----------+----------------------------------+

Looking at the result, you can already see a problem with the dataset. There is a duplicate airline name. Since the data is stored in Kudu rather than HDFS, you can quickly change any individual record and fix the problem without having to rewrite the entire table.

UPDATE passenger_data
  SET operating_airline="United Airlines"
  WHERE operating_airline LIKE "United Airlines - Pre%";

SELECT sum(passenger_count) AS total, operating_airline FROM passenger_data
  GROUP BY operating_airline
  HAVING total IS NOT null
  ORDER BY total DESC LIMIT 10;

+-----------+--------------------+
| total     | operating_airline  |
+-----------+--------------------+
| 156683762 | United Airlines    |
| 32657456  | SkyWest Airlines   |
| 31727343  | American Airlines  |
| 23801507  | Delta Air Lines    |
| 23685267  | Virgin America     |
| 22507320  | Southwest Airlines |
| 16235520  | US Airways         |
| 11860630  | Alaska Airlines    |
| 6706438   | JetBlue Airways    |
| 6266220   | Northwest Airlines |
+-----------+--------------------+

Troubleshooting

Problems accessing the VM via SSH

  • Make sure the host has a SSH client installed.

  • Make sure the VM is running, by running the following command and checking for a VM called kudu-demo:

    $ VBoxManage list runningvms
  • Verify that the VM’s IP address is included in the host’s /etc/hosts file. You should see a line that includes an IP address followed by the hostname quickstart.cloudera. To check the running VM’s IP address, use the VBoxManage command below.

    $ VBoxManage guestproperty get kudu-demo /VirtualBox/GuestInfo/Net/0/V4/IP
    Value: 192.168.56.100
  • If you’ve used a Cloudera Quickstart VM before, your .ssh/known_hosts file may contain references to the previous VM’s SSH credentials. Remove any references to quickstart.cloudera from this file.

Failing with lack of SSE4.2 support when running inside VirtualBox

  • Running Kudu currently requires a CPU that supports SSE4.2 (Nehalem or later for Intel). To pass through SSE4.2 support into the guest VM, refer to the VirtualBox documentation


1. In addition, the script will create a host-only network between host and guest and setup an enty in the /etc/hosts file with the name quickstart.cloudera and the guest’s IP address.