Apache Kudu Ecosystem

While the Apache Kudu project provides client bindings that allow users to mutate and fetch data, more complex access patterns are often written via SQL and compute engines. This is a non-exhaustive list of projects that integrate with Kudu to enhance ingest, querying capabilities, and orchestration.

Frequently used

The following integrations are among the most commonly used with Apache Kudu (sorted alphabetically).

SQL

Apache Drill

Apache Drill provides schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage. See the Drill Kudu API documentation for more details.

Apache Hive

The Apache Hive ™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. See the Hive Kudu integration documentation for more details.

Apache Impala

Apache Impala is the open source, native analytic database for Apache Hadoop. See the Kudu Impala integration documentation for more details.

Apache Spark SQL

Spark SQL is a Spark module for structured data processing. See the Kudu Spark integration documentation for more details.

Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. See the Presto Kudu connector documentation for more details.

Computation

Apache Beam

Apache Beam is a unified model for defining both batch and streaming data-parallel processing pipelines, as well as a set of language-specific SDKs for constructing pipelines and Runners for executing them on distributed processing backends. See the Beam Kudu source and sink documentation for more details.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. See the Kudu Spark integration documentation for more details.

Pandas

Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Kudu Python scanners can be converted to Pandas DataFrames. See Kudu’s Python tests for example usage.

Talend Big Data

Talend simplifies and automates big data integration projects with on demand Serverless Spark and machine learning. See Talend’s Kudu component documentation for more details.

Ingest

Akka

Akka facilitates building highly concurrent, distributed, and resilient message-driven applications on the JVM. See the Alpakka Kudu connector documentation for more details.

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. See the Flink Kudu connector documentation for more details.

Apache Nifi

Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic. See the PutKudu processor documentation for more details.

Apache Spark Streaming

Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. See Kudu’s Spark Streaming tests for example usage.

Confluent Platform Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications. See the Kafka Kudu connector documentation for more details.

StreamSets Data Collector

StreamSets Data Collector is a lightweight, powerful engine that streams data in real time. See the StreamSets Data Collector Kudu destination documentation.

Striim

Striim is real-time data integration software that enables continuous data ingestion, in-flight stream processing, and delivery. See the Striim Kudu Writer documentation for more details.

TIBCO StreamBase

TIBCO StreamBase® is an event processing platform for applying mathematical and relational processing to real-time data streams. See the StreamBase Kudu operator documentation for more details.

Informatica PowerExchange

Informatica® PowerExchange® is a family of products that enables retrieval of a variety of data sources without having to develop custom data-access programs. See the PowerExchange for Kudu documentation for more details.

Deployment and Orchestration

Apache Camel

Camel is an open source integration framework that empowers you to quickly and easily integrate various systems consuming or producing data. See the Camel Kudu component documentation for more details.

Cloudera Manager

Cloudera Manager is an end-to-end application for managing CDH clusters. See the Cloudera Manager documentation for Kudu for more details.

Docker

Docker facilitates packaging software into standardized units for development, shipment, and deployment. See the official Apache Kudu Dockerhub and the Apache Kudu Docker Quickstart for more details.

Wavefront

Wavefront is a high-performance streaming analytics platform that supports 3D observability. See the Wavefront Kudu integration documentation for more details.

Visualization

Zoomdata

Zoomdata provides a high-performance BI engine and visually engaging, interactive dashboards. See Zoomdata’s Kudu page for more details.

Distribution and Support

While Kudu is an Apache-licensed open source project, software vendors may package and license it with other components to facilitate consumption. These offerings are typically bundled with support to tune and facilitate administration.