@InterfaceAudience.Public @InterfaceStability.Evolving public class KuduTableInputFormat extends org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.NullWritable,RowResult> implements org.apache.hadoop.conf.Configurable
This input format generates one split per tablet and the only location for each split is that tablet's leader.
Hadoop doesn't have the concept of "closing" the input format so in order to release the
resources (mainly, the Kudu client) we assume that once either
getSplits(org.apache.hadoop.mapreduce.JobContext)
or KuduTableInputFormat.TableRecordReader.close()
have been called that the object won't be used again and the AsyncKuduClient is shut down.
To prevent a premature shutdown of the client, the KuduTableInputFormat and the
TableRecordReader both get their own client that they don't share.
Default behavior of hadoop is to call getSplits(org.apache.hadoop.mapreduce.JobContext)
in the MRAppMaster and for each inputSplit (in our case, Kudu tablet) will spawn one Mapper
with a TableRecordReader reading one Tablet.
Therefore, total number of Kudu clients opened over the course of a MR application can be
estimated by (#Tablets +1). To reduce the number of concurrent open clients, it might be
advisable to restrict resources of the MR application or implement the
org.apache.hadoop.mapred.lib.CombineFileInputFormat over this InputFormat.
Constructor and Description |
---|
KuduTableInputFormat() |
Modifier and Type | Method and Description |
---|---|
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.NullWritable,RowResult> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit,
org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext) |
org.apache.hadoop.conf.Configuration |
getConf() |
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext jobContext) |
void |
setConf(org.apache.hadoop.conf.Configuration entries) |
public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException, InterruptedException
getSplits
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.NullWritable,RowResult>
IOException
InterruptedException
public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.NullWritable,RowResult> createRecordReader(org.apache.hadoop.mapreduce.InputSplit inputSplit, org.apache.hadoop.mapreduce.TaskAttemptContext taskAttemptContext) throws IOException, InterruptedException
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.NullWritable,RowResult>
IOException
InterruptedException
public void setConf(org.apache.hadoop.conf.Configuration entries)
setConf
in interface org.apache.hadoop.conf.Configurable
public org.apache.hadoop.conf.Configuration getConf()
getConf
in interface org.apache.hadoop.conf.Configurable