2、Hadoop输入与输出

厚积薄发 (我的世界开始下雪)

章节名：2、Hadoop输入与输出
2012-03-03 21:22:52

这一问题本书只在第三章简单说了一下读写HDFS，虽然能说明问题，但是本着第一遍读书应该把书读厚的原则，我觉得很有必要自行展开一番。再说凡是万变不离其宗嘛，任何程序都是从“输入-->计算-->输出”。先说输入，Hadoop的默认的输入方式是将输入的每一行视为一条记录，该行文件偏移量为key，内容为value。这样当然不一定能满足所有的业务需要。因此，一方面Hadoop也提供了很多其他的输入格式，另一方面，更自由的，提供了自定义方式。

先摆出几个概念：
InputFiles : 这个好说，简单。
InputFormat : 这个得说说，虽然也简单，这个接口(Java interface)决定了Mapper实例将从Hadoop框架中得到什么样的数据，即什么样的Key-Value
InputSplit : 这个在应用里不会直接接触到，但是这个概念值得了解，YDN上有这么一段话：
（注：以下标为原文是为了在日记中进行突出显示，非原文字句，请作者及读者见谅，如果存在版权问题请指出~）
Another important job of the InputFormat is to divide the input data sources (e.g., input files) into fragments that make up the inputs to individual map tasks.  These fragments are called "splits" and are encapsulated in instances of the InputSplit interface.引自 2、Hadoop输入与输出
一般说来，InputSplit决定了每个Mapper要处理的数据集；
而InputFormat则决定了每一个Split里面的数据格式/数据结构；
不知道这样一说有没有说清楚，大体可以理解为InputSplit是物理性的输入，InputFormat是逻辑性的输入。





Hadoop系统提供以下几种：
（注：以下标为原文是为了在日记中进行突出显示，非原文字句，请作者及读者见谅，如果存在版权问题请指出~）
    TextInputFormat：文件偏移量 ：整行数据
    KeyValueTextInputFormat：第一个"\t"前的数据 ： 后面的整行数据
    SequenceFileInputFormat：因为这是二进制文件，所以Key-Value都是由用户指定
    NLineInputFormat：与TextInputFormat一样，就是NLine的区别了引自 2、Hadoop输入与输出

标准的InputFormat接口如下：
public interface InputFormat<K, V> 
{
	InputSplit[] getSplits(JobConf job, int numSplits) throws IOException;
	
	RecordReader<K, V> getRecordReader(InputSplit split,
	                                                            JobConf job,
	                                                            Reporter reporter) throws IOException;
}

如果要自定制输入，就是要继承这个接口。两个函数分别的用途是：
■ Identify all the files used as input data and divide them into input splits. Eachmap task is assigned one split.
■ Provide an object (RecordReader) to iterate through records in a given split,
and to parse each record into key and value of predefined types.引自 2、Hadoop输入与输出
根据本书的建议，如果一定要自定制输入，最好派生FileInputFormat，而不是直接实现InputFormat接口，原因是对于getSplits()方法，它已经实现好了，足够绝大多数实际开发的需求。


下面给出一个例子：

假设你的输入数据格式是这样的：
ball, 3.5, 12.7, 9.0
car, 15, 23.76, 42.23
device, 0.0, 12.4, -67.1

每个点的名字，后面是在坐标系里面的坐标值。
/* 仅仅实现了getRecordReader()方法 */
public class ObjectPositionInputFormat extends FileInputFormat<Text, Point3D> {

  public RecordReader<Text, Point3D> getRecordReader(InputSplit input,  JobConf job, Reporter reporter) throws IOException {
		reporter.setStatus(input.toString());
		return new ObjPosRecordReader(job, (FileSplit)input);
	}
}


/* 下面是实现了ObjPosRecordReader类 */
class ObjPosRecordReader implements RecordReader<Text, Point3D> {

	private LineRecordReader lineReader;
	private LongWritable lineKey;
	private Text lineValue;
	
	public ObjPosRecordReader(JobConf job, FileSplit split) throws IOExpection {
		lineReader = new LineRecordReader(job, conf);
		lineKey = lineReader.createKey();
		lineValue = lineReader.createValue();
	}
	
	public boolean next(Text Key, Point3D value) throws IOEcpection {
		if(!lineReader.next(lineKey, lineValue)){
			return false;
		}
		
		String[] pieces = lineValue.toString().split(",");
		if(pieces.length != 4) {
			throw new IOExpection("Invalid record received");
		}
		
		float fx, fy, fz;
		try {
			fx = Float.parseFloat(pieces[1].trim());
			fy = Float.parseFloat(pieces[2].trim());
			fz = Float.parseFloat(pieces[3].trim());
		} catch(NumberFormatExecption nfe) {
			throw new IOException("Error parsing floating point value in record");
		}
		
		key.set(pieces[0].trim());
		
		value.x = fx;
		value.y = fy;
		value.z = fz;
		
		return true;
	}
	
	public Text createKey() {
		return new Text("");
	}
	
	public Text createValue() {
		return new Point3D();
	}
	
	public long getPos() throws IOExpection {
		return lineReader.getPos();
	}
	
	public void close() throws IOExpection {
		lineReader.close();
	}
	
	public float getProgress() throws IOExpection {
		return lineReader.getProcess();
	}
}


关于输出，一般都是对输出格式进行控制，比如要输出XML或是JSON类型等等，这一部分不说了，少敲几个字，因为总体与输入差不多。

202人阅读

> 厚积薄发的所有笔记（41篇）

厚积薄发对本书的所有笔记 · · · · · ·

1、自定义Hadoop数据类型

Hadoop的自定制数据类型一般有两个办法，一种较为简单的是针对值，另外一种更为完整的是对于...
2、Hadoop输入与输出
3、一些特殊场合的支持

先说明一下，这篇完全基于读书要读厚的思想而写的，因为书上本身没有谈到或是谈的不多，但是...
4、Mapper与Reducer的链接

书上说的不清晰透彻，下面是在StackOverflow上的一个方案，我觉得很好： (1) Cascading jobs ...

> 查看全部7篇

说明 · · · · · ·

表示其中内容是对原文的摘抄

2、Hadoop输入与输出

厚积薄发 (我的世界开始下雪)

厚积薄发对本书的所有笔记 · · · · · ·

1、自定义Hadoop数据类型

2、Hadoop输入与输出

3、一些特殊场合的支持

4、Mapper与Reducer的链接

说明 · · · · · ·